## Introduction

This notebook is aimed at beginners wanting to learn basic data manipulations using the Pandas library in Python.
The datasets used are International Movies Database (IMDB) Data.
  
The basic analysis we are going to answer throughout are:
* Which are the highest grossing movies within the data?
* Which movies have the highest average user vote?
* Which movies have the most 'polarised' votes?
* Which movies have the largest vote difference by sex?
* What is the gross income per director?
*__A Challenge Question!__

## Let's import the modules we need

We will be using **"NumPy"** and **"Pandas"**.
  
**NumPy** is a library for Python that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  
**Pandas** is a high-level data manipulation tool that is built on the NumPy package.
  
Aliasing “numpy” to “np” and “pandas” and “pd” is optional (you could call them anything you want), but is recommended as a convention.
  
See the Pandas API reference [here](https://pandas.pydata.org/pandas-docs/stable/index.html).
  
See the NumPy documentation [here](https://numpy.org/doc/).

In [None]:
import numpy as np
import pandas as pd

# These are just some display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Display floating points in a more readable format
pd.options.display.float_format = '{:,.2f}'.format

## Reading in the Data
The key data structure in Pandas is a DataFrame. DataFrames are incredibly powerful as they allow you to store and manipulate tabular data in rows of observations and columns.
  
Pandas allows us to read in data that exists in mulitple different formats - in this case plain old csv is being used.
  
The `.head(x)` method allows us to preview the first x items of the DataFrame. You should also try out the `.tail()` method.

In [None]:
# Read in the movies data first
movies_df = pd.read_csv("/kaggle/input/imdb-data/IMDb movies.csv")
movies_df.head(5)

We may look at the data types of each column using the `.info()` method.
  
__Note__ that columns which have mixed data types (for example, entries have a combination of integers and strings) will have the **object** dtype.

In [None]:
# We can take a look at the schema
movies_df.info()

## Some basic Syntax
The following are some fundamental data manipulation techniques in Pandas.

In [None]:
# To select certain columns from a dataframe we specify them as a list of strings within square brackets
movies_df[['director', 'genre', 'original_title']]

In [None]:
# To filter a dataframe we use expressions within the square 'selecting' brackets
movies_df[movies_df['avg_vote'] >= 7.0].head(3)

In [None]:
# You can call operations via methods available to the 'dataframe' type
movies_df.sort_values(by='avg_vote', ascending=False).head(3)

For more information on methods and techniques in Pandas see the Pandas API reference [here](https://pandas.pydata.org/pandas-docs/stable/index.html).

## Looking at the Basic Stats of a Dataset
When examining a dataset it's useful to look at the basic statistics of each column.

The two different options listed give stats for firstly, numeric columns (specified by `include=np.number`), and secondly for 'text-like' columns (specified by `include=['O']`).
  

Note that the statistics given are different for numbers and text columns.

In [None]:
movies_df.describe(include=np.number)

In [None]:
movies_df.describe(include=['O'])

__Note__ that 'worlwide_gross_income' is included as an 'object' column instead of a numeric column, which is problematic for the rest of the analysis - we'll have to clean that column.

But first, we want to examine the 'worlwide_gross_income' column - we do this by viewing the first 10 entries. \
We also ensure to exclude entries which are blank in our preview using the **bitwise NOT operator (~)** and `.isna()` method.

In [None]:
# Examine the 'worlwide_gross_income' column for its issues
movies_df[~(movies_df['worlwide_gross_income'].isna())]['worlwide_gross_income'].head(10)

Next, we want to get rid of the currency symbols and keep only the numerical figures. For this, we will be using **Regular Expressions** to extract digits only. We will then convert the column to a float data type (i.e. a real number with a decimal point).

> Exercise: Do more research on what Regular Expressions are, and how they are used. \
> Hint - Have you ever used CTRL+F command before?

In [None]:
# We can use regex to extract the digits from the column
movies_df['worlwide_gross_income'] = movies_df['worlwide_gross_income'].str.extract(r'(\d+)').astype('float')
movies_df[~(movies_df['worlwide_gross_income'].isna())].head(10)

## Finding the Top Grossing Movies within the Dataset
Now that the gross income column is clean, the first question can be answered.
  
This is a 'one-liner' using Pandas: simply select and then sort.

>Question: what if we wanted to see the lowest grossing movies?

In [None]:
movies_df[['original_title', 'worlwide_gross_income']].sort_values(by='worlwide_gross_income', ascending=False).head(10)

## Movies with the highest vote
In order to get hold of ratings, the 'IMDB ratings.csv' will need to be joined onto the movies data.
  
Only the necessary columns will be selected from ratings - this assumes some pre-existing knowledge about the data.
  
_Note that the movies dataframe has got user rating info - but let's assume we want detailed ratings info._

In [None]:
# Here we leverage the 'usecols' argument to grab what I need
ratings_df = pd.read_csv("/kaggle/input/imdb-data/IMDb ratings.csv", usecols=['imdb_title_id', 'mean_vote', 'median_vote', 'males_allages_avg_vote', 'females_allages_avg_vote', 'total_votes'])
ratings_df.head(5)

The `.merge()` method below performs a *SQL-like join*.
  
For a quick refresher on joins see [the W3Schools page](https://www.w3schools.com/sql/sql_join.asp) on joins.
> Would you consider using another type of join here?

In [None]:
movie_ratings_df = movies_df[['imdb_title_id', 'original_title', 'worlwide_gross_income']].merge(
    ratings_df,
    on='imdb_title_id',
    how='inner'
)

movie_ratings_df.head(10)

Let's view the top 10 movies according to average (mean) votes.

In [None]:
movie_ratings_df[['original_title', 'mean_vote', 'total_votes']].sort_values(by='mean_vote', ascending=False).head(10)

>We see a bunch of movies you \[probably\] haven't heard of - why?

We will filter on the `movie_ratings_df` and create a new DataFrame based on the filtered copy.

__Note:__ When creating a new DataFrame from an old one, be sure to remember the **.copy()** method, this ensures you are creating a completely new dataframe that is not 'tied' to the old one.

In [None]:
filtered_movie_ratings_df = movie_ratings_df[movie_ratings_df['total_votes'] >= 500000].copy()

In [None]:
filtered_movie_ratings_df[['original_title', 'mean_vote', 'total_votes']].sort_values(by='mean_vote', ascending=False).head(10)

## Movies with the Highest Skewness
>How would you figure which movies people have polarised opinions on?

In [None]:
# I'll stick with the filtered movies so we can see movies we all know about
filtered_movie_ratings_df['skewness'] = filtered_movie_ratings_df['mean_vote'] - filtered_movie_ratings_df['median_vote']
filtered_movie_ratings_df.head(10)

Data skewness can be measured as the difference between the mean and median - this is so because the mean is the average of a variable and thus will tend towards extreme values. The median is the 'middlemost' value in the data and thus is not influenced by outliers like the mean is. _Technically speaking, you should divide this difference by the standard deviation of the data, by the definition of skewness_. More on the standard deviation later.
  
Thus, their difference is a good measure of how much data tends towards extreme values.
  
See [this](https://www.youtube.com/watch?v=U0NZu6f5TMI) video on the concept of skewness.

In [None]:
# Get movies with the highest skewness
filtered_movie_ratings_df.sort_values('skewness', ascending=False).head(10)

## Movies with the Largest Vote Difference by Sex
There are multiple ways of doing something like this - the one presented is straightforward, but you should try some other ways.
  
We'll stick with the filtered dataset for this question.

In [None]:
# This operation creates a new column as the element-wise subtraction of two columns
filtered_movie_ratings_df['rating_discrepency_by_sex'] = filtered_movie_ratings_df['males_allages_avg_vote'] - filtered_movie_ratings_df['females_allages_avg_vote']

In [None]:
filtered_movie_ratings_df.sort_values(by='rating_discrepency_by_sex', ascending=False)

## Finding the Gross Income per Director
Here we'll leverage the `group_by` method in Pandas to find the sum of the 'worlwide_gross_income' per director.

In [None]:
# There's a catch though - this data is de-normalised as two directors appear as one element in the dataframe
# The filter expression is simply searching for 'director elements' that contain commas
movies_df[movies_df['director'].str.contains(',') == True].head(3)

A simple strategy would be to simple allocate the same income to each director on the movie.
  
Note that the income per director will no longer be additive.

In [None]:
# First move - turn the comman seperated values into a list
movies_df['director_list'] = movies_df['director'].str.split(',')

movies_df[['director', 'director_list']].head(5)

In [None]:
# Now we will use the explode operator to turn each list element into a new row
exploded_movies_df = movies_df.explode('director_list').copy()

exploded_movies_df[['director', 'director_list']][exploded_movies_df['director'].str.contains(',') == True].head(6)

In [None]:
grouped_exploded_movies_df = exploded_movies_df.groupby('director_list', as_index=False).agg({'worlwide_gross_income': 'sum'})

grouped_exploded_movies_df

In [None]:
grouped_exploded_movies_df.sort_values(by='worlwide_gross_income', ascending=False).head(10)

## Challenge Question!
> Find the variance and standard deviation of ratings per movie.
  
See [here](https://en.wikipedia.org/wiki/Variance) for a definition of variance.
>_Hint:_ The data in `IMDB ratings.csv` has columns that count the number of a particular rating - e.g. `votes_10` counts the number of rating 10 votes for a movie.