 ## Objective : 
In this assignment, you will try to find some interesting insights into a few movies released between 1916 and 2016, using Python. This is a compulsory individual assignment wherein you will download a movie dataset, write Python code to explore the data, gain insights into the movies, actors, directors, and collections, and submit the code.

In [1]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [1]:
# Import the numpy and pandas packages

import numpy as np
import pandas as pd

## Task 1: Reading and Inspection

-  ### Subtask 1.1: Import and read

In [1]:
movies = pd.read_csv("../input/MovieAssignmentData.csv") 
movies.head()

-  ### Subtask 1.2: Inspect the dataframe

Inspect the dataframe's columns, shapes, variable types etc.

In [1]:
# Code for inspection here
print(movies.columns)

In [1]:
#Printing the dimensions of the movies dataframe
print(movies.shape)

In [1]:
# Looking at the datatypes of each column
print(movies.info())

## Task 2: Cleaning the Data

-  ### Subtask 2.1: Inspect Null values

Find out the number of Null values in all the columns and rows. Also, find the percentage of Null values in each column. Round off the percentages upto two decimal places.

In [1]:
# Code for column-wise null count here
movies.isnull().sum(axis=0)

In [1]:
# Code for row-wise null count here
movies.isnull().sum(axis=1)

In [1]:
# Code for column-wise null percentages here
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))

-  ### Subtask 2.2: Drop unecessary columns

Dropping the columns which are not required for the analysis.

In [1]:
# Code for dropping the columns here. It is advised to keep inspecting the dataframe after each set of operations 
movies = movies.drop(['color','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes',
                     'actor_2_name','cast_total_facebook_likes','actor_3_name','duration','facenumber_in_poster','content_rating',
                     'country','movie_imdb_link','aspect_ratio','plot_keywords'], axis=1)

#printing the datatype and dataframe information after dropping the unneccesary columns
print(movies.info())

In [1]:
#printing the column-wise null percentage for the remaining columns in the dataset
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))

-  ### Subtask 2.3: Drop unecessary rows using columns with high Null percentages

Drop all the rows which have more that 5% Null values for such columns.

In [1]:
# Write your code for dropping the rows for columns having more that 5% null values here
movies = movies[~(movies['gross'].isnull() | movies['budget'].isnull())]

#printing the column-wise null percentage for the remaining columns in the dataset
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))

-  ### Subtask 2.4: Drop unecessary rows

Some of the rows might have greater than five NaN values. Such rows aren't of much use for the analysis and hence, should be removed.

In [1]:
#length of rows with more than 5 NaN values
print(len(movies[movies.isnull().sum(axis=1) > 5]))

#length of rows with less or equal to 5 NaN values
print(len(movies[movies.isnull().sum(axis=1) <= 5]))

# Code for dropping the rows here
movies = movies[movies.isnull().sum(axis=1) <= 5]

#printing the column-wise null percentage for the remaining columns in the dataset
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))

-  ### Subtask 2.5: Fill NaN values


In [1]:
#converting language column as a category
movies['language'] = movies['language'].astype('category')

#printing the value counts for the language column
print(movies['language'].value_counts())

#imputing the NaN values by English - top language 
movies.loc[movies['language'].isnull(), ['language']] = 'English'

#printing the null percentage column-wise again
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))

-  ### Subtask 2.6: Check the number of retained rows


In [1]:
# Code for checking number of retained rows here
print(len(movies.index))

#Percentage of retained rows
print(100*(len(movies.index)/5043))

**Note 1:** We still have around `77%` of the rows!

## Task 3: Data Analysis

-  ### Subtask 3.1: Change the unit of columns


In [1]:
# Code for unit conversion here
movies['gross'] = movies['gross']/1000000
movies['budget'] = movies['budget']/1000000
movies

-  ### Subtask 3.2: Find the movies with highest profit

In [1]:
#Code for creating the profit column here
movies['profit'] = movies['gross'] - movies['budget']
movies

In [1]:
# Code for sorting the dataframe here
movies.sort_values('profit',ascending = False)

In [1]:
# Code to get the top 10 profiting movies here
top10 = movies.sort_values('profit',ascending = False).head(10) 
top10

-  ### Subtask 3.3: Dropping duplicate values


In [1]:
# Code for dropping duplicate values here
movies = movies[~movies.duplicated()]
movies

In [1]:
# Code for repeating subtask 2 here
top10 = movies.sort_values('profit',ascending = False).head(10)
top10

**Note 2:** We have two movies directed by `James Cameron` in the list.

-  ### Subtask 3.4: Find IMDb Top 250

    1. Create a new dataframe `IMDb_Top_250` and store the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). Also make sure that for all of these movies, the `num_voted_users` is greater than 25,000.
Also add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.
    2. Extract all the movies in the `IMDb_Top_250` dataframe which are not in the English language and store them in a new dataframe named `Top_Foreign_Lang_Film`.

In [1]:
# Code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe 
# and name that dataframe as 'IMDb_Top_250'
IMDb_Top_250 = movies.sort_values(by = 'imdb_score', ascending = False)
IMDb_Top_250 = IMDb_Top_250.loc[IMDb_Top_250.num_voted_users > 25000]
IMDb_Top_250 = IMDb_Top_250.iloc[:250, ]
IMDb_Top_250['Rank'] = range(1,251)
IMDb_Top_250

In [1]:
# Code to extract top foreign language films from 'IMDb_Top_250' here
Top_Foreign_Lang_Film = IMDb_Top_250[IMDb_Top_250['language'] != 'English']
Top_Foreign_Lang_Film

- ### Subtask 3.5: Find the best directors

    1. Find out the top 10 directors for whom the mean of `imdb_score` is the highest and store them in a new dataframe `top10director`. 

In [1]:
# Code for extracting the top 10 directors here
top10director = pd.DataFrame(movies.groupby('director_name').imdb_score.mean().sort_values(ascending = False).head(10))
top10director

-  ### Subtask 3.6: Find popular genres

    1. Extract the first two genres from the `genres` column and store them in two new columns: `genre_1` and `genre_2`. Some of the movies might have only one genre. In such cases, extract the single genre into both the columns, i.e. for such movies the `genre_2` will be the same as `genre_1`.
    2. Group the dataframe using `genre_1` as the primary column and `genre_2` as the secondary column.
    3. Find out the 5 most popular combo of genres by finding the mean of the gross values using the `gross` column and store them in a new dataframe named `PopGenre`.

In [1]:
# Code for extracting the first two genres of each movie here
movies['genre_1'] = movies['genres'].apply(lambda x: x.split('|')[0])
movies['genre_2'] = movies['genres'].apply(lambda x: x.split('|')[0] if len(x.split('|'))<2 else x.split('|')[1])
movies

In [1]:
# Code for grouping the dataframe here
movies_by_segment = movies.groupby(['genre_1','genre_2'])

In [1]:
# Code for getting the 5 most popular combo of genres here
PopGenre = pd.DataFrame(movies_by_segment.gross.mean().sort_values(ascending = False).head(5))
PopGenre

**Note 3:** Well, as it turns out. `Family + Sci-Fi` is the most popular combo of genres out there!

-  ### Subtask 3.7: Find the critic-favorite and audience-favorite actors

    1. Create three new dataframes namely, `Meryl_Streep`, `Leo_Caprio`, and `Brad_Pitt` which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the `actor_1_name` column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction.
    2. Append the rows of all these dataframes and store them in a new dataframe named `Combined`.
    3. Group the combined dataframe using the `actor_1_name` column.
    4. Find the mean of the `num_critic_for_reviews` and `num_user_for_review` and identify the actors which have the highest mean.

In [1]:
# Code for creating three new dataframes here
# Including all movies in which Meryl_Streep is the lead
Meryl_Streep = movies.loc[movies['actor_1_name'] == 'Meryl Streep']
Meryl_Streep

In [1]:
# Including all movies in which Leo_Caprio is the lead
Leo_Caprio = movies.loc[movies['actor_1_name'] == 'Leonardo DiCaprio']
Leo_Caprio

In [1]:
# Including all movies in which Brad_Pitt is the lead
Brad_Pitt = movies.loc[movies['actor_1_name'] == 'Brad Pitt']
Brad_Pitt

In [1]:
# Code for combining the three dataframes here
combined = Meryl_Streep.append(Leo_Caprio.append(Brad_Pitt))
combined

In [1]:
# Code for grouping the combined dataframe here
combined = combined.groupby('actor_1_name')

In [1]:
# Code for finding the mean of critic reviews and audience reviews here
combined['num_critic_for_reviews','num_user_for_reviews'].mean().sort_values('num_critic_for_reviews',ascending = False)

**Note 4:** `Leonardo` has aced both the lists!