In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [1]:
# Import the numpy and pandas packages
import numpy as np
import pandas as pd

## Task 1: Reading and Inspection

-  ### Subtask 1.1: Import and read

Import and read the movie database. Store it in a variable called `movies`.

In [1]:
# Write your code for importing the csv file here
movies = pd.read_csv('../input/imdb-movie-dataset/movie_data.csv')
movies

-  ### Subtask 1.2: Inspect the dataframe

Inspect the dataframe's columns, shapes, variable types etc.

In [1]:
# Write your code for inspection here
print(movies.shape)
print(type(movies.info()))

## Task 2: Cleaning the Data

-  ### Subtask 2.1: Inspect Null values

Find out the number of Null values in all the columns and rows. Also, find the percentage of Null values in each column. Round off the percentages upto two decimal places.

In [1]:
# Write your code for column-wise null count here
movies.isnull().sum(axis=0) # Here we found out the sum of all the null values in the column

In [1]:
# Write your code for row-wise null count here
movies.isnull().sum(axis=1) # Here we found out the sum of all the null values in the rows

In [1]:
# Write your code for column-wise null percentages here
round(100*(movies.isnull().sum()/len(movies.index)),2) # Here we found out the percentage of all the null values in the column

-  ### Subtask 2.2: Drop unecessary columns

For this assignment, you will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So it is advised to drop the following columns.
-  color
-  director_facebook_likes
-  actor_1_facebook_likes
-  actor_2_facebook_likes
-  actor_3_facebook_likes
-  actor_2_name
-  cast_total_facebook_likes
-  actor_3_name
-  duration
-  facenumber_in_poster
-  content_rating
-  country
-  movie_imdb_link
-  aspect_ratio
-  plot_keywords

In [1]:
# Write your code for dropping the columns here. It is advised to keep inspecting the dataframe after each set of operations 
#Dropped the columns manually
movies=movies.drop('color', axis=1)
movies=movies.drop('director_facebook_likes', axis=1)
movies=movies.drop('actor_1_facebook_likes', axis=1)
movies=movies.drop('actor_2_facebook_likes', axis=1)
movies=movies.drop('actor_3_facebook_likes', axis=1)
movies=movies.drop('actor_2_name', axis=1)
movies=movies.drop('cast_total_facebook_likes', axis=1)
movies=movies.drop('actor_3_name', axis=1)
movies=movies.drop('duration', axis=1)
movies=movies.drop('facenumber_in_poster', axis=1)
movies=movies.drop('content_rating', axis=1)
movies=movies.drop('country', axis=1)
movies=movies.drop('movie_imdb_link', axis=1)
movies=movies.drop('aspect_ratio', axis=1)
movies=movies.drop('plot_keywords', axis=1)
(100*(movies.isnull().sum()/len(movies.index)))

-  ### Subtask 2.3: Drop unecessary rows using columns with high Null percentages

Now, on inspection you might notice that some columns have large percentage (greater than 5%) of Null values. Drop all the rows which have Null values for such columns.

In [1]:
# Write your code for dropping the rows here
# The columns with high Null percentage (i.e. greater than 5%) are gross and budget so dropping the rows for these columns and again converting it to percentage of null values
movies = movies[~np.isnan(movies['gross'])]
movies = movies[~np.isnan(movies['budget'])]
(100*(movies.isnull().sum()/len(movies.index)))

-  ### Subtask 2.4: Fill NaN values

You might notice that the `language` column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with `'English'`.

In [1]:
# Write your code for filling the NaN values in the 'language' column here
movies.loc[pd.isnull(movies['language']), ['language']] = 'English'
(100*(movies.isnull().sum()/len(movies.index)))
#It is observed that it is safe for replacing all the missing values to English as the percentage is less for null values

-  ### Subtask 2.5: Check the number of retained rows

You might notice that two of the columns viz. `num_critic_for_reviews` and `actor_1_name` have small percentages of NaN values left. You can let these columns as it is for now. Check the number and percentage of the rows retained after completing all the tasks above.

In [1]:
# Write your code for checking number of retained rows here
print(movies.index)
(100*(len(movies.index)/5043)) #To calculate the percentage of the rows retained

**Checkpoint 1:** You might have noticed that we still have around `77%` of the rows!

Yes there are still more than 77% of the rows

## Task 3: Data Analysis

-  ### Subtask 3.1: Change the unit of columns

Convert the unit of the `budget` and `gross` columns from `$` to `million $`.

In [1]:
# Write your code for unit conversion here
movies[['gross','budget']].apply(lambda x:x/1000000) #To convert the unit we divide the columns with 1 Million

-  ### Subtask 3.2: Find the movies with highest profit

    1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
    2. Sort the dataframe using the `profit` column as reference.
    3. Plot `profit` (y-axis) vs `budget` (x- axis) and observe the outliers using the appropriate chart type.
    4. Extract the top ten profiting movies in descending order and store them in a new dataframe - `top10`

In [1]:
# Write your code for creating the profit column here
movies['profit'] = movies['gross'] - movies['budget']
movies

In [1]:
# Write your code for sorting the dataframe here
movies.sort_values(by = 'profit', ascending = True)

In [1]:
# Write code for profit vs budget plot here
import matplotlib.pyplot as plt
plt.figure(figsize = (12,6),dpi = 100)
plt.scatter(x = movies['budget'],y = movies['profit'], alpha = 0.8,color = "red")
plt.ylabel("Profit",color="blue",size = 20)
plt.xlabel("Budget",color="blue",size = 20)
plt.title("Profit VS Budget",color = "red",size = 30)
plt.show()

There are outliers in the scatter plot.

In [1]:
# Write your code to get the top 10 profiting movies here
top10=movies.sort_values(by = ['profit'],ascending = False).head(10)
top10

-  ### Subtask 3.3: Drop duplicate values

After you found out the top 10 profiting movies, you might have noticed a duplicate value. So, it seems like the dataframe has duplicate values as well. Drop the duplicate values from the dataframe and repeat `Subtask 3.2`. Note that the same `movie_title` can be there in different languages. 

In [1]:
# Write your code for dropping duplicate values here
movies.drop_duplicates()

In [1]:
# Write code for repeating subtask 2 here 
top10 = movies.sort_values(by = ['profit'], inplace = True, ascending = False)
top10 = movies.drop_duplicates().head(10)
top10

**Checkpoint 2:** You might spot two movies directed by `James Cameron` in the list.

#### Insight:
Yes there are two movies directed by James Cameron in the list

-  ### Subtask 3.4: Find IMDb Top 250

    1. Create a new dataframe `IMDb_Top_250` and store the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). Also make sure that for all of these movies, the `num_voted_users` is greater than 25,000.
Also add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.
    2. Extract all the movies in the `IMDb_Top_250` dataframe which are not in the English language and store them in a new dataframe named `Top_Foreign_Lang_Film`.

In [1]:
# Write your code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe 
# and name that dataframe as 'IMDb_Top_250'
IMDb_Top_250 = movies.sort_values(by=['imdb_score'],ascending = False).head(250)
IMDb_Top_250.loc[(IMDb_Top_250.num_voted_users>25),:]
IMDb_Top_250['Rank']=range(1,251)
IMDb_Top_250

In [1]:
# Write your code to extract top foreign language films from 'IMDb_Top_250' here
Top_Foreign_Lang_Film = Top_Foreign_Lang_Film =IMDb_Top_250.loc[(IMDb_Top_250.language != 'English'),:]
Top_Foreign_Lang_Film

**Checkpoint 3:** Can you spot `Veer-Zaara` in the dataframe?

#### Insight:
There is Veer-Zara in the dataframe at rank 240.

- ### Subtask 3.5: Find the best directors

    1. Group the dataframe using the `director_name` column.
    2. Find out the top 10 directors for whom the mean of `imdb_score` is the highest and store them in a new dataframe `top10director`.  Incase of a tie in IMDb score between two directors, sort them alphabetically. 

In [1]:
# Write your code for extracting the top 10 directors here
df = movies.groupby('director_name')
df1 = df['imdb_score'].mean()
df1.sort_values(ascending = False, inplace = True)
top10director = df1.head(10)
print(top10director)

**Checkpoint 4:** No surprises that `Damien Chazelle` (director of Whiplash and La La Land) is in this list.

#### Insight:
Yes there is presence of Damien Chazelle in this list.

-  ### Subtask 3.6: Find popular genres

You might have noticed the `genres` column in the dataframe with all the genres of the movies seperated by a pipe (`|`). Out of all the movie genres, the first two are most significant for any film.

1. Extract the first two genres from the `genres` column and store them in two new columns: `genre_1` and `genre_2`. Some of the movies might have only one genre. In such cases, extract the single genre into both the columns, i.e. for such movies the `genre_2` will be the same as `genre_1`.
2. Group the dataframe using `genre_1` as the primary column and `genre_2` as the secondary column.
3. Find out the 5 most popular combo of genres by finding the mean of the gross values using the `gross` column and store them in a new dataframe named `PopGenre`.

In [1]:
# Write your code for extracting the first two genres of each movie here
first = movies['genres'].apply(lambda x: pd.Series(x.split('|')))
movies['genre_1'] = first[0]
movies['genre_2'] = first[1]
movies.loc[pd.isnull(movies['genre_2']), ['genre_2']] = movies['genre_1']
movies = movies.drop(['genres'], axis = 1)
movies

In [1]:
# Write your code for grouping the dataframe here
movies_by_segment = movies.groupby(['genre_1','genre_2'])
movies_by_segment

In [1]:
# Write your code for getting the 5 most popular combo of genres here
PopGenre = movies_by_segment['gross'].mean().sort_values(ascending=False).head(5)
PopGenre

**Checkpoint 5:** Well, as it turns out. `Family + Sci-Fi` is the most popular combo of genres out there!

-  ### Subtask 3.7: Find the critic-favorite and audience-favorite actors

    1. Create three new dataframes namely, `Meryl_Streep`, `Leo_Caprio`, and `Brad_Pitt` which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the `actor_1_name` column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction.
    2. Append the rows of all these dataframes and store them in a new dataframe named `Combined`.
    3. Group the combined dataframe using the `actor_1_name` column.
    4. Find the mean of the `num_critic_for_reviews` and `num_users_for_review` and identify the actors which have the highest mean.
    5. Observe the change in number of voted users over decades using a bar chart. Create a column called `decade` which represents the decade to which every movie belongs to. For example, the  `title_year`  year 1923, 1925 should be stored as 1920s. Sort the dataframe based on the column `decade`, group it by `decade` and find the sum of users voted in each decade. Store this in a new data frame called `df_by_decade`.

In [1]:
# Write your code for creating three new dataframes here
# Include all movies in which Meryl_Streep is the lead

Meryl_Streep = movies.loc[(movies.actor_1_name=='Meryl Streep'),:].head(3891)
Meryl_Streep

In [1]:
# Include all movies in which Leo_Caprio is the lead
Leo_Caprio = movies.loc[(movies.actor_1_name=='Leonardo DiCaprio'),:].head(3891)
Leo_Caprio

In [1]:
# Include all movies in which Brad_Pitt is the lead
Brad_Pitt = movies.loc[(movies.actor_1_name=='Brad Pitt'),:].head(3891)
Brad_Pitt

In [1]:
# Write your code for combining the three dataframes here
Combined_dataframe = Meryl_Streep.append(Leo_Caprio).append(Brad_Pitt)
Combined_dataframe

In [1]:
# Write your code for grouping the combined dataframe here
actor_name=Combined_dataframe.groupby('actor_1_name')
actor_name

In [1]:
# Write the code for finding the mean of critic reviews and audience reviews here
critic_reviews=actor_name['num_critic_for_reviews'].mean().sort_values(ascending=False).head(49)
print(critic_reviews)
audience_reviews=actor_name['num_user_for_reviews'].mean().sort_values(ascending=False).head(49)
print(audience_reviews)

**Checkpoint 6:** `Leonardo` has aced both the lists!

#### Insight:
Yes Leonardo has aced both the lists!

In [1]:
# Write the code for calculating decade here
movies['decade'] = movies['title_year'].apply(lambda x: x // 10 * 10).astype(np.int64)
movies['decade'] = movies['decade'].apply(lambda x: str(x)[:3]+'0s').astype(str)
movies.drop(['title_year'], axis = 1, inplace = True)
movies = movies.sort_values(['decade'])
movies

In [1]:
# Write your code for creating the data frame df_by_decade here
df_by_decade = movies.groupby(['decade'])
df_by_decade['num_voted_users'].sum()
df_by_decade = pd.DataFrame(df_by_decade['num_voted_users'].sum())
df_by_decade

In [1]:
# Write your code for plotting number of voted users vs decade
import seaborn as sns
df_by_decade.plot.bar(figsize=(16,8) , color = 'red')
plt.ylabel('Decade',color='black',size=20)
plt.xlabel('Number of Voted Users',color='black',size = 20)
plt.title('Number of Voted Users VS Decade',color='black',size=30)
plt.show()