IMDB Movie dataset is a movie review dataset, used to analyze the interest level of movies according to a number of criteria 
such as: director, actors, movie name... to provide perspectives to support prediction. in the future.

# 1. Import libraries and load dataset: 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
dataset_path = "./data/IMDB-Movie-Data.csv"

In [3]:
# Read data from .csv file
data = pd.read_csv(dataset_path)

In addition, we can read and specify a column as an index for the data table (by default pandas will create a separate 
index column). Here, we can choose the Title column as the index column as follows (the index column cannot 
contain duplicate values):

In [4]:
# Read data with specified explicit index .
# We will use this later in our analysis
data_indexed = pd.read_csv (dataset_path,index_col ="Title")

In [5]:
data_indexed.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


# 2. View the data: Look through the first 5 rows of the data table using head()

In [6]:
# Preview top 5 rows using head ()
data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


# 3. Understand some basic information about the data:

In [7]:
# Let 's first understand the basic information about this data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [8]:
data . describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


- Min and max values of Year, meaning the dataset contains movies from 2006 to 2016. • 
- The average rating for movies is 6.7, the lowest is 1.9, the highest is 9.0. •
- The highest revenue achieved was 936.6 million dollars

# 4. Data Selection 

- Indexing and Slicing data: From the data table, we can split any column in the data table to 
become a Series or a DataFrame, depending on the splitting method we use. Here, we will separate some 
columns in the data using Indexing technique. To split columns into Series, do:

In [9]:
 # Extract data as series
genre = data["Genre"]
genre.to_frame()

Unnamed: 0,Genre
0,"Action,Adventure,Sci-Fi"
1,"Adventure,Mystery,Sci-Fi"
2,"Horror,Thriller"
3,"Animation,Comedy,Family"
4,"Action,Adventure,Fantasy"
...,...
995,"Crime,Drama,Mystery"
996,Horror
997,"Drama,Music,Romance"
998,"Adventure,Comedy"


In [10]:
data.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

We can select and split multiple columns at the same time, creating a new DataFrame:

In [11]:
some_col = data[["Title","Genre","Actors","Director", "Rating"]]
some_col.head()

Unnamed: 0,Title,Genre,Actors,Director,Rating
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi","Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",James Gunn,8.1
1,Prometheus,"Adventure,Mystery,Sci-Fi","Noomi Rapace, Logan Marshall-Green, Michael Fa...",Ridley Scott,7.0
2,Split,"Horror,Thriller","James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",M. Night Shyamalan,7.3
3,Sing,"Animation,Comedy,Family","Matthew McConaughey,Reese Witherspoon, Seth Ma...",Christophe Lourdelet,7.2
4,Suicide Squad,"Action,Adventure,Fantasy","Will Smith, Jared Leto, Margot Robbie, Viola D...",David Ayer,6.2


For row splitting, we can separate a certain number of rows, from index X to index Y in the data table, called 
Slicing. For example, to separate rows 10 through 15, do the following:


In [12]:
data.iloc[10:15][["Title","Genre","Actors","Director", "Rating"]]

Unnamed: 0,Title,Genre,Actors,Director,Rating
10,Fantastic Beasts and Where to Find Them,"Adventure,Family,Fantasy","Eddie Redmayne, Katherine Waterston, Alison Su...",David Yates,7.5
11,Hidden Figures,"Biography,Drama,History","Taraji P. Henson, Octavia Spencer, Janelle Mon...",Theodore Melfi,7.8
12,Rogue One,"Action,Adventure,Sci-Fi","Felicity Jones, Diego Luna, Alan Tudyk, Donnie...",Gareth Edwards,7.9
13,Moana,"Animation,Adventure,Comedy","Auli'i Cravalho, Dwayne Johnson, Rachel House,...",Ron Clements,7.7
14,Colossal,"Action,Comedy,Drama","Anne Hathaway, Jason Sudeikis, Austin Stowell,...",Nacho Vigalondo,6.4


In [13]:
data.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Data Selection - Based on Conditional filtering: We can retrieve rows in the data table based on some conditions 
that need to be followed. For example, we want to get movies from 2010 to 2015, with ratings less than 6.0 but 
with revenue in the top 10% of the entire dataset. Accordingly, we can implement the code as follows:

In [14]:
threshold = data["Revenue (Millions)"].quantile(0.9)
filtered_data = data[((data["Year"] > 2010) & (data["Year"] < 2015)) & (data["Rating"] < 5.0) & (data["Revenue (Millions)"] > threshold)]

In [15]:
filtered_data

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
925,926,The Twilight Saga: Breaking Dawn - Part 1,"Adventure,Drama,Fantasy",The Quileutes close in on expecting parents Ed...,Bill Condon,"Kristen Stewart, Robert Pattinson, Taylor Laut...",2011,117,4.9,190244,281.28,45.0


# 6. Groupby Operations

Groupby is a grouping of data based on one or more variables (here, data columns in a 
table). For example, we can find the average rating achieved by directors by grouping the ratings of movies 
by Director.

In [16]:
data.groupby("Director")["Rating"].mean().to_frame().sort_values("Rating", ascending = False)

Unnamed: 0_level_0,Rating
Director,Unnamed: 1_level_1
Nitesh Tiwari,8.80
Christopher Nolan,8.68
Olivier Nakache,8.60
Makoto Shinkai,8.60
Aamir Khan,8.50
...,...
Micheal Bafaro,3.50
Jonathan Holbrook,3.20
Shawn Burkett,2.70
James Wong,2.70


In [17]:
data.groupby("Director")["Votes"].mean().to_frame().sort_values("Votes", ascending = False)

Unnamed: 0_level_0,Votes
Director,Unnamed: 1_level_1
Christopher Nolan,1311817.0
James Cameron,935408.0
Joss Whedon,781241.5
Quentin Tarantino,639896.5
Tim Miller,627797.0
...,...
Mark Williams,164.0
Alexi Pappas,115.0
Gillies MacKinnon,102.0
David Leveaux,96.0


In [18]:
data.groupby("Director")["Revenue (Millions)"].sum().to_frame().sort_values("Revenue (Millions)", ascending = False)

Unnamed: 0_level_0,Revenue (Millions)
Director,Unnamed: 1_level_1
J.J. Abrams,1683.45
David Yates,1630.51
Christopher Nolan,1515.09
Michael Bay,1421.32
Francis Lawrence,1299.81
...,...
Jalil Lespert,0.00
Jamal Hill,0.00
James Franco,0.00
James Lapine,0.00


# 7. Sorting Operations: 

Sorting allows us to sort rows in a data table in ascending/descending order based on the 
value of a certain column in the data table. For example, based on the groupby results of the previous 
section, we can find the top 5 directors with the highest average ratings as follows:

In [19]:
data.groupby("Director")["Rating"].mean().to_frame().sort_values("Rating", ascending = False).head()

Unnamed: 0_level_0,Rating
Director,Unnamed: 1_level_1
Nitesh Tiwari,8.8
Christopher Nolan,8.68
Olivier Nakache,8.6
Makoto Shinkai,8.6
Aamir Khan,8.5


# 8. View missing values:

Data sets will often have missing values in some information fields of some data 
samples. When processing data, we need to overcome this problem. Therefore, the first thing we need to 
check is the location of data loss in the following way:

In [20]:
 # To check null values row - wise
data.isnull().sum().to_frame()

Unnamed: 0,0
Rank,0
Title,0
Genre,0
Description,0
Director,0
Actors,0
Year,0
Runtime (Minutes),0
Rating,0
Votes,0


Here we see Revenue (Millions) and Metascore are two columns containing null data. To handle the 
problem of data loss, there are two main options: either replace empty areas with some value or remove 
them

# 9. Deal with missing values - Deleting

- Deleting For the removal option, we can remove the entire column containing 
many null values (if possible) or only remove rows containing null values. To delete columns, we do:

In [21]:
data.drop("Metascore", axis = 1).head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions)
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02


In [22]:
data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


The above command still does not drop real data on the server until we add inplace=True.

To delete rows, we use

In [23]:
data.dropna().head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


# 10. Dealing with missing values - Filling:

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [25]:
aver_revenue = data["Revenue (Millions)"].mean()
print(f"The average of revenue is {aver_revenue} Millions")

The average of revenue is 82.95637614678898 Millions


In [26]:
# we can fill missing value with this mean revenue
data["Revenue (Millions)"].fillna(aver_revenue, inplace = True)

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  1000 non-null   float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


# 11. apply() functions: 

Apply functions are used when we want to execute a function on the rows in the data 
table. After execution, the result returned from the main function is the new value of the corresponding row. 
For example, if we want to classify movies into three levels ['Good', 'Average', 'Bad'] based on Rating, we 
can define a function to do this and apply it to the DataFrame:

In [28]:
# Classify movies based on ratings
def classify_rating(rating):
    if rating >= 7.5:
        return "Good"
    elif rating >= 6.0:
        return "Average"
    else:
        return "Bad"

In [29]:
# Lets apply this function on our movies data
# creating a new variable in the dataset to hold the rating category
data["Rating Category"] = data["Rating"].apply(classify_rating)

In [32]:
data[["Title","Genre","Rating","Rating Category"]].head()

Unnamed: 0,Title,Genre,Rating,Rating Category
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1,Good
1,Prometheus,"Adventure,Mystery,Sci-Fi",7.0,Average
2,Split,"Horror,Thriller",7.3,Average
3,Sing,"Animation,Comedy,Family",7.2,Average
4,Suicide Squad,"Action,Adventure,Fantasy",6.2,Average


 DataFrame after the rating_group() function is applied. The results returned after executing this 
row will be put into a new column named Rating_category