# **Link of the colab notebook :**
https://colab.research.google.com/drive/1hH12iuzH8RkR6sSw_k3IvYRa0TfSAObI?usp=sharing

# **Link of the dataset:**
https://drive.google.com/file/d/1wfez4vnyXcBe-sKEVqIbFLq09WRoLNKv/view?usp=sharing


### **Step 1: Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style="whitegrid")

### **Step 2: Loading the Dataset**

In [2]:
#mount the gdrive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [3]:
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/Exploratory_Data_Analysis/Dataset/IMdB_India_Top250.csv')
df.head(10)

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,"$138,288.00",945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,NIL,48,3 wins & 1 nomination,A man's simple lie to secure his job escalates...,"Amazon Prime Video, YouTube, Zee5"
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,"$975,543.00",370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,"$120,481.93",237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,"$134,241.00",62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"
5,Anbe Sivam,2003,2 hours 40 minutes,8.6,26K,Kollywood (Tamil),"Drama, Comedy",Sundar C.,NIL,115,2 wins & 3 nominations,"Two men, one young and arrogant, the other dam...",Disney+ Hotstar
6,Pariyerum Perumal,2018,2 hours 34 minutes,8.7,19K,Kollywood (Tamil),"Drama, Social",Mari Selvaraj,NIL,169,11 wins & 5 nominations,A law student from a lower caste begins a frie...,Amazon Prime Video
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,"$60,262,836.00",1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video"
8,#Home,2021,2 hours 38 minutes,8.8,16K,Mollywood (Malayalam),"Drama, Family",Rojin Thomas,NIL,541,3 wins & 2 nominations,Oliver Twist (Indrans) wants to be tech-savvy ...,Amazon Prime Video
9,Black Friday,2004,2 hours 23 minutes,8.4,22K,Bollywood (Hindi),"Crime, Drama, Thriller",Anurag Kashyap,"$1,610,897.00",88,1 win & 7 nominations,A film about the investigations following the ...,YouTube


### **Step 3: Exploring the Data**

In [4]:
# Get basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Movie name             250 non-null    object 
 1   Year of release        250 non-null    int64  
 2   Watch  hour            250 non-null    object 
 3   Rating                 250 non-null    float64
 4   Ratedby                250 non-null    object 
 5   Film Industry          250 non-null    object 
 6   Genre                  250 non-null    object 
 7   Director               250 non-null    object 
 8   Box office collection  250 non-null    object 
 9   User reviews           250 non-null    int64  
 10  Awards                 250 non-null    object 
 11  Description            250 non-null    object 
 12  Streaming platform     250 non-null    object 
dtypes: float64(1), int64(2), object(10)
memory usage: 25.5+ KB


In [5]:
# Summary statistics for numerical columns
df.describe()

Unnamed: 0,Year of release,Rating,User reviews
count,250.0,250.0,250.0
mean,2008.388,8.1904,222.184
std,14.708597,0.252513,388.15949
min,1955.0,7.7,10.0
25%,2003.0,8.0,39.0
50%,2013.0,8.15,94.5
75%,2018.0,8.3,231.5
max,2024.0,9.1,3300.0


In [6]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
Movie name,0
Year of release,0
Watch hour,0
Rating,0
Ratedby,0
Film Industry,0
Genre,0
Director,0
Box office collection,0
User reviews,0


# **IMPORTANT NOTE :**

Here the missing values are not NaN but are written as NIL so df.isnull().sum() cannot identify them as missing value. So we have to write Null/NaN inplace of NIL value so that function df.isnull().sum() can identify it.

In [7]:
# Strip whitespace from column names
df.columns = df.columns.str.strip()
# Print columns to verify the correct name
print(df.columns)

# Replace "NIL" with NaN in the entire DataFrame
df.replace("NIL", np.nan, inplace=True)

# Replace "NIL" with NaN in the 'Box Office Collection' column specifically
df['Box office collection'].replace("NIL", np.nan, inplace=True)


Index(['Movie name', 'Year of release', 'Watch  hour', 'Rating', 'Ratedby',
       'Film Industry', 'Genre', 'Director', 'Box office collection',
       'User reviews', 'Awards', 'Description', 'Streaming platform'],
      dtype='object')


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Box office collection'].replace("NIL", np.nan, inplace=True)


In [8]:
# Function to clean the 'Box office collection' column
def clean_box_office(value):
    value = value.strip()  # Remove leading/trailing spaces
    if value == 'NIL' or value == '$0' or value == '$0.00':
        return 0.0
    else:
        # Remove $ and commas, then convert to float
        return float(value.replace('$', '').replace(',', ''))

# Apply the function to clean the 'Box office collection' column
df['Box office collection'] = df['Box office collection'].apply(clean_box_office)

# Replace 0.0 with NaN in the 'Box office collection' column
df['Box office collection'].replace(0.0, np.nan, inplace=True)

# Print the result to verify
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Box office collection'].replace(0.0, np.nan, inplace=True)


Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,,48,3 wins & 1 nomination,A man's simple lie to secure his job escalates...,"Amazon Prime Video, YouTube, Zee5"
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"


In [9]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
Movie name,0
Year of release,0
Watch hour,0
Rating,0
Ratedby,0
Film Industry,0
Genre,0
Director,0
Box office collection,52
User reviews,0


In [10]:
# Unique values in categorical columns
cols = df.columns
num_cols = df._get_numeric_data().columns
num_cols
cat_cols = list(set(cols) - set(num_cols))

df[cat_cols].nunique()


Unnamed: 0,0
Movie name,248
Director,182
Film Industry,9
Streaming platform,28
Ratedby,110
Awards,184
Description,250
Genre,106
Watch hour,79


In [11]:
# Unique values in categorical columns
print(df['Streaming platform'].unique())
print(df['Film Industry'].unique())

['SonyLIV' 'Amazon Prime Video, YouTube, Zee5' 'Netflix'
 'Amazon Prime Video, YouTube' 'Amazon Prime Video, Hoichoi'
 'Disney+ Hotstar' 'Amazon Prime Video' 'Netflix, Amazon Prime Video'
 'YouTube' 'Voot, Amazon Prime Video' 'Netflix, Disney+ Hotstar'
 'Yet to be released/Not available' 'Zee5' 'Amazon Prime Video, Hotstar'
 'Hotstar' 'Disney+ Hotstar, Amazon Prime Video' 'Mubi'
 'Not available on major streaming platforms'
 'Amazon Prime Video, Disney+ Hotstar' 'Disney+ Hotstar, Netflix'
 'Netflix, Zee5' 'Amazon Prime Video, Zee5' 'Amazon Prime Video, Netflix'
 'Hoichoi' 'Voot' 'Netflix, SonyLIV' 'Sun NXT' 'Zee5, Amazon Prime Video']
['Bollywood (Hindi)' 'Kollywood (Tamil)' 'Bengali Cinema'
 'Mollywood (Malayalam)' 'Sandalwood (Kannada)' 'Tollywood (Telugu)'
 'Marathi Cinema' 'South Korean Cinema' 'Hollywood (English)']


In [12]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the number of missing points
missing_values_count

Unnamed: 0,0
Movie name,0
Year of release,0
Watch hour,0
Rating,0
Ratedby,0
Film Industry,0
Genre,0
Director,0
Box office collection,52
User reviews,0


Find the probability of missing value.

In [13]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

1.8769230769230771

### **Step 4: Data Cleaning**

By looking at the data, it can be seen that missing column has information on the Box office collection and Awards.

This means that some columns values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are some columns have lot of missing fields. In this case, though, the field is missing because if there was no value(does not exist) then it doesn't make sense to guess it . For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

In [14]:
# Drop rows with missing values (if any)
df_cleaned = df.dropna()

In [15]:
# Check for missing values
df_cleaned.isnull().sum()

Unnamed: 0,0
Movie name,0
Year of release,0
Watch hour,0
Rating,0
Ratedby,0
Film Industry,0
Genre,0
Director,0
Box office collection,0
User reviews,0


### **Markdown:**
First, we check how many rows and columns contain missing data. If only a small portion of the dataset is missing values, dropping rows may be acceptable.

We also calculate the percentage of rows with missing values to ensure we're not losing too much data. Generally, if more than 5-10% of the dataset has missing values, alternative methods like imputation should be considered.

After dropping rows, we verify the dataset's new shape to confirm the number of rows left.

In [16]:
# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates()

### **Step 5: Data Transformation**

**Filtering and Sorting**

a. Filter the dataset to display only movies released after the year 2000.

In [17]:
df['Year of release'] > 2000


Unnamed: 0,Year of release
0,True
1,False
2,True
3,False
4,False
...,...
245,False
246,True
247,True
248,True


In [18]:
movies_after_2000 = df[df['Year of release'] > 2000]
movies_after_2000

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.00,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.00,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
5,Anbe Sivam,2003,2 hours 40 minutes,8.6,26K,Kollywood (Tamil),"Drama, Comedy",Sundar C.,,115,2 wins & 3 nominations,"Two men, one young and arrogant, the other dam...",Disney+ Hotstar
6,Pariyerum Perumal,2018,2 hours 34 minutes,8.7,19K,Kollywood (Tamil),"Drama, Social",Mari Selvaraj,,169,11 wins & 5 nominations,A law student from a lower caste begins a frie...,Amazon Prime Video
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,60262836.00,1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
244,Angamaly Diaries,2017,2 hours 12 minutes,7.9,6.8K,Mollywood (Malayalam),"Crime, Drama",Lijo Jose Pellissery,661915.00,39,2 wins & 6 nominations,Vincent Pepe who wanted to be a powerful leade...,Netflix
246,Hazaaron Khwaishein Aisi,2003,2 hours,7.9,5.3K,Bollywood (Hindi),"Political Drama, Romance",Sudhir Mishra,,44,8 wins & 9 nominations,"Set in the backdrop of Indian Emergency 1975, ...",Amazon Prime Video
247,Ghilli,2004,2 hours 40 minutes,8.1,17K,Kollywood (Tamil),"Action, Drama",Dharani,163961.00,47,2 wins & 2 nominations,"Velu, an aspiring kabaddi player, goes to Madu...",Disney+ Hotstar
248,Mukundan Unni Associates,2022,2 hours 8 minutes,7.9,6.2K,Mollywood (Malayalam),"Satire, Drama",Abhinav Sunder Nayak,44990.00,62,1 win,"Advocate Mukundan Unni, played by Vineeth Sree...",Disney+ Hotstar


b. Sort the DataFrame by IMDB Rating in descending order.

In [19]:
sorted_by_rating = df.sort_values(by='Rating', ascending=False)
sorted_by_rating

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
27,Mayabazar,1957,3 hours 12 minutes,9.1,5.9K,Tollywood (Telugu),"Fantasy, Mythological",Kadiri Venkata Reddy,,39,,Balarama promises Subhadra to get his daughter...,"Amazon Prime Video, YouTube"
17,Sandesham,1991,2 hours 18 minutes,9.0,5.4K,Mollywood (Malayalam),"Political Satire, Comedy",Sathyan Anthikad,,17,2 wins,A satire on contemporary Kerala politics where...,Amazon Prime Video
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
14,Kireedam,1989,2 hours 4 minutes,8.9,8.8K,Mollywood (Malayalam),"Drama, Family",Sibi Malayil,181001.0,33,1 win,The life of a young man turns upside down when...,Amazon Prime Video
8,#Home,2021,2 hours 38 minutes,8.8,16K,Mollywood (Malayalam),"Drama, Family",Rojin Thomas,,541,3 wins & 2 nominations,Oliver Twist (Indrans) wants to be tech-savvy ...,Amazon Prime Video
...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,Nayak: The Real Hero,2001,3 hours 7 minutes,7.8,18K,Bollywood (Hindi),"Action, Political Drama",S. Shankar,,20,4 nominations,A man accepts a challenge by the chief ministe...,Amazon Prime Video
236,Stanley Ka Dabba,2011,1 hour 36 minutes,7.8,7.7K,Bollywood (Hindi),"Drama, Family",Amole Gupte,1241102.0,44,4 wins & 6 nominations,"A school-teacher, who forces children to share...",Disney+ Hotstar
220,Lakshya,2004,3 hours 6 minutes,7.8,26K,Bollywood (Hindi),"War, Drama",Farhan Akhtar,5859242.0,107,6 wins & 14 nominations,"An aimless, jobless, irresponsible grown man j...",Amazon Prime Video
243,Vicky Donor,2012,2 hours 6 minutes,7.7,46K,Bollywood (Hindi),"Comedy, Drama",Shoojit Sircar,6456358.0,82,43 wins & 37 nominations,A man is brought in by an infertility doctor t...,"Amazon Prime Video, Netflix"


c. Create a new DataFrame that contains only movies with an IMDB rating greater than 8.5.

In [20]:
high_rated_movies = df[df['Rating'] > 8.5]
high_rated_movies

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
5,Anbe Sivam,2003,2 hours 40 minutes,8.6,26K,Kollywood (Tamil),"Drama, Comedy",Sundar C.,,115,2 wins & 3 nominations,"Two men, one young and arrogant, the other dam...",Disney+ Hotstar
6,Pariyerum Perumal,2018,2 hours 34 minutes,8.7,19K,Kollywood (Tamil),"Drama, Social",Mari Selvaraj,,169,11 wins & 5 nominations,A law student from a lower caste begins a frie...,Amazon Prime Video
8,#Home,2021,2 hours 38 minutes,8.8,16K,Mollywood (Malayalam),"Drama, Family",Rojin Thomas,,541,3 wins & 2 nominations,Oliver Twist (Indrans) wants to be tech-savvy ...,Amazon Prime Video
10,Manichithrathazhu,1993,2 hours 49 minutes,8.7,13K,Mollywood (Malayalam),"Horror, Thriller, Drama",Fazil,843373.49,53,5 wins,When a forbidden room in an old bungalow is un...,Disney+ Hotstar
11,Rocketry: The Nambi Effect,2022,2 hours 37 minutes,8.7,59K,Bollywood (Hindi),"Biography, Drama",Madhavan,398615.0,1100,6 wins & 19 nominations,The story of Indian Space Research Organizatio...,Amazon Prime Video
13,777 Charlie,2022,2 hours 44 minutes,8.7,41K,Sandalwood (Kannada),"Adventure, Drama",Kiranraj K,7523995.0,392,3 wins & 6 nominations,Dharma is stuck in a rut with his negative and...,"Voot, Amazon Prime Video"
14,Kireedam,1989,2 hours 4 minutes,8.9,8.8K,Mollywood (Malayalam),"Drama, Family",Sibi Malayil,181001.0,33,1 win,The life of a young man turns upside down when...,Amazon Prime Video


Other data transformations

In [21]:
# Create a new column for calculating weighted review the multiplication of user review and rating
df_cleaned['weighted reviews'] = (df_cleaned['User reviews']* df_cleaned['Rating'])
df_cleaned.head()

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform,weighted reviews
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV,8410.5
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix,3182.0
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube",2061.9
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi",520.8
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,60262836.0,1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video",8400.0


In [22]:
# convert the Ratedby from string to integer datatype and add a new column
def convert_k_to_int(value):
    value = value.strip()  # Remove any leading/trailing whitespace
    if 'K' in value:
        return int(float(value.replace('K', '')) * 1000)  # Convert 'K' to integer
    else:
        return int(float(value))  # Handle numeric values without 'K'

# Apply the function and add as a new column
df_cleaned['Ratedby_converted'] = df_cleaned['Ratedby'].apply(convert_k_to_int)
df_cleaned.head()

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform,weighted reviews,Ratedby_converted
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV,8410.5,126000
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix,3182.0,37000
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube",2061.9,25000
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi",520.8,17000
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,60262836.0,1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video",8400.0,441000


In [23]:
# Generate a summary of the rating for different Film Industry
Film_Industry_summary_rating = df_cleaned.groupby('Film Industry')['Rating'].mean().reset_index()
print(Film_Industry_summary_rating)

           Film Industry    Rating
0         Bengali Cinema  8.175000
1      Bollywood (Hindi)  8.055696
2    Hollywood (English)  8.000000
3      Kollywood (Tamil)  8.255769
4         Marathi Cinema  8.800000
5  Mollywood (Malayalam)  8.225000
6   Sandalwood (Kannada)  8.300000
7     Tollywood (Telugu)  8.140000


In [24]:
# Create a new column (e.g., SuperHit, Hit, Flop)
df_cleaned['Popularity'] = pd.cut(df_cleaned['Rating'], bins=[0, 8, 8.5, np.inf], labels=['Flop', 'Hit' ,'SuperHit'])
df_cleaned.sample(10)

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform,weighted reviews,Ratedby_converted,Popularity
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,60262836.0,1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video",8400.0,441000,Hit
137,Lucia,2013,2 hours 15 minutes,8.3,13K,Sandalwood (Kannada),Psychological Thriller,Pawan Kumar,61445.78,129,4 wins & 3 nominations,A man suffering from insomnia is tricked into ...,Amazon Prime Video,1070.7,13000,Hit
142,Special 26,2013,2 hours 24 minutes,8.0,60K,Bollywood (Hindi),"Crime, Drama, Thriller",Neeraj Pandey,647074.0,148,13 nominations,A gang of con-men rob prominent rich businessm...,Netflix,1184.0,60000,Flop
167,Saptha Sagaradaache Ello - Side A,2023,2 hours 22 minutes,8.2,7.7K,Sandalwood (Kannada),"Drama, Romance",Hemanth M. Rao,37253.0,88,6 wins & 9 nominations,Manu and Priya hail from a middle-class backgr...,Amazon Prime Video,721.6,7700,Hit
51,Gangs of Wasseypur,2012,5 hours 21 minutes,8.2,105K,Bollywood (Hindi),"Crime, Drama, Action",Anurag Kashyap,4384642.0,306,18 wins & 65 nominations,A clash between Sultan and Shahid Khan leads t...,Amazon Prime Video,2509.2,105000,Hit
207,Super 30,2019,2 hours 34 minutes,7.9,38K,Bollywood (Hindi),"Biography, Drama",Vikas Bahl,24701637.0,617,6 wins & 12 nominations,Based on the life of Patna-based mathematician...,Zee5,4874.3,38000,Flop
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi",520.8,17000,Hit
16,Like Stars on Earth,2007,2 hours 42 minutes,8.3,209K,Bollywood (Hindi),"Drama, Family",Aamir Khan,21897373.0,463,28 wins & 18 nominations,An eight-year-old boy is thought to be a lazy ...,"Netflix, Disney+ Hotstar",3842.9,209000,Hit
99,Virumandi,2004,2 hours 55 minutes,8.4,8K,Kollywood (Tamil),"Action, Drama, Thriller",Kamal Haasan,5746385.54,39,1 win,A temperamental farmer gets embroiled in a blo...,Amazon Prime Video,327.6,8000,Hit
179,Deiva Thirumagal,2011,2 hours 46 minutes,8.2,6.6K,Kollywood (Tamil),"Drama, Family",A.L. Vijay,846407.0,49,13 wins & 13 nominations,A man with disabilities fights for custody of ...,Amazon Prime Video,401.8,6600,Hit


In [25]:
def clean_box_office(value):
    if value == 'NIL':
        return 0.0
    else:
        return value

# Apply the function to clean the 'Box office collection' column
df_cleaned['Box office collection'] = df_cleaned['Box office collection'].apply(clean_box_office)

# Now, summarize the box office collection for each Film Industry
Film_Industry_summary_collection = df_cleaned.groupby('Film Industry').agg({
    'Box office collection': 'sum',  # Sum the box office collections
    'Rating': 'mean'                 # Calculate the mean rating
}).reset_index()

# Display the result
print(Film_Industry_summary_collection)


           Film Industry  Box office collection    Rating
0         Bengali Cinema           4.070710e+05  8.175000
1      Bollywood (Hindi)           1.179970e+09  8.055696
2    Hollywood (English)           1.408538e+08  8.000000
3      Kollywood (Tamil)           8.832680e+07  8.255769
4         Marathi Cinema           6.330000e+02  8.800000
5  Mollywood (Malayalam)           5.774348e+07  8.225000
6   Sandalwood (Kannada)           1.059601e+08  8.300000
7     Tollywood (Telugu)           2.896626e+08  8.140000


### **Step 6: Grouping and Aggregation**

**Group By Operations**

a. Group the movies by the Year column and calculate the average IMDB rating for each year.

In [26]:
avg_rating_per_year = df.groupby('Year of release')['Rating'].mean()
avg_rating_per_year

Unnamed: 0_level_0,Rating
Year of release,Unnamed: 1_level_1
1955,8.2
1956,8.2
1957,8.7
1958,7.9
1959,8.4
1960,8.1
1964,8.1
1965,8.3
1968,8.1
1970,7.9


b. Group the data by Director and count the number of movies each director has in the top 250.

In [27]:
movies_per_director = df.groupby('Director')['Movie name'].count()
movies_per_director

Unnamed: 0_level_0,Movie name
Director,Unnamed: 1_level_1
A.L. Vijay,1
A.R. Murugadoss,1
Aamir Khan,1
Abhinav Sunder Nayak,1
Abhishek Pathak,1
...,...
Vishal Bhardwaj,3
Vishnuvardhan,1
Vivek Agnihotri,1
Yash Chopra,2


c. Find the highest-rated movie for each year by grouping the data by Year and selecting the movie with the highest rating in each group.

In [28]:
highest_rated_movie_per_year = df.loc[df.groupby('Year of release')['Rating'].idxmax()]
highest_rated_movie_per_year

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
43,Pather Panchali,1955,2 hours 5 minutes,8.2,39K,Bengali Cinema,Drama,Satyajit Ray,135342.0,204,11 wins & 2 nominations total,"Impoverished priest Harihar Ray, dreaming of a...","Amazon Prime Video, Hoichoi"
38,Aparajito,1956,1 hour 50 minutes,8.2,16K,Bengali Cinema,Drama,Satyajit Ray,134241.0,59,6 wins & 2 nominations total,"Following his father's death, a boy leaves hom...",Amazon Prime Video
27,Mayabazar,1957,3 hours 12 minutes,9.1,5.9K,Tollywood (Telugu),"Fantasy, Mythological",Kadiri Venkata Reddy,,39,,Balarama promises Subhadra to get his daughter...,"Amazon Prime Video, YouTube"
196,The Music Room,1958,1 hour 35 minutes,7.9,6.8K,Bengali Cinema,Drama,Satyajit Ray,3247.0,42,2 wins & 2 nominations,Depicts the end days of a decadent zamindar (l...,Amazon Prime Video
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"
138,Mughal-E-Azam,1960,3 hours 17 minutes,8.1,8.9K,Bollywood (Hindi),"Historical, Romance",K. Asif,161434.0,46,3 wins & 3 nominations,A 16th century prince falls in love with a cou...,Amazon Prime Video
120,Charulata,1964,1 hour 57 minutes,8.1,7.3K,Bengali Cinema,"Drama, Romance",Satyajit Ray,,43,8 wins & 2 nominations,The lonely wife of a newspaper editor falls in...,Amazon Prime Video
54,Guide,1965,3 hours 3 minutes,8.3,8.7K,Bollywood (Hindi),"Romance, Drama",Vijay Anand,,62,9 wins & 3 nominations,"When mistaken to be a sage by some villagers, ...","Amazon Prime Video, YouTube"
154,Padosan,1968,2 hours 37 minutes,8.1,7.6K,Bollywood (Hindi),"Comedy, Romance",Jyoti Swaroop,,28,,A simple man from a village falls in love with...,"Amazon Prime Video, YouTube"
221,Mera Naam Joker,1970,3 hours 44 minutes,7.9,5.4K,Bollywood (Hindi),"Drama, Musical",Raj Kapoor,,29,11 wins,"Raju is a joker, a clown. It is what he is and...",Amazon Prime Video


Other grouping and aggregation operations

In [29]:
# Average of Rating by Film Industry and Genre
Film_Industry_Genre_summary = df_cleaned.groupby(['Film Industry', 'Genre']).agg({
    'Ratedby_converted': 'mean'
}).reset_index()

print(Film_Industry_Genre_summary)

          Film Industry                     Genre  Ratedby_converted
0        Bengali Cinema                     Drama            19700.0
1     Bollywood (Hindi)          Action, Thriller            61000.0
2     Bollywood (Hindi)   Action, Thriller, Drama            27000.0
3     Bollywood (Hindi)          Biography, Drama            39000.0
4     Bollywood (Hindi)  Biography, Drama, Social            29000.0
..                  ...                       ...                ...
120  Tollywood (Telugu)           Political Drama             5700.0
121  Tollywood (Telugu)    Romance, Comedy, Drama             8500.0
122  Tollywood (Telugu)            Romance, Drama            45000.0
123  Tollywood (Telugu)             Sports, Drama            23000.0
124  Tollywood (Telugu)         Thriller, Mystery             6800.0

[125 rows x 3 columns]


In [30]:
# Advance Data Manipulation
# Pivot Table to summarize data by Film Industry and Popularity
pivot_table = df_cleaned.pivot_table(values='Box office collection', index='Film Industry', columns='Popularity', aggfunc='sum')

print(pivot_table)

Popularity                     Flop           Hit    SuperHit
Film Industry                                                
Bengali Cinema         3.247000e+03  4.038240e+05        0.00
Bollywood (Hindi)      3.648356e+08  8.145978e+08   536903.00
Hollywood (English)    1.408538e+08  0.000000e+00        0.00
Kollywood (Tamil)      1.169611e+07  7.360595e+07  3024743.14
Marathi Cinema         0.000000e+00  0.000000e+00      633.00
Mollywood (Malayalam)  1.057260e+07  4.608420e+07  1086687.49
Sandalwood (Kannada)   0.000000e+00  9.843615e+07  7523995.00
Tollywood (Telugu)     2.676218e+07  2.629004e+08        0.00


  pivot_table = df_cleaned.pivot_table(values='Box office collection', index='Film Industry', columns='Popularity', aggfunc='sum')


**Column Creation and Manipulation**

a. Create a new column named Rating Category that categorizes movies as "Excellent" if the IMDB rating is 9.0 or above, "Good" if between 8.0 and 9.0, and "Average" if below 8.0.

In [31]:
def categorize_rating(rating):
    if rating >= 9.0:
        return 'Excellent'
    elif 8.0 <= rating < 9.0:
        return 'Good'
    else:
        return 'Average'

df['Rating Category'] = df['Rating'].apply(categorize_rating)

b. Extract the Year from the movie title (assuming the year is part of the title) and create a new column Extracted Year.

In [32]:
df['Extracted Year'] = df['Movie name'].str.extract(r'\((\d{4})\)').astype(float)

c. Create a new column that combines the Title and Director into a single string, separated by a hyphen.

In [33]:
df['Title-Director'] = df['Movie name'] + ' - ' + df['Director']

In [34]:
# Display the DataFrame to see the new columns
df[['Movie name', 'Rating', 'Rating Category', 'Extracted Year', 'Title-Director']]

Unnamed: 0,Movie name,Rating,Rating Category,Extracted Year,Title-Director
0,12th Fail,8.9,Good,,12th Fail - Vidhu Vinod Chopra
1,Gol Maal,8.5,Good,,Gol Maal - Hrishikesh Mukherjee
2,Maharaja,8.6,Good,,Maharaja - Nithilan Saminathan
3,Nayakan,8.7,Good,,Nayakan - Mani Ratnam
4,The World of Apu,8.4,Good,,The World of Apu - Satyajit Ray
...,...,...,...,...,...
245,Mr. India,7.7,Average,,Mr. India - Shekhar Kapur
246,Hazaaron Khwaishein Aisi,7.9,Average,,Hazaaron Khwaishein Aisi - Sudhir Mishra
247,Ghilli,8.1,Good,,Ghilli - Dharani
248,Mukundan Unni Associates,7.9,Average,,Mukundan Unni Associates - Abhinav Sunder Nayak


# **Alternative ways to fill missing values**

# **Loading the data**

In [35]:
nfl_data = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/Exploratory_Data_Analysis/Dataset/IMdB_India_Top250.csv")

In [36]:
nfl_data.head()

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,"$138,288.00",945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,NIL,48,3 wins & 1 nomination,A man's simple lie to secure his job escalates...,"Amazon Prime Video, YouTube, Zee5"
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,"$975,543.00",370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,"$120,481.93",237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,"$134,241.00",62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"


Replace missing values or NIL or 0 with NaN or None

In [37]:
# Strip whitespace from column names
nfl_data.columns = nfl_data.columns.str.strip()
# Print columns to verify the correct name
print(nfl_data.columns)

# Replace "NIL" with NaN in the entire DataFrame
nfl_data.replace("NIL", np.nan, inplace=True)

# Replace "NIL" with NaN in the 'Box Office Collection' column specifically
nfl_data['Box office collection'].replace("NIL", np.nan, inplace=True)


Index(['Movie name', 'Year of release', 'Watch  hour', 'Rating', 'Ratedby',
       'Film Industry', 'Genre', 'Director', 'Box office collection',
       'User reviews', 'Awards', 'Description', 'Streaming platform'],
      dtype='object')


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nfl_data['Box office collection'].replace("NIL", np.nan, inplace=True)


In [38]:
# Function to clean the 'Box office collection' column
def clean_box_office(value):
    value = value.strip()  # Remove leading/trailing spaces
    if value == 'NIL' or value == '$0' or value == '$0.00':
        return 0.0
    else:
        # Remove $ and commas, then convert to float
        return float(value.replace('$', '').replace(',', ''))

# Apply the function to clean the 'Box office collection' column
nfl_data['Box office collection'] = nfl_data['Box office collection'].apply(clean_box_office)

# Replace 0.0 with NaN in the 'Box office collection' column
nfl_data['Box office collection'].replace(0.0, np.nan, inplace=True)

# Print the result to verify
nfl_data.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nfl_data['Box office collection'].replace(0.0, np.nan, inplace=True)


Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,,48,3 wins & 1 nomination,A man's simple lie to secure his job escalates...,"Amazon Prime Video, YouTube, Zee5"
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"


By looking at the data, it can be seen that missing column has information on the Box office collection and award. This means that column with Box office collection values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there is column of awards, also have lot of missing fields. In this case, though, the field is missing because if there was no award then it doesn't make sense to guess it . For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

# Drop missing values
If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values.

In [39]:
nfl_data_copy = nfl_data.copy()
# remove all the rows that contain a missing value
nfl_data_copy.dropna()

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.00,945,23 wins & 32 nominations,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.00,370,2 nominations,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.00,62,4 wins & 2 nominations total,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"
7,3 Idiots,2009,2 hours 50 minutes,8.4,441K,Bollywood (Hindi),"Comedy, Drama",Rajkumar Hirani,60262836.00,1000,64 wins & 30 nominations,Two friends are searching for their long lost ...,"Netflix, Amazon Prime Video"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,Vicky Donor,2012,2 hours 6 minutes,7.7,46K,Bollywood (Hindi),"Comedy, Drama",Shoojit Sircar,6456358.00,82,43 wins & 37 nominations,A man is brought in by an infertility doctor t...,"Amazon Prime Video, Netflix"
244,Angamaly Diaries,2017,2 hours 12 minutes,7.9,6.8K,Mollywood (Malayalam),"Crime, Drama",Lijo Jose Pellissery,661915.00,39,2 wins & 6 nominations,Vincent Pepe who wanted to be a powerful leade...,Netflix
247,Ghilli,2004,2 hours 40 minutes,8.1,17K,Kollywood (Tamil),"Action, Drama",Dharani,163961.00,47,2 wins & 2 nominations,"Velu, an aspiring kabaddi player, goes to Madu...",Disney+ Hotstar
248,Mukundan Unni Associates,2022,2 hours 8 minutes,7.9,6.2K,Mollywood (Malayalam),"Satire, Drama",Abhinav Sunder Nayak,44990.00,62,1 win,"Advocate Mukundan Unni, played by Vineeth Sree...",Disney+ Hotstar


**Removing all the columns that have at least one missing value instead**

In [40]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,User reviews,Description,Streaming platform
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,945,The real-life story of IPS Officer Manoj Kumar...,SonyLIV
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,48,A man's simple lie to secure his job escalates...,"Amazon Prime Video, YouTube, Zee5"
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,370,A barber seeks vengeance after his home is bur...,Netflix
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,237,A common man's struggles against a corrupt pol...,"Amazon Prime Video, YouTube"
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,62,This final installment in Satyajit Ray's Apu T...,"Amazon Prime Video, Hoichoi"


In [41]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 13 

Columns with na's dropped: 11


# Filling in missing values automatically

Another option is to try and fill in the missing values. For this next bit, we take a small sub-section of the NFL data so that it will print well.

In [42]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'Movie name':'Awards'].head()
subset_nfl_data

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,,48,3 wins & 1 nomination
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total


**Filling missing value with 0.**

In [43]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,0.0,48,3 wins & 1 nomination
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total


**Filling missing value with backward-filling in column for first value and then filling all missing values with 0 after first value.**

In [44]:
# replace all NA's the value that comes directly after it in the same column,
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

  subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)


Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,975543.0,48,3 wins & 1 nomination
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total


**Filling missing value with backward-filling in column.**

In [45]:
subset_nfl_data.fillna(method = 'bfill', axis=0)

  subset_nfl_data.fillna(method = 'bfill', axis=0)


Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,975543.0,48,3 wins & 1 nomination
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total


**Filling missing value with forward-filling in column.**

In [46]:
subset_nfl_data.fillna(method = 'ffill', axis=0)

  subset_nfl_data.fillna(method = 'ffill', axis=0)


Unnamed: 0,Movie name,Year of release,Watch hour,Rating,Ratedby,Film Industry,Genre,Director,Box office collection,User reviews,Awards
0,12th Fail,2023,2 hours 27 minutes,8.9,126K,Bollywood (Hindi),"Drama, Biography",Vidhu Vinod Chopra,138288.0,945,23 wins & 32 nominations
1,Gol Maal,1979,2 hours,8.5,20K,Bollywood (Hindi),Comedy,Hrishikesh Mukherjee,138288.0,48,3 wins & 1 nomination
2,Maharaja,2024,2 hours 30 minutes,8.6,37K,Kollywood (Tamil),"Crime, Drama",Nithilan Saminathan,975543.0,370,2 nominations
3,Nayakan,1987,2 hours 25 minutes,8.7,25K,Kollywood (Tamil),"Crime, Drama",Mani Ratnam,120481.93,237,7 wins & 1 nomination
4,The World of Apu,1959,1 hour 45 minutes,8.4,17K,Bengali Cinema,Drama,Satyajit Ray,134241.0,62,4 wins & 2 nominations total
