In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df=pd.read_csv('movie.csv')


In [4]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [None]:
df.columns

Index(['Release_Date', 'Title', 'Overview', 'Popularity', 'Vote_Count',
       'Vote_Average', 'Original_Language', 'Genre', 'Poster_Url'],
      dtype='object')

In [None]:
col_df=pd.DataFrame([df.columns])
col_df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url


In [11]:
col_df1 = pd.DataFrame(df.columns, columns=["column_names"])
col_df1

Unnamed: 0,column_names
0,Release_Date
1,Title
2,Overview
3,Popularity
4,Vote_Count
5,Vote_Average
6,Original_Language
7,Genre
8,Poster_Url


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9837 entries, 0 to 9836
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Release_Date       9837 non-null   object 
 1   Title              9828 non-null   object 
 2   Overview           9828 non-null   object 
 3   Popularity         9827 non-null   float64
 4   Vote_Count         9827 non-null   object 
 5   Vote_Average       9827 non-null   object 
 6   Original_Language  9827 non-null   object 
 7   Genre              9826 non-null   object 
 8   Poster_Url         9826 non-null   object 
dtypes: float64(1), object(8)
memory usage: 691.8+ KB


->looks like our data has no NaN values.
->Columns like Overview,Poster_Url,Original_langauge would not be useful in analysis 
->Release date columns need to be casted into a data-time format from where we can extract our year

## Exploring the Genres Columns

In [14]:
df['Genre'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 9837 entries, 0 to 9836
Series name: Genre
Non-Null Count  Dtype 
--------------  ----- 
9826 non-null   object
dtypes: object(1)
memory usage: 77.0+ KB


In [19]:
genre_df=df['Genre'].to_frame(name='Genre')
genre_df.head()

Unnamed: 0,Genre
0,"Action, Adventure, Science Fiction"
1,"Crime, Mystery, Thriller"
2,Thriller
3,"Animation, Comedy, Family, Fantasy"
4,"Action, Adventure, Thriller, War"


In [26]:
# checking for any null values
df['Genre'].isnull().sum()


11

In [27]:
df['Genre'].notnull().sum()


9826

In [28]:
df['Genre'].shape

(9837,)

In [30]:
#check for duplicated rows
print("no of duplicates in given data ->",df.duplicated().sum())
print("no of duplicates in genre column ->",genre_df.duplicated().sum())

no of duplicates in given data -> 0
no of duplicates in genre column -> 7499


->Since number of duplicates in genre columns is higher,

It can help answer:

->Are there many repeated genre values?
->Is the dataset dominated by a few genres?
->Should you compress these into categories?



In [33]:
dom_gen=df['Genre'].value_counts().to_frame(name='dominatedGenre')
dom_gen.head()

Unnamed: 0_level_0,dominatedGenre
Genre,Unnamed: 1_level_1
Drama,466
Comedy,403
"Drama, Romance",248
Horror,238
"Horror, Thriller",199


In [6]:
df.describe()

Unnamed: 0,Popularity
count,9827.0
mean,40.32057
std,108.874308
min,7.1
25%,16.1275
50%,21.191
75%,35.1745
max,5083.954


# Exploration summary
we have a dataframe consisting of 9827 rows and 9 columns.

Dataset looks a bit tidy with no NaNs nor duplicated values.

Release_Date column needs to be casted into datetime and we need to extract only the year.

Overview, Original_Language and Poster-Url wouldn’t be so useful during analysis.

There are noticeable outliers in the Popularity column.

Vote_Average is better to be categorised for proper analysis.

Genre column has comma-separated values and white spaces that need to be handled.

In [7]:
# Data Cleaning

df.head()


Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [9]:
release_date_parsed = pd.to_datetime(
    df['Release_Date'],
    errors='coerce'   # invalid strings → NaT instead of crashing
)

print(release_date_parsed.dtype)


datetime64[ns]


In [10]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [11]:
release_year = release_date_parsed.dt.year


In [12]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [13]:
df['Release_Date_Clean'] = release_date_parsed
df['Release_Year'] = release_date_parsed.dt.year


In [14]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url,Release_Date_Clean,Release_Year
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...,2021-12-15,2021.0
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...,2022-03-01,2022.0
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...,2022-02-25,2022.0
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...,2021-11-24,2021.0
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...,2021-12-22,2021.0


In [15]:
# Step 1: Convert Release_Date safely to datetime
df['Release_Date'] = pd.to_datetime(df['Release_Date'], errors='coerce')

# Step 2: Drop rows where date conversion failed
df = df.dropna(subset=['Release_Date'])

# Step 3: Extract ONLY year (this is what your screenshot shows)
df['Release_Date'] = df['Release_Date'].dt.year


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Release_Date'] = df['Release_Date'].dt.year


In [16]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url,Release_Date_Clean,Release_Year
0,2021,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...,2021-12-15,2021.0
1,2022,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...,2022-03-01,2022.0
2,2022,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...,2022-02-25,2022.0
3,2021,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...,2021-11-24,2021.0
4,2021,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...,2021-12-22,2021.0


In [18]:
df = df.drop(columns=['Release_Date_Clean', 'Release_Year'])


In [19]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


Dropping - 'Original_Lanaguage' & poster_URL


In [None]:
df.drop(columns=['Overview', 'Poster_Url'], inplace=True)
df.head()


Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Genre
0,2021,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,"Action, Adventure, Science Fiction"
1,2022,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,"Crime, Mystery, Thriller"
2,2022,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,Thriller
3,2021,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,"Animation, Comedy, Family, Fantasy"
4,2021,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,"Action, Adventure, Thriller, War"


# Categorizing vote average columns

we would cut the vote_average value and make 4 categories : Popular,Average,Below_average,not_popular
to describe it more using .categorize_col() function provided here 

In [28]:
def categorize_col(df,col,labels):
    """
    categorize a certain columns based on its quartiles
    Args:
        (df)     df-dataframe we are processing
        (col).   str-to be categorized columns
        (labels) list-list of labels from min to max
    
    Returns:
        (df).   df- dataframe with the categorized col

    """
    # setting the edges to cut the columns accordingly

    edges=[df[col].describe()['min'],
           df[col].describe()['25%'],
           df[col].describe()['50%'],
           df[col].describe()['75%'],
           df[col].describe()['max']]
    
    df[col]=pd.cut(df[col],edges,labels=labels,duplicates='drop')
    return df