## IMDb Movie Dataset Preprocessing and Cleaning

### Dataset Discriptions
Poster_Link - Link of the poster that imdb using

Series_Title = Name of the movie

Released_Year - Year at which that movie released

Certificate - Certificate earned by that movie

Runtime - Total runtime of the movie

Genre - Genre of the movie

IMDB_Rating - Rating of the movie at IMDB site

Overview - mini story/ summary

Meta_score - Score earned by the movie

Director - Name of the Director

Star1,Star2,Star3,Star4 - Name of the Stars

No_of_votes - Total number of votes

Gross - Money earned by that movie

In [None]:
import pandas as pd
import numpy as np

movie = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")


In [23]:
#check for null values

print(movie.info())
print(movie.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB
None
       IMDB_Rating  Meta_score  

In [24]:
#handling missing values
movie['Certificate'] = movie['Certificate'].fillna('Unrated')
movie['Meta_score'].fillna(movie['Meta_score'].mode()[0], inplace=True)
movie['Gross'].fillna(movie['Gross'].mode()[0], inplace=True)

print(movie.isnull().sum())

Poster_Link      0
Series_Title     0
Released_Year    0
Certificate      0
Runtime          0
Genre            0
IMDB_Rating      0
Overview         0
Meta_score       0
Director         0
Star1            0
Star2            0
Star3            0
Star4            0
No_of_Votes      0
Gross            0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie['Meta_score'].fillna(movie['Meta_score'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie['Gross'].fillna(movie['Gross'].mode()[0], inplace=True)


In [25]:
#convert the time format 
def convert_runtime(runtime):
    minutes = int(runtime.replace('min', ''))
    hours = minutes // 60
    remaining_minutes = minutes % 60
    return f"{hours}h {remaining_minutes}m"
movie['Duration'] = movie['Runtime'].apply(convert_runtime)
movie = movie.drop(columns =['Runtime'])

In [26]:
#combine columms (star1, star2, star3, star4)
movie['Stars'] = movie[['Star1', 'Star2', 'Star3', 'Star4']].apply(lambda x: ','.join(x.dropna()), axis=1)
movie = movie.drop(columns = ['Star1', 'Star2', 'Star3', 'Star4'])
movie.head(5)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Genre,IMDB_Rating,Overview,Meta_score,Director,No_of_Votes,Gross,Duration,Stars
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,2343110,28341469,2h 22m,"Tim Robbins,Morgan Freeman,Bob Gunton,William ..."
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,1620367,134966411,2h 55m,"Marlon Brando,Al Pacino,James Caan,Diane Keaton"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,2303232,534858444,2h 32m,"Christian Bale,Heath Ledger,Aaron Eckhart,Mich..."
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,1129952,57300000,3h 22m,"Al Pacino,Robert De Niro,Robert Duvall,Diane K..."
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,689845,4360000,1h 36m,"Henry Fonda,Lee J. Cobb,Martin Balsam,John Fie..."


In [28]:
# Save the cleaned dataset to a CSV file
movie.to_csv('/kaggle/working/IMDb_cleaned.csv', index=False)
