---
---
# Movie Recommendation Systerm
---
---

---
## Day2 

Goal: Prepare the MovieLens dataset by cleaning it and organizing it for recommendations.

### Tasks:

#### Task 1: Install Python and required libraries.


!pip install pandas numpy scikit-learn

#### Task 2: Download the dataset.

- Visit Kaggle TMDB 5000 Movie Dataset, sign in, click “Download”, and unzip to get tmdb_5000_movies.csv and tmdb_5000_credits.csv. Place them in your project folder.
- Rename files to movies.csv and credits.csv for consistency. Ensure they’re in the same folder as thss script

#### Task 3: Load datasets into Pandas.

In [3]:
# Import pandas
import pandas as pd

In [5]:
movies = pd.read_csv('movies.csv')
credits = pd.read_csv('credits.csv')

In [6]:
movies.shape

(4803, 20)

#### Task 4: Inspect dataset columns.

In [8]:
movies.head(4)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106


In [12]:
print(movies.columns.tolist())
print("*"*50)
print(credits.columns.tolist())

['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']
**************************************************
['movie_id', 'title', 'cast', 'crew']


---
## Day 3

**Goal: Learn how to measure similarity between users to find similar tastes.**

In [2]:
import pandas as pd

movies = pd.read_csv('movies.csv')
credits = pd.read_csv('credits.csv')

### Task 1: Merge datasets.
- First you will have to learn: Merging combines datasets using id (movies) and movie_id (credits).
- How to learn easily: Read about Pandas merge.

In [9]:
# Step 1: Rename the 'movie_id' column in the 'credits' DataFrame to 'id'
# This ensures both DataFrames have a common column name for merging.
credits = credits.rename(columns={'movie_id': 'id'})

In [12]:
# Step 2: Merge the 'movies' and 'credits' DataFrames
# We're merging 'on' the 'id' column, which is now common to both.
df = movies.merge(credits, on='id')

In [15]:
# Step 3: Print the shape of the new merged DataFrame (df)
# This helps you verify if the merge resulted in the expected number of rows and columns.
print("Shape after merging movies and credits:", df.shape)
print("Shape before merging movies and credits:", movies.shape, "::", credits.shape)

Shape after merging movies and credits: (4803, 23)
Shape before merging movies and credits: (4803, 20) :: (4803, 4)


In [16]:
print(df.columns)

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'title_y', 'cast', 'crew'],
      dtype='object')


### Task 2: Check for missing data
- First you will have to learn: Missing values (NaN) can break text processing.
- How to learn easily: Learn isna().

In [18]:
print(df.isna().sum())

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
title_y                    0
cast                       0
crew                       0
dtype: int64


### Task 3: Handle missing overview (column name).

- First you will have to learn: Replace NaN with empty strings for text columns.
- How to learn easily: Learn fillna()
- The .fillna() method in Pandas is used to replace missing values (NaN) with values with empty strings in text-based columns.('')

In [22]:
# we well just look now to the overview column for now
df['overview'] = df['overview'].fillna('')

In [23]:
# Cheking if there is an empty row at overview column
print(df.isna().sum())

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   0
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
title_y                    0
cast                       0
crew                       0
dtype: int64


### Task 4: Save the cleaned DataFrame.
- First you will have to learn: Saving progress avoids rework.
- How to learn easily: Learn **to_csv()**.

In [25]:
# Save the DataFrame 'df' to a new CSV file
# index=False prevents Pandas from writing the DataFrame index as a column in the CSV
df.to_csv('cleaned_movies.csv', index=False)