# plan for the Project

1. Understand the Data – Look at the movie dataset and see what information we have.
2. Data Cleaning – Fix any missing or bad data.
3. Exploratory Data Analysis (EDA) – Find patterns, like the most popular movies or highest-rated genres.
4. Feature Engineering – Create useful data points for recommendations.
5. Build the Recommendation System – Use different techniques like:
6. Popularity-based recommendations (simple)
7. Content-based filtering (movies similar to what you like)
8. Collaborative filtering (like Netflix recommendations)
10. Create a Simple UI – Make it easy to use.
11. Final Testing & Submission – Ensure everything works perfectly.

In [38]:
import os

print(os.listdir())  # This will list all files in the current directory


['.anaconda', '.android', '.aws', '.cache', '.conda', '.condarc', '.continuum', '.emulator_console_auth_token', '.equo', '.gitconfig', '.gradle', '.Icecream PDF Editor', '.idlerc', '.ipynb_checkpoints', '.ipython', '.jupyter', '.knime', '.matplotlib', '.ms-ad', '.nuget', '.virtual_documents', '.vscode', 'anaconda3', 'AndroidStudioProjects', 'AppData', 'Application Data', 'consonatorvowle.ipynb', 'Contacts', 'Cookies', 'data_discibefunc.ipynb', 'Desktop', 'division.ipynb', 'Documents', 'Downloads', 'dtaframe.ipynb', 'Favorites', 'functions.ipynb', 'hakerrank', 'handling_arrays.ipynb', 'Heartdisease_prediction.ipynb', 'Heart_disease_prediction.ipynb', 'input_output.ipynb', 'knime-workspace', 'Links', 'Local Settings', 'loopsinpython.ipynb', 'ml', 'Module10.ipynb', 'Module_11.ipynb', 'movies.csv', 'movie_rec.ipynb', 'Music', 'My Documents', 'NetHood', 'NTUSER.DAT', 'ntuser.dat.LOG1', 'ntuser.dat.LOG2', 'NTUSER.DAT{4b0d0a3b-99f2-11ee-ac8c-9c77c0a42057}.TM.blf', 'NTUSER.DAT{4b0d0a3b-99f2-11

## Step 1: Load the Dataset
import the necessary Python libraries and then load the dataset into a Pandas DataFrame

In [40]:
import pandas as pd  # Importing Pandas library

# Load the dataset (Replace 'movies.csv' with the actual filename)
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")


## Step 2: Understand the Data
1️⃣ Exploring movies.csv

   This file contains movie details. Look at the output of movies.head(). It should have columns like:

movieId → Unique ID for each movie

title → Name of the movie

genres → Movie genres (e.g., Action, Comedy)

In [42]:
# Display the first 5 rows
print("Movies Dataset:")
display(movies.head())


Movies Dataset:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


##2️⃣ Exploring ratings.csv

     This file contains user ratings for movies. It should have columns like:
     

userId → ID of the user who rated the movie.

movieId → Movie they rated.

rating → Rating given (between 0.5 to 5.0).

timestamp → When the rating was given.

In [44]:
print("\nRatings Dataset:")
display(ratings.head())


Ratings Dataset:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


# Step 3: Check for Missing Values

We must check if any data is missing before using it.

In [46]:
print("Missing values in movies dataset:")
print(movies.isnull().sum()) #.isnull().sum() counts the number of missing values in each column.

print("\nMissing values in ratings dataset:")
print(ratings.isnull().sum())

Missing values in movies dataset:
movieId    0
title      0
genres     0
dtype: int64

Missing values in ratings dataset:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64


## Step 4: Checking Column Names

In [48]:

print("\nColumns in movies dataset:", movies.columns)
print("\nColumns in ratings dataset:", ratings.columns)



Columns in movies dataset: Index(['movieId', 'title', 'genres'], dtype='object')

Columns in ratings dataset: Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')


# Step 5: Understanding the Data (Basic Analysis)

Before creating a movie recommendation system, let’s explore:

How many movies are there?

How many ratings do we have?

What are the most popular movies?

In [50]:
# Number of movies
print("Total number of movies:", movies.shape[0])

# Number of ratings
print("Total number of ratings:", ratings.shape[0])

# Number of unique users
print("Total number of unique users:", ratings['userId'].nunique())

# Number of unique movies that have been rated
print("Total number of movies rated:", ratings['movieId'].nunique())

# Average rating given to movies
print("Average movie rating:", ratings['rating'].mean())

# Checking the distribution of ratings
print("\nRating Distribution:")
display(ratings['rating'].value_counts().sort_index())


Total number of movies: 10329
Total number of ratings: 105339
Total number of unique users: 668
Total number of movies rated: 10325
Average movie rating: 3.5168503593161127

Rating Distribution:


rating
0.5     1198
1.0     3258
1.5     1567
2.0     7943
2.5     5484
3.0    21729
3.5    12237
4.0    28880
4.5     8187
5.0    14856
Name: count, dtype: int64

##Step 6: Find the Most Popular Movies
Now, let’s check which movies are rated the most.

In [52]:
# Count the number of ratings per movie
movie_ratings_count = ratings.groupby('movieId')['rating'].count()

# Sort movies by number of ratings (most rated movies first)
most_rated_movies = movie_ratings_count.sort_values(ascending=False).head(10)

# Merge with movie names
most_rated_movies = most_rated_movies.reset_index().merge(movies, on='movieId')

# Show the top 10 most rated movies
print("\nTop 10 Most Rated Movies:")
display(most_rated_movies[['title', 'rating']])



Top 10 Most Rated Movies:


Unnamed: 0,title,rating
0,Pulp Fiction (1994),325
1,Forrest Gump (1994),311
2,"Shawshank Redemption, The (1994)",308
3,Jurassic Park (1993),294
4,"Silence of the Lambs, The (1991)",290
5,Star Wars: Episode IV - A New Hope (1977),273
6,"Matrix, The (1999)",261
7,Terminator 2: Judgment Day (1991),253
8,Schindler's List (1993),248
9,Braveheart (1995),248


## Step 7: Choosing a Recommendation Approach

There are two main types of recommendation systems:

1️⃣ Content-Based Filtering → Recommends movies similar to a given movie.
    
2️⃣ Collaborative Filtering → Recommends movies based on user preferences.

Since we have movie genres (from movies.csv), we’ll first build a Content-Based Recommendation System. 

# TF-IDF helps the computer ignore common words and focus on important words like "Action" and "Comedy."


In [54]:
# Step 11: Extracting Movie Features
#We’ll use the genres column from movies.csv to find similar movies.

from sklearn.feature_extraction.text import TfidfVectorizer # TF-IDF (Term Frequency-Inverse Document Frequency).

# Fill missing values in genres column (if any)
movies['genres'] = movies['genres'].fillna('')

# Convert genres into a matrix of TF-IDF features
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['genres'])

print("TF-IDF Matrix Shape:", tfidf_matrix.shape)


TF-IDF Matrix Shape: (10329, 23)


## ✅ linear_kernel is used to find similar movies based on their genres.

✅ It compares all movies in a pairwise manner.

✅ Higher scores mean the movies are more similar.

✅ This is the brain behind movie recommendations!

In [56]:
from sklearn.metrics.pairwise import linear_kernel 
#linear_kernel is a function that calculates similarity between two things using a mathematical method called cosine similarity.

# Compute similarity between movies
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

print("Similarity Matrix Shape:", cosine_sim.shape)


Similarity Matrix Shape: (10329, 10329)


# Step 9: Creating a Movie Recommendation Function
Now that we have our cosine similarity matrix (which tells us how similar movies are), we need a function that:
✔ Takes a movie name as input
✔ Finds the most similar movies
✔ Returns the top 5 recommendations

In [None]:
# to Remove unwanted spaces
movies['title'] = movies['title'].str.strip()
#Lowercase Matching
# If titles have different capitalizations ("avatar" instead of "Avatar"), change:
indices = pd.Series(movies.index, index=movies['title'].str.lower()).drop_duplicates()



In [78]:
# Create a function to get movie recommendations
import pandas as pd

# Create a mapping from movie titles to index
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

def recommend_movies(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices.get(title)

    if idx is None:
        return "Movie not found! Please check the title."

    # Get the similarity scores for all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices of the top 5 most similar movies (excluding itself)
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 5 most similar movies
    return movies['title'].iloc[movie_indices]

# Try it with a movie name
print(recommend_movies("Avatar (2009)"))


5725                           Spider-Man 2 (2004)
7055                       Superman Returns (2006)
8095                              Star Trek (2009)
8150    Transformers: Revenge of the Fallen (2009)
8350                                 Avatar (2009)
Name: title, dtype: object


#  Step 11: Handle Missing Titles Gracefully
# This will suggest similar movie names if the title is missing!  

In [84]:

def recommend_movies(title, cosine_sim=cosine_sim):
    title = title.strip().lower()  # Remove extra spaces and make lowercase
    idx = indices.get(title)

    if idx is None:
        print("Movie not found! Did you mean one of these?")
        print(movies['title'][movies['title'].str.contains(title[:5], case=False, na=False)].head(5))
        return "Try another movie name."

    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]

    return movies['title'].iloc[movie_indices]

print(recommend_movies("Superman Returns"))



Movie not found! Did you mean one of these?
489                              Super Mario Bros. (1993)
696     Supercop (Police Story 3: Supercop) (Jing cha ...
705        Supercop 2 (Project S) (Chao ji ji hua) (1993)
2107                                      Superman (1978)
2108                                   Superman II (1980)
Name: title, dtype: object
Try another movie name.


# Step 12.1: Modify the Function to Show a Clean Output
Instead of just returning a list of movie names, let's display them in a nicer format with genres.


In [90]:

def recommend_movies(title, cosine_sim=cosine_sim):
    title = title.strip().lower()  # Remove extra spaces and make lowercase
    idx = indices.get(title)

    if idx is None:
        print("Movie not found! Did you mean one of these?")
        print(movies['title'][movies['title'].str.contains(title[:5], case=False, na=False)].head(5))
        return "Try another movie name."

    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]  # Get top 5 similar movies

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Display recommendations in a better format
    print(f"\nTop 5 movies similar to '{title.title()}':\n")
    for i, movie_idx in enumerate(movie_indices):
        movie_title = movies.iloc[movie_idx]['title']
        movie_genres = movies.iloc[movie_idx]['genres']
        print(f"{i+1}. {movie_title}  |  Genres: {movie_genres}")

# Try the function
recommend_movies("Titanic")


Movie not found! Did you mean one of these?
1367                                       Titanic (1997)
1712    Chambermaid on the Titanic, The (Femme de cham...
2699                             Raise the Titanic (1980)
2700                                       Titanic (1953)
2962                                    Titan A.E. (2000)
Name: title, dtype: object


'Try another movie name.'