<a href="https://colab.research.google.com/github/ratul-sraj/Hello-World/blob/master/MoviePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Prediction Based on Cosine Similarity

In [48]:
!wget https://files.grouplens.org/datasets/movielens/ml-32m.zip

--2025-09-02 07:13:20--  https://files.grouplens.org/datasets/movielens/ml-32m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238950008 (228M) [application/zip]
Saving to: ‘ml-32m.zip’


2025-09-02 07:13:22 (124 MB/s) - ‘ml-32m.zip’ saved [238950008/238950008]



In [49]:
!unzip ml-32m.zip
!rm ml-32m.zip
import pandas as pd


Archive:  ml-32m.zip
replace ml-32m/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-32m/tags.csv         
replace ml-32m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-32m/links.csv        
replace ml-32m/README.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-32m/README.txt       
replace ml-32m/checksums.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-32m/checksums.txt    
replace ml-32m/ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-32m/ratings.csv      Y
Y

replace ml-32m/movies.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:   inflating: ml-32m/movies.csv       


In [50]:
df_ratings = pd.read_csv('ml-32m/ratings.csv')
df_tags = pd.read_csv('ml-32m/tags.csv')
df_movies = pd.read_csv('ml-32m/movies.csv')

In [46]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858


In [5]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,22,26479,Kevin Kline,1583038886
1,22,79592,misogyny,1581476297
2,22,247150,acrophobia,1622483469
3,34,2174,music,1249808064
4,34,2174,weird,1249808102


In [6]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movie_rated_mean = df_ratings.groupby("movieId").rating.mean()

In [8]:
df_movies = df_movies.merge(movie_rated_mean, left_on="movieId", right_on="movieId")

In [9]:
df_movies.head()

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331
4,5,Father of the Bride Part II (1995),Comedy,3.059602


In [10]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,22,26479,Kevin Kline,1583038886
1,22,79592,misogyny,1581476297
2,22,247150,acrophobia,1622483469
3,34,2174,music,1249808064
4,34,2174,weird,1249808102


In [11]:
# Group df_tags by movieId and aggregate tags into a list
movie_tags = df_tags.groupby('movieId')['tag'].apply(list).reset_index()

In [12]:
# Group df_tags by movieId and aggregate tags into a list
movie_tags = df_tags.groupby('movieId')['tag'].apply(list).reset_index()

# Merge the aggregated tags into df_movies
df_movies = df_movies.merge(movie_tags, on='movieId', how='left')

# Display the updated df_movies DataFrame
df_movies.head()

Unnamed: 0,movieId,title,genres,rating,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438,"[children, Disney, animation, children, Disney..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758,"[Robin Williams, fantasy, Robin Williams, time..."
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447,"[comedinha de velhinhos engraÃƒÂ§ada, comedinh..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331,"[characters, slurs, based on novel or book, ch..."
4,5,Father of the Bride Part II (1995),Comedy,3.059602,"[Fantasy, pregnancy, remake, family, Steve Mar..."


# Task
Find 5 similar movies for a given movie based on their genres and tags.

## Feature engineering

### Subtask:
Create a combined feature representation for each movie using its genres and tags. This might involve techniques like one-hot encoding for genres and potentially using techniques like TF-IDF for tags.


**Reasoning**:
Handle missing values in the 'tag' column and create a combined string of genres and tags.



In [13]:
df_movies['tag'] = df_movies['tag'].fillna('').astype(str)
df_movies['combined_features'] = df_movies['genres'] + ' ' + df_movies['tag']

**Reasoning**:
Initialize and apply TF-IDF vectorization to the combined features.



In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_movies['combined_features'])

## Similarity calculation

### Subtask:
Calculate the similarity between movies based on their feature representations using cosine similarity.


**Reasoning**:
Calculate the cosine similarity between movies based on their combined features.



In [15]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

## Recommendation function

### Subtask:
Create a function that takes a movie title as input and returns a list of the most similar movies based on the calculated similarity scores.


**Reasoning**:
Define the `get_recommendations` function to find similar movies based on the cosine similarity matrix.



In [16]:
def get_recommendations(title, cosine_sim=cosine_sim):
    """
    Gets movie recommendations based on cosine similarity.

    Args:
        title: The title of the movie.
        cosine_sim: The cosine similarity matrix.

    Returns:
        A list of recommended movie titles, or a message if the movie is not found.
    """
    # Get the index of the movie that matches the title
    if title not in df_movies['title'].values:
        return f"Movie with title '{title}' not found in the dataset."

    idx = df_movies[df_movies['title'] == title].index[0]

    # Get the pairwise similarity scores for the movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 5 most similar movies (excluding the movie itself)
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 5 most similar movies
    return df_movies['title'].iloc[movie_indices].tolist()

## Summary:

### Data Analysis Key Findings

*   Missing values in the 'tag' column were successfully handled by replacing them with empty strings.
*   A combined feature representation for each movie was created by concatenating the 'genres' and 'tag' columns.
*   A TF-IDF matrix was successfully generated from the combined features, representing the movies in a vector space.
*   The cosine similarity matrix between movies was calculated using the TF-IDF matrix.
*   A function `get_recommendations` was successfully created to return the top 5 most similar movies for a given movie title based on the cosine similarity scores.


In [17]:
get_recommendations("Father of the Bride Part II (1995)")

['Steve Martin: A Wild and Crazy Guy (1978)',
 'Father of the Bride (1991)',
 'Roxanne (1987)',
 'Novocaine (2001)',
 '¡Three Amigos! (1986)']

In [18]:
import re

def find_movie_title(title):
    """
    Finds a movie title in the df_movies DataFrame by focusing on the core content of the title,
    ignoring information before or after it, and handling variations in spacing and capitalization.

    Args:
        title: The movie title to search for.

    Returns:
        The exact movie title from the DataFrame if found, otherwise None.
    """
    # Function to extract the core title content using a regular expression
    def extract_core_title(t):
        # This regex attempts to capture the main title part. It might need adjustment
        # based on the specific patterns in your titles.
        match = re.search(r'^(.*?)\s*\(\d{4}\)', t) # Matches anything before a space followed by (year)
        if match:
            return match.group(1).strip().lower()
        return t.strip().lower() # Fallback to just stripping and lowercasing if pattern not found

    cleaned_input_title = extract_core_title(title)

    for movie_title in df_movies['title']:
        if extract_core_title(movie_title) == cleaned_input_title:
            return movie_title
    return None

# Examples found after cleaning:
found_title = find_movie_title("Toy Story")
print(found_title)
found_title = find_movie_title("  Good Will Hunting (1997)  ")
print(found_title)
found_title = find_movie_title("orca")
print(found_title)

Toy Story (1995)
Good Will Hunting (1997)
Orca (2023)


In [19]:
df_movies

Unnamed: 0,movieId,title,genres,rating,tag,combined_features
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438,"['children', 'Disney', 'animation', 'children'...",Adventure|Animation|Children|Comedy|Fantasy ['...
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758,"['Robin Williams', 'fantasy', 'Robin Williams'...","Adventure|Children|Fantasy ['Robin Williams', ..."
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447,"['comedinha de velhinhos engraÃƒÂ§ada', 'comed...",Comedy|Romance ['comedinha de velhinhos engraÃ...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331,"['characters', 'slurs', 'based on novel or boo...","Comedy|Drama|Romance ['characters', 'slurs', '..."
4,5,Father of the Bride Part II (1995),Comedy,3.059602,"['Fantasy', 'pregnancy', 'remake', 'family', '...","Comedy ['Fantasy', 'pregnancy', 'remake', 'fam..."
...,...,...,...,...,...,...
84427,292731,The Monroy Affaire (2022),Drama,4.000000,,Drama
84428,292737,Shelter in Solitude (2023),Comedy|Drama,1.500000,,Comedy|Drama
84429,292753,Orca (2023),Drama,4.000000,,Drama
84430,292755,The Angry Breed (1968),Drama,1.000000,,Drama


In [20]:
import re

def extract_year(title):
    """
    Extracts the four-digit year from a movie title string, assuming it's in parentheses at the end.

    Args:
        title: The movie title string.

    Returns:
        The extracted year as a string, or None if no year is found.
    """
    match = re.search(r'\((\d{4})\)$', title)
    if match:
        return match.group(1)
    return None

df_movies['year'] = df_movies['title'].apply(extract_year)

# Display the first few rows with the new 'year' column
display(df_movies.head())

Unnamed: 0,movieId,title,genres,rating,tag,combined_features,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438,"['children', 'Disney', 'animation', 'children'...",Adventure|Animation|Children|Comedy|Fantasy ['...,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758,"['Robin Williams', 'fantasy', 'Robin Williams'...","Adventure|Children|Fantasy ['Robin Williams', ...",1995
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447,"['comedinha de velhinhos engraÃƒÂ§ada', 'comed...",Comedy|Romance ['comedinha de velhinhos engraÃ...,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331,"['characters', 'slurs', 'based on novel or boo...","Comedy|Drama|Romance ['characters', 'slurs', '...",1995
4,5,Father of the Bride Part II (1995),Comedy,3.059602,"['Fantasy', 'pregnancy', 'remake', 'family', '...","Comedy ['Fantasy', 'pregnancy', 'remake', 'fam...",1995


In [21]:
df_movies.year = pd.to_datetime(df_movies['year'], format='%Y')

In [22]:
# Filter for movies released in the year 2023
top_movies = df_movies[df_movies['year'].dt.year == 2023].rating.sort_values().tail().index

In [23]:
df_movies.iloc[top_movies]

Unnamed: 0,movieId,title,genres,rating,tag,combined_features,year
84339,291975,Centurion: The Dancing Stallion (2023),Drama,5.0,,Drama,2023-01-01
84205,291381,Why Can't My Life Be a Rom-Com? (2023),Comedy|Romance,5.0,,Comedy|Romance,2023-01-01
83981,290890,When Love Springs (2023),Comedy|Romance,5.0,,Comedy|Romance,2023-01-01
82442,285971,His Only Son (2023),Drama,5.0,,Drama,2023-01-01
84303,291831,Retreat to You (2023),Comedy|Romance,5.0,,Comedy|Romance,2023-01-01


In [24]:
get_recommendations(find_movie_title("Remember the titans"))

['Rudy (1993)',
 'Friday Night Lights (2004)',
 'Gridiron Gang (2006)',
 'Wildcats (1986)',
 'Semi-Tough (1978)']

In [25]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import re

# Create a text input widget
movie_title_input = widgets.Text(
    value='',
    placeholder='Enter movie title here',
    description='Movie Title:',
    disabled=False
)

# Function to provide movie title suggestions based on the input prefix
def get_suggestions(prefix, num_suggestions=5):
    """Provides movie title suggestions based on the input prefix."""
    cleaned_prefix = prefix.strip().lower()
    if not cleaned_prefix:
        return []

    suggestions = []
    for title in df_movies['title']:
        # Clean the movie title for comparison (remove year and clean)
        cleaned_title = re.sub(r'\s*\(\d{4}\)\s*$', '', title).strip().lower()
        if cleaned_prefix in cleaned_title:
            suggestions.append(title)

    # Return the top N suggestions
    return suggestions[:num_suggestions]

# Function to get and display recommendations
def get_and_display_recommendations(title):
    with output_widget:
        clear_output()
        if title:
            # Use the find_movie_title function to get the exact title
            exact_title = find_movie_title(title)
            if exact_title:
                recommendations = get_recommendations(exact_title)
                if isinstance(recommendations, list):
                    print(f"Recommendations for '{exact_title}':")
                    for movie in recommendations:
                        print(f"- {movie}")
                else:
                    # This handles the case where get_recommendations returns a "not found" message
                    print(recommendations)
            else:
                print(f"Movie title '{title}' not found or could not be matched.")
        else:
            print("Please enter a movie title.")

# Function to be called when an input suggestion button is clicked
def on_suggestion_button_clicked(b):
    get_and_display_recommendations(b.description)


# Function to be called when the input changes (for suggestions)
def handle_input_change(change):
    with output_widget:
        clear_output()
        entered_text = change['new']
        if entered_text:
            suggestions = get_suggestions(entered_text)
            if suggestions:
                print("Suggestions:")
                # Display suggestions as clickable buttons
                suggestion_buttons = [widgets.Button(description=s, layout=widgets.Layout(width='auto')) for s in suggestions]
                for btn in suggestion_buttons:
                    btn.on_click(on_suggestion_button_clicked)
                display(widgets.VBox(suggestion_buttons)) # Use VBox to arrange buttons vertically
            else:
                print("No suggestions found.")

# Create an output widget to display suggestions and recommendations
output_widget = widgets.Output()

# Observe changes in the input widget
movie_title_input.observe(handle_input_change, names='value')

# Display the widgets
display(movie_title_input, output_widget)

Text(value='', description='Movie Title:', placeholder='Enter movie title here')

Output()