<a href="https://colab.research.google.com/github/shaaranii12/recommendation-system/blob/main/Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preparation & Preprocessing

In [None]:
#Installing required libraries quietly ( -q no output shown)
!pip install nltk scikit-learn numpy pandas -q

In [None]:
# Importing core libraries for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Loading of dataset
data = pd.read_csv('/content/drive/MyDrive/Projects/Project2_Rec_System/movies.csv')

In [None]:
data.shape

In [None]:
#Selecting only the important columns needed for recommendation
required_column = ['genres', 'keywords', 'overview', 'title']
data = data[required_column]

In [None]:
data.head()

In [None]:
data.info()

In [None]:
#drop rows with missing values
data = data.dropna().reset_index(drop=True)

For text-based recommendation, we need a single text field that represents each movie. This line merges overview, genres, and keywords into one column called **total_overview**. Later, this is converted into numerical features using TF-IDF to measure similarity between movies.

In [None]:
data['total_overview'] = data['overview'] + ' ' + data['genres'] + ' ' + data['keywords']

In [None]:
data = data[['title', 'total_overview']]

In [None]:
from wordcloud import WordCloud

#Generate a word cloud of the total overview column
overview_words = " ".join(data['total_overview'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(overview_words)

#Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Movie Overviews')
plt.show()

**Wordcloud** quickly identify the most common keywords, genres, and descriptions across all movies.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

#Text Preprocessing

In [None]:
#Define a set of stopwords
stop_words = set(stopwords.words('english'))

**Stop words** are very common words like “the”, “is”, “and” that don’t help tell movies apart. Removing them helps the model focuses on meaningful words instead and reduces noise

In [None]:
import re

def preprocess_text(text):
    #Remove all characters that are not letters or spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    #Convert all text to lowercase
    text = text.lower()

    #Tokenize the text and remove stopwords
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [None]:
data['total_overview_processed'] = data['total_overview'].apply(preprocess_text)

##Modeling & Similarity

**TF-IDF** is a method that gives more weight to unique words and less to very common ones. Instead of just counting how often a word appears, it looks at how important that word is for describing a specific movie. This makes the comparison between movies smarter and more accurate than using plain word counts.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Initialize a TF-IDF Vectorizer, keeping only the top 10,000 most important words
vectors = TfidfVectorizer(max_features=10000)

#Fit the vectorizer on the processed text and transform it into a TF-IDF matrix
matrix = vectors.fit_transform(data['total_overview_processed'])

The TF-IDF matrix is a chart where each movie is a row, each word is a column, and the numbers show how important each word is for that movie.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

#Compute pairwise cosine similarity between all movie vectors
cos_similarity = cosine_similarity(matrix, matrix)

The cosine similarity result is a big square table where each spot shows how similar one movie is to another. The higher the number, the more alike the two movies are.

#Building the function

In [None]:
def recommend_movies(movie_name, cos_similarity=cos_similarity, data=data, top_n=5):
    # Find the index of the given movie
    index = data[data['title'].str.lower() == movie_name.lower()].index
    if len(index) == 0:
        # Always return a list (empty if not found)
        return []
    index = index[0]

    # Get similarity scores for the selected movie against all others
    sim_scores = list(enumerate(cos_similarity[index]))

    # Sort movies by similarity score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Select the top N most similar movies (skip the first one, which is the movie itself)
    sim_scores = sim_scores[1:top_n+1]

    movie_indices = [i[0] for i in sim_scores]
    return data['title'].iloc[movie_indices].tolist()  # ✅ convert to Python list

In [None]:
#Testing the usage of the recommendation function
movie_example = "Avatar"
print(f"Recommendation for the movie {movie_example}")
recommendation = recommend_movies(movie_example)
print(recommendation)

#API Intergration

In [None]:
import json
with open("/content/drive/MyDrive/Projects/Project2_Rec_System/config.json") as f:
    config = json.load(f)

OMDB_API_KEY = config["OMDB_API_KEY"]

In [None]:
import requests

def get_movie_details(title, api_key):
    url = f"http://www.omdbapi.com/?t={title}&plot=full&apikey={api_key}"
    res = requests.get(url).json()
    if res.get("Response") == "True":
        plot = res.get("Plot", "N/A")
        poster = res.get("Poster", "N/A")
        return plot, poster
    return "N/A", "N/A"

# Test
plot, poster = get_movie_details("Inception", OMDB_API_KEY)
print("Plot:", plot)
print("Poster:", poster)

In [None]:
#Testing the API with the function built
movies = recommend_movies("Inception")
for m in movies:
    plot, poster = get_movie_details(m, OMDB_API_KEY)
    print(m)
    print("Plot:", plot[:100], "...")
    print("Poster:", poster)
    print("===")

# Deployment to Gradio

In [None]:
import joblib

# Save cleaned dataframe and similarity matrix
joblib.dump(data, "df_cleaned.pkl")
joblib.dump(cos_similarity, "cosine_sim.pkl")

In [None]:
!pip install gradio -q

In [None]:
!pip install -U huggingface_hub -q

In [None]:
from huggingface_hub import login, create_repo

In [None]:
login()

In [None]:
#Define the repository ID
repo_id = "Shaaranii12/reccomendation-model"

#Create a new repository on Hugging Face Hub
create_repo(repo_id=repo_id, repo_type="space", space_sdk="gradio", private=False, exist_ok=True)

The movie recommendation system was successfully uploaded and deployed on Hugging Face Spaces, creating a simple web app where users can select a movie and instantly see similar recommendations.

Click the link to explore the app: [Movie Recommendation System](https://huggingface.co/spaces/Shaaranii12/reccomendation-model)