### Sheyam Bitar

Python: Recommender System
 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
#Load datasets
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [3]:
#Merge datasets
movie_ratings = pd.merge(movies, ratings, on='movieId')
movie_tags = pd.merge(movies, tags, on='movieId')
movie_data = pd.merge(movie_ratings, movie_tags, on=['movieId', 'userId'])

In [4]:
#Drop unnecessary columns
movie_data = movie_data.drop(['userId', 'timestamp_x', 'timestamp_y'], axis=1)

In [5]:
#Drop duplicate rows
movie_data = movie_data.drop_duplicates()

In [7]:
print(movie_data.columns)

Index(['movieId', 'title_x', 'genres_x', 'rating', 'title_y', 'genres_y',
       'tag'],
      dtype='object')


In [8]:
#Concatenate movie features
movie_data['features'] = movie_data['genres_x'] + ' ' + movie_data['tag']

In [10]:
#Drop rows with missing values in 'features' column
movie_data = movie_data.dropna(subset=['features'])

In [33]:
#due to memory, I have to sample data - I am getting "memoryError"
#Sample a subset of the dataset
sample_size = 30000  # Adjust the sample size as needed
movie_data_sample = movie_data.sample(sample_size, random_state=42)

In [34]:
#TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_sample = tfidf.fit_transform(movie_data_sample['features'])

In [35]:
#Compute similarity matrix
cosine_sim_sample = linear_kernel(tfidf_matrix_sample, tfidf_matrix_sample)

In [36]:
print(movie_data_sample.columns)


Index(['movieId', 'title_x', 'genres_x', 'rating', 'title_y', 'genres_y',
       'tag', 'features'],
      dtype='object')


In [37]:
#Function to recommend movies based on input movie
def recommend_movies(movie_title, cosine_sim=cosine_sim_sample):
    # Case-insensitive matching of movie titles
    movie_indices = movie_data_sample[movie_data_sample['title_x'].str.contains(movie_title, case=False)].index
    
    if len(movie_indices) == 0:
        print("Sorry, no movie found. Please try again with a different title.")
        return
    
    idx = movie_indices[0]  # Take the first matching movie index
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movie_data_sample['title_x'].iloc[movie_indices]

# User input
user_movie = input("Enter a movie you like: ")

# Generate recommendations
recommendations = recommend_movies(user_movie)

# Display recommendations
print("Recommended movies for you:")
print(recommendations)


Enter a movie you like: Showgirls
Recommended movies for you:
354541                     Kill Bill: Vol. 2 (2004)
245035                   Requiem for a Dream (2000)
799986                              I, Tonya (2017)
158191        Last Temptation of Christ, The (1988)
813703                                  Baal (1970)
476621    Before the Devil Knows You're Dead (2007)
525961                            Sin Nombre (2009)
697737                              The Drop (2014)
472342                      Eastern Promises (2007)
576710                             Town, The (2010)
Name: title_x, dtype: object


-------------------

1. Introduction

The objective of this project is to develop a movie recommender system using the MovieLens dataset. The recommender system allows users to input a movie they like and provides recommendations for other movies they might enjoy watching. In this write-up, we will explain the process of building the recommender system, including data preprocessing, feature engineering, and recommendation generation.



2. Data Preparation

The MovieLens dataset consists of several CSV files containing information about movies, user ratings, and tags. We loaded the necessary datasets into a programming environment and performed data preprocessing steps such as merging datasets, handling missing values, and dropping duplicate rows to ensure data cleanliness and consistency.

3. Feature Engineering

To build the recommender system, we used a content-based filtering approach, which recommends movies based on their features such as genres and tags. We concatenated the genres and tags of each movie to create a combined feature vector that represents the content of the movie.

4. TF-IDF Vectorization

We used the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique to convert the text-based movie features into numerical vectors. TF-IDF assigns weights to words based on their frequency in a document relative to their frequency in the entire corpus. This process allowed us to represent each movie as a vector in a high-dimensional space.

5. Similarity Computation

Next, we computed the cosine similarity between pairs of movies based on their TF-IDF vectors. Cosine similarity measures the cosine of the angle between two vectors and indicates the similarity between them. A higher cosine similarity value implies a greater similarity between two movies.

6. Recommendation Generation

To generate recommendations, we defined a function that takes a user's input movie and finds similar movies based on the cosine similarity scores. We used case-insensitive matching to find movies whose titles contain the input keyword. The function returns a list of recommended movies sorted by their similarity scores.

7. User Interaction

Finally, we allowed users to input a movie they like using the input() function in Python. The system then generates and displays recommendations for other movies based on the user's input.