## 10.2 Exercise
### <span style="color:darkred;"><em>Recommender System</em></span>
##### <span style="color:darkred;"><em>Using the small MovieLens data set, create a recommender system that allows users to input a movie they like (in the data set) and recommends ten other movies for them to watch. In your write-up, clearly explain the recommender system process and all steps performed. If you are using a method found online, be sure to reference the source.</em></span>

##### Data Import

In [3]:
# Importing library
import pandas as pd

# Importing csv files into pandas dataframe
movies_df = pd.read_csv("ml-latest-small/movies.csv")
tags_df = pd.read_csv("ml-latest-small/tags.csv")
ratings_df = pd.read_csv("ml-latest-small/ratings.csv")

#### EDA

In [5]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
movies_df.dtypes

movieId     int64
title      object
genres     object
dtype: object

In [7]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
tags_df.dtypes

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

In [9]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [10]:
ratings_df.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [11]:
# Dropping timestamp variables
tags_df = tags_df.drop(columns=["timestamp"])
ratings_df = ratings_df.drop(columns=["timestamp"])

#### Merging Data Frames

In [13]:
# Merging data frames
combined_df0 = pd.merge(ratings_df, movies_df, on="movieId", how="left")
combined_df = pd.merge(combined_df0, tags_df, on=["movieId", "userId"], how="left")

# Data preview
combined_df.head()

Unnamed: 0,userId,movieId,rating,title,genres,tag
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,
1,1,3,4.0,Grumpier Old Men (1995),Comedy|Romance,
2,1,6,4.0,Heat (1995),Action|Crime|Thriller,
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,
4,1,50,5.0,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,


### Hybrid Recommender System
##### Looking at input based on ratings and tags.

In [27]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors

# NaN handling
combined_df['tag'] = combined_df['tag'].fillna('')
combined_df['genres'] = combined_df['genres'].fillna('')

# Combining tags and genres
combined_df['combined_features'] = combined_df['genres'] + " " + combined_df['tag']

# Fitting new combined_features with TF-IDF - I almost lost my mind and went with dummy variables. That would have taken a while.
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(combined_df['combined_features'])

# Fitting cosign nearest neighbors, this is neat. I need to read more about it.
nn_model = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='brute')
nn_model.fit(tfidf_matrix)

# Creating pivoted data frame for ratings
ratings_pivot = combined_df.pivot_table(index='movieId', columns='userId', values='rating', fill_value=0)
knn = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='brute')
knn.fit(ratings_pivot)

# Mapping indices to titles
movie_indices = pd.Series(combined_df.index, index=combined_df['title']).drop_duplicates()
titles = combined_df['title']

def hybrid_recommender(user_movie):
    if user_movie not in movie_indices:
        return ["Movie not found. Please try another title."]
    
    # Content-based recommendations using Nearest Neighbors
    idx = movie_indices[user_movie]
    distances, indices = nn_model.kneighbors(tfidf_matrix[idx], n_neighbors=10)
    content_based = [titles.iloc[i] for i in indices[0]]

    # Collaborative filtering recommendations using Nearest Neighbors
    movie_id = combined_df.loc[combined_df['title'] == user_movie, 'movieId'].values[0]
    distances, indices = knn.kneighbors([ratings_pivot.loc[movie_id]], n_neighbors=10)
    collab_based = combined_df[combined_df['movieId'].isin(ratings_pivot.iloc[indices[0]].index)]['title'].tolist()

    # Merging and blending recommendations
    hybrid_result = list(set(content_based + collab_based))[:10]
    return hybrid_result

# Decided on a fairly obscure test, and I'm pleasantly surprised with the recommendations.
sample_recommendations = hybrid_recommender('I Spit on Your Grave (Day of the Woman) (1978)')
print(sample_recommendations)

['Thinner (1996)', 'Poltergeist (1982)', 'Killing Me Softly (2002)', 'Cursed (2005)', 'Scanners (1981)', 'Cujo (1983)', 'I Spit on Your Grave (Day of the Woman) (1978)', 'Perfect Crime, The (Crimen Ferpecto) (Ferpect Crime) (2004)', 'Peaceful Warrior (2006)', 'House on Haunted Hill (1999)']
