# MovieLens and Recommender System

### System I: Recommendation Based on Genres

In [1]:
import numpy as np
import pandas as pd
import streamlit as st


In [2]:
import warnings
warnings.filterwarnings('ignore')

#### Load the data

In [17]:
ratings = pd.read_csv('ratings.dat', sep='::', engine = 'python', header=None)
ratings.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']

movies = pd.read_csv('movies.dat', sep='::', engine = 'python',
                     encoding="ISO-8859-1", header = None)
movies.columns = ['MovieID', 'Title', 'Genres']

users = pd.read_csv(f'users.dat', sep='::', engine = 'python', header = None)
users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode']
    

In [4]:
rating_merged = ratings.merge(movies, left_on = 'MovieID', right_on = 'MovieID')
#rating_merged


### How recommendations are generated by Genre?

* We are going to recommend movies based on highly-rated by users.

* A scenario to be addressed: how do we come up with a score that has Rating & Number of Ratings embedded inside.

* Our idea is, to keep the Rating schema simple. Add a dimnishing factor. And this can shrink the Movies with less ratings more. And perform less shrinkage on movies with more ratings.

* Here is a scheme, we are coming with: 
    * (avg_rating_of_the_movie * rating_count_of_the_movie + min_rating_of_all_movies * avg_rating_count_of_all_movies) / (rating_count_of_the_movie + avg_rating_count_of_all_movies)


#### Weighted Rating computed for each Movie

In [5]:
movie_rating = rating_merged[['MovieID', 'Rating']].groupby("MovieID").agg(['mean', 'count']).droplevel(0, axis=1).reset_index()

movie_rating.rename(columns={"mean": "Rating", "count": "Rating_count"}, inplace=True)


avg_rating_count = movie_rating['Rating_count'].mean() 
#avg_rating = (movie_rating['Rating'] * movie_rating['Rating_count']).sum() / movie_rating['Rating_count'].sum()
#avg_rating = 2.5
avg_rating = movie_rating['Rating'].min()

movie_rating['Weighted_Rating'] = (movie_rating['Rating'] * movie_rating['Rating_count'] + avg_rating * avg_rating_count)  / (movie_rating['Rating_count'] + avg_rating_count)



#### Data frame is built that has Movies with Genres & Weighted Rating

In [6]:
movie_with_rating = movies.join(movie_rating.set_index('MovieID'), how='left', on="MovieID")

movie_with_rating['Weighted_Rating'].fillna(value=avg_rating, inplace=True)

#movie_with_rating.sort_values(by='Rating_count', ascending=False)[0:30]

In [7]:
genre_movie_ratings = movie_with_rating.copy()
genre_movie_ratings['Genres'] = genre_movie_ratings['Genres'].str.split('|')
genre_movie_ratings = genre_movie_ratings.explode('Genres')
#genre_movie_ratings

### Find movie by genre

In [8]:
def get_all_genre():
    genres = genre_movie_ratings['Genres'].unique()
    
    return genres

In [9]:
def find_top_movies_by_genre(genre, n=10):
    top_movies = genre_movie_ratings[genre_movie_ratings['Genres'] == genre]


    top_movies = top_movies.sort_values(by='Weighted_Rating', ascending=False)
    
    top_movies = top_movies[0:n]
    return top_movies



In [10]:
get_all_genre()

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [11]:
find_top_movies_by_genre(genre='Drama', n=10)

Unnamed: 0,MovieID,Title,Genres,Rating,Rating_count,Weighted_Rating
315,318,"Shawshank Redemption, The (1994)",Drama,4.554558,2227.0,4.170345
847,858,"Godfather, The (1972)",Drama,4.524966,2223.0,4.143341
523,527,Schindler's List (1993),Drama,4.510417,2304.0,4.142327
2789,2858,American Beauty (1999),Drama,4.317386,3428.0,4.075268
589,593,"Silence of the Lambs, The (1991)",Drama,4.351823,2578.0,4.034177
1959,2028,Saving Private Ryan (1998),Drama,4.337354,2653.0,4.029195
1178,1196,Star Wars: Episode V - The Empire Strikes Back...,Drama,4.292977,2990.0,4.020348
604,608,Fargo (1996),Drama,4.254676,2513.0,3.939032
900,912,Casablanca (1942),Drama,4.412822,1669.0,3.937765
1176,1193,One Flew Over the Cuckoo's Nest (1975),Drama,4.390725,1725.0,3.931993


### System II: Recommendation Based on IBCF

We train a recommender system and make prediction on the test data. We will use the surprise library. First, we must create a dataset object. To do so, we start with a dataframe with columns itemID, userID and rating and use the load_from_df method.

In [12]:
from surprise import Dataset, Reader
ratings = ratings.drop('Timestamp', axis = 1)
ratings.columns = ['userID', 'itemID', 'rating']
ratings = ratings[['itemID', 'userID', 'rating']]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[["userID", "itemID", "rating"]], reader)

Next, we will split the data into a train and test.

In [13]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size = 0.25)

We'll use the item-based collaborative filtering method with cosine similarity and normalization.

In [27]:
from surprise.prediction_algorithms.knns import KNNWithZScore
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithZScore(sim_options=sim_options).fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


Now let’s make prediction on the test data:

In [28]:
import numpy as np
preds = [algo.predict(uid, iid).est for uid, iid, _ in testset]
results_table = pd.DataFrame(np.stack([testset])[0], 
                             columns = ['itemID', 'userID', 'rating'])
results_table['predicted'] = preds
results_merged = results_table.merge(movies, 
                                     left_on = 'itemID', right_on = 'MovieID')
results_merged = results_merged.drop('itemID', axis = 1)
results_merged.sort_values('userID')

Unnamed: 0,userID,rating,predicted,MovieID,Title,Genres
31562,1.0,4.0,3.614674,1593,Picture Perfect (1997),Comedy|Romance
14132,1.0,5.0,3.786717,749,"Man from Down Under, The (1943)",Drama
157091,1.0,5.0,4.401654,1000,Curdled (1996),Crime
112550,1.0,4.0,3.968580,2154,How Stella Got Her Groove Back (1998),Drama|Romance
4675,1.0,5.0,4.526305,1298,Pink Floyd - The Wall (1982),Drama|Musical|War
...,...,...,...,...,...,...
80273,3952.0,4.0,4.035114,2244,"Allnighter, The (1987)",Comedy|Romance
68199,3952.0,4.0,3.618029,531,"Secret Garden, The (1993)",Children's|Drama
127439,3952.0,5.0,4.137540,2088,Popeye (1980),Adventure|Comedy|Musical
101085,3952.0,4.0,3.620253,1952,Midnight Cowboy (1969),Drama


In [18]:
users

Unnamed: 0,UserID,Gender,Age,Occupation,Zipcode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [24]:
def get_random_movie_set(n=10):
    movie_set = movies.sample(n)
    return movie_set

get_random_movie_set()

Unnamed: 0,MovieID,Title,Genres
1971,2040,"Computer Wore Tennis Shoes, The (1970)",Children's|Comedy
3844,3914,"Broken Hearts Club, The (2000)",Drama
3711,3780,Rocketship X-M (1950),Sci-Fi
637,642,Roula (1995),Drama
3590,3659,Quatermass II (1957),Sci-Fi|Thriller
3665,3734,Prince of the City (1981),Drama
3482,3551,Marathon Man (1976),Thriller
215,217,"Babysitter, The (1995)",Drama|Thriller
1690,1741,"Midaq Alley (Callejón de los milagros, El) (1995)",Drama
3753,3823,Wonderland (1999),Drama


In [25]:
users.shape

(6040, 5)

In [26]:
movies.shape

(3883, 3)