# Recommender Systems with Surprise Library

## Problem Statement:

Datasource: https://grouplens.org/datasets/movielens/100k/

### What's in the store?
- Exploratory Data Analysis
- Feature Engineering
- User based Collaborative Filtering Recommender System using Surprise Library
- Model Evaluation (RMSE, MAE)

Note: Expecting to expand this Kernel with other Recommender systems in future :) Please suggest improvements and corrections in the comments section.

## Info about the dataset

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC  

u.genre    -- A list of the genres.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.
              
u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

## Load the necessary libraries

In [None]:
#Importing the life-savers

import numpy as np # linear algebra
import pandas as pd # data processing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#Suprise Library for Recommendation Model Building
from surprise import KNNWithMeans
from surprise import accuracy
from surprise import BaselineOnly, Reader, Dataset

from surprise.model_selection import cross_validate, train_test_split

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Exploratory Data Analysis

In [None]:
#Loading the genre dataset. Please note the different encoder used.
genrecols = ['Genre','Genre_Id']
genredf = pd.read_csv("/kaggle/input/movielens100k/u.genre", sep = '|', encoding = 'latin-1', names = genrecols,parse_dates = True)
genredf.set_index("Genre_Id", inplace = True)
genredf.head(20).T

In [None]:
#Saving the list of genres into a list for further use
genres = genredf['Genre'].values.tolist()
genres

In [None]:
#Loading the item dataset. Please note the different encoder used. Also first 5 columns alone are chosen and remaining are ignored as it has irrelevant data.
itemcols = ['Movie_Id','Title','Release_Date','Video_Release_Date','IMDb_Url']
itemcols_genres = itemcols + genres
itemdf = pd.read_csv("/kaggle/input/movielens100k/u.item", sep = '|', encoding = 'latin-1', names = itemcols_genres, parse_dates = True)
itemdf.head()

In [None]:
#Loading the user dataset. Please note the different encoder used.
usercols = ['User_Id','Age','Gender','Profession','Zipcode']
userdf = pd.read_csv("/kaggle/input/movielens100k/u.user", sep = '|', encoding = 'latin-1', names = usercols,parse_dates = True)
userdf.head()

In [None]:
sns.countplot(userdf["Gender"])
plt.show()

In [None]:
plt.figure(figsize = (16,8))
sns.countplot(userdf["Profession"], hue = userdf["Gender"])

plt.xticks(rotation = 90)
plt.show()

In [None]:
#Loading the data dataset. Please note the different encoder used.
ratingcols = ['User_Id', 'Movie_Id', 'Rating', 'Timestamp']
ratingsdf = pd.read_csv("/kaggle/input/movielens100k/u.data", sep = '\t', encoding = 'latin-1', names = ratingcols, usecols = range(3), parse_dates = True)
ratingsdf.head()

In [None]:
sns.countplot(ratingsdf["Rating"])
plt.show()

In [None]:
#Merging all three dataframes to form our Master dataframe
movies = pd.merge(pd.merge(itemdf, ratingsdf),userdf)
movies.head()

In [None]:
#Setting the Movie_Id as row index
movies.set_index("Movie_Id", inplace = True)
movies.head(3)

## Feature Engineering

In [None]:
#Converting date_added as Pandas DateTime type
movies["Release_Date"] = pd.to_datetime(movies["Release_Date"])

#Deriving Year, Month, Date from date_added
movies["release_year"] = movies["Release_Date"].dt.year
movies["release_month"] = movies["Release_Date"].dt.month
movies["release_date"] = movies["Release_Date"].dt.day #Day of the month
movies["release_day"] = movies["Release_Date"].dt.dayofweek #The day of the week with Monday=0, Sunday=6.
movies.head()

In [None]:
#Distribution of movie releases per calendar year. 
#We can see the humongous growth the movie industry has seen from 1990 onwards
sns.distplot(movies["release_year"])
plt.show()

In [None]:
#Distribution of movie releases - monthwise. 
#We can see the majority of the movies releases in January-February period
sns.distplot(movies["release_month"])
plt.show()

In [None]:
#Distribution of movie releases - monthwise. 
#We can see the majority of the movies releases on Friday (4) followed by Saturday(5) & Sunday(6) weekend
sns.distplot(movies["release_day"])
plt.show()

In [None]:
#Distribution of age groups. 
#We can see huge skyscrapers in the range of 18 to 35 years
plt.figure(figsize = (12,8))
sns.distplot(movies["Age"], rug = True)
plt.show()

Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.

In [None]:
movies.columns

In [None]:
genre_map = movies.groupby('release_year').sum()
genre_map = genre_map.drop(columns = ['unknown','Video_Release_Date','User_Id','Rating','Age','release_month', 'release_date', 'release_day']).T
genre_map

In [None]:
#Plotting the heatmap of the above grouped dataset to understand the distribution over the years
plt.figure(figsize = (20,8))
sns.set()
sns.heatmap(genre_map, cmap = 'YlGnBu', linewidths = 0.5, xticklabels = 5, cbar_kws={"orientation": "vertical"})
plt.show()

In [None]:
#!pip install WordCloud

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
stopwords = set(STOPWORDS)

In [None]:
wordcloud = WordCloud(
                          background_color='white',
                          stopwords=stopwords,
                          max_words=200,
                          max_font_size=40, 
                          random_state=42
                         ).generate(str(movies['Title']))

print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
fig.savefig("word1.png", dpi=900)

In [None]:
movies.head()

## Model Building using Suprise Library

In [None]:
user_item_df = pd.read_csv('/kaggle/input/movielens100k/u.data', sep='\t')
user_item_df.head()

In [None]:
file_path = os.path.expanduser('/kaggle/input/movielens100k/u.data')

reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

cross_validate(BaselineOnly(), data, verbose=True)

In [None]:
trainset, testset = train_test_split(data, test_size=0.20)

In [None]:
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo.fit(trainset)

## Model Prediction

In [None]:
# we can now query for specific predicions
uid = str(196)  # raw user id
iid = str(302)  # raw item id

In [None]:
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

In [None]:
# run the trained model against the testset
test_pred = algo.test(testset)

In [None]:
test_pred

## Model Evaluation

In [None]:
#RMSE
print("User-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

In [None]:
#MAE
print("User-based Model : Test Set")
accuracy.mae(test_pred, verbose=True)