# AIES Final Project
---
* Todd McCullough [Git](https://github.com/tamccullough)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import numpy as np
import pandas as pd
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

##### Import Data

In [3]:
movies = pd.read_csv('datasets/movies.csv')
users = pd.read_csv('datasets/users.csv')
ratings = pd.read_csv('datasets/ratings.csv')

### Exploratory Data Analysis

This is a dataset provided by [grouplens](https://grouplens.org/datasets/movielens/)

The data has already been cleaned and organized, and does not require further manipulation.

The beginning 2 rows of each set are being displayed, along with the sets shape.

In [4]:
movies.head(5)

Unnamed: 0,movieid,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies.shape

(3883, 3)

We do not require the timestamp column from this set, and therefore it will be dropped.

In [6]:
ratings.pop('timestamp')
ratings.head(5)

Unnamed: 0,userid,movieid,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [7]:
ratings.shape

(1000209, 3)

In [8]:
users.head(2)

Unnamed: 0,userid,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072


In [9]:
users.shape

(6040, 5)

In [10]:
min_movie_ratings = 250
filter_movies = ratings['movieid'].value_counts() > min_movie_ratings
filter_movies = filter_movies[filter_movies].index.tolist()

min_user_ratings = 250
filter_users = ratings['userid'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

ratings = ratings[(ratings['movieid'].isin(filter_movies)) & (ratings['userid'].isin(filter_users))]
ratings.shape

(445094, 3)

For the purposes of testing, it would be good to get a few users who have rated many films and to test with their user ids.

In [11]:
grouped = ratings.groupby('userid').count().reset_index()
grouped = grouped.sort_values('rating', ascending=False)
grouped.head(10)

Unnamed: 0,userid,movieid,rating
854,4169,1037,1037
222,1181,959,959
871,4277,952,952
335,1680,934,934
401,1941,919,919
409,1980,875,875
1184,5831,870,870
76,424,857,857
1088,5367,852,852
574,2909,847,847


## Check the count of 5 star ratings

As we can see here, the ratings are somewhat skewed. Many of the ratings are 3 stars and higher. It could prove to make the predictions somewhare biased.

In [12]:
ratings_count = ratings.groupby('rating').count()
ratings_count = ratings_count.sort_values('rating', ascending=False)
ratings_count

Unnamed: 0_level_0,userid,movieid
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
5,95593,95593
4,160252,160252
3,121540,121540
2,47258,47258
1,20451,20451


In [13]:
ratings_count.describe()

Unnamed: 0,userid,movieid
count,5.0,5.0
mean,89018.8,89018.8
std,56170.554899,56170.554899
min,20451.0,20451.0
25%,47258.0,47258.0
50%,95593.0,95593.0
75%,121540.0,121540.0
max,160252.0,160252.0


In [14]:
ratings_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 5 to 1
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   userid   5 non-null      int64
 1   movieid  5 non-null      int64
dtypes: int64(2)
memory usage: 120.0 bytes


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [15]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings[['userid', 'movieid', 'rating']], reader=reader)

### Build the Recommender Model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [16]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [17]:
trainset = data.build_full_trainset()
item_based_recommender = build_recommender()
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f231df16fd0>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [18]:
cross_validate(item_based_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8547  0.8579  0.8566  0.8535  0.8607  0.8567  0.0025  
MAE (testset)     0.6730  0.6742  0.6746  0.6705  0.6771  0.6739  0.0022  
Fit time          10.69   11.23   11.60   10.92   10.99   11.09   0.31    
Test time         40.55   41.51   39.75   39.06   40.30   40.24   0.82    


{'test_rmse': array([0.85465016, 0.857921  , 0.85661577, 0.85347496, 0.86073975]),
 'test_mae': array([0.67297877, 0.67423061, 0.67460163, 0.67052718, 0.67710954]),
 'fit_time': (10.693198442459106,
  11.22611141204834,
  11.601833581924438,
  10.9208345413208,
  10.990837097167969),
 'test_time': (40.54874277114868,
  41.51346039772034,
  39.74882483482361,
  39.059611082077026,
  40.304763317108154)}

##### Prediction

Using this test to see how a user might rate a specific recipe.

In [19]:
i = 252
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 3.59, 

### Inference

The main function to run the model and get inferences

In [20]:
ml_model = item_based_recommender

In [21]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = ml_model
    # N will represent how many items to recommend
    N = 2000

    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(ratings.loc[ratings['userid'] == user_id]['movieid'].tolist()))
    ratings_list = movies['movieid'].values.tolist()
    reduced_ratings = ratings.loc[ratings['movieid'].isin(ratings_list)].copy()

    # Self explanitory name
    all_item_ids = list(set(reduced_ratings['movieid'].tolist()))

    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]

    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass

    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)

    # predicted_ratings
    recommended_df = movies.loc[movies['movieid'].isin(recommended_ids)].copy()
    #recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    recommended_df.insert(1, 'pred_rating', 0)

    # recommended_df = movies.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] = int(predicted_ratings[item_id])
        pass
    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

def cap_str(item):
    string = item
    return string.capitalize()

def reg_frame(f_list,items):
    s_ = ''
    for i in items:
        j = cap_str(i)
        str_ = f'(?=.*{j})'
        s_ += str_
    s_
    f_list = f_list[f_list['genres'].str.contains(fr'^\b{s_}\b',regex=True)]
    return f_list

def set_up_ml(user_id,genre_list):
    f_list = get_r(user_id)
    items = genre_list.split(',')
    f_list = reg_frame(f_list,items)
    f_list.pop('movieid')
    f_list.pop('pred_rating')
    f_list = f_list.reset_index(drop=True)
    f_list = f_list.T.reset_index(drop=True).T
    #f_list = f_list.head(10)
    return f_list

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [22]:
user_id = 4168
genre_list = 'horror,comedy'
table_list = set_up_ml(user_id,genre_list)

In [23]:
table_list

Unnamed: 0,0,1
0,"Little Shop of Horrors, The (1960)",Comedy|Horror
1,Little Shop of Horrors (1986),Comedy|Horror|Musical
2,"American Werewolf in Paris, An (1997)",Comedy|Horror
3,"Rocky Horror Picture Show, The (1975)",Comedy|Horror|Musical|Sci-Fi
4,Ghostbusters II (1989),Comedy|Horror
5,Ghostbusters (1984),Comedy|Horror
6,Fright Night (1985),Comedy|Horror
7,Gremlins (1984),Comedy|Horror
8,Gremlins 2: The New Batch (1990),Comedy|Horror
9,American Psycho (2000),Comedy|Horror|Thriller


## Save the Model

In [25]:
import pickle
filename = 'model/movielens_light_recommender_model.sav'
pickle.dump(item_based_recommender, open(filename, 'wb'))