# AIES Final Project
---
* Todd McCullough [Git](https://github.com/tamccullough)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import numpy as np
import pandas as pd
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

##### Import Data

In [3]:
DIR = 'datasets/'

In [4]:
movies = pd.read_csv(DIR+'movies.dat', delimiter='::')
ratings = pd.read_csv(DIR+'ratings.dat', delimiter='::').drop_duplicates(['movieid', 'userid'], keep='last')
users = pd.read_csv(DIR+'users.dat', delimiter='::')

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


### Exploratory Data Analysis

This is a dataset provided by [grouplens](https://grouplens.org/datasets/movielens/)

The data has already been cleaned and organized, and does not require further manipulation.

The beginning 2 rows of each set are being displayed, along with the sets shape.

In [8]:
movies.head(2)

Unnamed: 0,movieid,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy


In [9]:
movies.shape

(3883, 3)

We do not require the timestamp column from this set, and therefore it will be dropped.

In [10]:
ratings.pop('timestamp')
ratings.head(2)

Unnamed: 0,userid,movieid,rating
0,1,1193,5
1,1,661,3


In [11]:
ratings.shape

(1000209, 3)

In [12]:
users.head(2)

Unnamed: 0,userid,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072


In [13]:
users.shape

(6040, 5)

For the purposes of testing, it would be good to get a few users who have rated many films and to test with their user ids.

In [34]:
grouped = ratings.groupby('userid').count().reset_index()
grouped = grouped.sort_values('rating', ascending=False)
grouped.head(10)

Unnamed: 0,userid,movieid,rating
4168,4169,2314,2314
1679,1680,1850,1850
4276,4277,1743,1743
1940,1941,1595,1595
1180,1181,1521,1521
888,889,1518,1518
3617,3618,1344,1344
2062,2063,1323,1323
1149,1150,1302,1302
1014,1015,1286,1286


## Check the count of 5 star ratings

As we can see here, the ratings are somewhat skewed. Many of the ratings are 3 stars and higher. It could prove to make the predictions somewhare biased.

In [45]:
ratings_count = ratings.groupby('rating').count()
ratings_count = ratings_count.sort_values('rating', ascending=False)
ratings_count

Unnamed: 0_level_0,userid,movieid
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
5,226310,226310
4,348971,348971
3,261197,261197
2,107557,107557
1,56174,56174


In [46]:
ratings_count.describe()

Unnamed: 0,userid,movieid
count,5.0,5.0
mean,200041.8,200041.8
std,118174.939749,118174.939749
min,56174.0,56174.0
25%,107557.0,107557.0
50%,226310.0,226310.0
75%,261197.0,261197.0
max,348971.0,348971.0


In [44]:
ratings_count.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 5 to 1
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   userid   5 non-null      int64
 1   movieid  5 non-null      int64
dtypes: int64(2)
memory usage: 120.0 bytes


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [47]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings[['userid', 'movieid', 'rating']], reader=reader)

### Build the Recommender Model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [48]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [49]:
trainset = data.build_full_trainset()
item_based_recommender = build_recommender()
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f22d466d0d0>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [51]:
cross_validate(item_based_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8917  0.8933  0.8957  0.8943  0.8935  0.8937  0.0013  
MAE (testset)     0.7016  0.7018  0.7028  0.7030  0.7022  0.7023  0.0006  
Fit time          31.43   32.45   32.22   32.89   32.70   32.34   0.51    
Test time         78.26   81.50   82.91   81.76   82.98   81.48   1.72    


{'test_rmse': array([0.89172452, 0.89334468, 0.89570078, 0.89428249, 0.89346105]),
 'test_mae': array([0.70156576, 0.70175556, 0.70283036, 0.70299299, 0.70219336]),
 'fit_time': (31.433568954467773,
  32.4473295211792,
  32.216604709625244,
  32.8928587436676,
  32.70250129699707),
 'test_time': (78.25737023353577,
  81.49770617485046,
  82.91267681121826,
  81.7580394744873,
  82.98236560821533)}

##### Prediction

Using this test to see how a user might rate a specific recipe.

In [52]:
i = 252
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

3.58, 3.97, 3.6, 3.44, 4.1, 3.17, 3.88, 3.87, 3.92, 3.57, 4.14, 3.2, 3.4, 3.47, 3.1, 3.41, 3.83, 4.04, 4.16, 3.71, 4.0, 3.69, 3.19, 3.22, 3.86, 3.78, 3.44, 3.87, 3.35, 3.47, 3.06, 3.68, 3.56, 3.16, 4.13, 3.38, 3.98, 3.93, 3.94, 3.77, 3.59, 3.51, 3.84, 3.97, 3.74, 2.93, 5, 3.38, 3.46, 3.59, 3.53, 4.0, 3.75, 4.44, 3.65, 3.69, 3.61, 3.13, 4.15, 3.12, 3.83, 3.33, 3.53, 3.78, 3.96, 4.32, 3.7, 3.98, 3.25, 4.06, 3.73, 3.4, 3.42, 3.91, 3.58, 4.42, 4.19, 3.24, 3.73, 3.5, 3.95, 4.14, 4.21, 3.89, 3.21, 3.22, 4.2, 2.45, 3.77, 2.72, 3.77, 4.35, 3.07, 3.6, 4.29, 3.28, 3.2, 4.07, 3.77, 3.29, 3.06, 4.9, 3.21, 4.44, 3.78, 4.06, 3.83, 3.23, 3.15, 3.81, 3.55, 3.64, 4.07, 3.97, 3.68, 4.31, 3.67, 3.66, 3.65, 3.91, 3.35, 4.5, 2.96, 3.45, 3.99, 3.91, 4.08, 3.55, 4.56, 4.07, 4.2, 3.25, 4.03, 3.74, 3.28, 3.65, 2.96, 3.72, 4.15, 3.79, 3.51, 3.72, 2.87, 3.46, 3.89, 3.06, 3.84, 3.68, 3.96, 4.11, 

### Inference

The main function to run the model and get inferences

In [54]:
ml_model = item_based_recommender

In [55]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = ml_model
    # N will represent how many items to recommend
    N = 2000

    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(ratings.loc[ratings['userid'] == user_id]['movieid'].tolist()))
    ratings_list = movies['movieid'].values.tolist()
    reduced_ratings = ratings.loc[ratings['movieid'].isin(ratings_list)].copy()

    # Self explanitory name
    all_item_ids = list(set(reduced_ratings['movieid'].tolist()))

    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]

    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass

    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)

    # predicted_ratings
    recommended_df = movies.loc[movies['movieid'].isin(recommended_ids)].copy()
    #recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    recommended_df.insert(1, 'pred_rating', 0)

    # recommended_df = movies.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] = int(predicted_ratings[item_id])
        pass
    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

def cap_str(item):
    string = item
    return string.capitalize()

def reg_frame(f_list,items):
    s_ = ''
    for i in items:
        j = cap_str(i)
        str_ = f'(?=.*{j})'
        s_ += str_
    s_
    f_list = f_list[f_list['genres'].str.contains(fr'^\b{s_}\b',regex=True)]
    return f_list

def set_up_ml(user_id,genre_list):
    f_list = get_r(user_id)
    items = genre_list.split(',')
    f_list = reg_frame(f_list,items)
    f_list.pop('movieid')
    f_list.pop('pred_rating')
    f_list = f_list.reset_index(drop=True)
    f_list = f_list.T.reset_index(drop=True).T
    #f_list = f_list.head(10)
    return f_list

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [65]:
user_id = 4168
genre_list = 'horror,comedy'
table_list = set_up_ml(user_id,genre_list)

In [66]:
table_list

Unnamed: 0,0,1
0,"Rocky Horror Picture Show, The (1975)",Comedy|Horror|Musical|Sci-Fi
1,Ghostbusters (1984),Comedy|Horror
2,Little Shop of Horrors (1986),Comedy|Horror|Musical
3,American Psycho (2000),Comedy|Horror|Thriller
4,Abbott and Costello Meet Frankenstein (1948),Comedy|Horror
5,Near Dark (1987),Comedy|Horror
6,Evil Dead II (Dead By Dawn) (1987),Action|Adventure|Comedy|Horror
7,Bad Taste (1987),Comedy|Horror
8,Army of Darkness (1993),Action|Adventure|Comedy|Horror|Sci-Fi
9,Young Frankenstein (1974),Comedy|Horror


## Save the Model

In [67]:
import os
os_path = os.path.join(os.path.expanduser('~'))
PATH = os_path+'/mldl/models/'

import pickle
filename = PATH+'movielens_light_recommender_model.sav'
pickle.dump(item_based_recommender, open(filename, 'wb'))