# AIES Final Project
---
* Todd McCullough [Git](https://github.com/tamccullough)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import numpy as np
import pandas as pd
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

##### Import Data

Data imported from [grouplens](https://grouplens.org/datasets/movielens/)

In [3]:
import os
os_path = os.path.join(os.path.expanduser('~'))
DIR = 'datasets/'
PATH = os_path+'/mldl/models/'

filename = PATH+'movielens_light_recommender_model.sav'

In [4]:
genres = ['Action','Adventure','Animation',
          'Children','Comedy','Crime',
          'Documentary','Drama','Fantasy',
          'Film-Noir','Horror','Musical',
          'Mystery','Romance','Sci-Fi',
          'Thriller','War','Western']

In [5]:
movies = pd.read_csv(DIR+'movies.dat', delimiter='::')
ratings = pd.read_csv(DIR+'ratings.dat', delimiter='::').drop_duplicates(['movieid', 'userid'], keep='last')
users = pd.read_csv(DIR+'users.dat', delimiter='::')

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [23]:
movies.head(2)

Unnamed: 0,movieid,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy


In [24]:
#ratings.pop('timestamp')
ratings.head(2)

Unnamed: 0,userid,movieid,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109


In [25]:
users.head(2)

Unnamed: 0,userid,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072


In [48]:
grouped = ratings.groupby('userid').sum().reset_index()
grouped.sort_values('rating', ascending=False)

Unnamed: 0,userid,movieid,rating,timestamp
4168,4169,4611649,8219,2253251570450
4276,4277,3434119,7207,1696635554925
1679,1680,3484153,6578,1803339881486
1940,1941,3220078,4872,1562701568029
2908,2909,2444231,4809,1232022760245
...,...,...,...,...
3641,3642,36255,54,20295945945
4348,4349,50892,53,26060748987
4364,4365,59333,51,19303672004
4055,4056,46784,51,21240785153


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [37]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings[['userid', 'movieid', 'rating']], reader=reader)

### Build the Recommender Model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [38]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [39]:
trainset = data.build_full_trainset()
# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()
# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7fa5428cd310>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [40]:
cross_validate(item_based_recommender, movielens, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8958  0.8933  0.8932  0.8946  0.8903  0.8934  0.0018  
MAE (testset)     0.7030  0.7016  0.7027  0.7031  0.7006  0.7022  0.0010  
Fit time          30.97   37.09   31.31   31.64   37.44   33.69   2.93    
Test time         86.06   88.81   83.51   84.19   83.73   85.26   1.99    


{'test_rmse': array([0.89575209, 0.8933469 , 0.89321132, 0.89456386, 0.89032265]),
 'test_mae': array([0.70304933, 0.70155716, 0.7026846 , 0.70310216, 0.70062151]),
 'fit_time': (30.972729921340942,
  37.08533072471619,
  31.307833194732666,
  31.643873691558838,
  37.437188148498535),
 'test_time': (86.06465816497803,
  88.81173324584961,
  83.50519371032715,
  84.18715929985046,
  83.73203921318054)}

##### Prediction

Using this test to see how a user might rate a specific recipe.

In [1]:
i = 252
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

NameError: name 'item_based_recommender' is not defined

### Inference

The main function to run the model and get inferences

In [108]:
def get_r(user_id,model):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = model
    # N will represent how many items to recommend
    N = 1000
    
    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(ratings.loc[ratings['userid'] == user_id]['movieid'].tolist()))
    ratings_list = movies['movieid'].values.tolist()
    reduced_ratings = ratings.loc[ratings['movieid'].isin(ratings_list)].copy()
    
    # Self explanitory name
    all_item_ids = list(set(reduced_ratings['movieid'].tolist()))
    
    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]
    
    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass
    
    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)
    
    # predicted_ratings
    recommended_df = movies.loc[movies['movieid'].isin(recommended_ids)].copy()
    #recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    recommended_df.insert(1, 'pred_rating', 0)
    
    # recommended_df = movies.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] = int(predicted_ratings[item_id])
        pass
    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

def set_up_ml(user_id,genre_list,model):
    ml_list = get_r(user_id,model)
    cols = ml_list.columns
    f_list = pd.DataFrame(columns = cols)
    
    items = genre_list.split(',')
    s_ = ''
    for i in items:
        str_ = f'(?=.*{i})'
        s_ += str_
    s_
    
    for j in range(0,len(items)):
        b_list = ml_list.copy()
        b_list = b_list[b_list['genres'].str.contains(items[j])]
        f_list = f_list.append(b_list, sort=False, ignore_index=True)
    f_list = f_list[f_list['genres'].str.contains(fr'^\b{s_}\b',regex=True)]
    f_list.pop('movieid')
    f_list.pop('pred_rating')
    return f_list

def mk_tbl(rows):
    #this is for creating dynamic tables
    rows.pop()
    return arr

### Get a Recommendation Based on Genres

The final code that will be impletented in a cleaner fashion through the browser interface.

In [3]:
user_id = 4168
genre_list = 'Fantasy'
table_list = set_up_ml(user_id,genre_list,item_based_recommender)

table_list = table_list.to_numpy()

test = pd.DataFrame(table_list)
test

NameError: name 'item_based_recommender' is not defined

## Save the Model

In [8]:
import os
os_path = os.path.join(os.path.expanduser('~'))
PATH = os_path+'/mldl/models/'

import pickle
filename = PATH+'movielens_light_recommender_model.sav'
#pickle.dump(item_based_recommender, open(filename, 'wb'))

In [9]:
ml_recommender = pickle.load(open(filename, 'rb'))

In [33]:
cross_validate(ml_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8933  0.8935  0.8950  0.8926  0.8940  0.8937  0.0008  
MAE (testset)     0.7013  0.7028  0.7034  0.7005  0.7029  0.7022  0.0011  
Fit time          20.51   21.22   21.25   21.31   21.19   21.10   0.30    
Test time         63.17   62.32   62.74   63.12   62.83   62.84   0.31    


{'fit_time': (20.505255222320557,
  21.218019008636475,
  21.251424312591553,
  21.31477117538452,
  21.189316034317017),
 'test_mae': array([0.70134784, 0.70276659, 0.70339275, 0.70052107, 0.70290287]),
 'test_rmse': array([0.89332273, 0.89348256, 0.89503767, 0.89258582, 0.89398495]),
 'test_time': (63.17433524131775,
  62.31940054893494,
  62.737791776657104,
  63.11521077156067,
  62.82955527305603)}

In [11]:
genres

['Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [166]:
user_id = 4168
genre_list = 'Sci-Fi,Comedy'
items = genre_list.split(',')
table_list = set_up_ml(user_id,genre_list,ml_recommender)

In [167]:
table_list

Unnamed: 0,title,genres
5,"Visitors, The (Les Visiteurs) (1993)",Comedy|Sci-Fi
7,"Nutty Professor, The (1996)",Comedy|Fantasy|Romance|Sci-Fi
11,Mystery Science Theater 3000: The Movie (1996),Comedy|Sci-Fi
17,Sleeper (1973),Comedy|Sci-Fi
29,Junior (1994),Comedy|Sci-Fi
38,Coneheads (1993),Comedy|Sci-Fi
46,Tank Girl (1995),Action|Comedy|Musical|Sci-Fi
70,"Visitors, The (Les Visiteurs) (1993)",Comedy|Sci-Fi
90,"Nutty Professor, The (1996)",Comedy|Fantasy|Romance|Sci-Fi
120,Mystery Science Theater 3000: The Movie (1996),Comedy|Sci-Fi


In [168]:
table_list.iloc[1,0]

'Nutty Professor, The (1996)'

In [145]:
len(table_list)-1

11

In [151]:
t_list = table_list.to_numpy()

In [163]:
j = len(t_list)-1
for i,r in t_list:
    print(i)
    #for i in range(0,11):
        #print(i)
        #print(table_list.iloc[r,i])

Cemetery Man (Dellamorte Dellamore) (1994)
Frighteners, The (1996)
Dracula: Dead and Loving It (1995)
From Dusk Till Dawn (1996)
Serial Mom (1994)
Tales from the Hood (1995)
Cemetery Man (Dellamorte Dellamore) (1994)
Frighteners, The (1996)
Dracula: Dead and Loving It (1995)
From Dusk Till Dawn (1996)
Serial Mom (1994)
Tales from the Hood (1995)


In [164]:
str(t_list[0,0])

'Cemetery Man (Dellamorte Dellamore) (1994)'

In [135]:
for index, row in table_list.iterrows():
    print(row['title'], row['genres'])

Cemetery Man (Dellamorte Dellamore) (1994) Comedy|Horror
Frighteners, The (1996) Comedy|Horror
Dracula: Dead and Loving It (1995) Comedy|Horror
From Dusk Till Dawn (1996) Action|Comedy|Crime|Horror|Thriller
Serial Mom (1994) Comedy|Crime|Horror
Tales from the Hood (1995) Comedy|Horror
Cemetery Man (Dellamorte Dellamore) (1994) Comedy|Horror
Frighteners, The (1996) Comedy|Horror
Dracula: Dead and Loving It (1995) Comedy|Horror
From Dusk Till Dawn (1996) Action|Comedy|Crime|Horror|Thriller
Serial Mom (1994) Comedy|Crime|Horror
Tales from the Hood (1995) Comedy|Horror


In [107]:
table_list

Unnamed: 0,movieid,pred_rating,title,genres
1,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
3,799,3,"Frighteners, The (1996)",Comedy|Horror
20,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
21,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller
25,532,3,Serial Mom (1994),Comedy|Crime|Horror
28,330,3,Tales from the Hood (1995),Comedy|Horror
38,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
65,799,3,"Frighteners, The (1996)",Comedy|Horror
235,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
239,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller


In [68]:
df = table_list
gt = 'Horror|Comedy'

In [69]:
table_list[table_list['genres'].str.contains(gt)].head(2)

Unnamed: 0,movieid,pred_rating,title,genres
0,742,3,Thinner (1996),Horror|Thriller
1,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror


In [70]:
df[(df['genres'].str.contains('Horror')) & (df['genres'].str.contains('Comedy'))]

Unnamed: 0,movieid,pred_rating,title,genres
1,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
3,799,3,"Frighteners, The (1996)",Comedy|Horror
20,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
21,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller
25,532,3,Serial Mom (1994),Comedy|Crime|Horror
28,330,3,Tales from the Hood (1995),Comedy|Horror
38,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
65,799,3,"Frighteners, The (1996)",Comedy|Horror
235,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
239,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller


In [92]:
s = ''
for i in items:
    str_ = f'(?=.*{i})'
    s += str_
s

'(?=.*Horror)(?=.*Comedy)'

In [78]:
base = r'^{}'
expr = '(?=.*{})'
words = ['Horror','Comedy']  # example
base.format(''.join(expr.format(w) for w in words))

'^(?=.*Horror)(?=.*Comedy)'

In [101]:
df[df['genres'].str.contains(fr'^\b{s}\b',regex=True)]

Unnamed: 0,movieid,pred_rating,title,genres
1,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
3,799,3,"Frighteners, The (1996)",Comedy|Horror
20,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
21,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller
25,532,3,Serial Mom (1994),Comedy|Crime|Horror
28,330,3,Tales from the Hood (1995),Comedy|Horror
38,735,3,Cemetery Man (Dellamorte Dellamore) (1994),Comedy|Horror
65,799,3,"Frighteners, The (1996)",Comedy|Horror
235,12,3,Dracula: Dead and Loving It (1995),Comedy|Horror
239,70,3,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller


In [87]:
df[df['genres'].str.contains(r'\b(?:{})\b'.format(words))]

Unnamed: 0,movieid,pred_rating,title,genres
34,1,3,Toy Story (1995),Animation|Children's|Comedy
55,810,3,Kazaam (1996),Children's|Comedy|Fantasy
63,837,3,Matilda (1996),Children's|Comedy
64,801,3,Harriet the Spy (1996),Children's|Comedy
76,586,3,Home Alone (1990),Children's|Comedy
78,588,3,Aladdin (1992),Animation|Children's|Comedy|Musical
83,575,3,"Little Rascals, The (1994)",Children's|Comedy
86,551,3,"Nightmare Before Christmas, The (1993)",Children's|Comedy|Musical
93,569,3,Little Big League (1994),Children's|Comedy
103,673,3,Space Jam (1996),Adventure|Animation|Children's|Comedy|Fantasy
