This workbook implements Context Based Filtering for a Cats Recommendation System, working from the raw data all the way to model creation and initial results.

# Table of Contents

* [Load in Data and Segment Features from Context data](#segment)
* [Pre-process feature data](#pre-process)
    - [Make Needed Helper Functions and Imports](#pp_pipeline)
    - [Preprocess Data for model runs](#pp)
* [Collaborative Filtering - Under Construction WIP](#cf)    
* [Conclusion and Next Steps](#conclusion)

# Load in Data and Segment Features from Context data<a id='segment'></a>

First, we load all of our adoptable cats.

In [1]:
from google.colab import drive
import joblib #so I can save files out

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
cats_DF = pd.read_csv("/content/drive/MyDrive/MLE10PetMatch/Adoptable_cats_20221125.csv",header=0,index_col=0)
cats_DF.shape

  exec(code_obj, self.user_global_ns, self.user_ns)


(49600, 50)

In [4]:
cats_DF = cats_DF.drop_duplicates()
cats_DF.shape

(49502, 50)

In [5]:
pd.set_option('display.max_columns', 500)
cats_DF.sample(3)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
14956,58922967,IN567,https://www.petfinder.com/cat/renny-58922967/i...,Cat,Cat,Baby,Female,Small,Short,[],Renny,Interested in adopting a kitty from us?! Apply...,ps_1582199-93555,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-21T00:02:32+0000,2022-11-21T00:02:29+0000,,Domestic Short Hair,,False,False,Black & White / Tuxedo,,,True,True,False,False,True,True,True,True,janet@fultoncoanimalcenter.org,574-223-7387,1540 Wentzel St,,Rochester,IN,46975,US,58922967,cat,in567,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
2260,58974035,OK30,https://www.petfinder.com/cat/caviar-58974035/...,Cat,Cat,Adult,Female,Small,,[],CAVIAR,All cats and kittens have received age-appropr...,A043177,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-27T01:41:24+0000,2022-11-27T01:41:22+0000,,Domestic Short Hair,,False,False,,,,True,False,False,False,False,,,,ila.lee@edmondok.com,405-216-7615,2424 Old Timbers Drive,,Edmond,OK,73034,US,58974035,cat,ok30,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
29024,58776512,TX114,https://www.petfinder.com/cat/janie-58776512/t...,Cat,Cat,Baby,Female,Medium,Short,"['Friendly', 'Playful', 'Curious', 'Loves kiss...",Janie,Janie is the sweetest little purr-machine. If ...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-13T02:31:40+0000,2022-11-13T02:31:39+0000,,Dilute Tortoiseshell,Tabby,True,False,Dilute Tortoiseshell,Tabby (Gray / Blue / Silver),Torbie,True,True,False,False,True,False,,True,emily@coppellhumanesociety.org,,,,Coppell,TX,75019,US,58776512,cat,tx114,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...


In [6]:
cats_DF.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'primary_photo_cropped', 'videos', 'status',
       'status_changed_at', 'published_at', 'distance', 'breeds.primary',
       'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary',
       'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',
       'organization_id.1', 'primary_photo_cropped.small',
       'primary_photo_cropped.medium', 'primary_photo

Drop animals with no pictures since they are key to our 'tinder-like' app experience.

In [7]:
cats_DF = cats_DF.dropna(subset=['primary_photo_cropped.full'])# drop rows with 0 pictures
cats_DF.shape # matches na count via sweet viz for cats

(46710, 50)

Next we seperate the dataframe into features to model over and context data that can be shown to the user for any matches. 'ID' will be our shared key between the two tables.

Of note, the 'distance' field and 'primary_photo_cropped.full' field will be useful data for future model enhancements. For the models so far, we will simply use textual data and assume a 0 distance for all pets.

In [8]:
contextCols = ['id','organization_id','url','type','tags','name','description','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full']
featureCols = ['id','age','gender','size','coat','breeds.primary', 'breeds.secondary','breeds.mixed',
              'breeds.unknown','colors.primary','colors.secondary','colors.tertiary',
              'attributes.spayed_neutered','attributes.house_trained','attributes.declawed',
              'attributes.special_needs','attributes.shots_current','environment.children',
              'environment.dogs','environment.cats','type','contact.address.postcode'] # initial columns to keep for training purposes
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(46710, 22)

Let's sanity check our missing values now that we just have cats and remove any columns with too many missing values.

In [9]:
valueCounts = cats_DF_features.set_index('type').isna().groupby(level=0).sum()/cats_DF_features.shape[0] # level=0 refers to our index, which we made 'type'


In [10]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,age,gender,size,coat,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.address.postcode
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cat,0.0,0.0,0.0,0.0,0.616271,0.0,0.89985,0.0,0.0,0.392935,0.746264,0.915907,0.0,0.0,0.0,0.0,0.0,0.736802,0.829844,0.588311,2.1e-05


In [11]:
valueCounts = cats_DF_context.set_index('type').isna().groupby(level=0).sum()/cats_DF_context.shape[0] # level=0 refers to our index, which we made 'type'


In [12]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,organization_id,url,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
Cat,0.0,0.0,0.0,0.0,2.1e-05,0.262299,0.313552,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.05106,0.193856,0.371741,0.923464,0.0,0.0,2.1e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After a quick NA check, we will have to remove 'coat','breeds.secondary','colors.secondary','colors.tertiary'. The column 'colors.primary' is also missing a lot of values but for sake of differing one cat from another it will be kept for now. Additionally, we will bring back in address postcode as an initial attempt to match nearby cats together. Lastly, 'environment.children','environment.dogs', and 'environment.cats' have a lot of missing values but users derive a lot of value from this information. Therefore, they will be kept as well.

In [13]:
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.declawed','attributes.special_needs','attributes.shots_current',
               'contact.address.postcode','environment.children','environment.dogs','environment.cats']
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(46710, 17)

In [14]:
cats_DF_features.dtypes

id                             int64
age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                    bool
breeds.unknown                  bool
colors.primary                object
attributes.spayed_neutered      bool
attributes.house_trained        bool
attributes.declawed             bool
attributes.special_needs        bool
attributes.shots_current        bool
contact.address.postcode      object
environment.children          object
environment.dogs              object
environment.cats              object
dtype: object

# Pre-process feature data<a id='pre-process'></a>

## Make Needed Helper Functions and Imports <a id='pp_pipeline'></a>

Make needed helper functions for modeling later on in workbook.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.metrics.pairwise import laplacian_kernel
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [None]:
def remove_columns_with_1_distinct(df):
    drop_col = [e for e in df.columns if df[e].nunique()==1]
    df_return = df.drop(drop_col,axis=1)
    return df_return


In [None]:
def drop_duplicates(df):
    df_return = df.drop_duplicates()
    return df_return


In [None]:
def linear_similarities(df_1,id_df):
    cs_simil = linear_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def cosine_similarities(df_1,id_df):
    cs_simil = cosine_similarity(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def laplacian_similarities(df_1,id_df):
    cs_simil = laplacian_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def item(id,df):  
    ds = df
    colsGrab = ['id']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def url(id,df):  
    ds = df
    colsGrab = ['url']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def picture(id,df):  
    ds = df
    colsGrab = ['primary_photo_cropped.full']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def recommend(item_id, num,df,reccs):
    print("Recommending " + str(num) + " cats similar to " + str(item(item_id,df)) + "... " 
          + picture(item_id,df) + " - " + url(item_id,df))   
    print("-------")    
    recs = reccs[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + str(item(rec[1],df)) + " (score:" +      str(rec[0]) + ") " 
              + picture(rec[1],df) + " - " + url(rec[1],df))
    
def score(reccs, num):
    print("Finding average reccomendation score for top 5 reccomendations per example")
    results = []
    for key in reccs.keys():
        subRecs = reccs[key][:num]
        for r in subRecs:
            results.append(r[0])
    averageRecc = sum(results) / len(results)
    print("There are "+ str(len(results)) + 'results with a sum of' + str(sum(results)) + 'and and average of: ' 
          + str(averageRecc) )
    return averageRecc

# Collaborative Filtering - Under Construction WIP <a id='cf'></a>

Collobrative Filtering uses rankings to reccommend new products to customers and have several approaches one can take. For this first iteration, we will use a model-based SVD (Matrix Factorization) approach on user-item interactions. 

##Surprise Method

In [16]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp38-cp38-linux_x86_64.whl size=3366438 sha256=423c463e1d32ac078dceeb52bf1d1b78ccd4dc141cf175edd34ce5176047253c
  Stored in directory: /root/.cache/pip/wheels/af/db/86/2c18183a80ba05da35bf0fb7417aac5cddbd93bcb1b92fd3ea
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [190]:
from surprise import Dataset, SVD, accuracy
from surprise.model_selection import cross_validate
from surprise import BaselineOnly, Reader
import pandas as pd
from sklearn.model_selection import train_test_split
# importing product
from itertools import product

In [18]:
data = Dataset.load_builtin("ml-100k")

# We'll use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9355  0.9410  0.9325  0.9296  0.9356  0.9348  0.0038  
MAE (testset)     0.7359  0.7427  0.7340  0.7339  0.7388  0.7371  0.0033  
Fit time          1.39    1.40    1.38    1.36    1.68    1.44    0.12    
Test time         0.22    0.15    0.23    0.17    0.28    0.21    0.05    


{'test_rmse': array([0.93551333, 0.9410357 , 0.93245018, 0.92958424, 0.93556615]),
 'test_mae': array([0.73590916, 0.74265288, 0.73404822, 0.73391421, 0.73878886]),
 'fit_time': (1.3913798332214355,
  1.4014124870300293,
  1.3821194171905518,
  1.363471269607544,
  1.6792771816253662),
 'test_time': (0.2214958667755127,
  0.15087270736694336,
  0.22713232040405273,
  0.1665782928466797,
  0.27875304222106934)}

works in basic case with not my data but can I make it my data now..

In [26]:
cat_rankings = pd.read_csv("/content/drive/MyDrive/MLE10PetMatch/petmatch_rankings_cats.csv",header=0)
cat_rankings.shape

(194, 3)

In [33]:
cat_rankings =cat_rankings.rename(columns={"user_name":"user_name","cat_id":"cat_id","preference":"raw_ratings"})
cat_rankings.head(3) # dataframe already in order we desire but renaming preferences

Unnamed: 0,user_name,cat_id,raw_ratings
0,Denise,58935988,0
1,Denise,58708840,1
2,Denise,58969335,0


In [35]:
cat_rankings.groupby('user_name').count() # get idea of rankings data so far

Unnamed: 0_level_0,cat_id,raw_ratings
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,8
3,62,62
Denise,32,32
Matt,92,92


In [39]:
cf_train, cf_test = train_test_split(cat_rankings,test_size=0.2,train_size=0.8, random_state=12)

In [40]:
cf_test.groupby('user_name').count()

Unnamed: 0_level_0,cat_id,raw_ratings
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,3
3,12,12
Denise,9,9
Matt,15,15


In [65]:
cf_train.groupby('user_name').count()

Unnamed: 0_level_0,cat_id,raw_ratings
user_name,Unnamed: 1_level_1,Unnamed: 2_level_1
1,5,5
3,50,50
Denise,23,23
Matt,77,77


This train-test mix seems to generally keep a 80-20% balance among users. This should work.

In [144]:
from math import nan
user_item_mat_train = pd.DataFrame(list(product(pd.unique(cat_rankings['user_name']),pd.unique(cats_DF['id'])))) # make unique name_id df
user_item_mat_train.columns = ['user_name', 'cat_id'] # reset column names
user_item_mat_train['raw_ratings'] = nan # no values yet

user_item_mat_test = user_item_mat_train.copy()

In [145]:
user_item_mat_test.head(3)

Unnamed: 0,user_name,cat_id,raw_ratings
0,Denise,58980784,
1,Denise,58980778,
2,Denise,58980506,


In [146]:
user_item_mat_train.head(3)

Unnamed: 0,user_name,cat_id,raw_ratings
0,Denise,58980784,
1,Denise,58980778,
2,Denise,58980506,


As we can see, we need to fix our data first to match the correct format.

Now we have a default tables, lets update it with our saved rankings so far!

In [147]:
for index, row in cf_train.iterrows(): #update train table
    cat_id = row['cat_id']
    user_id= row['user_name']
    #print(user_id) # shows names
    indextoChange = user_item_mat_train[(user_item_mat_train['user_name']==user_id) & (user_item_mat_train['cat_id']==cat_id)].index # our name to change
    preferencetoChange = row['raw_ratings'] # ranking for animal to use
    user_item_mat_train.at[indextoChange,'raw_ratings'] = preferencetoChange # update cell in dataframe

In [148]:
cf_train.sample(5)

Unnamed: 0,user_name,cat_id,raw_ratings
74,Matt,58701164,0
15,Denise,58980554,0
133,3,58759916,0
52,Matt,58742497,0
163,3,58852637,0


In [149]:
user_item_mat_train[user_item_mat_train['cat_id']==58861703]

Unnamed: 0,user_name,cat_id,raw_ratings
23624,Denise,58861703,0.0
70334,Matt,58861703,
117044,1,58861703,
163754,3,58861703,


In [150]:
for index, row in cf_test.iterrows(): #update test table
    cat_id = row['cat_id']
    user_id= row['user_name']
    #print(user_id) # shows names
    indextoChange = user_item_mat_test[(user_item_mat_test['user_name']==user_id) & (user_item_mat_test['cat_id']==cat_id)].index # our name to change
    preferencetoChange = row['raw_ratings'] # ranking for animal to use
    user_item_mat_test.at[indextoChange,'raw_ratings'] = preferencetoChange # update cell in dataframe

In [151]:
cf_test.sample(5)

Unnamed: 0,user_name,cat_id,raw_ratings
174,3,58896656,0
21,Denise,58905228,1
179,3,58913431,0
90,Matt,58979831,0
14,Denise,58792788,0


In [152]:
user_item_mat_test[user_item_mat_test['cat_id']==58905228]

Unnamed: 0,user_name,cat_id,raw_ratings
16496,Denise,58905228,1.0
63206,Matt,58905228,
109916,1,58905228,
156626,3,58905228,


In [189]:
'''
rating dataframe will look like this
| user_id | item_id | rating          |
|---------|---------|-----------------|
| 1       | 1       | 5               |
| ...     | ...     | ...             |
| n       | m       | 3               |
'''

'\nrating dataframe will look like this\n| user_id | item_id | rating          |\n|---------|---------|-----------------|\n| 1       | 1       | 5               |\n| ...     | ...     | ...             |\n| n       | m       | 3               |\n'

In [197]:
reader=Reader(rating_scale=(0,1))
data= Dataset.load_from_df(cat_rankings,reader)

In [210]:
type(data)

surprise.dataset.DatasetAutoFolds

In [200]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=10, verbose=True) # initial results on cat_rankings with no fillers and no datasets

Evaluating RMSE, MAE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    0.3310  0.1673  0.2973  0.4274  0.4044  0.1212  0.2129  0.3412  0.2148  0.2690  0.2787  0.0952  
MAE (testset)     0.1788  0.0917  0.1253  0.2854  0.2603  0.0577  0.1372  0.2402  0.1174  0.1735  0.1667  0.0714  
Fit time          0.00    0.01    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([0.33101506, 0.16732019, 0.29732069, 0.42736536, 0.40437103,
        0.12122449, 0.2129406 , 0.34119532, 0.21483102, 0.26899917]),
 'test_mae': array([0.1787528 , 0.09174591, 0.12534982, 0.28543058, 0.26028846,
        0.05766877, 0.13717203, 0.24015473, 0.11738239, 0.17352051]),
 'fit_time': (0.0033288002014160156,
  0.007955312728881836,
  0.0025365352630615234,
  0.0025129318237304688,
  0.0026199817657470703,
  0.0029060840606689453,
  0.002421855926513672,
  0.0024132728576660156,
  0.0023963451385498047,
  0.0023949146270751953),
 'test_time': (0.0001575946807861328,
  0.000171661376953125,
  0.00013303756713867188,
  0.0001366138458251953,
  0.0001277923583984375,
  0.00012993812561035156,
  0.00011730194091796875,
  0.00011658668518066406,
  0.00011444091796875,
  0.0001266002655029297)}

Inital CV results are not bad. Let's continue with train-test

In [201]:
trainset= data.build_full_trainset()
svd.fit(trainset) # model fit

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff562c01640>

In [211]:
type(trainset)

surprise.trainset.Trainset

In [206]:
# Let's check Denise's ratings
cat_rankings[cat_rankings['user_name']=="Denise"]

Unnamed: 0,user_name,cat_id,raw_ratings
0,Denise,58935988,0
1,Denise,58708840,1
2,Denise,58969335,0
3,Denise,58979432,0
4,Denise,58875428,1
5,Denise,58847178,1
6,Denise,58861703,0
7,Denise,58809183,1
8,Denise,58818518,0
9,Denise,58773403,0


In [207]:
# Let's check someone else's ratings to find one Denise did not rate
cat_rankings[cat_rankings['user_name']=="Matt"]

Unnamed: 0,user_name,cat_id,raw_ratings
32,Matt,58893275,0
33,Matt,58656130,0
34,Matt,58713901,0
35,Matt,58886009,0
36,Matt,58927772,0
...,...,...,...
119,Matt,58927110,0
120,Matt,58833286,0
121,Matt,58968665,0
122,Matt,58665847,0


In [208]:
#svd.predict("Denise",58893275) #version if haven't rated it yet
svd.predict("Denise",58708840,1)

Prediction(uid='Denise', iid=58708840, r_ui=1, est=0.48535308409743877, details={'was_impossible': False})

In [214]:
svd.predict("Denise",58893275) #version if haven't rated it yet

Prediction(uid='Denise', iid=58893275, r_ui=None, est=0.438104654231287, details={'was_impossible': False})

In [234]:
# Try Collab Filtering with cat train data
#train_df= user_item_mat_train
#test_df= user_item_mat_test
train_df= cat_rankings#cf_train
#train_df= cf_train
#test_df= cf_test
reader=Reader(rating_scale=(0,1))
train_data= Dataset.load_from_df(train_df,reader)
#test_data= Dataset.load_from_df(test_df,reader)

In [235]:
trainset_full= train_data.build_full_trainset() #REQUIRED
testset_full= trainset_full.build_testset()

In [236]:
cross_validate(SVD(), train_data, cv=2)

{'test_rmse': array([0.31578444, 0.27530422]),
 'test_mae': array([0.18592754, 0.1515951 ]),
 'fit_time': (0.002444028854370117, 0.0025072097778320312),
 'test_time': (0.0009093284606933594, 0.0008852481842041016)}

In [237]:
algo = SVD()
algo.fit(trainset_full)
predictions = algo.test(testset_full,verbose=True)
accuracy.rmse(predictions)  


user: Denise     item: 58935988   r_ui = 0.00   est = 0.35   {'was_impossible': False}
user: Denise     item: 58708840   r_ui = 1.00   est = 0.66   {'was_impossible': False}
user: Denise     item: 58969335   r_ui = 0.00   est = 0.27   {'was_impossible': False}
user: Denise     item: 58979432   r_ui = 0.00   est = 0.43   {'was_impossible': False}
user: Denise     item: 58875428   r_ui = 1.00   est = 0.62   {'was_impossible': False}
user: Denise     item: 58847178   r_ui = 1.00   est = 0.66   {'was_impossible': False}
user: Denise     item: 58861703   r_ui = 0.00   est = 0.37   {'was_impossible': False}
user: Denise     item: 58809183   r_ui = 1.00   est = 0.62   {'was_impossible': False}
user: Denise     item: 58818518   r_ui = 0.00   est = 0.39   {'was_impossible': False}
user: Denise     item: 58773403   r_ui = 0.00   est = 0.24   {'was_impossible': False}
user: Denise     item: 58977844   r_ui = 0.00   est = 0.38   {'was_impossible': False}
user: Denise     item: 58976645   r_ui = 0.

0.21944211494715832

In [238]:
cat_rankings.head(3)

Unnamed: 0,user_name,cat_id,raw_ratings
0,Denise,58935988,0
1,Denise,58708840,1
2,Denise,58969335,0


In [239]:
joblib.dump(algo, '/content/drive/MyDrive/MLE10PetMatch/models/collabfilter_model_cats_v1.pkl')

['/content/drive/MyDrive/MLE10PetMatch/models/collabfilter_model_cats_v1.pkl']

In [240]:
algo_t = joblib.load('/content/drive/MyDrive/MLE10PetMatch/models/collabfilter_model_cats_v1.pkl')

In [241]:
predictions = algo_t.test(testset_full,verbose=True)
accuracy.rmse(predictions) 

user: Denise     item: 58935988   r_ui = 0.00   est = 0.35   {'was_impossible': False}
user: Denise     item: 58708840   r_ui = 1.00   est = 0.66   {'was_impossible': False}
user: Denise     item: 58969335   r_ui = 0.00   est = 0.27   {'was_impossible': False}
user: Denise     item: 58979432   r_ui = 0.00   est = 0.43   {'was_impossible': False}
user: Denise     item: 58875428   r_ui = 1.00   est = 0.62   {'was_impossible': False}
user: Denise     item: 58847178   r_ui = 1.00   est = 0.66   {'was_impossible': False}
user: Denise     item: 58861703   r_ui = 0.00   est = 0.37   {'was_impossible': False}
user: Denise     item: 58809183   r_ui = 1.00   est = 0.62   {'was_impossible': False}
user: Denise     item: 58818518   r_ui = 0.00   est = 0.39   {'was_impossible': False}
user: Denise     item: 58773403   r_ui = 0.00   est = 0.24   {'was_impossible': False}
user: Denise     item: 58977844   r_ui = 0.00   est = 0.38   {'was_impossible': False}
user: Denise     item: 58976645   r_ui = 0.

0.21944211494715832

# Conclusion and Next Steps <a id='conclusion'></a>

**Conclusion of ML Modeling as of 1/2/23**: 
- All three content-based filtering models perform well
- Cosine Similarity appears to be the most sensitive to differences and has a very useful scale of 0-1.
- Can hook up content-based filtering models to PetMatch UI as-is and it should return good results based on overall similarity measures measured so far.
- User Rankings Data generated require more formating than initially expected but our application tracks all the key required fields for now.
- Collaborative Filtering is harder to implement than initially expected, but we have initial data to give it a try.

**Conclusion of ML Baseline as of 12/6/22**: 
- Average top 5 recommendation per cat in the training set is 10.96. The highest available score is a 12.  
- The result above uses a simple content-based filtering recommendation model without using user perferences, since they are currently not available. Instead it compares items against each other, aka you liked this ketchup so here are 10 other similar types of ketchup. 
- Due to the method used to create the simple content-based filtering model, dev and test set can not be used so to get an initial idea of the results the training set was used. 
- The cats data version 0.5 features need more ways to dileanate one cat from another but based on include visual scans and the average reccomendation score, the simple cat CBF model generally excels at giving you similar cats to what you stated you wanted.
- In instances where there is more ambiguity (aka a chosen cat with less defined details), it will still find cats very similar to it but sometimes it can also throw in very similar cats who are a different breed. This might not be a bad thing.

**Next Steps**:

- Get more user rankings!
- Incorporate distance more effectively
- Can we use description field for cats at all? 
- Collaborative Filtering for item and user-based
  - Use surprise library possibly
  - Add timestamp to rankings so we can be time-sensitive in terms of reccomendations