This workbook implements Context Based Filtering for a Cats Recommendation System, working from the raw data all the way to model creation and initial results.

# Table of Contents

* [Load in Data and Segment Features from Context data](#segment)
* [Pre-process feature data](#pre-process)
    - [Make Needed Helper Functions and Imports](#pp_pipeline)
    - [Preprocess Data for model runs](#pp)
* [Run Content-Based-Filtering Modeling Iterations](#run_pipeline)
    - [Cosine similarity results](#cs)
    - [Overall Content Based Filtering results as of 1/18/2023](#ov)  
* [Conclusion and Next Steps](#conclusion)

# Load in Data and Segment Features from Context data<a id='segment'></a>

First, we load all of our adoptable cats.

In [None]:
from google.colab import drive
import joblib #so I can save files out

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
cats_DF = pd.read_csv("/content/drive/MyDrive/MLE10PetMatch/Adoptable_cats_20221125.csv",header=0,index_col=0)
cats_DF.shape

  exec(code_obj, self.user_global_ns, self.user_ns)


(49600, 50)

In [None]:
pd.set_option('display.max_columns', 500)
cats_DF.sample(3)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
48711,58667767,GA423,https://www.petfinder.com/cat/rosemary-5866776...,Cat,Cat,Baby,Female,Small,,[],Rosemary,,51007780,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-10-26T19:26:19+0000,2022-10-26T19:26:19+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,False,,,,adoption@dekalbanimalservices.com,(404) 294-2165,3280 Chamblee Dunwoody Rd,,Chamblee,GA,30341,US,58667767,cat,ga423,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
40050,58754877,ND32,https://www.petfinder.com/cat/luna-58754877/nd...,Cat,Cat,Baby,Female,Small,,['Playful'],Luna,Hi. I&amp;#39;m Luna I am a little shy but ver...,FFR-A-3945,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-04T13:21:28+0000,2022-11-04T13:21:27+0000,,Domestic Medium Hair,,False,False,Black,,,True,False,False,False,True,,,,ffrrinc@gmail.com,,,,Bismarck,ND,58507,US,58754877,cat,nd32,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
9683,58946253,WI207,https://www.petfinder.com/cat/liberty-58946253...,Cat,Cat,Senior,Female,Medium,Short,[],Liberty,Meet Liberty. She is a 7 year old cat looking ...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-23T16:15:16+0000,2022-11-23T16:15:15+0000,,Domestic Short Hair,,False,False,,,,True,True,False,False,True,,,True,ocontoareahumane@gmail.com,920-835-1738,150 S. Katch Drive,,Oconto,WI,54153,US,58946253,cat,wi207,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...


In [None]:
cats_DF.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'primary_photo_cropped', 'videos', 'status',
       'status_changed_at', 'published_at', 'distance', 'breeds.primary',
       'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary',
       'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',
       'organization_id.1', 'primary_photo_cropped.small',
       'primary_photo_cropped.medium', 'primary_photo

Drop animals with no pictures since they are key to our 'tinder-like' app experience.

In [None]:
cats_DF = cats_DF.dropna(subset=['primary_photo_cropped.full'])# drop rows with 0 pictures
cats_DF.shape # matches na count via sweet viz for cats

(46805, 50)

Next we seperate the dataframe into features to model over and context data that can be shown to the user for any matches. 'ID' will be our shared key between the two tables.

Of note, the 'distance' field and 'primary_photo_cropped.full' field will be useful data for future model enhancements. For the models so far, we will simply use textual data and assume a 0 distance for all pets.

In [None]:
contextCols = ['id','organization_id','url','type','tags','name','description','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full']
featureCols = ['id','age','gender','size','coat','breeds.primary', 'breeds.secondary','breeds.mixed',
              'breeds.unknown','colors.primary','colors.secondary','colors.tertiary',
              'attributes.spayed_neutered','attributes.house_trained','attributes.declawed',
              'attributes.special_needs','attributes.shots_current','environment.children',
              'environment.dogs','environment.cats','type','contact.address.postcode'] # initial columns to keep for training purposes
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(46805, 22)

Let's sanity check our missing values now that we just have cats and remove any columns with too many missing values.

In [None]:
valueCounts = cats_DF_features.set_index('type').isna().groupby(level=0).sum()/cats_DF_features.shape[0] # level=0 refers to our index, which we made 'type'


In [None]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,age,gender,size,coat,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.address.postcode
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cat,0.0,0.0,0.0,0.0,0.616857,0.0,0.899925,0.0,0.0,0.393548,0.746523,0.916035,0.0,0.0,0.0,0.0,0.0,0.737015,0.829954,0.588612,2.1e-05


In [None]:
valueCounts = cats_DF_context.set_index('type').isna().groupby(level=0).sum()/cats_DF_context.shape[0] # level=0 refers to our index, which we made 'type'


In [None]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,organization_id,url,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
Cat,0.0,0.0,0.0,0.0,2.1e-05,0.262365,0.313022,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.051405,0.193804,0.371499,0.923534,0.0,0.0,2.1e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After a quick NA check, we will have to remove 'coat','breeds.secondary','colors.secondary','colors.tertiary'. The column 'colors.primary' is also missing a lot of values but for sake of differing one cat from another it will be kept for now. Additionally, we will bring back in address postcode as an initial attempt to match nearby cats together. Lastly, 'environment.children','environment.dogs', and 'environment.cats' have a lot of missing values but users derive a lot of value from this information. Therefore, they will be kept as well.

In [None]:
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.declawed','attributes.special_needs','attributes.shots_current',
               'contact.address.postcode','environment.children','environment.dogs','environment.cats']
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(46805, 17)

In [None]:
cats_DF_features.dtypes

id                             int64
age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                    bool
breeds.unknown                  bool
colors.primary                object
attributes.spayed_neutered      bool
attributes.house_trained        bool
attributes.declawed             bool
attributes.special_needs        bool
attributes.shots_current        bool
contact.address.postcode      object
environment.children          object
environment.dogs              object
environment.cats              object
dtype: object

# Pre-process feature data<a id='pre-process'></a>

## Make Needed Helper Functions and Imports <a id='pp_pipeline'></a>

Make needed helper functions for modeling later on in workbook.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.metrics.pairwise import laplacian_kernel
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [None]:
def remove_columns_with_1_distinct(df):
    drop_col = [e for e in df.columns if df[e].nunique()==1]
    df_return = df.drop(drop_col,axis=1)
    return df_return


In [None]:
def drop_duplicates(df):
    df_return = df.drop_duplicates()
    return df_return


In [None]:
def linear_similarities(df_1,id_df):
    cs_simil = linear_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def cosine_similarities(df_1,id_df):
    cs_simil = cosine_similarity(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def laplacian_similarities(df_1,id_df):
    cs_simil = laplacian_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def item(id,df):  
    ds = df
    colsGrab = ['id']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def url(id,df):  
    ds = df
    colsGrab = ['url']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def picture(id,df):  
    ds = df
    colsGrab = ['primary_photo_cropped.full']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def recommend(item_id, num,df,reccs):
    print("Recommending " + str(num) + " cats similar to " + str(item(item_id,df)) + "... " 
          + picture(item_id,df) + " - " + url(item_id,df))   
    print("-------")    
    recs = reccs[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + str(item(rec[1],df)) + " (score:" +      str(rec[0]) + ") " 
              + picture(rec[1],df) + " - " + url(rec[1],df))
    
def score(reccs, num):
    print("Finding average reccomendation score for top 5 reccomendations per example")
    results = []
    for key in reccs.keys():
        subRecs = reccs[key][:num]
        for r in subRecs:
            results.append(r[0])
    averageRecc = sum(results) / len(results)
    print("There are "+ str(len(results)) + 'results with a sum of' + str(sum(results)) + 'and and average of: ' 
          + str(averageRecc) )
    return averageRecc

## Preprocess Data for model runs <a id='pp'></a>

Now that essential methods are defined, lets handle the data.

In [None]:
cats_DF_features.head(3) #sneak peak of what we have to work with initially

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,contact.address.postcode,environment.children,environment.dogs,environment.cats
1,58980784,Baby,Male,Medium,Tuxedo,False,False,Black & White / Tuxedo,True,True,False,False,True,37343,True,,True
13,58980778,Baby,Male,Medium,Domestic Short Hair,False,False,Black,True,True,False,False,True,92057,True,,True
14,58980506,Young,Female,Medium,Domestic Short Hair,False,False,Torbie,True,True,False,False,True,50126,,,True


Besides the id value, which is our shared key, all other fields are categorical. We can use One-Hot encoding to transform them into something more efficient to run models over. 

Some preprocessing before One-Hot Encoding must occur to ensure everything goes as planned. First, we proactively drop duplicate rows. Second, we remove any features with only 1 distince value, since content-based filtering uses differences between objects and if everyone is the same there is no new information. Third, we replace NaNs with a special string so that One-Hot Encoding can work. Lastly, we fix the postcode to a string so that One-Hot Encoding works properly.

In [None]:
# Preprocess data before encoding occurs for some troublesome fields
X = cats_DF_features
X = drop_duplicates(X) # remove duplicate rows
X = remove_columns_with_1_distinct(X) # remove any features with only 1 distinct value
X["contact.address.postcode"]= X["contact.address.postcode"].astype(str) # fix postcode to be a str rather than an int
# One-Hot Encoder requires all strings or all ints, so bools are not strings
X['breeds.mixed'] = X['breeds.mixed'].map({True: 'True', False: 'False'}) 
X['attributes.spayed_neutered'] = X['attributes.spayed_neutered'].map({True: 'True', False: 'False'}) 
X['attributes.house_trained'] = X['attributes.house_trained'].map({True: 'True', False: 'False'}) 
X['attributes.declawed'] = X['attributes.declawed'].map({True: 'True', False: 'False'}) 
X['attributes.special_needs'] = X['attributes.special_needs'].map({True: 'True', False: 'False'}) 
X['attributes.shots_current'] = X['attributes.shots_current'].map({True: 'True', False: 'False'}) 
X['environment.children'] = X['environment.children'].map({True: 'True', False: 'False'}) 
X['environment.dogs'] = X['environment.dogs'].map({True: 'True', False: 'False'}) 
X['environment.cats'] = X['environment.cats'].map({True: 'True', False: 'False'}) 
X = X.replace(np.nan,'Not Available') # replace nan's with their own special category, do this last once types all fixed!
X.dtypes

id                             int64
age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                  object
colors.primary                object
attributes.spayed_neutered    object
attributes.house_trained      object
attributes.declawed           object
attributes.special_needs      object
attributes.shots_current      object
contact.address.postcode      object
environment.children          object
environment.dogs              object
environment.cats              object
dtype: object

In [None]:
X_transform = X

In [None]:
X_transform.shape

(46710, 16)

In [None]:
X_transform.head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,contact.address.postcode,environment.children,environment.dogs,environment.cats
1,58980784,Baby,Male,Medium,Tuxedo,False,Black & White / Tuxedo,True,True,False,False,True,37343,True,Not Available,True
13,58980778,Baby,Male,Medium,Domestic Short Hair,False,Black,True,True,False,False,True,92057,True,Not Available,True
14,58980506,Young,Female,Medium,Domestic Short Hair,False,Torbie,True,True,False,False,True,50126,Not Available,Not Available,True


In [None]:
cats_DF_context.head(3)

Unnamed: 0,id,organization_id,url,type,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
1,58980784,TN589,https://www.petfinder.com/cat/zorro-58980784/t...,Cat,"['Friendly', 'Affectionate', 'Playful', 'Funny...",Zorro,Zorro is very sweet and enjoys being in your l...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-28T02:29:46+0000,2022-11-28T02:29:45+0000,,bulldog50@epbfi.com,,,,Hixson,TN,37343,US,58980784,cat,tn589,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
13,58980778,CA2825,https://www.petfinder.com/cat/sammy-58980778/c...,Cat,"['Friendly', 'Playful', 'Loves kisses', 'Athle...",Sammy,"“Sammy” is a tiny black, male kitten. About 10...",,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-28T02:29:10+0000,2022-11-28T02:29:09+0000,,info@sunriserescue.com,,,,Oceanside,CA,92057,US,58980778,cat,ca2825,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
14,58980506,IA16,https://www.petfinder.com/cat/girly-58980506/i...,Cat,"['Friendly', 'Gentle', 'Dignified']",Girly,Girly is a dainty lady! She enjoys getting pet...,7.0,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,"[{'embed': '<iframe title=""Video"" frameborder=...",adoptable,2022-11-28T02:28:45+0000,2022-11-28T02:28:44+0000,,greenbelthumane@hotmail.com,(641) 648-2692,319 River St.,,Iowa Falls,IA,50126,US,58980506,cat,ia16,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...


Since we are using unsupervised content-based filtering, the full data set will be used to create the model. No data splits will be used. 

The entries match! We need to pass to our models the numerical data to analyze similarity of products and the context data that goes along with it. As long as the indexes are the same, we can stitch them back together.

Since we know the indexs match, lets get rid of the id columns.

In [None]:
X_transform = X_transform.reset_index(drop=True) # required so keys work properly
X_transform_woID = X_transform.drop(columns='id')
X_transform_woID.dtypes

age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                  object
colors.primary                object
attributes.spayed_neutered    object
attributes.house_trained      object
attributes.declawed           object
attributes.special_needs      object
attributes.shots_current      object
contact.address.postcode      object
environment.children          object
environment.dogs              object
environment.cats              object
dtype: object

Notice that indexes are the same and id columns are gone, so we can recover the IDs later! Now we can apply One-Hot Encoding!

In [None]:
ohe = OneHotEncoder().fit(X_transform_woID) # One Hot Encoding WAAAY better, fit on whole X
X_train_transform = ohe.transform(X_transform_woID) # don't need to add id columns because same columns preserved

# Run Content-Based-Filtering Modeling Iterations <a id='run_pipeline'></a>

Content-Based Filtering is a method of comparing products against each other when you don't have user rankings. This can be a simple way to create models before user ranking data is available and can often do well in recommending similar products. In our case, products are cats. Cosine Similarity will be used moving forward.

## Cosine similarity results <a id='cs'></a>

In [None]:
Cosine_Model =cosine_similarities(X_train_transform,X_transform) #run similarities with cosine similarity


In [None]:
joblib.dump(Cosine_Model, '/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_catsv2.pkl')

['/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_catsv2.pkl']

In [None]:
Cosine_Model = joblib.load('/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_catsv2.pkl')

In [None]:
pd.options.display.max_colwidth = 100
recommend(item_id=58706766, num=5,df=cats_DF_context,reccs=Cosine_Model)

['Recommending 5 cats similar to [58706766]... https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58706766/2/?bust=1667159800 - https://www.petfinder.com/cat/pjs-58706766/tx/austin/new-hope-animal-rescue-nfp-tx2339/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
-------
['Recommended: [58926364] (score:0.9333333333333331) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58926364/1/?bust=1669040782 - https://www.petfinder.com/cat/squidlet-58926364/va/alexandria/tails-high-inc-va540/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [58706847] (score:0.9333333333333331) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58706847/2/?bust=1667160445 - https://www.petfinder.com/cat/sleepy-spice-58706847/tx/austin/new-hope-animal-rescue-nfp-tx2339/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [58811866] (score:0.9333333333333331) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58811866/1/?bust=1668884795 - https://www.petfinder.com/cat/bria-58811866/az/q

The above is score for one item only so now let's get an idea of how well this does for the entire training set.

In [None]:
# Gather average score of top 5 recommendations for training set, with a max score of 1!
cosineScore = score(reccs=Cosine_Model, num=5)
cosineScore

Finding average reccomendation score for top 5 reccomendations per example
There are 233550results with a sum of215295.19999944614and and average of: 0.9218377221128072


0.9218377221128072

The overall score for the whole training set for Cosine Similarity is .922

## Overall Content Based Filtering results as of 1/18/2023 <a id='ov'></a>

In [None]:
from tabulate import tabulate
table = [['Model Name', 'Score'],
         ['Cosine Similarity',cosineScore]]
print(tabulate(table,headers='firstrow',tablefmt='fancy_grid'))

╒═══════════════════╤══════════╕
│ Model Name        │    Score │
╞═══════════════════╪══════════╡
│ Cosine Similarity │ 0.921838 │
╘═══════════════════╧══════════╛


# Conclusion and Next Steps <a id='conclusion'></a>

**Conclusion of ML Modeling as of 1/18/23:**


*   Cosine Similarity with all examples improves the performance by .004
*   Will now work with UI and full cat set



**Conclusion of ML Modeling as of 1/2/23**: 
- All three content-based filtering models perform well
- Cosine Similarity appears to be the most sensitive to differences and has a very useful scale of 0-1.
- Can hook up content-based filtering models to PetMatch UI as-is and it should return good results based on overall similarity measures measured so far.
- User Rankings Data generated require more formating than initially expected but our application tracks all the key required fields for now.
- Collaborative Filtering is harder to implement than initially expected, but we have initial data to give it a try.

**Conclusion of ML Baseline as of 12/6/22**: 
- Average top 5 recommendation per cat in the training set is 10.96. The highest available score is a 12.  
- The result above uses a simple content-based filtering recommendation model without using user perferences, since they are currently not available. Instead it compares items against each other, aka you liked this ketchup so here are 10 other similar types of ketchup. 
- Due to the method used to create the simple content-based filtering model, dev and test set can not be used so to get an initial idea of the results the training set was used. 
- The cats data version 0.5 features need more ways to dileanate one cat from another but based on include visual scans and the average reccomendation score, the simple cat CBF model generally excels at giving you similar cats to what you stated you wanted.
- In instances where there is more ambiguity (aka a chosen cat with less defined details), it will still find cats very similar to it but sometimes it can also throw in very similar cats who are a different breed. This might not be a bad thing.

**Next Steps**:

- Get more user rankings!
- Incorporate distance more effectively
- Can we use description field for cats at all? 
- Collaborative Filtering for item and user-based
  - Use surprise library possibly
  - Add timestamp to rankings so we can be time-sensitive in terms of reccomendations