This workbook implements Context Based Filtering for a Dogs Recommendation System, working from the raw data all the way to model creation and initial results.

# Table of Contents

* [Load in Data and Segment Features from Context data](#segment)
* [Pre-process feature data](#pre-process)
    - [Make Needed Helper Functions and Imports](#pp_pipeline)
    - [Preprocess Data for model runs](#pp)
* [Run Content-Based-Filtering Modeling Iterations](#run_pipeline)
    - [Cosine similarity results](#cs)
    - [Overall Content Based Filtering results as of 1/18/2023](#ov)  
* [Conclusion and Next Steps](#conclusion)

# Load in Data and Segment Features from Context data<a id='segment'></a>

First, we load all of our adoptable dogs.

In [1]:
from google.colab import drive
import joblib #so I can save files out

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
dogs_DF = pd.read_csv("/content/drive/MyDrive/MLE10PetMatch/Adoptable_dogs_20221202_withExtras.csv",header=0,index_col=0)
dogs_DF.shape

(97694, 70)

In [3]:
pd.set_option('display.max_columns', 500)
dogs_DF.sample(3)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description_x,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped,description_y,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
7776,59002018,FL1720,https://www.petfinder.com/dog/chase-59002018/t...,Dog,Dog,Adult,Male,Medium,Short,"['Athletic', 'Smart', 'Playful', 'Friendly', '...",Chase,Meet Chase! He loves to play and then lounge o...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-30T01:57:49+0000,2022-11-30T01:57:48+0000,,Staffordshire Bull Terrier,,False,False,Black,White / Cream,,True,True,,False,True,False,False,False,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,ForgottenCoastAnimalRescue@gmail.com,(850) 527-3214,,,Houston,TX,77038,US,59002018,dog,fl1720,,"At 14 to 16 inches, Staffordshire Bull Terrier...","Clever, Brave, Tenacious",80.0,35.56,40.64,10.886217,17.23651,12.0,14.0,Terrier Group,0.4,Weekly Brushing,0.4,Occasional,0.8,Energetic,0.2,May be Stubborn,0.8,Friendly
52768,58722378,CA2932,https://www.petfinder.com/dog/damon-58722378/c...,Dog,Dog,Baby,Male,Medium,Short,"['Friendly', 'Affectionate', 'Loyal', 'Playful...",Damon,For more info on us contact the hoomans that r...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-01T17:25:22+0000,2022-11-01T17:25:22+0000,,Belgian Malinois,Labrador Retriever,True,False,Bicolor,Brown / Chocolate,Brindle,True,False,,False,True,True,True,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,woof@amazingdogs.org,(408) 477-0553,2741 El Camino Real,,Tustin,CA,92782,US,58722378,dog,ca2932,,,,,,,,,,,,,,,,,,,,,
64133,58570875,CA2390,https://www.petfinder.com/dog/craig-58570875/c...,Dog,Dog,Adult,Male,Medium,,[],Craig,,51326029.0,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-10-18T00:06:17+0000,2022-10-18T00:06:17+0000,,Cattle Dog,Mixed Breed,True,False,,,,False,False,,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@fresnohumane.org,(559) 600-7387,1510 West Dan Ronquillo Dr,,Fresno,CA,93706,US,58570875,dog,ca2390,,,,,,,,,,,,,,,,,,,,,


In [4]:
dogs_DF.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description_x',
       'organization_animal_id', 'photos', 'videos', 'status',
       'status_changed_at', 'published_at', 'distance', 'breeds.primary',
       'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary',
       'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'primary_photo_cropped.small', 'primary_photo_cropped.medium',
       'primary_photo_cropped.large', 'primary_photo_cropped.full',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type

Drop animals with no pictures since they are key to our 'tinder-like' app experience.

In [5]:
dogs_DF = dogs_DF.dropna(subset=['primary_photo_cropped.full'])# drop rows with 0 pictures
dogs_DF.shape # matches na count via sweet viz for dogs

(97694, 70)

In [6]:
dogs_DF = dogs_DF.drop_duplicates()
dogs_DF.shape

(97459, 70)

Cosine Similarity won't run on 97694 examples, so keeping as many as possible

In [7]:
dogs_DF = dogs_DF[0:57000]
dogs_DF.shape

(57000, 70)

In [9]:
# save dog set v2
dogs_DF.to_csv('/content/drive/MyDrive/MLE10PetMatch/dogsAdoptablewithExtrasv2.csv')

Next we seperate the dataframe into features to model over and context data that can be shown to the user for any matches. 'ID' will be our shared key between the two tables.

Of note, the 'distance' field and 'primary_photo_cropped.full' field will be useful data for future model enhancements. For the models so far, we will simply use textual data and assume a 0 distance for all pets.

In [None]:
contextCols = ['id','organization_id','url','type','tags','name','description_x','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full','description_y',
               'temperament','popularity','min_height','max_height','min_weight','max_weight',
               'min_expectancy','max_expectancy','grooming_frequency_value','shedding_value',
               'energy_level_value','trainability_value','demeanor_value']
featureCols = ['id','age','gender','size','coat','breeds.primary', 'breeds.secondary','breeds.mixed',
              'breeds.unknown','colors.primary','colors.secondary','colors.tertiary',
              'attributes.spayed_neutered','attributes.house_trained','attributes.declawed',
              'attributes.special_needs','attributes.shots_current','environment.children',
              'environment.dogs','environment.cats','type','contact.address.postcode',
              'group','grooming_frequency_category','shedding_category','energy_level_category',
               'trainability_category','demeanor_category'] # initial columns to keep for training purposes
dogs_DF_features = dogs_DF[featureCols]
dogs_DF_context = dogs_DF[contextCols]
dogs_DF_features.shape

(57000, 28)

Let's sanity check our missing values now that we just have dogs and remove any columns with too many missing values.

In [None]:
valueCounts = dogs_DF_features.set_index('type').isna().groupby(level=0).sum()/dogs_DF_features.shape[0] # level=0 refers to our index, which we made 'type'


In [None]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,age,gender,size,coat,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.address.postcode,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Dog,0.0,0.0,0.0,0.0,0.640579,0.0,0.627035,0.0,0.0,0.452982,0.629614,0.848456,0.0,0.0,1.0,0.0,0.0,0.68014,0.565544,0.810789,5.3e-05,0.510246,0.517649,0.517982,0.510263,0.511018,0.511018


In [None]:
valueCounts = dogs_DF_context.set_index('type').isna().groupby(level=0).sum()/dogs_DF_context.shape[0] # level=0 refers to our index, which we made 'type'


In [None]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,organization_id,url,tags,name,description_x,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,description_y,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
Dog,0.0,0.0,0.0,0.0,0.0,0.261333,0.334526,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.070333,0.238965,0.423719,0.940474,0.0,0.0,5.3e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.510246,0.510246,0.54714,0.510246,0.510246,0.513088,0.513088,0.512035,0.512035,0.517649,0.517982,0.510263,0.511018,0.511018


After a quick NA check, we will have to remove 'coat','breeds.secondary','colors.secondary','colors.tertiary', and 'attributes.declawed' (doesn't make sense for dogs). The column 'colors.primary' is also missing a lot of values but for sake of differing one dog from another it will be kept for now. Additionally, we will bring back in address postcode as an initial attempt to match nearby dogs together. Also, 'environment.children','environment.dogs', and 'environment.cats' have a lot of missing values but users derive a lot of value from this information. Lastly, the AKC data has a lot of missing data but that is only because the dataset has a lot of mixed breeds and AKC only handles purebreds. The value for purebreds is significant. Therefore, they will be kept as well.

In [None]:
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.special_needs','attributes.shots_current',
               'contact.address.postcode','environment.children','environment.dogs','environment.cats',
               'group','grooming_frequency_category','shedding_category',
               'energy_level_category', 'trainability_category', 'demeanor_category']

dogs_DF_features = dogs_DF[featureCols]
dogs_DF_context = dogs_DF[contextCols]
dogs_DF_features.shape

(57000, 21)

In [None]:
dogs_DF_features.dtypes

id                              int64
age                            object
gender                         object
size                           object
breeds.primary                 object
breeds.mixed                     bool
colors.primary                 object
attributes.spayed_neutered       bool
attributes.house_trained         bool
attributes.special_needs         bool
attributes.shots_current         bool
contact.address.postcode       object
environment.children           object
environment.dogs               object
environment.cats               object
group                          object
grooming_frequency_category    object
shedding_category              object
energy_level_category          object
trainability_category          object
demeanor_category              object
dtype: object

# Pre-process feature data<a id='pre-process'></a>

## Make Needed Helper Functions and Imports <a id='pp_pipeline'></a>

Make needed helper functions for modeling later on in workbook.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.metrics.pairwise import laplacian_kernel
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [None]:
def remove_columns_with_1_distinct(df):
    drop_col = [e for e in df.columns if df[e].nunique()==1]
    df_return = df.drop(drop_col,axis=1)
    return df_return


In [None]:
def drop_duplicates(df):
    df_return = df.drop_duplicates()
    return df_return


In [None]:
def linear_similarities(df_1,id_df):
    cs_simil = linear_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def cosine_similarities(df_1,id_df):
    cs_simil = cosine_similarity(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def laplacian_similarities(df_1,id_df):
    cs_simil = laplacian_kernel(df_1,df_1)
    results = {}
    ds = id_df # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

In [None]:
def item(id,df):  
    ds = df
    colsGrab = ['id']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def url(id,df):  
    ds = df
    colsGrab = ['url']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def picture(id,df):  
    ds = df
    colsGrab = ['primary_photo_cropped.full']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def recommend(item_id, num,df,reccs):
    print("Recommending " + str(num) + " dogs similar to " + str(item(item_id,df)) + "... " 
          + picture(item_id,df) + " - " + url(item_id,df))   
    print("-------")    
    recs = reccs[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + str(item(rec[1],df)) + " (score:" +      str(rec[0]) + ") " 
              + picture(rec[1],df) + " - " + url(rec[1],df))
    
def score(reccs, num):
    print("Finding average reccomendation score for top 5 reccomendations per example")
    results = []
    for key in reccs.keys():
        subRecs = reccs[key][:num]
        for r in subRecs:
            results.append(r[0])
    averageRecc = sum(results) / len(results)
    print("There are "+ str(len(results)) + 'results with a sum of' + str(sum(results)) + 'and and average of: ' 
          + str(averageRecc) )
    return averageRecc

## Preprocess Data for model runs <a id='pp'></a>

Now that essential methods are defined, lets handle the data.

In [None]:
dogs_DF_features.head(3) #sneak peak of what we have to work with initially

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode,environment.children,environment.dogs,environment.cats,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
0,59027590,Adult,Female,Large,Golden Retriever,True,,True,True,False,True,7442,True,True,False,Sporting Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Eager to Please,Friendly
1,59027588,Adult,Male,Small,Dandie Dinmont Terrier,True,Black,True,True,False,True,75093,,True,,Terrier Group,Daily Brushing,Infrequent,Regular Exercise,Independent,Reserved with Strangers
2,59027587,Baby,Female,Large,Great Pyrenees,False,White / Cream,True,False,False,True,36541,True,True,True,Working Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Independent,Reserved with Strangers


Besides the id value, which is our shared key, all other fields are categorical. We can use One-Hot encoding to transform them into something more efficient to run models over. 

Some preprocessing before One-Hot Encoding must occur to ensure everything goes as planned. First, we proactively drop duplicate rows. Second, we remove any features with only 1 distince value, since content-based filtering uses differences between objects and if everyone is the same there is no new information. Third, we replace NaNs with a special string so that One-Hot Encoding can work. Lastly, we fix the postcode to a string so that One-Hot Encoding works properly.

In [None]:
# Preprocess data before encoding occurs for some troublesome fields
X = dogs_DF_features
X = drop_duplicates(X) # remove duplicate rows
X = remove_columns_with_1_distinct(X) # remove any features with only 1 distinct value
X["contact.address.postcode"]= X["contact.address.postcode"].astype(str) # fix postcode to be a str rather than an int
# One-Hot Encoder requires all strings or all ints, so bools are not strings
X['breeds.mixed'] = X['breeds.mixed'].map({True: 'True', False: 'False'}) 
X['attributes.spayed_neutered'] = X['attributes.spayed_neutered'].map({True: 'True', False: 'False'}) 
X['attributes.house_trained'] = X['attributes.house_trained'].map({True: 'True', False: 'False'}) 
X['attributes.special_needs'] = X['attributes.special_needs'].map({True: 'True', False: 'False'}) 
X['attributes.shots_current'] = X['attributes.shots_current'].map({True: 'True', False: 'False'}) 
X['environment.children'] = X['environment.children'].map({True: 'True', False: 'False'}) 
X['environment.dogs'] = X['environment.dogs'].map({True: 'True', False: 'False'}) 
X['environment.cats'] = X['environment.cats'].map({True: 'True', False: 'False'}) 

X = X.replace(np.nan,'Not Available') # replace nan's with their own special category, do this last once types all fixed!
X.dtypes

id                              int64
age                            object
gender                         object
size                           object
breeds.primary                 object
breeds.mixed                   object
colors.primary                 object
attributes.spayed_neutered     object
attributes.house_trained       object
attributes.special_needs       object
attributes.shots_current       object
contact.address.postcode       object
environment.children           object
environment.dogs               object
environment.cats               object
group                          object
grooming_frequency_category    object
shedding_category              object
energy_level_category          object
trainability_category          object
demeanor_category              object
dtype: object

In [None]:
X_transform = X

In [None]:
X_transform.shape

(57000, 21)

In [None]:
X_transform.head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode,environment.children,environment.dogs,environment.cats,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
0,59027590,Adult,Female,Large,Golden Retriever,True,Not Available,True,True,False,True,7442,True,True,False,Sporting Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Eager to Please,Friendly
1,59027588,Adult,Male,Small,Dandie Dinmont Terrier,True,Black,True,True,False,True,75093,Not Available,True,Not Available,Terrier Group,Daily Brushing,Infrequent,Regular Exercise,Independent,Reserved with Strangers
2,59027587,Baby,Female,Large,Great Pyrenees,False,White / Cream,True,False,False,True,36541,True,True,True,Working Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Independent,Reserved with Strangers


In [None]:
dogs_DF_context.head(3)

Unnamed: 0,id,organization_id,url,type,tags,name,description_x,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,description_y,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
0,59027590,NJ519,https://www.petfinder.com/dog/roxy-59027590/nj...,Dog,"['Friendly', 'Affectionate', 'Loyal', 'Gentle'...",Roxy,Roxy is 4 yrs old and weighs 60Lbs. She is spa...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-12-02T05:57:18+0000,2022-12-02T05:57:17+0000,,Doggiedogrescue@aol.com,,,,Pompton Lakes,NJ,7442,US,59027590,dog,nj519,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,"The Golden Retriever is a sturdy, muscular dog...","Friendly, Intelligent, Devoted",3,54.61,60.96,24.94758,34.019428,10.0,12.0,0.4,0.6,1.0,1.0,0.8
1,59027588,TX1203,https://www.petfinder.com/dog/blackie-59027588...,Dog,"['Friendly', 'Gentle']",Blackie,,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-12-02T05:55:53+0000,2022-12-02T05:55:52+0000,,tzrinquiries@tzuzoorescue.com,,,,Plano,TX,75093,US,59027588,dog,tx1203,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,Physical hallmarks of the Dandie Dinmont Terri...,"Independent, Smart, Proud",176,20.32,27.94,8.164663,10.886217,12.0,15.0,0.8,0.2,0.6,0.4,0.4
2,59027587,AL387,https://www.petfinder.com/dog/marlie-59027587/...,Dog,[],Marlie,NEWBIR ALERT!!\n\nThis sweet pup is Marlie! Sh...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-12-02T05:54:25+0000,2022-12-02T05:54:24+0000,,wagsandwhiskersrescuepets@yahoo.com,,,,Grand Bay,AL,36541,US,59027587,dog,al387,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,"Frequently described as “majestic,” Pyrs are b...","Smart, Patient, Calm",66,63.5,81.28,38.555351,45.359237,10.0,12.0,0.4,0.6,1.0,0.4,0.4


Since we are using unsupervised content-based filtering, the full data set will be used to create the model. No data splits will be used.

The entries match! We need to pass to our models the numerical data to analyze similarity of products and the context data that goes along with it. As long as the indexes are the same, we can stitch them back together.

Since we know the indexs match, lets get rid of the id columns.

In [None]:
X_transform = X_transform.reset_index(drop=True) # required so keys work properly
X_transform_woID = X_transform.drop(columns='id')
X_transform_woID.dtypes

age                            object
gender                         object
size                           object
breeds.primary                 object
breeds.mixed                   object
colors.primary                 object
attributes.spayed_neutered     object
attributes.house_trained       object
attributes.special_needs       object
attributes.shots_current       object
contact.address.postcode       object
environment.children           object
environment.dogs               object
environment.cats               object
group                          object
grooming_frequency_category    object
shedding_category              object
energy_level_category          object
trainability_category          object
demeanor_category              object
dtype: object

Notice that indexes are the same and id columns are gone, so we can recover the IDs later! Now we can apply One-Hot Encoding!

In [None]:
ohe = OneHotEncoder().fit(X_transform_woID) # One Hot Encoding WAAAY better, fit on whole X
X_train_transform = ohe.transform(X_transform_woID) # don't need to add id columns because same columns preserved

# Run Content-Based-Filtering Modeling Iterations <a id='run_pipeline'></a>

Content-Based Filtering is a method of comparing products against each other when you don't have user rankings. This can be a simple way to create models before user ranking data is available and can often do well in recommending similar products. In our case, products are cats. Cosine Similarity will be used moving forward.

## Cosine similarity results <a id='cs'></a>

In [None]:
Cosine_Model =cosine_similarities(X_train_transform,X_transform) #run similarities with cosine similarity


In [None]:
joblib.dump(Cosine_Model, '/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_dogsv2.pkl')

['/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_dogsv2.pkl']

In [None]:
Cosine_Model = joblib.load('/content/drive/MyDrive/MLE10PetMatch/models/cosine_similarity_model_dogsv2.pkl')

In [None]:
pd.options.display.max_colwidth = 100
recommend(item_id=58925506, num=5,df=dogs_DF_context,reccs=Cosine_Model)

['Recommending 5 dogs similar to [58925506]... https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58925506/2/?bust=1669021069 - https://www.petfinder.com/dog/denali-eskimo-58925506/ca/manhattan-beach/caring-songs-rescue-ca2668/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
-------
['Recommended: [58978435] (score:0.9000000000000002) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58978435/1/?bust=1669582687 - https://www.petfinder.com/dog/aaahh-the-taste-of-pepsi-cola-huskimo-58978435/ca/manhattan-beach/caring-songs-rescue-ca2668/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [58981654] (score:0.8500000000000002) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58981654/4/?bust=1669615450 - https://www.petfinder.com/dog/pepsi-or-cola-pepsi-58981654/ca/manhattan-beach/caring-songs-rescue-ca2668/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [58919722] (score:0.8500000000000002) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58919722/1/?bust=

The above is score for one item only so now let's get an idea of how well this does for the entire training set.

In [None]:
# Gather average score of top 5 recommendations for training set, with a max score of 1!
cosineScore = score(reccs=Cosine_Model, num=5)
cosineScore

Finding average reccomendation score for top 5 reccomendations per example
There are 285000results with a sum of261553.95000020857and and average of: 0.9177331578954687


0.9177331578954687

The overall score for the whole training set for Cosine Similarity is .918

## Overall Content Based Filtering results as of 1/18/2023 <a id='ov'></a>

In [None]:
from tabulate import tabulate
table = [['Model Name', 'Score'],
         ['Cosine Similarity',cosineScore]]
print(tabulate(table,headers='firstrow',tablefmt='fancy_grid'))

╒═══════════════════╤══════════╕
│ Model Name        │    Score │
╞═══════════════════╪══════════╡
│ Cosine Similarity │ 0.917733 │
╘═══════════════════╧══════════╛


# Conclusion and Next Steps <a id='conclusion'></a>

**Conclusion of ML Modeling as of 1/18/23:**


*   Cosine Similarity with all examples improves the performance by .004
*   Will now work with UI and only slightly expanded dog set





**Conclusion of ML Modeling as of 1/2/23**: 
- All three content-based filtering models perform well
- Cosine Similarity appears to be the most sensitive to differences and has a very useful scale of 0-1.
- Can hook up content-based filtering models to PetMatch UI as-is and it should return good results based on overall similarity measures measured so far.
- User Rankings Data generated require more formating than initially expected but our application tracks all the key required fields for now.
- Collaborative Filtering is harder to implement than initially expected, but we have initial data to give it a try.

**Conclusion of ML Baseline as of 12/6/22**: 
- Average top 5 recommendation per cat in the training set is 10.96. The highest available score is a 12.  
- The result above uses a simple content-based filtering recommendation model without using user perferences, since they are currently not available. Instead it compares items against each other, aka you liked this ketchup so here are 10 other similar types of ketchup. 
- Due to the method used to create the simple content-based filtering model, dev and test set can not be used so to get an initial idea of the results the training set was used. 
- The cats data version 0.5 features need more ways to dileanate one cat from another but based on include visual scans and the average reccomendation score, the simple cat CBF model generally excels at giving you similar cats to what you stated you wanted.
- In instances where there is more ambiguity (aka a chosen cat with less defined details), it will still find cats very similar to it but sometimes it can also throw in very similar cats who are a different breed. This might not be a bad thing.

**Next Steps**:

- Get more user rankings!
- Incorporate distance more effectively
- Use cosine similarity with all dogs!
- Collaborative Filtering for item and user-based
  - Use surprise library possibly
  - Add timestamp to rankings so we can be time-sensitive in terms of reccomendations