This workbook implements Context Based Filtering for a Dogs Recommendation System, working from the raw data all the way to model creation and initial results.

# Table of Contents

* [Load in Data and Segment Features from Context data](#segment)
* [Pre-process feature data](#pre-process)
    - [Recheck sweetviz for distinct values and types](#sweet)
    - [Make Pipeline](#pp_pipeline)
* [Run Modeling Pipeline](#run_pipeline)
* [Conclusion and Next Steps](#conclusion)

# Load in Data and Segment Features from Context data<a id='segment'></a>

First, we load all of our adoptable dogs.

In [1]:
import pandas as pd
dogs_DF = pd.read_csv("../data/raw/version0_5/Adoptable_dogs_20221202.csv",header=0,index_col=0)
dogs_DF.shape

(100000, 50)

In [2]:
pd.set_option('display.max_columns', 500)
dogs_DF.sample(3)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
71770,58091608,CA595,https://www.petfinder.com/dog/tomas-hertl-5809...,Dog,Dog,Young,Male,Large,,[],TOMAS HERTL,11/11/22 18:58 Reason animal needs rescue: dec...,A1272710,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-10-12T20:34:42+0000,2022-10-12T20:34:40+0000,,German Shepherd Dog,,False,False,,,,False,False,,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,AdoptAPetSJ@sanjoseca.gov,(408) 794-7297,2750 Monterey Road,,San Jose,CA,95111,US,58091608,dog,ca595,
9733,58998702,CA44,https://www.petfinder.com/dog/misty-58998702/c...,Dog,Dog,Adult,Female,Small,Medium,[],Misty,"Misty 4 yr old female, Pomeranian mix. Very sw...",,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-29T21:51:33+0000,2022-11-29T21:51:32+0000,,Pomeranian,Chihuahua,True,False,Black,"Tricolor (Brown, Black, & White)",Merle (Blue),True,False,,False,True,,True,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoptions@petorphans.org,(818) 901-0190,7720 Gloria Avenue,,Van Nuys,CA,91406,US,58998702,dog,ca44,
62238,58632792,TN61,https://www.petfinder.com/dog/terry-58632792/t...,Dog,Dog,Young,Male,Medium,,[],Terry,"Hi there, I&amp;#39;m Terry! You wouldn&amp;#3...",50364300,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-10-23T03:03:14+0000,2022-10-23T03:03:14+0000,,Pit Bull Terrier,Mixed Breed,True,False,,,,True,False,,False,False,False,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoptions@young-williams.org,(865) 215-6670,3201 Division St.,,Knoxville,TN,37919,US,58632792,dog,tn61,


In [3]:
dogs_DF.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'videos', 'status', 'status_changed_at', 'published_at',
       'distance', 'breeds.primary', 'breeds.secondary', 'breeds.mixed',
       'breeds.unknown', 'colors.primary', 'colors.secondary',
       'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'primary_photo_cropped.small', 'primary_photo_cropped.medium',
       'primary_photo_cropped.large', 'primary_photo_cropped.full',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',

Drop animals with no pictures since they are key to our 'tinder-like' app experience.

In [2]:
dogs_DF = dogs_DF.dropna(subset=['primary_photo_cropped.full'])# drop rows with 0 pictures
dogs_DF.shape # matches na count via sweet viz for cats

(97694, 50)

Next we seperate the dataframe into features to model over and context data that can be shown to the user for any matches. 'ID' will be our shared key between the two tables.

Of note, the 'distance' field and 'primary_photo_cropped.full' field will be useful data for future model enhancements. For the model baseline, we will simply use textual data and assume a 0 distance for all pets.

In [5]:
contextCols = ['id','organization_id','url','type','tags','name','description','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full']
featureCols = ['id','age','gender','size','coat','breeds.primary', 'breeds.secondary','breeds.mixed',
              'breeds.unknown','colors.primary','colors.secondary','colors.tertiary',
              'attributes.spayed_neutered','attributes.house_trained',
              'attributes.special_needs','attributes.shots_current','environment.children',
              'environment.dogs','environment.cats','type','contact.address.postcode']
dogs_DF_features = dogs_DF[featureCols]
dogs_DF_context = dogs_DF[contextCols]
dogs_DF_features.shape

(97694, 21)

Let's sanity check our missing values now that we just have cats and remove any columns with too many missing values.

In [6]:
valueCounts = dogs_DF_features.set_index('type').isna().groupby(level=0).sum()/dogs_DF_features.shape[0] # level=0 refers to our index, which we made 'type'


In [7]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,age,gender,size,coat,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.address.postcode
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Dog,0.0,0.0,0.0,0.0,0.661535,0.0,0.632639,0.0,0.0,0.456026,0.660153,0.885756,0.0,0.0,0.0,0.0,0.668342,0.546216,0.802168,0.000194


In [8]:
valueCounts = dogs_DF_context.set_index('type').isna().groupby(level=0).sum()/dogs_DF_context.shape[0] # level=0 refers to our index, which we made 'type'


In [9]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,organization_id,url,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
Dog,0.0,0.0,0.0,0.0,0.0,0.229277,0.344289,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.062931,0.245726,0.439014,0.942883,0.000154,0.000154,0.000194,0.000154,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After a quick NA check, we will have to remove 'coat','breeds.secondary','colors.secondary','colors.tertiary','environment.children','environment.dogs' and 'environment.cats'. The column 'colors.primary' is also missing a lot of values but for sake of differing one dog from another it will be kept for now. Additionally, we will bring back in address postcode as an initial attempt to match nearby dogs together.

In [10]:
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.special_needs','attributes.shots_current','contact.address.postcode']
dogs_DF_features = dogs_DF[featureCols]
dogs_DF_context = dogs_DF[contextCols]
dogs_DF_features.shape

(97694, 13)

In [11]:
dogs_DF_features.dtypes

id                             int64
age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                    bool
breeds.unknown                  bool
colors.primary                object
attributes.spayed_neutered      bool
attributes.house_trained        bool
attributes.special_needs        bool
attributes.shots_current        bool
contact.address.postcode      object
dtype: object

Now we add in AKC data where it exists! First we fix a few petfinder breeds to match AKC and then we join the tables!

In [12]:
dogs_DF_features.loc[dogs_DF_features['breeds.primary'] == 'Black Labrador Retriever'].head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode
315,59026983,Young,Female,Medium,Black Labrador Retriever,True,False,Black,True,True,False,True,32533
317,59026974,Baby,Female,Medium,Black Labrador Retriever,True,False,,False,False,False,True,17055
318,59026976,Baby,Male,Medium,Black Labrador Retriever,True,False,,False,False,False,True,17055


In [13]:
#replace mistyped breeds that don't match AKC using EDA workbook as a reference!
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Black Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Yellow Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Poodle", "breeds.primary"] = 'Poodle (Standard)'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Wirehaired Dachshund", "breeds.primary"] = 'Dachshund'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "White German Shepherd", "breeds.primary"] = 'German Shepherd Dog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Standard Poodle", "breeds.primary"] = 'Poodle (Standard)'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "West Highland White Terrier / Westie", "breeds.primary"] = 'West Highland White Terrier'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Eskimo Dog", "breeds.primary"] = 'American Eskimo Dog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Newfoundland Dog", "breeds.primary"] = 'Newfoundland'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Mountain Dog", "breeds.primary"] = 'Bernese Mountain Dog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Miniature Poodle", "breeds.primary"] = 'Poodle (Miniature)'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Miniature Dachshund'", "breeds.primary"] = 'Dachshund'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Jack Russell Terrier", "breeds.primary"] = 'Parson Russell Terrier'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Husky", "breeds.primary"] = 'Siberian Husky'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Foxhound", "breeds.primary"] = 'American Foxhound'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Fox Terrier", "breeds.primary"] = 'Wire Fox Terrier'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "English Bulldog", "breeds.primary"] = 'Bulldog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Corgi", "breeds.primary"] = 'Pembroke Welsh Corgi'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Chocolate Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Belgian Shepherd / Tervuren'", "breeds.primary"] = 'Belgian Tervuren'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Belgian Shepherd / Sheepdog", "breeds.primary"] = 'Belgian Sheepdog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Belgian Shepherd / Malinois", "breeds.primary"] = 'Belgian Malinois'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Belgian Shepherd / Laekenois", "breeds.primary"] = 'Belgian Laekenois'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Anatolian Shepherd", "breeds.primary"] = 'Anatolian Shepherd Dog'
dogs_DF_features.loc[dogs_DF_features["breeds.primary"] == "Australian Cattle Dog / Blue Heeler", "breeds.primary"] = 'Australian Cattle Dog'


Now that the dog breeds are fixed in the petfinder frame, let's pull in the akc data!

In [3]:
akc = pd.read_csv("../data/external/akc-data-2020-05-18.csv",header=0)
akc.columns.values[0] = 'breeds.primary'
akc.head(3)

Unnamed: 0,breeds.primary,description,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,...,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
0,Affenpinscher,The Affen’s apish look has been described many...,"Confident, Famously Funny, Fearless",148,22.86,29.21,3.175147,4.535924,12.0,15.0,...,0.6,2-3 Times a Week Brushing,0.6,Seasonal,0.6,Regular Exercise,0.8,Easy Training,1.0,Outgoing
1,Afghan Hound,"The Afghan Hound is an ancient breed, his whol...","Dignified, Profoundly Loyal, Aristocratic",113,63.5,68.58,22.679619,27.215542,12.0,15.0,...,0.8,Daily Brushing,0.2,Infrequent,0.8,Energetic,0.2,May be Stubborn,0.2,Aloof/Wary
2,Airedale Terrier,The Airedale Terrier is the largest of all ter...,"Friendly, Clever, Courageous",60,58.42,58.42,22.679619,31.751466,11.0,14.0,...,0.6,2-3 Times a Week Brushing,0.4,Occasional,0.6,Regular Exercise,1.0,Eager to Please,0.8,Friendly


In [15]:
akcFeatures = ['breeds.primary','group','grooming_frequency_category','shedding_category','energy_level_category','trainability_category','demeanor_category']
akctoAdd = akc[akcFeatures]
akctoAdd.head(3)


Unnamed: 0,breeds.primary,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
0,Affenpinscher,Toy Group,2-3 Times a Week Brushing,Seasonal,Regular Exercise,Easy Training,Outgoing
1,Afghan Hound,Hound Group,Daily Brushing,Infrequent,Energetic,May be Stubborn,Aloof/Wary
2,Airedale Terrier,Terrier Group,2-3 Times a Week Brushing,Occasional,Regular Exercise,Eager to Please,Friendly


In future, we will want to pull in these descriptions, but for now, lets keep it simple. Next we join the frames.

In [16]:
dogs_DF_featuresJoint = pd.merge(dogs_DF_features, akctoAdd, on="breeds.primary",how="left")
dogs_DF_featuresJoint.shape

(97694, 19)

In [6]:
#run to generate streamlit file otherwise comment out
dogs_DF_Joint = pd.merge(dogs_DF, akc, on="breeds.primary",how="left")
dogs_DF_Joint.shape

(97694, 70)

In [7]:
#run to generate streamlit file otherwise comment out
dogs_DF_Joint.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description_x',
       'organization_animal_id', 'photos', 'videos', 'status',
       'status_changed_at', 'published_at', 'distance', 'breeds.primary',
       'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary',
       'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'primary_photo_cropped.small', 'primary_photo_cropped.medium',
       'primary_photo_cropped.large', 'primary_photo_cropped.full',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type

In [8]:
#run to generate streamlit file otherwise comment out
#replace mistyped breeds that don't match AKC using EDA workbook as a reference!
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Black Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Yellow Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Poodle", "breeds.primary"] = 'Poodle (Standard)'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Wirehaired Dachshund", "breeds.primary"] = 'Dachshund'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "White German Shepherd", "breeds.primary"] = 'German Shepherd Dog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Standard Poodle", "breeds.primary"] = 'Poodle (Standard)'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "West Highland White Terrier / Westie", "breeds.primary"] = 'West Highland White Terrier'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Eskimo Dog", "breeds.primary"] = 'American Eskimo Dog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Newfoundland Dog", "breeds.primary"] = 'Newfoundland'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Mountain Dog", "breeds.primary"] = 'Bernese Mountain Dog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Miniature Poodle", "breeds.primary"] = 'Poodle (Miniature)'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Miniature Dachshund'", "breeds.primary"] = 'Dachshund'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Jack Russell Terrier", "breeds.primary"] = 'Parson Russell Terrier'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Husky", "breeds.primary"] = 'Siberian Husky'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Foxhound", "breeds.primary"] = 'American Foxhound'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Fox Terrier", "breeds.primary"] = 'Wire Fox Terrier'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "English Bulldog", "breeds.primary"] = 'Bulldog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Corgi", "breeds.primary"] = 'Pembroke Welsh Corgi'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Chocolate Labrador Retriever", "breeds.primary"] = 'Labrador Retriever'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Belgian Shepherd / Tervuren'", "breeds.primary"] = 'Belgian Tervuren'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Belgian Shepherd / Sheepdog", "breeds.primary"] = 'Belgian Sheepdog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Belgian Shepherd / Malinois", "breeds.primary"] = 'Belgian Malinois'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Belgian Shepherd / Laekenois", "breeds.primary"] = 'Belgian Laekenois'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Anatolian Shepherd", "breeds.primary"] = 'Anatolian Shepherd Dog'
dogs_DF_Joint.loc[dogs_DF_Joint["breeds.primary"] == "Australian Cattle Dog / Blue Heeler", "breeds.primary"] = 'Australian Cattle Dog'


In [10]:
#run to generate streamlit file otherwise comment out
dogs_DF_Joint.sample(3)["breeds.primary"]

63592        Pit Bull Terrier
95316    Pembroke Welsh Corgi
32798         Black Mouth Cur
Name: breeds.primary, dtype: object

In [11]:
#run to generate streamlit file otherwise comment out
# Name of the CSV file
csvFileName = "../data/raw/Adoptable_dogs_20221202_withExtras.csv"

# Write contents of the DataFrame to a CSV file
dogs_DF_Joint.to_csv(csvFileName);

In [17]:
dogs_DF_featuresJoint.sample(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
66011,58508911,Adult,Female,Large,German Shepherd Dog,False,False,"Tricolor (Brown, Black, & White)",True,True,False,True,47001,Herding Group,Weekly Brushing,Regularly,Regular Exercise,Eager to Please,Alert/Responsive
48707,58763272,Young,Male,Large,German Shepherd Dog,False,False,,True,False,False,False,90242,Herding Group,Weekly Brushing,Regularly,Regular Exercise,Eager to Please,Alert/Responsive
17346,58852934,Adult,Male,Extra Large,Pit Bull Terrier,True,False,White / Cream,True,False,False,True,7063,,,,,,


In [18]:
dogs_DF_featuresJoint.columns

Index(['id', 'age', 'gender', 'size', 'breeds.primary', 'breeds.mixed',
       'breeds.unknown', 'colors.primary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.special_needs',
       'attributes.shots_current', 'contact.address.postcode', 'group',
       'grooming_frequency_category', 'shedding_category',
       'energy_level_category', 'trainability_category', 'demeanor_category'],
      dtype='object')

# Pre-process feature data<a id='pre-process'></a>

## Recheck sweetviz for distinct values and types<a id='sweet'></a>

First, let's re-examine our dataframe for distinct values.

In [19]:
dogs_DF_featuresJoint.head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
0,59027590,Adult,Female,Large,Golden Retriever,True,False,,True,True,False,True,7442,Sporting Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Eager to Please,Friendly
1,59027588,Adult,Male,Small,Dandie Dinmont Terrier,True,False,Black,True,True,False,True,75093,Terrier Group,Daily Brushing,Infrequent,Regular Exercise,Independent,Reserved with Strangers
2,59027587,Baby,Female,Large,Great Pyrenees,False,False,White / Cream,True,False,False,True,36541,Working Group,Weekly Brushing,Seasonal,Needs Lots of Activity,Independent,Reserved with Strangers


In [20]:
# make special version without postocde so sweetviz can handle it, since postcode has both numbers and letters
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.special_needs','attributes.shots_current','group',
               'grooming_frequency_category','shedding_category','energy_level_category',
               'trainability_category','demeanor_category']
contextCols = ['id','organization_id','url','type','tags','name','description','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full']
dogs_DF_featuresJoint_test = dogs_DF_featuresJoint[featureCols]
dogs_DF_contextJoint_test = dogs_DF[contextCols]
dogs_DF_featuresJoint_test.shape

(97694, 18)

In [21]:
import sweetviz as sv

dog_data_report = sv.analyze(dogs_DF_featuresJoint_test)
dog_data_report.show_html() #save to html document

  all_source_names = [cur_name for cur_name, cur_series in source_df.iteritems()]
  filtered_series_names_in_source = [cur_name for cur_name, cur_series in source_df.iteritems()


                                             |      | [  0%]   00:00 -> (? left)

  stats["mad"] = series.mad()
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Make Pipeline <a id='pp_pipeline'></a>

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder

In [23]:
def remove_columns_with_1_distinct(df):
    drop_col = [e for e in df.columns if df[e].nunique()==1]
    df_return = df.drop(drop_col,axis=1)
    return df_return


In [24]:
def drop_duplicates(df):
    df_return = df.drop_duplicates()
    return df_return


In [25]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel 

def cosine_similarities(df_1,df_2):
    cs_simil = linear_kernel(df_1,df_1)
    results = {}
    ds = df_2 # needs id column
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

#cosineSimilarity = FunctionTransformer(cosine_similarities)

In [26]:
def item(id,df):  
    ds = df
    colsGrab = ['id']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def url(id,df):  
    ds = df
    colsGrab = ['url']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def picture(id,df):  
    ds = df
    colsGrab = ['primary_photo_cropped.full']
    return ds.loc[ds['id'] == id][colsGrab].values[0]# Just reads the results out of the dictionary.

def recommend(item_id, num,df,reccs):
    print("Recommending " + str(num) + " dogs similar to " + str(item(item_id,df)) + "... " 
          + picture(item_id,df) + " - " + url(item_id,df))   
    print("-------")    
    recs = reccs[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + str(item(rec[1],df)) + " (score:" +      str(rec[0]) + ") " 
              + picture(rec[1],df) + " - " + url(rec[1],df))
    
def score(reccs, num):
    print("Finding average reccomendation score for top 5 reccomendations per example")
    results = []
    for key in reccs.keys():
        subRecs = reccs[key][:num]
        for r in subRecs:
            results.append(r[0])
    averageRecc = sum(results) / len(results)
    print("There are "+ str(len(results)) + 'results with a sum of' + str(sum(results)) + 'and and average of: ' 
          + str(averageRecc) )
    return averageRecc

In [27]:
categorical_features = ['age', 'gender', 'size', 'breeds.primary', 'breeds.mixed',
       'colors.primary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.special_needs',
       'attributes.shots_current', 'contact.address.postcode', 'group',
       'grooming_frequency_category', 'shedding_category',
       'energy_level_category', 'trainability_category', 'demeanor_category']

categorical_transformer = OneHotEncoder()

In [28]:
# Not used currently but kept for future when distance is more properly implemented
numerical_features = ['id']
numeric_transformer = lambda x:x #change nothing

In [29]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers = [
        #("num", numeric_transformer,numerical_features), 
        ("cat", categorical_transformer, categorical_features),
    ])

In [30]:
from sklearn.base import BaseEstimator

class ContentBasedRecommendor(BaseEstimator):
    def __init__(self):
        pass # constructor not needed for anything yet

    def fit(self,X,y=None):
        #print(X.shape)
        #self.X = X
        #self.y = y
        return cosine_similarities(X,y) 
    
    #def transform(self):
        #pass

    def predict(self,X,num,context_df,reccs):
        item_id = X['id'].values[0]
        return recommend(X, num,context_df,reccs)
    
    #def score(self: ContentBasedRecommendor, item_id,num,df_context,reccs):
        

In [31]:
model = Pipeline(
    steps=[("preprocessor", preprocessor),
           ("model", ContentBasedRecommendor())
          ])

# Run Modeling Pipeline <a id='run_pipeline'></a>

In [32]:
import numpy as np
#target = 'todo' # will be rankings once we have them
#X, y = cats_DF_features.drop(columns=target), cats_DF_features[target]
X = dogs_DF_featuresJoint
X = drop_duplicates(X)
X = remove_columns_with_1_distinct(X)
X = X.replace(np.nan,'Not Available')
X["contact.address.postcode"]= X["contact.address.postcode"].astype(str)
X.dtypes

id                              int64
age                            object
gender                         object
size                           object
breeds.primary                 object
breeds.mixed                     bool
colors.primary                 object
attributes.spayed_neutered       bool
attributes.house_trained         bool
attributes.special_needs         bool
attributes.shots_current         bool
contact.address.postcode       object
group                          object
grooming_frequency_category    object
shedding_category              object
energy_level_category          object
trainability_category          object
demeanor_category              object
dtype: object

In [33]:
from sklearn.model_selection import train_test_split
import numpy as np
# split data
x, x_test = train_test_split(X,test_size=0.4,train_size=0.6, random_state=13)
x_train, x_dev = train_test_split(x,test_size = 0.4,train_size =0.6, random_state=13)

# given the way the model works so far, the x_dev and x_test are not used. 
# If you aren't in the catalog you can't be scored so for now,  just using x_train for initial model results
# Once we get user rankings, we can move the model to something that can use these additional sets.

In [34]:
x_train = x_train.reset_index(drop=True) # index reset required so model fitting can match keys
x_train.shape

(35085, 18)

In [35]:
x_dev.shape

(23390, 18)

In [36]:
x_test.shape

(38984, 18)

In [51]:
x_train.sample(3)
x_train.loc[x_train['breeds.primary'] == 'Maltese'].head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.special_needs,attributes.shots_current,contact.address.postcode,group,grooming_frequency_category,shedding_category,energy_level_category,trainability_category,demeanor_category
1142,58823201,Baby,Female,Small,Maltese,True,White / Cream,True,False,False,True,92691,Toy Group,Daily Brushing,Infrequent,Regular Exercise,Agreeable,Outgoing
1402,58848937,Senior,Male,Small,Maltese,True,Not Available,True,True,False,False,92832,Toy Group,Daily Brushing,Infrequent,Regular Exercise,Agreeable,Outgoing
1929,58536245,Adult,Female,Small,Maltese,True,Not Available,True,False,False,False,75711,Toy Group,Daily Brushing,Infrequent,Regular Exercise,Agreeable,Outgoing


In [38]:

categorical_features_test = ['age', 'gender', 'size', 'breeds.primary', 'breeds.mixed',
       'colors.primary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.special_needs',
       'attributes.shots_current', 'contact.address.postcode', 'group',
       'grooming_frequency_category', 'shedding_category',
       'energy_level_category', 'trainability_category', 'demeanor_category']
#x_train = x_train.replace(np.nan,'Not Available')
x_train = x_train.reset_index(drop=True) # required so keys work properly
xtrain_dog = x_train[categorical_features_test]
#xtrain_dog = xtrain_cat.replace(np.nan,'Not Available') 
ohe = OneHotEncoder().fit(xtrain_dog) # One Hot Encoding WAAAY better
x_train_test = ohe.transform(xtrain_dog) # don't need to add id columns because same columns preserved
#type(x_train_test)
#x_train_test.shape
test =cosine_similarities(x_train_test,x_train)


Will need to use parquets in databricks for dogs since there are too many for local runs 

In [39]:
xtrain_dog.shape #max score of 17

(35085, 17)

In [40]:
x_train_test.shape

(35085, 5228)

In [41]:
type(x_train_test)
x_train_test.shape
x_train_test.todense()[1]

matrix([[1., 0., 0., ..., 1., 0., 0.]])

In [52]:
pd.options.display.max_colwidth = 100
recommend(item_id=58823201, num=5,df=dogs_DF_context,reccs=test)

['Recommending 5 dogs similar to [58823201]... https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58823201/1/?bust=1669640592 - https://www.petfinder.com/dog/tinsley-58823201/ca/mission-viejo/leashes-of-love-rescue-inc-ca2366/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
-------
['Recommended: [58878576] (score:16.0) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58878576/1/?bust=1668637211 - https://www.petfinder.com/dog/apricot-58878576/ca/clovis/paw-squad-559-ca2604/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [59007650] (score:15.0) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/59007650/1/?bust=1669827519 - https://www.petfinder.com/dog/rags-59007650/pa/breinigsville/pa-caring-hearts-pa1104/?referrer_id=c2f7479c-c7e8-422b-bfb4-7c0b8aed0e55']
['Recommended: [58878499] (score:15.0) https://dl5zpyw5k3jeb.cloudfront.net/photos/pets/58878499/2/?bust=1668636959 - https://www.petfinder.com/dog/huckleberry-58878499/ca/clovis/paw-squad-559-ca2604/?referrer_id=

In [43]:
# Gather average score of top 5 recommendations for training set, with a max score of 12!
score(reccs=test, num=5)

Finding average reccomendation score for top 5 reccomendations per example
There are 175425results with a sum of2717862.0and and average of: 15.493014108593416


15.493014108593416

**Below is IP code for a pipeline. Still too buggy to use for a ML Baseline though.**

In [44]:
#categorical_features_test = ['age','gender','size','breeds.primary','breeds.mixed',
#                        'colors.primary','attributes.spayed_neutered','attributes.house_trained',
#                        'attributes.declawed','attributes.special_needs','attributes.shots_current',
#                        'contact.address.postcode']
#xtrain_cat = x_train[categorical_features_test]
#xtrain_cat.shape


In [45]:
#model = model.fit(X= x_train, y=x_train)
#savedMod = model.fit(X= xtrain_cat, y=x_train)

In [46]:
#item_id=58761493
#array_id = pd.DataFrame([item_id],columns=['id'])
#array_id

In [47]:
#type(savedMod)

In [48]:
#model.predict(X=array_id, num=5,context_df=cats_DF_context,reccs=savedMod)

# Conclusion and Next Steps <a id='conclusion'></a>

**Conclusion of ML Baseline as of 12/6/22**: 
- Average top 5 recommendation per dog in the training set is 15.49. The highest available score is a 17.  
- The result above uses a simple content-based filtering recommendation model without using user perferences, since they are currently not available. Instead it compares items against each other, aka you liked this ketchup so here are 5 other similar types of ketchup. 
- Due to the method used to create the simple content-based filtering model, dev and test set can not be used so to get an initial idea of the results the training set was used. 
- The dogs data version 0.5 features includes akc data as well and it seems to help purebreeds. Mixed breeds it somewhat helps assuming the lister picked both breeds but it can lead to more variety in the output, which might not be a bad thing. Eg) If you like boxers, you probably like a boxer-labrador mix? In general though, based on visual scans and the average reccomendation score, the simple dog CBF model generally excels at giving you similar dog to what you stated you wanted and has a little variety.
- In instances where there is more ambiguity (aka a chosen dog with less defined details), it will still find dog very similar to it but sometimes it can also throw in very similar dog who are a different breed. This might not be a bad thing.

**Next Steps**:

- Incorporate distance more effectively
- Can we use description field for dogs at all? 
- Colloborative Filtering once user preferences are collected
- Need to rerun this workbook inside Databricks with delta files to escape local compute limitations
- Do more in-depth search of AKC to Petfinder to match more mixed breeds with general traits, say for dogs that list 2 breeds and specify them.