This workbook implements Context Based Filtering for a Cats Recommendation System, working from the raw data all the way to model creation and initial results.

# Table of Contents

* [Load in Data and Segment Features from Context data](#segment)
* [Pre-process feature data](#pre-process)
    - [Recheck sweetviz for distinct values and types](#sweet)
    - [Make Pre-Processing Pipeline](#pp_pipeline)
    - [Animal type impact on missing values](#byAnimalmissingValues)
    - [Duplicate ID Check](#duplicateRows)
    - [Org Names for those posting baby animals](#babies)
    - [Distinguish cats from each other](#distinguish)
    - [Search orgs in the Petfinder database](#orgs)
* [Data Augmentation Possibilities](#aug)
* [Conclusion](#conclusion)

# Load in Data and Segment Features from Context data<a id='segment'></a>

First, we load all of our adoptable cats.

In [1]:
import pandas as pd
cats_DF = pd.read_csv("../data/raw/version0_5/Adoptable_cats_20221125.csv",header=0,index_col=0)
cats_DF.shape

  cats_DF = pd.read_csv("../data/raw/version0_5/Adoptable_cats_20221125.csv",header=0,index_col=0)


(49600, 50)

In [7]:
pd.set_option('display.max_columns', 500)
cats_DF.sample(3)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
34914,58795150,FL1036,https://www.petfinder.com/cat/denver-58795150/...,Cat,Cat,Young,Male,Medium,,[],DENVER,,A2130214,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-08T20:44:19+0000,2022-11-08T20:44:19+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,False,,,,pbcacc@pbcgov.org,561-233-1200,7100 Belvedere Road,,West Palm Beach,FL,33411,US,58795150,cat,fl1036,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
6356,58960012,OR74,https://www.petfinder.com/cat/apple-58960012/o...,Cat,Cat,Adult,Female,Medium,Short,[],Apple,My information is not yet available. Please co...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-25T02:24:32+0000,2022-11-25T02:24:31+0000,,Domestic Short Hair,,True,False,Black,,,True,False,False,False,True,,,,kennelmanager@savinggrace.info,(541) 672-3907,450 Old Del Rio Rd.,,Roseburg,OR,97471,US,58960012,cat,or74,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...
29854,58832270,GA150,https://www.petfinder.com/cat/adeline-58832270...,Cat,Cat,Baby,Female,Medium,,[],Adeline,Adeline\nGender: Female\nBreed: DMH\nDOB: 06/1...,B2022042,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,,[],adoptable,2022-11-12T10:49:07+0000,2022-11-12T10:49:07+0000,,Domestic Medium Hair,,False,False,,,,True,False,False,False,True,,,,info@humanemorgan.org,(706) 343-9977,1170 Fairground Road,,Madison,GA,30650,US,58832270,cat,ga150,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...


In [5]:
cats_DF.columns

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'primary_photo_cropped', 'videos', 'status',
       'status_changed_at', 'published_at', 'distance', 'breeds.primary',
       'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary',
       'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',
       'organization_id.1', 'primary_photo_cropped.small',
       'primary_photo_cropped.medium', 'primary_photo

Next we seperate the dataframe into features to model over and context data that can be shown to the user for any matches. 'ID' will be our shared key between the two tables.

Of note, the 'distance' field and 'primary_photo_cropped.full' field will be useful data for future model enhancements. For the model baseline, we will simply use textual data and assume a 0 distance for all pets.

In [13]:
contextCols = ['id','organization_id','url','type','tags','name','description','organization_animal_id',
              'photos','primary_photo_cropped','videos','status','status_changed_at','published_at',
              'distance','contact.email', 'contact.phone', 'contact.address.address1',
               'contact.address.address2', 'contact.address.city','contact.address.state', 
               'contact.address.postcode','contact.address.country', 'animal_id', 'animal_type',
               'organization_id.1', 'primary_photo_cropped.small','primary_photo_cropped.medium',
               'primary_photo_cropped.large','primary_photo_cropped.full']
featureCols = ['id','age','gender','size','coat','breeds.primary', 'breeds.secondary','breeds.mixed',
              'breeds.unknown','colors.primary','colors.secondary','colors.tertiary',
              'attributes.spayed_neutered','attributes.house_trained','attributes.declawed',
              'attributes.special_needs','attributes.shots_current','environment.children',
              'environment.dogs','environment.cats','type']
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(49600, 21)

Let's sanity check our missing values now that we just have cats and remove any columns with too many missing values.

In [14]:
valueCounts = cats_DF_features.set_index('type').isna().groupby(level=0).sum()/cats_DF_features.shape[0] # level=0 refers to our index, which we made 'type'


In [15]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,age,gender,size,coat,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Cat,0.0,0.0,0.0,0.0,0.633246,0.0,0.90244,0.0,0.0,0.406129,0.754194,0.920544,0.0,0.0,0.0,0.0,0.0,0.745968,0.83625,0.603226


In [17]:
valueCounts = cats_DF_context.set_index('type').isna().groupby(level=0).sum()/cats_DF_context.shape[0] # level=0 refers to our index, which we made 'type'


In [18]:
pd.set_option('display.max_columns', 500)
valueCounts 

Unnamed: 0_level_0,id,organization_id,url,tags,name,description,organization_animal_id,photos,primary_photo_cropped,videos,status,status_changed_at,published_at,distance,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
Cat,0.0,0.0,0.0,0.0,2e-05,0.280242,0.298327,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.050161,0.187762,0.364395,0.923931,0.0,0.0,2e-05,0.0,0.0,0.0,0.0,0.056351,0.056351,0.056351,0.056351


After a quick NA check, we will have to remove 'coat','breeds.secondary','colors.secondary','colors.tertiary','environment.children','environment.dogs' and 'environment.cats'. The column 'colors.primary' is also missing a lot of values but for sake of differing one cat from another it will be kept for now.

In [31]:
featureCols = ['id','age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.declawed','attributes.special_needs','attributes.shots_current']
cats_DF_features = cats_DF[featureCols]
cats_DF_context = cats_DF[contextCols]
cats_DF_features.shape

(49600, 13)

In [32]:
cats_DF_features.dtypes

id                             int64
age                           object
gender                        object
size                          object
breeds.primary                object
breeds.mixed                    bool
breeds.unknown                  bool
colors.primary                object
attributes.spayed_neutered      bool
attributes.house_trained        bool
attributes.declawed             bool
attributes.special_needs        bool
attributes.shots_current        bool
dtype: object

# Pre-process feature data<a id='pre-process'></a>

First, let's re-examine our dataframe for distinct values.

## Recheck sweetviz for distinct values and types<a id='sweet'></a>

In [19]:
cats_DF_features.head(3)

Unnamed: 0,id,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current
0,58980795,Young,Female,Medium,Domestic Short Hair,True,False,,False,False,False,False,True
1,58980784,Baby,Male,Medium,Tuxedo,False,False,Black & White / Tuxedo,True,True,False,False,True
2,58980785,Baby,Female,Medium,Domestic Short Hair,True,False,,True,True,False,False,True


In [109]:
cats_DF_features['colors.primary'].values[0]

nan

In [48]:
import sweetviz as sv

cat_data_report = sv.analyze(cats_DF_features)
cat_data_report.show_html() #save to html document

  all_source_names = [cur_name for cur_name, cur_series in source_df.iteritems()]
  filtered_series_names_in_source = [cur_name for cur_name, cur_series in source_df.iteritems()


                                             |      | [  0%]   00:00 -> (? left)

  stats["mad"] = series.mad()
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Make Pre-Processing Pipeline <a id='pp_pipeline'></a>

In [101]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OrdinalEncoder

In [42]:
#target = 'id' # ID set to target since that is what recommender will return (hope I'm using this right.)
#X, y = cats_DF_features.drop(columns=target), cats_DF_features[target]
X = cats_DF_features

In [43]:
X.shape

(49600, 13)

In [54]:
test = drop_duplicates(X)
test.shape #49502 matches unique cats per sweetviz. yay

(49502, 13)

In [44]:
#y.shape

In [56]:
categorical_features = ['age','gender','size','breeds.primary','breeds.mixed','breeds.unknown',
               'colors.primary','attributes.spayed_neutered','attributes.house_trained',
               'attributes.declawed','attributes.special_needs','attributes.shots_current']

categorical_transformer = OrdinalEncoder(handle_unknown='ignore')

In [59]:
numerical_features = ['id']
numeric_transformer = lambda x:x #change nothing

In [40]:
def remove_columns_with_1_distinct(df):
    drop_col = [e for e in df.columns if df[e].nunique()==1]
    df_return = df.drop(drop_col,axis=1)
    return df_return

remove1Distinct_transformer = FunctionTransformer(remove_columns_with_1_distinct)

In [55]:
def drop_duplicates(df):
    df_return = df.drop_duplicates()
    return df_return

dups_transformer = FunctionTransformer(drop_duplicates)

In [57]:
cat_transformer = Pipeline(steps=[
    ("rmv1Distinct",remove1Distinct_transformer),
    ("dropDuplicates",drop_duplicates),
])

In [60]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers = [
        ("num", numeric_transformer,numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ])

In [87]:
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarities(df_1,df_2):
    cs_simil = cosine_similarity(df_1,df_2)
    results = {}
    for idx, row in ds.iterrows():
       similar_indices = cs_simil[idx].argsort()[:-100:-1] 
       similar_items = [(cs_simil[idx][i], ds['id'][i]) for i in similar_indices] 
       results[row['id']] = similar_items[1:]
    return results

cosineSimilarity = FunctionTransformer(cosine_similarities)

In [114]:
import math
xtrain_cat = x_train[categorical_features]
ohe = OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=math.nan).fit(xtrain_cat)
x_train_test = ohe.transform(xtrain_cat)
#x_train_test = x_train_test.join(x_train["id"])
x_train_test = pd.DataFrame(x_train_test,columns=xtrain_cat.columns)
x_train_test.shape
#test =cosine_similarities(x_train_test,x_train_test)

(40176, 12)

In [115]:
x_train_test.head(3)

Unnamed: 0,age,gender,size,breeds.primary,breeds.mixed,breeds.unknown,colors.primary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current
0,0.0,1.0,3.0,19.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,3.0,0.0,2.0,21.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2,1.0,0.0,3.0,58.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


In [61]:
model = Pipeline(
    steps=[("preprocessor", preprocessor),
           ("model", ContentBasedRecommendor)
          ])

## Run Modeling Pipeline <a id='run_pipeline'></a>

In [79]:
from sklearn.model_selection import train_test_split

# split data
x, x_test = train_test_split(X,test_size=0.1,train_size=0.9)
x_train, x_dev = train_test_split(x,test_size = 0.1,train_size =0.9)

In [80]:
x_train.shape

(40176, 13)

In [81]:
x_dev.shape

(4464, 13)

In [82]:
x_test.shape

(4960, 13)

In [83]:
model.fit(x_train)

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. '<function <lambda> at 0x7f27e073aca0>' (type <class 'function'>) doesn't.