This workbook works through the EDA for the PetMatch project. 

It uses the `petpy` package and its methods for interacting with the Petfinder API. The goal of the `petpy` library is to enable other users to interact with the rich data available in the Petfinder database with an easy-to-use and straightforward Python interface. Methods for coercing the resulting JSON data into [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) are also available to help facilitate users who are more interested in using the API for data analysis. More information on the Petfinder API itself can be found on the [API documentation page](https://www.petfinder.com/developers/v2/docs/).

Findings are documents in the workbook and SweetViz EDA output is shared in the repo along with this book and the data used to generate the below findings.

# Table of Contents

* [Obtaining an API and Secret key](#api_key)
* [Installation](#installation)
* [Initial Database EDA](#database_size)
* [SweetViz- Auto EDA](#sweetviz)
    - [Initial EDA findings from SweetViz](#sweetviz_findings)
* [Follow-up Questions](#followup)
    - [How many missing values for list columns](#missingValues)
    - [Animal type impact on missing values](#byAnimalmissingValues)
    - [Duplicate ID Check](#duplicateRows)
    - [Org Names for those posting baby animals](#babies)
    - [Distinguish cats from each other](#distinguish)
    - [Search orgs in the Petfinder database](#orgs)
* [Data Augmentation Possibilities](#aug)
* [Conclusion](#conclusion)

# Obtaining an API and Secret Key <a id='api_key'></a>

Before we can begin extracting data from the API, we first require an API and secret key to authenticate access. To receive an API and secret key, [create a free account with Petfinder](https://www.petfinder.com/developers/) on their developer page and request an API key.

The API and secret key received from Petfinder are what we will use to authenticate our connection to the Petfinder API with `petpy`. Note authenication has a timeout of 3600 seconds, or one hour, after which the authentication to the API will need to be made again. 

Storing your keys received from APIs and other sensitive information in a secure file or as an environment variable is considered best practice to avoid any potential malicious activity. Therefore, we save the API and secret keys we received from Petfinder as environment variables to keep our credentials safe. 

In [1]:
import os
import secrets_petfinder

#key = os.getenv('PETFINDER_KEY')
#secret = os.getenv('PETFINDER_SECRET_KEY')
key = secrets_petfinder.PETFINDER_API_CLIENT_ID
secret = secrets_petfinder.PETFINDER_API_CLIENT_SECRET

# Installation <a id='installation'></a>

If not already installed, install `petpy` using `pip`:

``pip install petpy``

Then, import the package.

In [2]:
import petpy
import pandas as pd

Now that `petpy` is imported, we can authenticate our connection to the API and begin extracting data! The authentication to the Petfinder API occurs when the `Petfinder` class is initialized, which requires the API and secret keys we received in the previous step as parameters.

In [3]:
pf = petpy.Petfinder(key=key, secret=secret)

The `pf` variable is the initialized Petfinder class with our given API and secret key. We can now use this instance to interact with and extract data from the Petfinder API.

# Initial Database EDA<a id='database_size'></a>

`Animals that were 'found' in the system number 4000. Found appears to be when you find a match but not adopted physically yet.`

In [17]:
allAnimalDF_F = pf.animals(return_df=True,status='found',results_per_page=100, pages=100)
allAnimalDF_F.shape

pages parameter exceeded maximum number of available pages available from the Petfinder API. As a result, the maximum number of pages 41 was returned


(4000, 50)

`Animals that were 'adopted' in the system number are 35,800. Nice`

In [18]:
allAnimalDF_AD = pf.animals(return_df=True,status='adopted',results_per_page=100, pages=500)
allAnimalDF_AD.shape

(35800, 50)

`Animals that are 'adoptable' in the system number are a LOT. Version 0 data only got a scrap of 20K animals. Version 0.5 of the raw data will attempt to capture more before the API cap is met.`

In [11]:
allAnimalDF_A = pf.animals(return_df=True,status='adoptable',results_per_page=100, pages=500)# use animal_type field it filter further if wanted.
allAnimalDF_A.shape

(49600, 50)

In [33]:
pd.set_option('display.max_columns', 500)
allAnimalDF_F.sample(5)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
1733,57399292,TX2046,https://www.petfinder.com/cat/angel-57399292/t...,Cat,Cat,Baby,Male,Medium,,[],Angel,B028B5F2-8CD2-47C5-8C15-B40EB13638F6.jpegInter...,20222667C,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-09-19T05:22:54+0000,2022-09-19T05:22:54+0000,,Domestic Medium Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@texasanimalsociety.com,,P.O. Box 130448,,Spring,TX,77393,US,57399292,cat,tx2046,
3956,26446124,OH235,https://www.petfinder.com/dog/371suzy-q-264461...,Dog,Dog,Adult,Female,Medium,Short,[],371~Suzy Q,&quot;Suzy Q&quot; (#371) is AWESOME!!! A fema...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2013-06-21T12:37:28+0000,2013-06-19T21:38:48+0000,,Labrador Retriever,Great Dane,True,False,Black,White / Cream,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,,740-349-6562,544 Dog Leg Rd.,,Heath,OH,43056,US,26446124,dog,oh235,
471,58887781,MD33,https://www.petfinder.com/cat/ginger-bug-58887...,Cat,Cat,Adult,Female,Medium,,[],Ginger Bug,image.jpgPlease come see me at the Worcester C...,S2021252,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-19T10:02:35+0000,2022-11-17T18:47:11+0000,,Domestic Short Hair,,False,False,,,,True,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,,(410) 213-0146,12330 Eagles Nest Rd,,Berlin,MD,21811,US,58887781,cat,md33,
3199,54609972,MD477,https://www.petfinder.com/small-furry/george-a...,Small & Furry,Guinea Pig,Baby,Male,Medium,Short,"[Friendly, Playful, Curious]",George and Fred-3 months old,George and Fred are young bonded Guinea pig br...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-19T14:23:43+0000,2022-02-13T17:39:27+0000,,Guinea Pig,,True,False,Blue / Gray,White,,False,False,,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,Mabelsangels@gmail.com,,,,Chesapeake Beach,MD,20732,US,54609972,small-furry,md477,
3555,51580081,MI957,https://www.petfinder.com/dog/milo-51580081/mi...,Dog,Dog,Senior,Male,Small,Short,[],Milo,"This is Milo. He is a Chihuahua mix, approxim...",,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2021-07-08T12:59:34+0000,2021-05-18T18:20:38+0000,,Chihuahua,,True,False,Apricot / Beige,,,True,True,,False,True,,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,agdobie1@aol.com,574-514-6832,,,Niles,MI,49120,US,51580081,dog,mi957,


In [34]:
pd.set_option('display.max_columns', 500)
allAnimalDF_AD.sample(5)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
20249,58736554,MI801,https://www.petfinder.com/dog/buddy-58736554/m...,Dog,Dog,Adult,Male,Large,Medium,"[Friendly, Loyal, Gentle, Playful, Athletic, Q...",Buddy,MEET BUDDY!\n\nThis handsome boy wants to be Y...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-09T00:32:40+0000,2022-11-02T22:20:06+0000,,Yellow Labrador Retriever,,False,False,Yellow / Tan / Blond / Fawn,,,True,True,,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,stefanie.A.Will@gmail.com,(616) 635-8756,,,Kentwood,MI,49548,US,58736554,dog,mi801,
11958,58799493,CO525,https://www.petfinder.com/cat/charm-mcalister-...,Cat,Cat,Baby,Female,Small,Short,"[Friendly, Affectionate, Gentle, Playful, Smar...",Charm McAlister,Hi my name is Charm and I am a beautiful grey ...,,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-14T15:34:04+0000,2022-11-09T02:22:28+0000,,Domestic Short Hair,,False,False,Tabby (Gray / Blue / Silver),,,True,True,False,False,True,,,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@coloradofelinefosterrescue.org,(720) 443-3550,,,Denver,CO,80246,US,58799493,cat,co525,
9999,58817265,WA65,https://www.petfinder.com/dog/anita-58817265/w...,Dog,Dog,Adult,Female,Small,Short,[sweet],Anita,"Meet Anita, a 2-year-old female Chihuahua mix....",51469197,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-12T00:36:39+0000,2022-11-10T20:00:14+0000,,Chihuahua,,True,False,Yellow / Tan / Blond / Fawn,,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoption@yakimahumane.org,(509) 457-6854,2405 West Birchfield Road,,Yakima,WA,98901,US,58817265,dog,wa65,
19224,58744053,MN465,https://www.petfinder.com/dog/meelo-58744053/m...,Dog,Dog,Adult,Female,Small,,[],Meelo,You can fill out an adoption application onlin...,18590263-22-0483,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-03T16:00:25+0000,2022-11-03T16:00:25+0000,,Beagle,,True,False,,,,True,True,,False,False,,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@healingheartsrescue.org,,,,Crystal,MN,55428,US,58744053,dog,mn465,
7365,58842774,DE34,https://www.petfinder.com/cat/jake-fcid-number...,Cat,Cat,Baby,Male,Medium,,[],Jake (FCID# 10/03/2022 - 22) C,Jake is a friendly and striking grey cat. He p...,18728818-FCID# 10-03-2022,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-21T17:32:54+0000,2022-11-13T14:39:35+0000,,Domestic Short Hair,,False,False,Gray / Blue / Silver,,,True,True,False,False,True,,,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,catgalleryinfo@gmail.com,(302) 429-0124,4023 Kennett Pike,Suite 422,Greenville,DE,19807,US,58842774,cat,de34,


In [35]:
pd.set_option('display.max_columns', 500)
allAnimalDF_A.sample(5)

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
7995,58954465,AL358,https://www.petfinder.com/cat/tinsel-58954465/...,Cat,Cat,Baby,Female,Medium,,[],Tinsel,Orphans Tinsel and Twinkle have been with us s...,18771268-10052022Tinsel-K,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-24T04:31:22+0000,2022-11-24T04:31:20+0000,,Domestic Short Hair,,True,False,Black & White / Tuxedo,,,False,True,False,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@straylovefoundation.org,(251) 540-2236,P O Box 76,,Magnolia Springs,AL,36555,US,58954465,cat,al358,
14585,58944814,WI451,https://www.petfinder.com/dog/nala-58944814/wi...,Dog,Dog,Young,Female,Large,,[],Nala,,5aa76024-6159-44d6-a579-0a668d5e6d18,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-23T12:59:34+0000,2022-11-23T12:59:32+0000,,German Shepherd Dog,Siberian Husky,True,False,,,,False,True,,False,False,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoptions@unforgettableunderdogs.org,(920) 710-1191,,,Little Chute,WI,54140,US,58944814,dog,wi451,
13974,58945871,FL1733,https://www.petfinder.com/small-furry/loki-589...,Small & Furry,Guinea Pig,Baby,Male,Medium,,[],Loki,You can apply to adopt him here: https://penny...,PWSR-A-733,[],[],adoptable,2022-11-23T15:28:01+0000,2022-11-23T15:28:00+0000,,Guinea Pig,,False,False,,,,True,False,,False,True,,,,,,,,info@pennyandwild.org,(954) 821-8008,,,Miami,FL,33179,US,58945871,small-furry,fl1733,
9564,58952179,GA181,https://www.petfinder.com/dog/capone-1361-5895...,Dog,Dog,Adult,Male,Medium,,[Available for adoption soon],Capone 1361,Owner returned due to behavior of dog and moving,PHCV-A-12127,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-24T00:19:53+0000,2022-11-24T00:19:51+0000,,Mixed Breed,,False,False,,White / Cream,,True,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoptions@pawshumane.org,(706) 565-0035,4900 Milgen Rd.,,Columbus,GA,31908,US,58952179,dog,ga181,
13656,58946362,PA194,https://www.petfinder.com/cat/monteray-jack-58...,Cat,Cat,Baby,Female,Small,,[],Monteray Jack,,BYC-A-2355,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-23T16:24:34+0000,2022-11-23T16:24:32+0000,,Domestic Short Hair,,False,False,Brown / Chocolate,Black,,True,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adoption@becauseyoucare.org,(814) 476-1212,6041 West Road,,McKean,PA,16426,US,58946362,cat,pa194,


In [36]:
pd.set_option('display.max_columns', 500)
allAnimalDF_F.describe(include='all')

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
count,4000.0,4000,4000,4000,4000,4000,4000,4000,1128,4000,4000,3604,2827,4000,4000,4000,4000,4000,0.0,4000,560,4000,4000,1271,639,111,4000,4000,2359,4000,4000,679,780,739,3453,3453,3453,3453,3760,3543,2710,216,4000,4000,4000.0,4000,4000.0,4000,4000,0.0
unique,,775,3985,8,20,4,3,4,5,483,3162,2630,2621,3443,73,1,2793,2776,0.0,180,90,2,1,49,37,28,2,2,2,2,2,2,2,2,3440,3440,3440,3440,774,577,431,61,669,53,769.0,3,3985.0,8,775,
top,,AZ723,https://www.petfinder.com/cat/esme-58643223/oh...,Cat,Cat,Baby,Male,Medium,Short,[],Stray,image.jpg&amp;lt;p&amp;gt;\nFor current adopti...,S2022144,[],[],found,2022-04-20T10:04:56+0000,2022-04-20T10:04:56+0000,,Domestic Short Hair,Mixed Breed,False,False,Black,White / Cream,White / Cream,False,False,False,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@feralcatwarriors.org,(865) 217-6532,3353 Morningside Drive,PO Box 1016,Kingman,AZ,86401.0,US,58643223.0,cat,az723,
freq,,467,2,2359,2359,1556,2029,1886,817,3361,45,132,5,545,3928,4000,32,32,,1785,74,2727,4000,285,223,23,2136,3039,2352,3947,2716,620,731,670,2,2,2,2,467,467,162,45,468,576,468.0,3977,2.0,2359,467,
mean,54707130.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,7421166.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,22502810.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,54923690.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,56752360.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,58731140.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [37]:
pd.set_option('display.max_columns', 500)
allAnimalDF_AD.describe(include='all')

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
count,35800.0,35800,35800,35800,35800,35800,35800,35800,21376,35800,35800,32079,16597,35800,35800,35800,35800,35800,0.0,35800,8779,35800,35800,24262,13092,6648,35800,35800,15638,35800,35800,17278,18063,16401,34589,34589,34589,34589,35020,26175,19576,4381,35800,35800,35793.0,35800,35800.0,35800,35800,0.0
unique,,3789,35703,8,24,4,3,4,6,6390,17987,27557,16344,34506,1025,1,28826,29758,0.0,334,248,2,1,58,56,48,2,2,2,2,2,2,2,2,34493,34493,34493,34493,3946,2691,1626,246,2455,63,3599.0,3,35703.0,8,3789,
top,,CA2413,https://www.petfinder.com/dog/martha-piper-586...,Dog,Dog,Baby,Female,Medium,Short,[],Charlie,You can fill out an adoption application onlin...,oti,[],[],adopted,2022-11-01T15:01:29+0000,2022-11-01T15:01:29+0000,,Domestic Short Hair,Domestic Short Hair,True,False,Black,White / Cream,Brown / Chocolate,True,False,False,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@wwprca.com,(530) 895-8888,2156 Pillsbury Road,#155,Chico,CA,95973.0,US,58670099.0,dog,ca2413,
freq,,1241,2,19757,19757,17553,18030,19607,15038,20403,78,401,20,1199,34713,35800,119,120,,8985,771,20860,35800,5248,2025,593,27061,18125,15520,35204,27262,16725,17583,15542,2,2,2,2,1241,1241,1241,1241,1257,4301,1241.0,35054,2.0,19757,1241,
mean,58761760.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,105158.4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,52313960.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,58693340.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,58753880.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,58826210.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [38]:
pd.set_option('display.max_columns', 500)
allAnimalDF_A.describe(include='all')

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
count,20000.0,20000,20000,20000,20000,20000,20000,20000,5229,20000,20000,12785,15444,20000,20000,20000,20000,20000,0.0,20000,4803,20000,20000,9727,5295,1616,20000,20000,9561,20000,20000,4435,4569,4346,17889,17889,17889,17889,18417,16563,13594,1498,20000,20000,20000.0,20000,20000.0,20000,20000,0.0
unique,,3067,19915,8,27,4,3,4,6,2503,11630,10562,15211,17910,271,1,12285,11516,0.0,297,180,2,1,55,50,47,2,2,2,2,2,2,2,2,17815,17815,17815,17815,2973,2242,1582,214,1952,64,2845.0,3,19915.0,8,3067,
top,,TX411,https://www.petfinder.com/dog/bubbles-58949738...,Dog,Dog,Baby,Male,Medium,Short,[],Dog,https://form.jotform.com/73636063084154,Courtesy post,[],[],adoptable,2022-11-25T13:27:59+0000,2022-11-23T05:41:56+0000,,Domestic Short Hair,Mixed Breed,False,False,Black,White / Cream,Brown / Chocolate,True,False,False,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@spcapolk.org,(936) 755-3020,802 South Houston Avenue,1729 Willey Ave,Houston,CA,77351.0,US,58949738.0,dog,tx411,
freq,,218,2,9861,9861,7882,10058,9813,3970,15365,78,55,4,2016,19713,20000,46,52,,7036,1878,10853,20000,2622,1103,158,12994,14841,9509,19799,12665,4195,4358,4027,2,2,2,2,218,218,218,61,287,2176,218.0,19558,2.0,9861,218,
mean,58951200.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,25116.77,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,56724330.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,58944120.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,58951530.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,58959670.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


`Given how many categorical and text columns exist, basic metrics from describe() are not very useful. Saving data for Auto EDA using SweetViz instead.`

In [27]:
# Name of the CSV file
csvFileName = "../data/raw/Found_20221125.csv"

# Write contents of the DataFrame to a CSV file
allAnimalDF_F.to_csv(csvFileName);

In [28]:
# Name of the CSV file
csvFileName = "../data/raw/Adopted_20221125.csv"

# Write contents of the DataFrame to a CSV file
allAnimalDF_AD.to_csv(csvFileName);

In [14]:
# Name of the CSV file
csvFileName = "../data/raw/Adoptable_20221125.csv"

# Write contents of the DataFrame to a CSV file
allAnimalDF_A.to_csv(csvFileName);

# SweetViz- Auto EDA <a id='sweetviz'></a>

In [15]:
# Read in raw data (version 0)
allAnimalDF_F = pd.read_csv("../data/raw/version0/Found_20221125.csv",header=0,index_col=0)
allAnimalDF_AD = pd.read_csv("../data/raw/version0/Adopted_20221125.csv",header=0,index_col=0)
allAnimalDF_A = pd.read_csv("../data/raw/version0/Adoptable_20221125.csv",header=0,index_col=0)

In [16]:
# make one big dataframe to analyze
frames = [allAnimalDF_F, allAnimalDF_AD, allAnimalDF_A]

fullFrame = pd.concat(frames) # full raw data (version 0)
fullFrame.shape

(59800, 50)

In [6]:
fullFrame.columns # columns in raw dataset

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'videos', 'status', 'status_changed_at', 'published_at',
       'distance', 'breeds.primary', 'breeds.secondary', 'breeds.mixed',
       'breeds.unknown', 'colors.primary', 'colors.secondary',
       'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'primary_photo_cropped.small', 'primary_photo_cropped.medium',
       'primary_photo_cropped.large', 'primary_photo_cropped.full',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',

In [17]:
fullFrameNoDups = fullFrame.loc[:,~fullFrame.columns.duplicated()]# drop duplicate column names
fullFrameNoDups.columns # no duplicates found because of read_csv marking duplicates with '.1'

Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'organization_animal_id',
       'photos', 'videos', 'status', 'status_changed_at', 'published_at',
       'distance', 'breeds.primary', 'breeds.secondary', 'breeds.mixed',
       'breeds.unknown', 'colors.primary', 'colors.secondary',
       'colors.tertiary', 'attributes.spayed_neutered',
       'attributes.house_trained', 'attributes.declawed',
       'attributes.special_needs', 'attributes.shots_current',
       'environment.children', 'environment.dogs', 'environment.cats',
       'primary_photo_cropped.small', 'primary_photo_cropped.medium',
       'primary_photo_cropped.large', 'primary_photo_cropped.full',
       'contact.email', 'contact.phone', 'contact.address.address1',
       'contact.address.address2', 'contact.address.city',
       'contact.address.state', 'contact.address.postcode',
       'contact.address.country', 'animal_id', 'animal_type',

In [18]:
# sweetviz can't handle lists, so remove columns with lists for EDA purposes, thus creating data version 0.1. 
fullFrameNoDups = fullFrameNoDups.drop(['photos','videos','tags'],axis=1)# data version 0.1 (only used for SweetViz)
fullFrameNoDups.dtypes

id                                int64
organization_id                  object
url                              object
type                             object
species                          object
age                              object
gender                           object
size                             object
coat                             object
name                             object
description                      object
organization_animal_id           object
status                           object
status_changed_at                object
published_at                     object
distance                        float64
breeds.primary                   object
breeds.secondary                 object
breeds.mixed                       bool
breeds.unknown                     bool
colors.primary                   object
colors.secondary                 object
colors.tertiary                  object
attributes.spayed_neutered         bool
attributes.house_trained           bool


In [9]:
import sweetviz as sv

orig_data_report = sv.analyze(fullFrameNoDups)
#orig_data_report.show_notebook() # makes it read badly in github, will use html format

  all_source_names = [cur_name for cur_name, cur_series in source_df.iteritems()]
  filtered_series_names_in_source = [cur_name for cur_name, cur_series in source_df.iteritems()


                                             |      | [  0%]   00:00 -> (? left)

  stats["mad"] = series.mad()
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  value_counts_without_nan = pd.Series()
  for item in to_process.source_counts["value_counts_without_nan"].i

  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  for item in category_counts.iteritems():
  stats["mad"] = series.mad()
  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  value_counts_without_nan = pd.Series()


In [10]:
orig_data_report.show_html() #save to html document

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Initial EDA findings from SweetViz <a id='sweetviz_findings'></a>

*SweetViz Auto EDA run on Data Version 0.1, which removes lists from dataset so SweetViz can analyze the rest of the columns*

**General DataSet Findings:**
- Total dataset size of 59,800 records as of EDA so far
- Columns are categorical or text based with the exception of dates, ID fields, and a 'distance' column if the user provides an address to reference from in the filter options.
- Dataset provides 23 categorical columns, 22 text columns, 2 numerical columns, and 3 columns that are lists (photos, videos, and tags). 
- Of the list columns: 'photos' only has 6% missing values (based on columns which had the same info but single photo links), but 'video' and 'tags' seem to be rarely filled in based on manual inspection. 
- 3 columns are copies of existing columns and would be removed (animal_id, animal_type, and organization_id.1)
- Columns that have high amounts missing values:  'coat', 'organization_animal_id', 'breeds.secondary', 'colors.primary', 'colors.secondary', 'colors.tertiary', 'attributes.declawed', 'environment.children', 'environment.dogs', 'environment.cats', 'contact.address.address1', 'contact.address.address1', and 'primary_photo_cropped'.
- Of the columns with missing values: 'coat', 'colors.primary', 'attributes.declawed', 'environment.children', 'environment.dogs', 'environment.cats', 'contact.address.address1', and 'contact.address.address1' might be of concern but is not a show stopper.
- Dataset provides plenty of context columns that would be returned to a potential user upon a match but not used for training or testing the model
- Dataset has 197 duplicates, will need to see if those are returned animals or in error
- Dataset features are not correlated with each other unless they identical columns that have the same data but different columns names (3 of them mentioned above). Only exception is the 'breeds.unknown' columns which only has 'False' for all animals. This column can safely be dropped. 
- Dataset does not include user ratings(Y), so those would have to be generated as the users use the system and the system would need a cold start.
- Dataset does not tag organizations as 'kill' or 'no-kill shelters'. This would need to be manually added to the dataset.

**Specific DataSet Column Findings of note:**
- 'ID' seems to be a unique identifier per animal in the system.
- Dogs and Cats make up 98% of the records that were obtained so far. App should only cover dogs and cats.
- 'species' and 'type' are different. 'species' is more fine-grained in its categories than 'type'.
- 71% of animals in the system are categorized as 'baby' or 'young', rather than older dogs. Why are 'baby' dogs so numerous? 
- There is no gender preference in the animals.
- 83% of animals are categorized as 'small' or 'medium' sized.
- 'coat' column is missing a lot of data and best we might be able to do it fill in gaps with general assumptions via say AKC breed information. Would be manual work. 
- 'description' column is freeform text that potentially holds useful information but would require NLP to parse it automatically.
- 'status' column lets us filter to adoptable pets only.
- 'distance' from user to matched pets is possible with API.
- For 'breeds.primary' *Mixed Breeds* do not dominate the dataset (only 4%). Rather *Domestic Short hair* (common cat) have the most at 30%. How distinct are the cats??
- 48% of the animals are labeled as NOT mixed breed. Breed generic information would be useful for those listed a single breed. 
- 'breeds.unknown' is False for all animals and can be dropped. 
- Colors data has a lot of missing values and might not be usable for training
- Only 46% of 'attributes.declawed' are filled out, which equals the % of cats in the systems. We can fill in the dogs missing values as 'NA'. 
- Attributes fields seem very useful and have very little missing data, which is great.
- Environment fields have a lot of missing data which is unfortunate, since this will probably be a key field. We will probably have to fill in missing values with something just so we can use what we have rather than use nothing. Knowing what an animal is okay with environment-wise is very good to know.
- Dataset includes plenty of photos with only 6% missing values, which will be great for the app. 
- Contact email, followed by contact phone, is most reliable for output to user once they are matched.

# Follow-up Questions <a id='followup'></a>

**Data Version so far**
- Data Version   0: fullFrame (saved)
- Data Version 0.1: fullFrameNoDups (only used internally within workbook)

**Data used for Follow-up Questions**: Version 0

### How many missing values for list columns <a id='missingValues'></a>

In [29]:
pd.set_option('display.max_columns', 500)
fullFrame.head(3) # display so we can see what 'missing' means for the 3 columns of note: tags, videos, and photos.

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
0,58965809,ND04,https://www.petfinder.com/cat/magnolia-5896580...,Cat,Cat,Adult,Female,Medium,,[],Magnolia,.,22-10411,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:44+0000,2022-11-26T00:33:42+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965809,cat,nd04,
1,58965808,ND04,https://www.petfinder.com/cat/vienna-58965808/...,Cat,Cat,Baby,Female,Small,,[],Vienna,.,22-10412,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:44+0000,2022-11-26T00:33:42+0000,,Domestic Medium Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965808,cat,nd04,
2,58965806,ND04,https://www.petfinder.com/cat/new-gray-cat-589...,Cat,Cat,Baby,Male,Medium,,[],NEW-Gray cat,.,22-10413,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:43+0000,2022-11-26T00:33:41+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965806,cat,nd04,


In [39]:
valueCounts = fullFrame['videos'].value_counts() # check videos columns for missing values aka []
print((valueCounts["[]"]/fullFrame.shape[0])*100,"% animals missing videos") # only interested in missing values

97.58193979933111 % animals missing videos


In [40]:
valueCounts = fullFrame['tags'].value_counts() # check videos columns for missing values aka []
print((valueCounts["[]"]/fullFrame.shape[0])*100,"% animals missing tags") # only interested in missing values

65.43311036789298 % animals missing tags


In [41]:
valueCounts = fullFrame['photos'].value_counts() # check videos columns for missing values aka []
print((valueCounts["[]"]/fullFrame.shape[0])*100,"% animals missing photos") # only interested in missing values

6.287625418060201 % animals missing photos


**List Column Missing Values Findings**
1. 'photos' missing value count matches SweetViz Auto EDA of 6% => Single picture columns can replace this list column.
2. 'videos' missing value count is huge. App should just stick to showing photos.
3. 'tags' missing valueu count is 65% and often no standardization for comments. Should not be using for training. At best, can be outputed to user when matched to animal. 

### Animal type impact on missing values <a id='byAnimalmissingValues'></a>

In [42]:
pd.set_option('display.max_columns', 500)
fullFrame.head(3) # display so we can see what 'missing' means, looks like we can just use an NaN check

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
0,58965809,ND04,https://www.petfinder.com/cat/magnolia-5896580...,Cat,Cat,Adult,Female,Medium,,[],Magnolia,.,22-10411,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:44+0000,2022-11-26T00:33:42+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965809,cat,nd04,
1,58965808,ND04,https://www.petfinder.com/cat/vienna-58965808/...,Cat,Cat,Baby,Female,Small,,[],Vienna,.,22-10412,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:44+0000,2022-11-26T00:33:42+0000,,Domestic Medium Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965808,cat,nd04,
2,58965806,ND04,https://www.petfinder.com/cat/new-gray-cat-589...,Cat,Cat,Baby,Male,Medium,,[],NEW-Gray cat,.,22-10413,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-11-26T00:33:43+0000,2022-11-26T00:33:41+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,adopt@cofpets.com,(701) 775-3732,1726 S. Washington St.,,Grand Forks,ND,58203,US,58965806,cat,nd04,


In [43]:
missingValuesCols = ['coat', 'organization_animal_id', 'breeds.secondary', 'colors.primary',
                     'colors.secondary', 'colors.tertiary', 'attributes.declawed',
                     'environment.children', 'environment.dogs', 'environment.cats',
                     'contact.address.address1', 'contact.address.address1','primary_photo_cropped']

In [99]:
valueCounts = fullFrame.set_index('type').isna().groupby(level=0).sum()/fullFrame.shape[0] # level=0 refers to our index, which we made 'type'


In [100]:
pd.set_option('display.max_columns', 500)
valueCounts # show percentage of values NA for all columns in the dataset

Unnamed: 0_level_0,id,organization_id,url,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
Barnyard,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000435,0.0,0.0,0.000318,6.7e-05,0.0,0.0,0.0,0.0,0.0,0.000468,0.0,0.000452,0.0,0.0,0.000418,0.000468,0.000468,0.0,0.0,0.000468,0.0,0.0,0.000452,0.000452,0.000452,8.4e-05,8.4e-05,8.4e-05,8.4e-05,5e-05,0.000268,0.000184,0.000468,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000468
Bird,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002258,0.0,0.0,0.000569,0.000786,0.0,0.0,0.0,0.0,0.0,0.002258,0.0,0.002258,0.0,0.0,0.001589,0.001722,0.00204,0.0,0.0,0.002258,0.0,0.0,0.002107,0.00209,0.002174,0.000284,0.000284,0.000284,0.000284,0.000368,0.000334,0.00087,0.001672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002258
Cat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.251505,0.0,1.7e-05,0.094983,0.177308,0.0,0.0,0.0,0.0,0.0,0.460836,0.0,0.404615,0.0,0.0,0.171773,0.350067,0.416722,0.0,0.0,0.0,0.0,0.0,0.310452,0.363729,0.239482,0.038378,0.038378,0.038378,0.038378,0.018294,0.085686,0.167207,0.413311,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.460836
Dog,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.271739,0.0,0.0,0.089164,0.233562,0.0,0.0,0.0,0.0,0.0,0.521773,0.0,0.342157,0.0,0.0,0.226739,0.317308,0.426572,0.0,0.0,0.521773,0.0,0.0,0.299916,0.228512,0.384833,0.024465,0.024465,0.024465,0.024465,0.023562,0.137224,0.226538,0.46888,0.0,0.0,0.000117,0.0,0.0,0.0,0.0,0.521773
Horse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000117,0.0,0.0,0.0,6.7e-05,0.0,0.0,0.0,0.0,0.0,0.000117,0.0,0.000117,0.0,0.0,3.3e-05,0.0001,0.000117,0.0,0.0,0.000117,0.0,0.0,0.0001,0.0001,0.0001,0.0,0.0,0.0,0.0,0.0,1.7e-05,6.7e-05,0.000117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000117
Rabbit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003997,0.0,0.0,0.001823,0.002224,0.0,0.0,0.0,0.0,0.0,0.006221,0.0,0.005719,0.0,0.0,0.003963,0.0051,0.006087,0.0,0.0,0.006221,0.0,0.0,0.005468,0.005753,0.005719,0.000318,0.000318,0.000318,0.000318,0.000401,0.001003,0.002492,0.005819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006221
"Scales, Fins & Other",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000819,0.0,0.0,0.000201,0.000385,0.0,0.0,0.0,0.0,0.0,0.000819,0.0,0.000803,0.0,0.0,0.000635,0.000753,0.000819,0.0,0.0,0.000819,0.0,0.0,0.000635,0.000702,0.000719,0.000184,0.000184,0.000184,0.000184,0.000151,0.000151,0.000418,0.000736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000819
Small & Furry,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005368,0.0,0.0,0.002441,0.002542,0.0,0.0,0.0,0.0,0.0,0.007508,0.0,0.007391,0.0,0.0,0.005217,0.006321,0.007124,0.0,0.0,0.007508,0.0,0.0,0.006421,0.007157,0.007224,0.000987,0.000987,0.000987,0.000987,0.000702,0.001388,0.002241,0.007074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007508


In [101]:
valueCounts[missingValuesCols] # show percentage of values NA for only columns we already know have high missing value rate

Unnamed: 0_level_0,coat,organization_animal_id,breeds.secondary,colors.primary,colors.secondary,colors.tertiary,attributes.declawed,environment.children,environment.dogs,environment.cats,contact.address.address1,contact.address.address1,primary_photo_cropped
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Barnyard,0.000435,6.7e-05,0.000452,0.000418,0.000468,0.000468,0.000468,0.000452,0.000452,0.000452,0.000184,0.000184,0.000468
Bird,0.002258,0.000786,0.002258,0.001589,0.001722,0.00204,0.002258,0.002107,0.00209,0.002174,0.00087,0.00087,0.002258
Cat,0.251505,0.177308,0.404615,0.171773,0.350067,0.416722,0.0,0.310452,0.363729,0.239482,0.167207,0.167207,0.460836
Dog,0.271739,0.233562,0.342157,0.226739,0.317308,0.426572,0.521773,0.299916,0.228512,0.384833,0.226538,0.226538,0.521773
Horse,0.000117,6.7e-05,0.000117,3.3e-05,0.0001,0.000117,0.000117,0.0001,0.0001,0.0001,6.7e-05,6.7e-05,0.000117
Rabbit,0.003997,0.002224,0.005719,0.003963,0.0051,0.006087,0.006221,0.005468,0.005753,0.005719,0.002492,0.002492,0.006221
"Scales, Fins & Other",0.000819,0.000385,0.000803,0.000635,0.000753,0.000819,0.000819,0.000635,0.000702,0.000719,0.000418,0.000418,0.000819
Small & Furry,0.005368,0.002542,0.007391,0.005217,0.006321,0.007124,0.007508,0.006421,0.007157,0.007224,0.002241,0.002241,0.007508


In [102]:
valueCounts = fullFrame.set_index('type').notna().groupby(level=0).sum()/fullFrame.shape[0] # level=0 refers to our index, which we made 'type'

In [103]:
valueCounts # show percentage of values not NA for all columns in the dataset

Unnamed: 0_level_0,id,organization_id,url,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
Barnyard,0.000468,0.000468,0.000468,0.000468,0.000468,0.000468,0.000468,3.3e-05,0.000468,0.000468,0.000151,0.000401,0.000468,0.000468,0.000468,0.000468,0.000468,0.0,0.000468,1.7e-05,0.000468,0.000468,5e-05,0.0,0.0,0.000468,0.000468,0.0,0.000468,0.000468,1.7e-05,1.7e-05,1.7e-05,0.000385,0.000385,0.000385,0.000385,0.000418,0.000201,0.000284,0.0,0.000468,0.000468,0.000468,0.000468,0.000468,0.000468,0.000468,0.0
Bird,0.002258,0.002258,0.002258,0.002258,0.002258,0.002258,0.002258,0.0,0.002258,0.002258,0.001689,0.001472,0.002258,0.002258,0.002258,0.002258,0.002258,0.0,0.002258,0.0,0.002258,0.002258,0.000669,0.000535,0.000217,0.002258,0.002258,0.0,0.002258,0.002258,0.000151,0.000167,8.4e-05,0.001973,0.001973,0.001973,0.001973,0.00189,0.001923,0.001388,0.000585,0.002258,0.002258,0.002258,0.002258,0.002258,0.002258,0.002258,0.0
Cat,0.460836,0.460836,0.460836,0.460836,0.460836,0.460836,0.460836,0.209331,0.460836,0.460819,0.365853,0.283528,0.460836,0.460836,0.460836,0.460836,0.460836,0.0,0.460836,0.056221,0.460836,0.460836,0.289064,0.110769,0.044114,0.460836,0.460836,0.460836,0.460836,0.460836,0.150385,0.097107,0.221355,0.422458,0.422458,0.422458,0.422458,0.442542,0.375151,0.293629,0.047525,0.460836,0.460836,0.460836,0.460836,0.460836,0.460836,0.460836,0.0
Dog,0.521773,0.521773,0.521773,0.521773,0.521773,0.521773,0.521773,0.250033,0.521773,0.521773,0.432609,0.288211,0.521773,0.521773,0.521773,0.521773,0.521773,0.0,0.521773,0.179615,0.521773,0.521773,0.295033,0.204465,0.095201,0.521773,0.521773,0.0,0.521773,0.521773,0.221856,0.293261,0.13694,0.497308,0.497308,0.497308,0.497308,0.498211,0.384548,0.295234,0.052893,0.521773,0.521773,0.521656,0.521773,0.521773,0.521773,0.521773,0.0
Horse,0.000117,0.000117,0.000117,0.000117,0.000117,0.000117,0.000117,0.0,0.000117,0.000117,0.000117,5e-05,0.000117,0.000117,0.000117,0.000117,0.000117,0.0,0.000117,0.0,0.000117,0.000117,8.4e-05,1.7e-05,0.0,0.000117,0.000117,0.0,0.000117,0.000117,1.7e-05,1.7e-05,1.7e-05,0.000117,0.000117,0.000117,0.000117,0.000117,0.0001,5e-05,0.0,0.000117,0.000117,0.000117,0.000117,0.000117,0.000117,0.000117,0.0
Rabbit,0.006221,0.006221,0.006221,0.006221,0.006221,0.006221,0.006221,0.002224,0.006221,0.006221,0.004398,0.003997,0.006221,0.006221,0.006221,0.006221,0.006221,0.0,0.006221,0.000502,0.006221,0.006221,0.002258,0.00112,0.000134,0.006221,0.006221,0.0,0.006221,0.006221,0.000753,0.000468,0.000502,0.005903,0.005903,0.005903,0.005903,0.005819,0.005217,0.003729,0.000401,0.006221,0.006221,0.006221,0.006221,0.006221,0.006221,0.006221,0.0
"Scales, Fins & Other",0.000819,0.000819,0.000819,0.000819,0.000819,0.000819,0.000819,0.0,0.000819,0.000819,0.000619,0.000435,0.000819,0.000819,0.000819,0.000819,0.000819,0.0,0.000819,1.7e-05,0.000819,0.000819,0.000184,6.7e-05,0.0,0.000819,0.000819,0.0,0.000819,0.000819,0.000184,0.000117,0.0001,0.000635,0.000635,0.000635,0.000635,0.000669,0.000669,0.000401,8.4e-05,0.000819,0.000819,0.000819,0.000819,0.000819,0.000819,0.000819,0.0
Small & Furry,0.007508,0.007508,0.007508,0.007508,0.007508,0.007508,0.007508,0.00214,0.007508,0.007508,0.005067,0.004967,0.007508,0.007508,0.007508,0.007508,0.007508,0.0,0.007508,0.000117,0.007508,0.007508,0.002291,0.001187,0.000385,0.007508,0.007508,0.0,0.007508,0.007508,0.001087,0.000351,0.000284,0.006522,0.006522,0.006522,0.006522,0.006806,0.00612,0.005268,0.000435,0.007508,0.007508,0.007508,0.007508,0.007508,0.007508,0.007508,0.0


In [104]:
valueCounts[missingValuesCols] # show percentage of values not NA for only columns we already know have high missing value rate

Unnamed: 0_level_0,coat,organization_animal_id,breeds.secondary,colors.primary,colors.secondary,colors.tertiary,attributes.declawed,environment.children,environment.dogs,environment.cats,contact.address.address1,contact.address.address1,primary_photo_cropped
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Barnyard,3.3e-05,0.000401,1.7e-05,5e-05,0.0,0.0,0.0,1.7e-05,1.7e-05,1.7e-05,0.000284,0.000284,0.0
Bird,0.0,0.001472,0.0,0.000669,0.000535,0.000217,0.0,0.000151,0.000167,8.4e-05,0.001388,0.001388,0.0
Cat,0.209331,0.283528,0.056221,0.289064,0.110769,0.044114,0.460836,0.150385,0.097107,0.221355,0.293629,0.293629,0.0
Dog,0.250033,0.288211,0.179615,0.295033,0.204465,0.095201,0.0,0.221856,0.293261,0.13694,0.295234,0.295234,0.0
Horse,0.0,5e-05,0.0,8.4e-05,1.7e-05,0.0,0.0,1.7e-05,1.7e-05,1.7e-05,5e-05,5e-05,0.0
Rabbit,0.002224,0.003997,0.000502,0.002258,0.00112,0.000134,0.0,0.000753,0.000468,0.000502,0.003729,0.003729,0.0
"Scales, Fins & Other",0.0,0.000435,1.7e-05,0.000184,6.7e-05,0.0,0.0,0.000184,0.000117,0.0001,0.000401,0.000401,0.0
Small & Furry,0.00214,0.004967,0.000117,0.002291,0.001187,0.000385,0.0,0.001087,0.000351,0.000284,0.005268,0.005268,0.0


**Animal Type Impact on Missing Values Findings**
1. Cats are verified to not contribute any missing values to 'attributes.declawed', thus a useful feature for cats.
2. Rest of columns have the highest NA counts for 'Cat' and 'Dog' with no advantage either way. This is probably an artifact of the dataset whose records are 98% dogs or cats. 

### Duplicate ID Check <a id='duplicateRows'></a>

In [107]:
duplicatedFullFrame = fullFrame[fullFrame.duplicated()]
duplicatedFullFrame.shape # 197 duplicate ids matches sweetviz output

(197, 50)

In [113]:
pd.set_option('display.max_columns', 500)
duplicatedFullFrame.sample(3) # on visual inspection, they don't look the same though...

Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
22684,58720904,NJ995,https://www.petfinder.com/dog/shaia-58720904/n...,Dog,Dog,Baby,Female,Small,,[],Shaia,Shaia is a part of a litter of 4 that was foun...,18661035,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adopted,2022-11-01T15:01:29+0000,2022-11-01T15:01:29+0000,,Labrador Retriever,,True,False,,,,False,False,,False,True,True,True,True,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,info@savethesatos.org,(973) 288-1934,5 Bowling Green Parkway,Suite 13,Lake Hopatcong,NJ,7849,US,58720904,dog,nj995,
11314,58949763,CA1886,https://www.petfinder.com/cat/winter-costa-mes...,Cat,Cat,Adult,Female,Medium,,['Costa Mesa'],Winter - Costa Mesa Location,Winter is a gorgeous cat with the softest snow...,PPR-A-20521,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-23T21:03:35+0000,2022-11-23T21:03:33+0000,,Domestic Short Hair,,False,False,White,,,True,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,priceless.pets@yahoo.com,(909) 203-3695,,,Chino Hills,CA,91709,US,58949763,cat,ca1886,
15120,58944059,ID103,https://www.petfinder.com/cat/ellies-58944059/...,Cat,Cat,Senior,Female,Medium,,[],Ellies,,S2022556,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],adoptable,2022-11-23T10:01:21+0000,2022-11-23T10:01:19+0000,,Domestic Short Hair,,False,False,,,,True,False,False,False,False,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,ccpethaven@gmail.com,(208) 466-1298,333 W. Orchard Ave.,,Nampa,ID,83651,US,58944059,cat,id103,


In [136]:
dupIDs = fullFrame.groupby("id").filter(lambda x: len(x) > 1)
dupIDs.shape # grab all rows for animal ids that show up more than once.

(394, 50)

In [142]:
pd.set_option('display.max_rows', None)
dupIDs.sort_values('id') # appears the duplicates have same values for all cells but 197 animals appear twice and have the same time stamps


Unnamed: 0,id,organization_id,url,type,species,age,gender,size,coat,tags,name,description,organization_animal_id,photos,videos,status,status_changed_at,published_at,distance,breeds.primary,breeds.secondary,breeds.mixed,breeds.unknown,colors.primary,colors.secondary,colors.tertiary,attributes.spayed_neutered,attributes.house_trained,attributes.declawed,attributes.special_needs,attributes.shots_current,environment.children,environment.dogs,environment.cats,primary_photo_cropped.small,primary_photo_cropped.medium,primary_photo_cropped.large,primary_photo_cropped.full,contact.email,contact.phone,contact.address.address1,contact.address.address2,contact.address.city,contact.address.state,contact.address.postcode,contact.address.country,animal_id,animal_type,organization_id.1,primary_photo_cropped
3108,54796422,GA152,https://www.petfinder.com/dog/22101026-hattie-...,Dog,Dog,Young,Female,Large,,[],22101026 - Hattie,,A2022017,[],[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Australian Cattle Dog / Blue Heeler,Labrador Retriever,True,False,,,,False,False,,False,True,,,,,,,,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796422,dog,ga152,
3079,54796422,GA152,https://www.petfinder.com/dog/22101026-hattie-...,Dog,Dog,Young,Female,Large,,[],22101026 - Hattie,,A2022017,[],[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Australian Cattle Dog / Blue Heeler,Labrador Retriever,True,False,,,,False,False,,False,True,,,,,,,,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796422,dog,ga152,
3102,54796437,GA152,https://www.petfinder.com/dog/21210224-raquel-...,Dog,Dog,Adult,Female,Large,,[],21210224 - Raquel,Wilkes County-20 (1).jpg,A2021149,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,American Staffordshire Terrier,Pit Bull Terrier,True,False,,,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796437,dog,ga152,
3099,54796437,GA152,https://www.petfinder.com/dog/21210224-raquel-...,Dog,Dog,Adult,Female,Large,,[],21210224 - Raquel,Wilkes County-20 (1).jpg,A2021149,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,American Staffordshire Terrier,Pit Bull Terrier,True,False,,,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796437,dog,ga152,
3098,54796448,GA152,https://www.petfinder.com/dog/21212271-brewer-...,Dog,Dog,Adult,Male,Large,,[],21212271 - Brewer,Brewer.jpg,A2021175,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Rhodesian Ridgeback,Labrador Retriever,True,False,,,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796448,dog,ga152,
3101,54796448,GA152,https://www.petfinder.com/dog/21212271-brewer-...,Dog,Dog,Adult,Male,Large,,[],21212271 - Brewer,Brewer.jpg,A2021175,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Rhodesian Ridgeback,Labrador Retriever,True,False,,,,False,False,,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796448,dog,ga152,
3080,54796453,GA152,https://www.petfinder.com/dog/22101027-hallie-...,Dog,Dog,Adult,Female,Large,,[],22101027 - Hallie,,A2022020,[],[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Labrador Retriever,,True,False,,,,False,False,,False,True,,,,,,,,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796453,dog,ga152,
3109,54796453,GA152,https://www.petfinder.com/dog/22101027-hallie-...,Dog,Dog,Adult,Female,Large,,[],22101027 - Hallie,,A2022020,[],[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Labrador Retriever,,True,False,,,,False,False,,False,True,,,,,,,,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796453,dog,ga152,
3100,54796454,GA152,https://www.petfinder.com/cat/22110233-sugar-5...,Cat,Cat,Baby,Female,Large,,[],22110233 - Sugar,Sugar2.jpg,S2021168,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796454,cat,ga152,
3097,54796454,GA152,https://www.petfinder.com/cat/22110233-sugar-5...,Cat,Cat,Baby,Female,Large,,[],22110233 - Sugar,Sugar2.jpg,S2021168,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,[],found,2022-03-01T10:03:16+0000,2022-03-01T10:03:16+0000,,Domestic Short Hair,,False,False,,,,False,False,False,False,True,,,,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,https://dl5zpyw5k3jeb.cloudfront.net/photos/pe...,wwhumaneshelter@gmail.com,706.678.2287,358 Brown Road,,Washington,GA,30673,US,54796454,cat,ga152,


In [154]:
pd.set_option('display.max_rows', 50)
dupIDsCounts = fullFrame['id'].value_counts()[fullFrame['id'].value_counts()> 1]
dupIDsCounts # 197 rows are duplicated only once! 

58949765    2
58947823    2
58739033    2
58739072    2
58966191    2
           ..
58720904    2
58670100    2
54796437    2
58720956    2
58720863    2
Name: id, Length: 197, dtype: int64

**Duplicate ID Findings**
1. There are 197 rows that are duplicated once and will need deletion before making the train and test set.

### Org Names for those posting baby animals <a id='babies'></a>

In [161]:
babyPostingsOrg = fullFrame[fullFrame['age']=='Baby']
babyPostingsOrg.shape # matches sweetviz number of baby animals

(26991, 50)

In [162]:
babyPostingsOrg['type'].value_counts() # cats and dogs about equal in dominating baby animals

Cat                     15351
Dog                     11507
Small & Furry              67
Rabbit                     52
Bird                        6
Scales, Fins & Other        4
Barnyard                    4
Name: type, dtype: int64

In [164]:
babyPostingsOrg = babyPostingsOrg[babyPostingsOrg["type"].isin(["Cat","Dog"])]
babyPostingsOrg.shape

(26858, 50)

In [176]:
pd.set_option('display.max_columns', 500)
babyPostingsOrg[["id","organization_id","type","contact.email"]].sample(3)

Unnamed: 0,id,organization_id,type,contact.email
10956,58809376,SC38,Cat,cgumienny@CharlestonAnimalSociety.org
10421,58813917,OH110,Dog,adopt@portageapl.org
2882,55326753,MD450,Dog,info@caninehumane.org


**Org Names for Baby Animal Findings**
1. Cats and Dogs again dominate the 'Baby' animal findings.
2. After manual review of samples, there is no clear indication of why so many 'Baby' animals are in the database. The database does not have organization name but it does have the contact.email, all of which look like rescue organizations on sample checks. No actions required.

### Distinguish cats from each other<a id='distinguish'></a>

In [178]:
Domest_shortHair = fullFrame[fullFrame["breeds.primary"]=="Domestic Short Hair"]
Domest_shortHair.shape # matches sweetviz output

(17806, 50)

In [184]:
onlyCats = fullFrame[fullFrame["type"]=="Cat"]
onlyCats["breeds.primary"].unique() # print all breeds.primary for all cat types, including domestic short hair

array(['Domestic Short Hair', 'Domestic Medium Hair',
       'Domestic Long Hair', 'American Bobtail', 'Tortoiseshell',
       'Siamese', 'Russian Blue', 'Calico', 'Abyssinian', 'Torbie',
       'Dilute Tortoiseshell', 'Tabby', 'American Curl',
       'British Shorthair', 'Bombay', 'Tuxedo', 'American Shorthair',
       'Maine Coon', 'Dilute Calico', 'Snowshoe', 'Turkish Van', 'Bengal',
       'Chausie', 'Tiger', 'Oriental Short Hair', 'Scottish Fold',
       'Oriental Long Hair', 'Balinese', 'Ragdoll', 'Egyptian Mau',
       'Himalayan', 'Manx', 'Siberian', 'Havana', 'Ocicat',
       'Extra-Toes Cat / Hemingway Polydactyl', 'Korat',
       'Norwegian Forest Cat', 'Turkish Angora', 'Burmese', 'Ragamuffin',
       'Japanese Bobtail', 'Persian', 'Exotic Shorthair', 'Chartreux',
       'Munchkin', 'Applehead Siamese', 'Birman', 'Nebelung', 'Silver',
       'Selkirk Rex', 'American Wirehair', 'Tonkinese', 'Cymric',
       'Sphynx / Hairless Cat', 'Oriental Tabby'], dtype=object)

In [186]:
cats_notDSH = onlyCats[onlyCats["breeds.primary"]!="Domestic Short Hair"]
cats_notDSH.shape # only return cat's not domestic short hair 

(9752, 50)

In [189]:
pd.set_option('display.max_rows', None)
cats_notDSH['breeds.primary'].value_counts() 

Domestic Medium Hair                     2288
Domestic Long Hair                       1539
Tabby                                    1416
Siamese                                   766
Tuxedo                                    435
Calico                                    416
American Shorthair                        345
Tortoiseshell                             337
Russian Blue                              262
Maine Coon                                227
Bombay                                    154
Dilute Calico                             138
Torbie                                    134
Tiger                                     124
American Bobtail                          112
Dilute Tortoiseshell                      110
Manx                                      100
Turkish Van                                80
Extra-Toes Cat / Hemingway Polydactyl      79
Bengal                                     68
Abyssinian                                 67
Snowshoe                          

**Distinguish Cat Findings**
1. No way to split out Domestic Short Hairs.
2. Upside: There are more diverse cat breeds than the data seemed to suggest on a first glance. ML system may learn a lot more about Domestic Short Hairs over other types. May need to account for potential bias towards Domestic Short Hairs in modeling and app creation.

### Search orgs in the Petfinder database <a id='orgs'></a>

`Checking out organization data in database`

In [5]:
wa_orgs = pf.organizations(location='Seattle, WA', distance=50, sort='distance', pages=None, return_df=True)
wa_orgs.shape # got small slice from Seattle, WA

(152, 30)

In [7]:
wa_orgs.sample(3) #manually inspect output

Unnamed: 0,id,name,email,phone,url,website,mission_statement,photos,distance,address.address1,...,hours.saturday,hours.sunday,adoption.policy,adoption.url,social_media.facebook,social_media.twitter,social_media.youtube,social_media.instagram,social_media.pinterest,organization_id
28,WA40,Seattle Area Feline Rescue,adoptions@seattleareafelinerescue.org,(206) 659-6220,https://www.petfinder.com/member/us/wa/shoreli...,https://www.seattleareafelinerescue.org/,SAFe Rescue saves feline lives by taking in ho...,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,14.4302,14717 Aurora Ave N,...,1:00-7:00pm,1:00-7:00pm,Our full adoption details are available on our...,,https://www.facebook.com/SeattleFelineRescue,https://twitter.com/SAFeRescueCats,https://www.youtube.com/channel/UCjW2hnd1GNLtJ...,https://www.instagram.com/seattleareafelineres...,,wa40
50,WA723,Sammamish Animal Sanctuary,diane@sammamishanimalsanctuary.com,(425) 829-5037,https://www.petfinder.com/member/us/wa/sammami...,https://www.sammamishanimalsanctuary.com/adopt...,Sammamish Animal Sanctuary is a safe haven whe...,[],21.6374,,...,,,Telephone interview plus photos of setup\r\nMe...,,,,,,,wa723
5,WA675,Seattle Dogs Homeless Program,seattledogs@hotmail.com,(206) 519-1697,https://www.petfinder.com/member/us/wa/seattle...,https://seattledogs.info,We are a street outreach program and foster ba...,[{'small': 'https://dl5zpyw5k3jeb.cloudfront.n...,3.17,,...,,,"Please do not call us, all applications must b...",https://seattledogs.info/forms,https://facebook.com/seattlesdogs,https://twitter.com/werseattledogs,,https://instagram.com/werseattledogs,,wa675


**Org Findings**
1. There is no 'no-kill or kill shelter designation even when you query org-specific data only.

# Data Augmentation Possibilities <a id='aug'></a>

`The chief source of possible data augmentation for this dataset is to pull in other existing data sources about Breeds that have been well documented. One of the best sources for dog breed information is AKC, of which a project already exists to pull this data: https://github.com/tmfilho/akcdata. Let's look at the data we have on hand below.`

In [13]:
akc = pd.read_csv("../data/external/akc-data-2020-05-18.csv",header=0,index_col=0)
akc.shape

(277, 20)

In [193]:
akc.head(3) # sample data from akc pull

Unnamed: 0,description,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
Affenpinscher,The Affen’s apish look has been described many...,"Confident, Famously Funny, Fearless",148,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,2-3 Times a Week Brushing,0.6,Seasonal,0.6,Regular Exercise,0.8,Easy Training,1.0,Outgoing
Afghan Hound,"The Afghan Hound is an ancient breed, his whol...","Dignified, Profoundly Loyal, Aristocratic",113,63.5,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,Daily Brushing,0.2,Infrequent,0.8,Energetic,0.2,May be Stubborn,0.2,Aloof/Wary
Airedale Terrier,The Airedale Terrier is the largest of all ter...,"Friendly, Clever, Courageous",60,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,2-3 Times a Week Brushing,0.4,Occasional,0.6,Regular Exercise,1.0,Eager to Please,0.8,Friendly


In [213]:
pd.options.display.max_seq_items = 300
akcList = akc.index# Breed Names from AKC data
akcList

Index(['Affenpinscher', 'Afghan Hound', 'Airedale Terrier', 'Akita',
       'Alaskan Malamute', 'American Bulldog', 'American English Coonhound',
       'American Eskimo Dog', 'American Foxhound', 'American Hairless Terrier',
       'American Leopard Hound', 'American Staffordshire Terrier',
       'American Water Spaniel', 'Anatolian Shepherd Dog',
       'Appenzeller Sennenhund', 'Australian Cattle Dog', 'Australian Kelpie',
       'Australian Shepherd', 'Australian Stumpy Tail Cattle Dog',
       'Australian Terrier', 'Azawakh', 'Barbet', 'Basenji',
       'Basset Fauve de Bretagne', 'Basset Hound',
       'Bavarian Mountain Scent Hound', 'Beagle', 'Bearded Collie',
       'Beauceron', 'Bedlington Terrier', 'Belgian Laekenois',
       'Belgian Malinois', 'Belgian Sheepdog', 'Belgian Tervuren',
       'Bergamasco Sheepdog', 'Berger Picard', 'Bernese Mountain Dog',
       'Bichon Frise', 'Biewer Terrier', 'Black and Tan Coonhound',
       'Black Russian Terrier', 'Bloodhound', 'Blueti

In [203]:
onlyDogs = fullFrame[fullFrame["type"]=="Dog"]
petFinderList = onlyDogs["breeds.primary"].unique() # print all breeds.primary for all dog types
petFinderList.shape

(223,)

In [207]:
Inter = list(set(akcList).intersection(petFinderList))
len(Inter)

144

In [208]:
pd.options.display.max_seq_items = 300
Inter

['Boxer',
 'Alaskan Malamute',
 'Brussels Griffon',
 'American Staffordshire Terrier',
 'Afghan Hound',
 'Dutch Shepherd',
 'Italian Greyhound',
 'Papillon',
 'Jindo',
 'Border Terrier',
 'Field Spaniel',
 'Smooth Fox Terrier',
 'Beauceron',
 'English Cocker Spaniel',
 'Great Pyrenees',
 'Cairn Terrier',
 'Australian Kelpie',
 'Doberman Pinscher',
 'Rat Terrier',
 'Pyrenean Shepherd',
 'Mountain Cur',
 'Ibizan Hound',
 'Dogo Argentino',
 'Norwich Terrier',
 'Siberian Husky',
 'Miniature Schnauzer',
 'Maltese',
 'Bichon Frise',
 'Icelandic Sheepdog',
 'Pug',
 'Coton de Tulear',
 'Neapolitan Mastiff',
 'Portuguese Podengo',
 'Pekingese',
 'Border Collie',
 'Basenji',
 'Lhasa Apso',
 'Rhodesian Ridgeback',
 'Saluki',
 'Greyhound',
 'Boston Terrier',
 'Samoyed',
 'Bluetick Coonhound',
 'Leonberger',
 'Silky Terrier',
 'Canaan Dog',
 'Nova Scotia Duck Tolling Retriever',
 'Irish Terrier',
 'Dandie Dinmont Terrier',
 'Vizsla',
 'Kuvasz',
 'Spinone Italiano',
 'American Water Spaniel',
 'Cata

In [214]:
inAKCnotPF =set(akcList).difference(set(petFinderList)) # exists in AKC but not in PetFinder
len(inAKCnotPF) #number of breeds found in AKC but not Petfinder

133

In [215]:
pd.options.display.max_seq_items = 500
inAKCnotPF

{'American English Coonhound',
 'American Hairless Terrier',
 'American Leopard Hound',
 'Anatolian Shepherd Dog',
 'Appenzeller Sennenhund',
 'Australian Cattle Dog',
 'Australian Stumpy Tail Cattle Dog',
 'Azawakh',
 'Barbet',
 'Basset Fauve de Bretagne',
 'Bavarian Mountain Scent Hound',
 'Bedlington Terrier',
 'Belgian Laekenois',
 'Belgian Malinois',
 'Belgian Sheepdog',
 'Belgian Tervuren',
 'Bergamasco Sheepdog',
 'Berger Picard',
 'Biewer Terrier',
 'Bohemian Shepherd',
 'Bolognese',
 'Borzoi',
 'Bouvier des Flandres',
 'Bracco Italiano',
 'Braque Francais Pyrenean',
 'Braque du Bourbonnais',
 'Brittany',
 'Broholmer',
 'Bulldog',
 'Caucasian Shepherd Dog',
 'Central Asian Shepherd Dog',
 'Cesky Terrier',
 'Chinese Crested',
 'Chinese Shar-Pei',
 'Chinook',
 'Cirneco dell’Etna',
 'Clumber Spaniel',
 'Croatian Sheepdog',
 'Czechoslovakian Vlcak',
 'Danish-Swedish Farmdog',
 'Deutscher Wachtelhund',
 'Drentsche Patrijshond',
 'Drever',
 'English Foxhound',
 'English Toy Spaniel',

In [216]:
inPFnotAKC =set(petFinderList).difference(set(akcList)) # exists in Petfinder but not in AKC
len(inPFnotAKC) #number of breeds found in Petfinder but not AKC

79

In [217]:
pd.options.display.max_seq_items = 500
inPFnotAKC

{'Akbash',
 'American Bully',
 'Anatolian Shepherd',
 'Aussiedoodle',
 'Australian Cattle Dog / Blue Heeler',
 'Belgian Shepherd / Laekenois',
 'Belgian Shepherd / Malinois',
 'Belgian Shepherd / Sheepdog',
 'Belgian Shepherd / Tervuren',
 'Bernedoodle',
 'Black Labrador Retriever',
 'Black Mouth Cur',
 'Blue Lacy',
 'Brittany Spaniel',
 'Cattle Dog',
 'Cavachon',
 'Cavapoo',
 'Chinese Crested Dog',
 'Chiweenie',
 'Chocolate Labrador Retriever',
 'Cockapoo',
 'Coonhound',
 'Corgi',
 'English Bulldog',
 'English Coonhound',
 'English Pointer',
 'English Shepherd',
 'Eskimo Dog',
 'Feist',
 'Fox Terrier',
 'Foxhound',
 'Goldendoodle',
 'Hound',
 'Husky',
 'Jack Russell Terrier',
 'Klee Kai',
 'Labradoodle',
 'Maltipoo',
 'Manchester Terrier',
 'Maremma Sheepdog',
 'McNab',
 'Miniature Dachshund',
 'Miniature Poodle',
 'Mixed Breed',
 'Morkie',
 'Mountain Dog',
 'Newfoundland Dog',
 'Patterdale Terrier / Fell Terrier',
 'Petit Basset Griffon Vendeen',
 'Pit Bull Terrier',
 'Pomsky',
 'Poo

**AKC Augmentation Findings**
1. Data already exists taken from AKC, which saves a lot of time.
2. Several categorical columns can be used to augment the petfinder dataset (eg. trainability category, energy level category, ...).
3. Of the 223 dog breed values found in PetFinder, 144 of them are a match with AKC which means there is some value merging the AKC dog breed data with Petfinder. 
4. Periodic data updates will be required in future to account for single breed dogs not currently in the dataset, but AKC has a large list to compare against in future. 
5. AKC as expected does not cover unofficial or mixed dog breeds.
6. AKC does not cover cats, which if we augment dogs would result in cats getting a lot of NA values for these new column values.

# Conclusion <a id='conclusion'></a>

**Key Conclusions from PetMatch EDA**
1. Petfinder Dataset is usable for modeling a recommendation system for dogs and cats. 
2. Some data cleaning and transformations of categorical data will have to occur but the potential feature data on animals is not correlated. 
3. Dataset does not include user ratings(Y), so those would have to be generated as the users use the system and the system would need a cold start.
4. Dataset does not tag organizations as 'kill' or 'no-kill shelters'. This would need to be manually added to the dataset.
5. Purebred dogs in the dataset can be augmented with AKC breed data.