**Author: Autumn Salsberry**

Contact: <a href="https://www.linkedin.com/in/salsbeas/">linkedin.com/in/salsbeas</a>

BrainStation Capstone: Predicting Adoptability of Shelter Dogs

August 8, 2022


This notebook will solely be for the purpose of understanding the datasets I have and how they might be related to each other. Three of these datasets were downloaded together at this <a href="https://data.world/rdowns26/austin-animal-shelter">link</a> and are sourced from an Austin, TX animal shelter. The other data set is from King County Animal Shelter (Seattle, WA) and was downloaded at this <a herf="https://data.world/kingcounty/yaai-7frk">link</a>. It contains less details about each animal, but contains images and labels clarifying if the pet is adoptable or not for each rescue. 

# Table of Contents

* [Looking at `austin_breed_info`](#breed)
    * [Key Points](#breedkey)
* [Looking at `austin_intakes`](#intake)
    * [Key Points](#intakekey)
* [Looking at `austin_outcomes`](#outcome) 
    * [Key Points](#outcomekey)
* [Looking at `adoptable_images`](#images)
    * [Key Points](#imagekey)
* [Goal Attainment](#goal)

In [2]:
#only need pandas for this one
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [3]:
#these three datasets are from a Austin, TX shelter
austin_breed_info = pd.read_csv('Tabular Data/all_records.csv')
austin_intakes = pd.read_csv('Tabular Data/austin_animal_center_intakes.csv')
austin_outcomes = pd.read_csv('Tabular Data/austin_animal_center_outcomes.csv')


#this one is from King County
adoptable_images = pd.read_csv('Tabular Data/lost_found_adoptable_pets.csv')

# Big Goals
* To predict the adoptability of any particular dog that is in an animal shelter
    * I will use "outcome type" adopted versus any other outcome
        * I will also consider if this should be adjusted depending on the data
        * Consider if the dog's name has any relationship with adoptability 
* Predict the length of time before the dog is adoption 
    * This would be modeled only using the dogs that were adopted data
* I will use images of dogs up for adoption as the input of this model to predict if it will be adopted or not
    * the images will provide breed and color of a dog
    * additional information may be needed to predict adoptability 

# Looking at `austin_breed_info`
<a class="anchor" id="breed"></a>

This data frame should contain all of the data from the income and outcome data, but I would like to confirm that. If there is a common key between this and the other two data frames, I will consider concatenating them to assure no important information is forgotten. I will start with exploring the data frame. 

In [4]:
#the shape is (76977, 38)
austin_breed_info.shape

(76977, 38)

In [5]:
#see what some of the data inputs are
austin_breed_info.head()

Unnamed: 0.1,Unnamed: 0,Animal ID,Name_intake,DateTime_intake,MonthYear_intake,Found_Location,Intake_Type,IntakeCondition,Animal_Type_intake,Sex,...,beagle,terrier,boxer,poodle,rottweiler,dachshund,chihuahua,pit bull,DateTime_length,Days_length
0,0,A730601,,2016-07-07 12:11:00,07/07/2016 12:11:00 PM,1109 Shady Ln in Austin (TX),Stray,Normal,Cat,Intact Male,...,0,0,0,0,0,0,0,0,0 days 20:49:00.000000000,0-7 days
1,1,A683644,*Zoey,2014-07-13 11:02:00,07/13/2014 11:02:00 AM,Austin (TX),Owner Surrender,Nursing,Dog,Intact Female,...,0,0,0,0,0,0,0,0,115 days 23:04:00.000000000,12 weeks - 6 months
2,2,A676515,Rico,2014-04-11 08:45:00,04/11/2014 08:45:00 AM,615 E. Wonsley in Austin (TX),Stray,Normal,Dog,Intact Male,...,0,0,0,0,0,0,0,1,3 days 09:53:00.000000000,0-7 days
3,3,A742953,,2017-01-31 13:30:00,01/31/2017 01:30:00 PM,S Hwy 183 And Thompson Lane in Austin (TX),Stray,Normal,Dog,Intact Male,...,0,0,0,0,0,0,0,0,4 days 00:47:00.000000000,0-7 days
4,4,A679549,*Gilbert,2014-05-22 15:43:00,05/22/2014 03:43:00 PM,124 W Anderson in Austin (TX),Stray,Normal,Cat,Intact Male,...,0,0,0,0,0,0,0,0,24 days 22:11:00.000000000,3-6 weeks


In [6]:
#all of the column names, data types and non-null values
austin_breed_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76977 entries, 0 to 76976
Data columns (total 38 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          76977 non-null  int64 
 1   Animal ID           76977 non-null  object
 2   Name_intake         57493 non-null  object
 3   DateTime_intake     76977 non-null  object
 4   MonthYear_intake    76977 non-null  object
 5   Found_Location      76977 non-null  object
 6   Intake_Type         76977 non-null  object
 7   IntakeCondition     76977 non-null  object
 8   Animal_Type_intake  76977 non-null  object
 9   Sex                 76976 non-null  object
 10  Age                 76977 non-null  object
 11  Breed_intake        76977 non-null  object
 12  Color_intake        76977 non-null  object
 13  Name_outcome        57493 non-null  object
 14  DateTime_outcome    76977 non-null  object
 15  MonthYear_outcome   76977 non-null  object
 16  Outcome_Type        76

The data is sufficiently large with 76,977 columns. I will have to do a lot of data cleaning before I can use this information for modeling because there is a lot of categorical data. There are some null values, but not so many that I am discouraged from using this dataset. 
Now I will look into some of the key variables in this dataset to get a better idea of what I will be working with. 

In [7]:
# There are 5 reasons why animals are accepted into the shelter
austin_breed_info['Intake_Type'].unique()

array(['Stray', 'Owner Surrender', 'Wildlife', 'Public Assist',
       'Euthanasia Request'], dtype=object)

In [8]:
#There are 8 subreason to help explain the intake type above
austin_breed_info['IntakeCondition'].unique()

array(['Normal', 'Nursing', 'Injured', 'Sick', 'Aged', 'Feral', 'Other',
       'Pregnant'], dtype=object)

In [9]:
# There are 9 reasons why animals leaves the shelter
austin_breed_info['Outcome_Type'].unique()

array(['Transfer', 'Adoption', 'Return to Owner', 'Euthanasia',
       'Disposal', 'Died', 'Rto-Adopt', 'Missing', nan, 'Relocate'],
      dtype=object)

In [10]:
#There are 18 subreason to help explain the outcome type above
austin_breed_info['Outcome_Subtype'].unique()

array(['SCRP', 'Foster', nan, 'Partner', 'Medical', 'Aggressive',
       'Rabies Risk', 'Suffering', 'Offsite', 'Behavior', 'In Kennel',
       'Underage', 'Court/Investigation', 'In Foster', 'Possible Theft',
       'At Vet', 'Enroute', 'In Surgery', 'Barn'], dtype=object)

In [11]:
#there are 5 animal categories represented in the data
austin_breed_info[['Animal ID','Animal_Type_intake']].groupby('Animal_Type_intake').count()

Unnamed: 0_level_0,Animal ID
Animal_Type_intake,Unnamed: 1_level_1
Bird,254
Cat,25125
Dog,48097
Livestock,8
Other,3493


In [12]:
dogs = 48097
f"There are {dogs} dogs in this dataset which is {round(dogs/austin_breed_info['Animal_Type_intake'].count()*100, 2)}% of the data."

'There are 48097 dogs in this dataset which is 62.48% of the data.'

There is a relatively clear explanation on why animals are accepted into the shelter and there are no null values. For the outcome information, there is a lot more variation and there are some NaN values.This potential could be because the animal could still be in the shelter at time of data collection. There are 5 categories of animals in the data and luckily dogs are the largest portion of the data - around 62%. Finally, let's look at the number of NaN specifically in each column. 

In [13]:
#checking the null values for each column
austin_breed_info.isna().sum()

Unnamed: 0                0
Animal ID                 0
Name_intake           19484
DateTime_intake           0
MonthYear_intake          0
Found_Location            0
Intake_Type               0
IntakeCondition           0
Animal_Type_intake        0
Sex                       1
Age                       0
Breed_intake              0
Color_intake              0
Name_outcome          19484
DateTime_outcome          0
MonthYear_outcome         0
Outcome_Type              7
Outcome_Subtype       45254
Sex_upon_Outcome          4
Age_upon_Outcome         21
gender_intake          5608
gender_outcome         5611
fixed_intake              1
fixed_outcome             4
fixed_changed             0
Age_Bucket                0
retriever                 0
shepherd                  0
beagle                    0
terrier                   0
boxer                     0
poodle                    0
rottweiler                0
dachshund                 0
chihuahua                 0
pit bull            

In [14]:
austin_breed_info['gender_intake'].unique()

array(['Male', 'Female', nan], dtype=object)

In [15]:
austin_breed_info['gender_outcome'].unique()

array(['Male', 'Female', nan], dtype=object)

In [16]:
austin_breed_info['Sex_upon_Outcome'].unique()

array(['Neutered Male', 'Spayed Female', 'Intact Male', 'Unknown',
       'Intact Female', nan], dtype=object)

In [17]:
austin_breed_info['Sex_upon_Outcome'].unique()

array(['Neutered Male', 'Spayed Female', 'Intact Male', 'Unknown',
       'Intact Female', nan], dtype=object)

It looks like there are a lot of pet names missing. I will still consider this is future EDA, but it seems unlikely that I will be able to consider this as an independent variable for modeling purposes. There are quite a lot of outcome subtypes missing, but considering the unique categories from this column above, it would be okay to drop this column all together. There is some gender data missing, but it looks like it can be filled in using wither of the sex columns. Finally, there are a lot of `Days_length` missing, but there are no `DateTime_length` missing, so I don't think this will be an issue. 

## Key Points
<a class="anchor" id="breedkey"></a>

* Has animal ID column that hopefully is a common key with intake and outcome data frames
* Contains several OneHotEncoded breed columns
* Has 48097 dogs
* Has some missing data, but no crucial information
* There is a column that tells me the length of time an animal is in the shelter
    * I can use this to predict the length of time to adoption
* I will have to consider which breeds are represented in both the tabular and image data

# Looking at `austin_intakes`
<a class="anchor" id="intake"></a>

Theoretically, all of the data from this data frame should already be represented in the dataset above. I will explore the data to confirm this is true. 

In [18]:
#the shape is (63328, 12)
austin_intakes.shape

(63328, 12)

It's clear this dataset is shorter than the previous, so there will be some missing data. 

In [19]:
#take a look at a few crows in this dataset
austin_intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A730601,,07/07/2016 12:11:00 PM,07/07/2016 12:11:00 PM,1109 Shady Ln in Austin (TX),Stray,Normal,Cat,Intact Male,7 months,Domestic Shorthair Mix,Blue Tabby
1,A683644,*Zoey,07/13/2014 11:02:00 AM,07/13/2014 11:02:00 AM,Austin (TX),Owner Surrender,Nursing,Dog,Intact Female,4 weeks,Border Collie Mix,Brown/White
2,A676515,Rico,04/11/2014 08:45:00 AM,04/11/2014 08:45:00 AM,615 E. Wonsley in Austin (TX),Stray,Normal,Dog,Intact Male,2 months,Pit Bull Mix,White/Brown
3,A742953,,01/31/2017 01:30:00 PM,01/31/2017 01:30:00 PM,S Hwy 183 And Thompson Lane in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Saluki,Sable/Cream
4,A679549,*Gilbert,05/22/2014 03:43:00 PM,05/22/2014 03:43:00 PM,124 W Anderson in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair Mix,Black/White


This doesn't look too different

In [20]:
#check what columns there are
austin_intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63328 entries, 0 to 63327
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Animal ID         63328 non-null  object
 1   Name              44107 non-null  object
 2   DateTime          63328 non-null  object
 3   MonthYear         63328 non-null  object
 4   Found Location    63328 non-null  object
 5   Intake Type       63328 non-null  object
 6   Intake Condition  63328 non-null  object
 7   Animal Type       63328 non-null  object
 8   Sex upon Intake   63327 non-null  object
 9   Age upon Intake   63328 non-null  object
 10  Breed             63328 non-null  object
 11  Color             63328 non-null  object
dtypes: object(12)
memory usage: 5.8+ MB


In [22]:
#how many breeds are listed and what do they look like
pd.DataFrame(austin_intakes['Breed'].unique())

Unnamed: 0,0
0,Domestic Shorthair Mix
1,Border Collie Mix
2,Pit Bull Mix
3,Saluki
4,Domestic Medium Hair Mix
...,...
1969,Brittany/Border Collie
1970,Tortoise
1971,Akita/Mastiff
1972,Dachshund Longhair/Maltese


In [23]:
#compared with the similar column in the previous df
pd.DataFrame(austin_breed_info['Breed_intake'].unique())

Unnamed: 0,0
0,Domestic Shorthair Mix
1,Border Collie Mix
2,Pit Bull Mix
3,Saluki
4,Domestic Medium Hair Mix
...,...
1964,Australian Kelpie/Border Collie
1965,Brittany/Border Collie
1966,Tortoise
1967,Akita/Mastiff


In [21]:
#how many colors are there and how are they listed
pd.DataFrame(austin_intakes['Color'].unique())

Unnamed: 0,0
0,Blue Tabby
1,Brown/White
2,White/Brown
3,Sable/Cream
4,Black/White
...,...
482,Brown Tabby/Gray Tabby
483,Brown/Pink
484,Cream/Yellow
485,Yellow/Orange Tabby


In [24]:
#compared with the similar column in the previous df
pd.DataFrame(austin_breed_info['Color_intake'].unique())

Unnamed: 0,0
0,Blue Tabby
1,Brown/White
2,White/Brown
3,Sable/Cream
4,Black/White
...,...
479,Cream/Silver
480,Brown Tabby/Gray Tabby
481,Brown/Pink
482,Cream/Yellow


It looks like these columns are already represented in the `austin_breed_info` df, so I won't need to concatenate them onto the df. There are approximately 2,000 breeds listed (many are mixed breeds) and 500 colors (many have multiple colors), so I will need to explore these more to determine how to include them in the modeling data frame. 

In [22]:
#checking the animal ratios
austin_intakes[['Animal ID', 'Animal Type']].groupby('Animal Type').count()

Unnamed: 0_level_0,Animal ID
Animal Type,Unnamed: 1_level_1
Bird,255
Cat,23408
Dog,36173
Livestock,8
Other,3484


Looks like the animal proportions are about the same, but there are definitely fewer dogs in this dataset. 

In [23]:
austin_intakes.isna().sum()

Animal ID               0
Name                19221
DateTime                0
MonthYear               0
Found Location          0
Intake Type             0
Intake Condition        0
Animal Type             0
Sex upon Intake         1
Age upon Intake         0
Breed                   0
Color                   0
dtype: int64

Similar to the previous dataset, a lot of the names are missing and one sex upon intake. I am not worried about this. 

## Key points
<a class="anchor" id="intakekey"></a>

* Has Animal ID that it looks like the datasets can be joined on
* Has more detailed breed and color info than in the previous data frame
    * Will need to concatenate this onto the previous data frame

# Looking at `austin_outcomes`
<a class="anchor" id="outcome"></a>

Similarly to the intake data frame, all of this data should already be represented in the first dataset above. I will explore the data to confirm this is true. 

In [24]:
#the shape is (63643, 12)
austin_outcomes.shape

(63643, 12)

This is a similar shape to the intake data frame, but not the exact same size. It is strange that the data frame that should include both intake and outcome data is longer than both by about 10,000 rows, but I will explore a little further. 

In [25]:
#have a look at the rows in the dataset
austin_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A741715,*Pebbles,01/11/2017 06:17:00 PM,01/11/2017 06:17:00 PM,03/07/2016,Adoption,,Cat,Spayed Female,10 months,Domestic Shorthair Mix,Calico
1,A658751,Benji,11/13/2016 01:38:00 PM,11/13/2016 01:38:00 PM,07/14/2011,Return to Owner,,Dog,Neutered Male,5 years,Border Terrier Mix,Tan
2,A721285,,02/24/2016 02:42:00 PM,02/24/2016 02:42:00 PM,02/24/2014,Euthanasia,Suffering,Other,Unknown,2 years,Raccoon Mix,Black/Gray
3,A707443,,07/13/2015 01:50:00 PM,07/13/2015 01:50:00 PM,06/21/2015,Transfer,Partner,Cat,Intact Female,3 weeks,Domestic Longhair Mix,Black Smoke
4,A684346,,07/22/2014 04:04:00 PM,07/22/2014 04:04:00 PM,07/07/2014,Transfer,Partner,Cat,Intact Male,2 weeks,Domestic Shorthair Mix,Orange Tabby


Looks pretty similar to the previous two data frames. 

In [26]:
#column names in the data frame
austin_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63643 entries, 0 to 63642
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Animal ID         63643 non-null  object
 1   Name              44458 non-null  object
 2   DateTime          63643 non-null  object
 3   MonthYear         63643 non-null  object
 4   Date of Birth     63643 non-null  object
 5   Outcome Type      63636 non-null  object
 6   Outcome Subtype   29729 non-null  object
 7   Animal Type       63643 non-null  object
 8   Sex upon Outcome  63639 non-null  object
 9   Age upon Outcome  63622 non-null  object
 10  Breed             63643 non-null  object
 11  Color             63643 non-null  object
dtypes: object(12)
memory usage: 5.8+ MB


Breed and color are also present in this dataset like the one above. 

In [27]:
#what outcome types are there
pd.DataFrame(austin_outcomes['Outcome Type'].unique())

Unnamed: 0,0
0,Adoption
1,Return to Owner
2,Euthanasia
3,Transfer
4,Died
5,Disposal
6,Missing
7,Relocate
8,
9,Rto-Adopt


Outcome types look the same at the first dataset above. There are some NaN values that I am concerned about though. 

In [28]:
austin_outcomes[['Animal ID', 'Outcome Type', 'Animal Type']].groupby(['Animal Type', 'Outcome Type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Animal ID
Animal Type,Outcome Type,Unnamed: 2_level_1
Bird,Adoption,91
Bird,Died,4
Bird,Disposal,18
Bird,Euthanasia,64
Bird,Missing,1
Bird,Relocate,6
Bird,Return to Owner,7
Bird,Transfer,63
Cat,Adoption,9842
Cat,Died,317


Dogs are most often adopted, euthanized, returned to owner or transfered. It doesn't seem fair to say a dog that was euthanized is not adoptable, so I will need to consider this when I clean the data. 

In [29]:
#animal preportions
austin_outcomes[['Animal ID', 'Animal Type']].groupby('Animal Type').count()

Unnamed: 0_level_0,Animal ID
Animal Type,Unnamed: 1_level_1
Bird,254
Cat,23679
Dog,36219
Livestock,9
Other,3482


In [30]:
adopted_dogs = 16276
total_dogs = 36219

f"Of the {total_dogs} dogs, only {adopted_dogs} were adopted. Thats a {round(adopted_dogs/total_dogs*100, 2)}% adoption rate (base rate)."

'Of the 36219 dogs, only 16276 were adopted. Thats a 44.94% adoption rate (base rate).'

Very similar to the intake data, but still fewer than the first data frame above. 

In [31]:
#checking for nulls
austin_outcomes.isna().sum()

Animal ID               0
Name                19185
DateTime                0
MonthYear               0
Date of Birth           0
Outcome Type            7
Outcome Subtype     33914
Animal Type             0
Sex upon Outcome        4
Age upon Outcome       21
Breed                   0
Color                   0
dtype: int64

Name data is also missing in the dataset and the outcome subtype has a lot of NaN values here similar to the first data frame above. I am not concerned about this. 

## Key Points
<a class="anchor" id="outcomekey"></a>

* Hopefully animal ID that can be used as a common key
* Only 63,643 rows
* Has an outcome column with adopted or not information
* About 45% of dogs reported as being adopted

# Looking at `adoptable_images`
<a class="anchor" id="images"></a>

This is a very small dataset (409 rows) that might be used for real world testing the data. It is too small to use for modeling, but I would like to explore it a little bit more to see if it can be of use in the future. 

In [32]:
#the shape is (409, 25)
adoptable_images.shape

(409, 25)

In [33]:
#what do the rows in this data frame look like?
adoptable_images.head()

Unnamed: 0,impound_no,Animal_ID,Data_Source,Record_Type,Link,Current_Location,Animal_Name,animal_type,Age,Animal_Gender,...,City,State,Zip,jurisdiction,obfuscated_latitude,obfuscated_longitude,Image,image_alt_text,location_for_map,Memo
0,K17-103806,A542208,Regional Animal Services of King County,FOUND,http://petharbor.com/PublicDetail.asp?searchty...,In Public Home,,Cat,,,...,,,,JURISDICTION,,,http://www.petharbor.com/get_image.asp?RES=Det...,Image Copyright HLP Inc. 2017,96TH AVE NE AND NE 28TH ST\n,
1,K17-103558,A541114,Regional Animal Services of King County,LOST,http://petharbor.com/PublicDetail.asp?searchty...,LOST,Neko,Cat,,Male,...,,,,JURISDICTION,,,http://www.petharbor.com/get_image.asp?RES=Det...,Image Copyright HLP Inc. 2017,WALLINGFORD AVE N NEAR 150TH ST\n,
2,K17-104020,A543155,Regional Animal Services of King County,LOST,http://petharbor.com/PublicDetail.asp?searchty...,LOST,Trixie,Dog,,Female,...,,,,JURISDICTION,,,http://www.petharbor.com/get_image.asp?RES=Det...,Image Copyright HLP Inc. 2017,THOMAS RD\n,
3,K17-103994,A543033,Regional Animal Services of King County,FOUND,http://petharbor.com/PublicDetail.asp?searchty...,In Public Home,,Dog,,Female,...,APT,,,JURISDICTION,,,http://www.petharbor.com/get_image.asp?RES=Det...,Image Copyright HLP Inc. 2017,10000 MEYDENBAUER WAY SE\nAPT\n,
4,K17-104418,A544774,Regional Animal Services of King County,LOST,http://petharbor.com/PublicDetail.asp?searchty...,LOST,Kitty,Cat,,Female,...,,,,JURISDICTION,,,http://www.petharbor.com/get_image.asp?RES=Det...,Image Copyright HLP Inc. 2017,76TH AVE SE\n,


In [34]:
# have a look at the columns in this dataset
adoptable_images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 409 entries, 0 to 408
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   impound_no            409 non-null    object 
 1   Animal_ID             409 non-null    object 
 2   Data_Source           409 non-null    object 
 3   Record_Type           409 non-null    object 
 4   Link                  409 non-null    object 
 5   Current_Location      398 non-null    object 
 6   Animal_Name           299 non-null    object 
 7   animal_type           409 non-null    object 
 8   Age                   228 non-null    object 
 9   Animal_Gender         399 non-null    object 
 10  Animal_Breed          409 non-null    object 
 11  Animal_Color          409 non-null    object 
 12  Date                  409 non-null    object 
 13  Date_Type             409 non-null    object 
 14  Obfuscated_Address    363 non-null    object 
 15  City                  3

Data frame has a lot of detail, but most is not very helpful for predicting adoptability such as jurisdiction/location information. I will look into the helpful columns a bit more now. 

In [35]:
#what does the image data look like
adoptable_images['Image'].head()

0    http://www.petharbor.com/get_image.asp?RES=Det...
1    http://www.petharbor.com/get_image.asp?RES=Det...
2    http://www.petharbor.com/get_image.asp?RES=Det...
3    http://www.petharbor.com/get_image.asp?RES=Det...
4    http://www.petharbor.com/get_image.asp?RES=Det...
Name: Image, dtype: object

In [36]:
#what are the outcome types
adoptable_images['Record_Type'].unique()

array(['FOUND', 'LOST', 'ADOPTABLE'], dtype=object)

In [37]:
# what animals are in this dataset
adoptable_images['animal_type'].unique()

array(['Cat', 'Dog', 'Goat/sheep', 'Bird', 'Rabbit Sh', 'Angora',
       'Dead Cat', 'Dead Dog', 'Hamster', 'Dead Bird', 'Lop-Mini'],
      dtype=object)

In [38]:
#what are the outcome breakdowns for dogs
adoptable_images[['impound_no','Record_Type', 'animal_type']].groupby(['animal_type','Record_Type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,impound_no
animal_type,Record_Type,Unnamed: 2_level_1
Angora,ADOPTABLE,1
Bird,FOUND,1
Cat,ADOPTABLE,42
Cat,FOUND,154
Cat,LOST,91
Dead Bird,FOUND,1
Dead Cat,FOUND,8
Dead Dog,FOUND,2
Dog,ADOPTABLE,11
Dog,FOUND,32


In [39]:
#How many nulls are there?
adoptable_images.isna().sum()

impound_no                0
Animal_ID                 0
Data_Source               0
Record_Type               0
Link                      0
Current_Location         11
Animal_Name             110
animal_type               0
Age                     181
Animal_Gender            10
Animal_Breed              0
Animal_Color              0
Date                      0
Date_Type                 0
Obfuscated_Address       46
City                     95
State                   101
Zip                      98
jurisdiction             30
obfuscated_latitude     361
obfuscated_longitude    361
Image                    11
image_alt_text            0
location_for_map         35
Memo                    181
dtype: int64

In [40]:
#which images (for which animal) are missing?
for i in range(0,len(adoptable_images['Image'].isna())):
    if adoptable_images['Image'].isna()[i] == True:
        print(adoptable_images['animal_type'][i])

Dead Cat
Dead Cat
Dead Cat
Dead Cat
Dead Cat
Dead Cat
Dead Dog
Dead Bird
Dead Cat
Dead Dog
Dead Cat


It looks like (luckily for us) this shelter did not upload pictures of dead animals. The dataset is missing name information and age of animals similar to the Austin Shelter data and it also is missing a lot of location information, but that doesn't affect this project. 

## Key Points
<a class="anchor" id="imageskey"></a>

* There are images of animals in this dataset, but it is too small to be used. There are only 11 adoptable dogs. 
* Not a good dataset for modeling, but could be used as real world examples to test the model 
    * Can check predictability based on common breeds

# Goal Attainment Assessment
<a class="anchor" id="goal"></a>

* To predict the adoptability of any particular dog that is in an animal shelter
    * There is a column "outcome type" that can be used as the target variable
    * The size of the data frame should be enough to get statistically significant results
    * There are important details about each dog (such as breed, color, age, sex, neutered status) that theoretically affected the adoptability of a dog. 
* Predict the length of time before the dog is adoption 
    * There is a column that specifies how long a dog was at the shelter and the arrival date and departure time can be used to calculate this if needed
* I will use images of dogs up for adoption as the input of this model to predict if it will be adopted or not
    * The data considered in this notebook does not pertain to this goal, so it is unclear at this time if this is achievable