# Austin Shelter Wrangle Notes <a name='top'></a>

This notebook contains notes and code to develop the `final_adoption_report` for the <a href='https://github.com/stephenfitzsimon/pet_adoption_project'>Austin Shelter Pet Outcomes</a> project.

- two tables are downloaded
- duplicate `animal_id` rows are dropped
- columns dropped: `animal_id_i`
- datetime columns are transformed to month-year
- animal_type, breed, and color were consistent, dropped the intake column
- name column had nulls that appeared as mismatched.  replaced with string
- drop 15 NaN outcome_types
    - outcome_subtype NaN are replaced with no subtype
    - SCRP is Stray Cat Release Program
- drop nulls for sex_outcome, age_outcome, sex_intake

### Contents

1. <a href='#download'>Getting data from the internet</a>
2. <a href='#joining'>Joining the table data</a>
3. <a href='#datetime'>Handling the datetime columns </a>
4. <a href='#integrity'>Checking data integrity of select columns </a>

In [1]:
import os
import requests
import pandas as pd
from sodapy import Socrata

## Getting the data from the internet <a name='download'></a>

Write a function that downloads the data from the internet.  Use the <a href='https://dev.socrata.com/'>Socrata Open Data API.</a>

In [2]:
def download_data():
    """
    Returns the pet outcome and pet intake dataframes from the SODA
    """
    client = Socrata("data.austintexas.gov", None)
    results_outcome = client.get("9t4d-g238", limit=200_000)
    results_intake = client.get("wter-evkm", limit=200_000)

    # Convert to pandas DataFrame
    df_outcome = pd.DataFrame.from_records(results_outcome)
    df_intake = pd.DataFrame.from_records(results_intake)
    return df_outcome, df_intake

df_o, df_i = download_data()



In [3]:
df_i

Unnamed: 0,animal_id,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,name
0,A860485,2022-06-29T10:14:00.000,2022-06-29T10:14:00.000,515 South Pleasant Valley in Austin (TX),Public Assist,Normal,Dog,Intact Female,1 month,Pit Bull,White/Tan,
1,A854419,2022-06-29T08:37:00.000,2022-06-29T08:37:00.000,2501 S Ih 32 Nb in Austin (TX),Stray,Normal,Dog,Intact Female,2 years,Basenji,Tan/White,Girly
2,A860475,2022-06-29T08:05:00.000,2022-06-29T08:05:00.000,On Ih 35 Between Exit 240 And 241 in Austin (TX),Stray,Injured,Cat,Intact Female,4 weeks,Domestic Shorthair,Black,
3,A860471,2022-06-29T07:56:00.000,2022-06-29T07:56:00.000,13776 Us 183 in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair,Brown Tabby/White,A860471
4,A860470,2022-06-29T07:47:00.000,2022-06-29T07:47:00.000,7421 Thanas Way in Austin (TX),Stray,Injured,Cat,Neutered Male,3 years,Domestic Longhair,Orange Tabby,A860470
...,...,...,...,...,...,...,...,...,...,...,...,...
141273,A664233,2013-10-01T08:53:00.000,2013-10-01T08:53:00.000,7405 Springtime in Austin (TX),Stray,Injured,Dog,Intact Female,3 years,Pit Bull Mix,Blue/White,Stevie
141274,A664237,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141275,A664235,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141276,A664236,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,


Use the function to check for a `.csv` file. Allow for the user to force a url query

In [4]:
def get_pet_data(query_url = False):
    file_o = 'pet_outcomes.csv'
    file_i = 'pet_intake.csv'
    if os.path.isfile(file_o) and os.path.isfile(file_i) and not query_url:
        #return dataframe from file
        print('Returning saved csv files.')
        df_o = pd.read_csv(file_o).drop(columns = ['Unnamed: 0'])
        df_i = pd.read_csv(file_i).drop(columns = ['Unnamed: 0'])
        return df_o, df_i
    else:
        print('Getting data from url...')
        df_o, df_i = download_data()
        print('Saving to .csv files...')
        df_o.to_csv(file_o)
        df_i.to_csv(file_i)
        print('Returned dataframes.')
        return df_o, df_i

df_o, df_i = get_pet_data(query_url=True)



Getting data from url...
Saving to .csv files...
Returned dataframes.


In [5]:
df_i

Unnamed: 0,animal_id,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,name
0,A860485,2022-06-29T10:14:00.000,2022-06-29T10:14:00.000,515 South Pleasant Valley in Austin (TX),Public Assist,Normal,Dog,Intact Female,1 month,Pit Bull,White/Tan,
1,A854419,2022-06-29T08:37:00.000,2022-06-29T08:37:00.000,2501 S Ih 32 Nb in Austin (TX),Stray,Normal,Dog,Intact Female,2 years,Basenji,Tan/White,Girly
2,A860475,2022-06-29T08:05:00.000,2022-06-29T08:05:00.000,On Ih 35 Between Exit 240 And 241 in Austin (TX),Stray,Injured,Cat,Intact Female,4 weeks,Domestic Shorthair,Black,
3,A860471,2022-06-29T07:56:00.000,2022-06-29T07:56:00.000,13776 Us 183 in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair,Brown Tabby/White,A860471
4,A860470,2022-06-29T07:47:00.000,2022-06-29T07:47:00.000,7421 Thanas Way in Austin (TX),Stray,Injured,Cat,Neutered Male,3 years,Domestic Longhair,Orange Tabby,A860470
...,...,...,...,...,...,...,...,...,...,...,...,...
141273,A664233,2013-10-01T08:53:00.000,2013-10-01T08:53:00.000,7405 Springtime in Austin (TX),Stray,Injured,Dog,Intact Female,3 years,Pit Bull Mix,Blue/White,Stevie
141274,A664235,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141275,A664237,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141276,A664236,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,


In [6]:
df_o

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,name
0,A851638,2022-06-29T10:52:00.000,2022-06-29T10:52:00.000,2021-02-16T00:00:00.000,Adoption,Foster,Cat,Spayed Female,1 year,Siamese,Flame Point,
1,A860417,2022-06-29T09:48:00.000,2022-06-29T09:48:00.000,2022-03-28T00:00:00.000,Euthanasia,Suffering,Cat,Intact Female,3 months,Domestic Shorthair,White/Cream Tabby,
2,A857678,2022-06-29T08:23:00.000,2022-06-29T08:23:00.000,2022-04-12T00:00:00.000,Adoption,Foster,Cat,Spayed Female,2 months,Domestic Shorthair,Brown Tabby,*Platypus
3,A857677,2022-06-29T08:22:00.000,2022-06-29T08:22:00.000,2022-04-12T00:00:00.000,Adoption,Foster,Cat,Spayed Female,2 months,Domestic Shorthair,Brown Tabby,*Koala
4,A858888,2022-06-29T08:19:00.000,2022-06-29T08:19:00.000,2014-06-06T00:00:00.000,Adoption,Foster,Dog,Spayed Female,8 years,Chihuahua Shorthair Mix,Black,*Daisy
...,...,...,...,...,...,...,...,...,...,...,...,...
141158,A664223,2013-10-01T11:03:00.000,2013-10-01T11:03:00.000,2009-09-30T00:00:00.000,Return to Owner,,Dog,Neutered Male,4 years,Bulldog Mix,White,Moby
141159,A664237,2013-10-01T10:44:00.000,2013-10-01T10:44:00.000,2013-09-24T00:00:00.000,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141160,A664236,2013-10-01T10:44:00.000,2013-10-01T10:44:00.000,2013-09-24T00:00:00.000,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,
141161,A664235,2013-10-01T10:39:00.000,2013-10-01T10:39:00.000,2013-09-24T00:00:00.000,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,


In [7]:
df_o, df_i = get_pet_data()

Returning saved csv files.


All the data looks like it's here

<a href='#top'>Back to Top</a>

## Joining the table data <a name='joining'></a>

In [8]:
df_o.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141163 entries, 0 to 141162
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   animal_id         141163 non-null  object
 1   datetime          141163 non-null  object
 2   monthyear         141163 non-null  object
 3   date_of_birth     141163 non-null  object
 4   outcome_type      141141 non-null  object
 5   outcome_subtype   64767 non-null   object
 6   animal_type       141163 non-null  object
 7   sex_upon_outcome  141162 non-null  object
 8   age_upon_outcome  141139 non-null  object
 9   breed             141163 non-null  object
 10  color             141163 non-null  object
 11  name              99530 non-null   object
dtypes: object(12)
memory usage: 12.9+ MB


In [9]:
df_i.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141278 entries, 0 to 141277
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   animal_id         141278 non-null  object
 1   datetime          141278 non-null  object
 2   datetime2         141278 non-null  object
 3   found_location    141278 non-null  object
 4   intake_type       141278 non-null  object
 5   intake_condition  141278 non-null  object
 6   animal_type       141278 non-null  object
 7   sex_upon_intake   141277 non-null  object
 8   age_upon_intake   141278 non-null  object
 9   breed             141278 non-null  object
 10  color             141278 non-null  object
 11  name              99600 non-null   object
dtypes: object(12)
memory usage: 12.9+ MB


First join the tables .  Start with adding `_i` for intake to all the intake columns to identify

In [10]:
def rename_intake(df):
    return df.add_suffix('_i')

df_i = rename_intake(df_i)

Now join the tables on `animal_id`.  There are repetitions in the `animal_id` columns.  These represent $\approx 20,000$ rows.  Simply drop them and join the two tables with an inner join.

In [11]:
df_i.animal_id_i.value_counts()

A721033    33
A718223    14
A718877    12
A706536    11
A737814     9
           ..
A785938     1
A785937     1
A785936     1
A785940     1
A521520     1
Name: animal_id_i, Length: 126391, dtype: int64

In [12]:
df_o.animal_id.value_counts()

A721033    33
A718223    14
A718877    12
A706536    11
A716018     9
           ..
A785807     1
A783403     1
A785975     1
A785976     1
A659834     1
Name: animal_id, Length: 126268, dtype: int64

In [13]:
def join_tables(df_o, df_i):
    df_i = df_i.drop_duplicates(subset='animal_id_i', keep=False)
    df_o = df_o.drop_duplicates(subset='animal_id', keep=False)
    df = df_o.merge(df_i, how='inner', left_on='animal_id', right_on='animal_id_i')
    df = df.drop(columns=['animal_id_i'])
    return df

df = join_tables(df_o, df_i)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113980 entries, 0 to 113979
Data columns (total 23 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   animal_id           113980 non-null  object
 1   datetime            113980 non-null  object
 2   monthyear           113980 non-null  object
 3   date_of_birth       113980 non-null  object
 4   outcome_type        113965 non-null  object
 5   outcome_subtype     59080 non-null   object
 6   animal_type         113980 non-null  object
 7   sex_upon_outcome    113979 non-null  object
 8   age_upon_outcome    113956 non-null  object
 9   breed               113980 non-null  object
 10  color               113980 non-null  object
 11  name                73212 non-null   object
 12  datetime_i          113980 non-null  object
 13  datetime2_i         113980 non-null  object
 14  found_location_i    113980 non-null  object
 15  intake_type_i       113980 non-null  object
 16  in

<a href='#top'>Back to top</a>
## Function to produce single dataframe <a name='make_dataframe'></a>

In [14]:
def get_pet_dataframe():
    df_o, df_i = get_pet_data()
    df_i = rename_intake(df_i)
    df = join_tables(df_o, df_i)
    return df

df = get_pet_dataframe()

Returning saved csv files.


<a href='#top'>Back to top</a>
## Handling the datetime columns <a name='datetime'></a>

first convert to datetime dtype

In [15]:
df[['animal_id', 'datetime', 'monthyear', 'datetime_i', 'datetime2_i']]

Unnamed: 0,animal_id,datetime,monthyear,datetime_i,datetime2_i
0,A851638,2022-06-29T10:52:00.000,2022-06-29T10:52:00.000,2022-02-16T07:32:00.000,2022-02-16T07:32:00.000
1,A860417,2022-06-29T09:48:00.000,2022-06-29T09:48:00.000,2022-06-28T11:17:00.000,2022-06-28T11:17:00.000
2,A857678,2022-06-29T08:23:00.000,2022-06-29T08:23:00.000,2022-05-19T12:40:00.000,2022-05-19T12:40:00.000
3,A857677,2022-06-29T08:22:00.000,2022-06-29T08:22:00.000,2022-05-19T12:40:00.000,2022-05-19T12:40:00.000
4,A858888,2022-06-29T08:19:00.000,2022-06-29T08:19:00.000,2022-06-06T07:23:00.000,2022-06-06T07:23:00.000
...,...,...,...,...,...
113975,A648744,2013-10-01T12:27:00.000,2013-10-01T12:27:00.000,2013-10-01T11:15:00.000,2013-10-01T11:15:00.000
113976,A664258,2013-10-01T12:27:00.000,2013-10-01T12:27:00.000,2013-10-01T11:15:00.000,2013-10-01T11:15:00.000
113977,A664237,2013-10-01T10:44:00.000,2013-10-01T10:44:00.000,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000
113978,A664236,2013-10-01T10:44:00.000,2013-10-01T10:44:00.000,2013-10-01T08:33:00.000,2013-10-01T08:33:00.000


In [16]:
df['datetime'] = pd.to_datetime(df['datetime'])
df['monthyear'] = pd.to_datetime(df['monthyear'])
df['dateime_i'] = pd.to_datetime(df['datetime_i'])
df['datetime2_i'] = pd.to_datetime(df['datetime2_i'])

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113980 entries, 0 to 113979
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   animal_id           113980 non-null  object        
 1   datetime            113980 non-null  datetime64[ns]
 2   monthyear           113980 non-null  datetime64[ns]
 3   date_of_birth       113980 non-null  object        
 4   outcome_type        113965 non-null  object        
 5   outcome_subtype     59080 non-null   object        
 6   animal_type         113980 non-null  object        
 7   sex_upon_outcome    113979 non-null  object        
 8   age_upon_outcome    113956 non-null  object        
 9   breed               113980 non-null  object        
 10  color               113980 non-null  object        
 11  name                73212 non-null   object        
 12  datetime_i          113980 non-null  object        
 13  datetime2_i         113980 no

It doesn't look like there are any mis-matches in the two tables

In [18]:
df[df['datetime'] != df['monthyear']]

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,...,found_location_i,intake_type_i,intake_condition_i,animal_type_i,sex_upon_intake_i,age_upon_intake_i,breed_i,color_i,name_i,dateime_i


In [19]:
df[df['datetime_i'] != df['datetime2_i']]

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,...,found_location_i,intake_type_i,intake_condition_i,animal_type_i,sex_upon_intake_i,age_upon_intake_i,breed_i,color_i,name_i,dateime_i


Therefore, make two columns: `outcome_date` and `intake_date` that contains the day of each

In [22]:
def make_date_columns(df):
    df['datetime'] = pd.to_datetime(df['datetime'])
    df['monthyear'] = pd.to_datetime(df['monthyear'])
    df['datetime_i'] = pd.to_datetime(df['datetime_i'])
    df['datetime2_i'] = pd.to_datetime(df['datetime2_i'])
    df['outcome_date'] = df['monthyear'].dt.strftime('%m %d, %Y')
    df['intake_date'] = df['monthyear'].dt.strftime('%m %d, %Y')
    df = df.drop(columns = ['datetime', 'monthyear', 'datetime_i', 'datetime2_i'])
    return df

make_date_columns(get_pet_dataframe()).info()

Returning saved csv files.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 113980 entries, 0 to 113979
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   animal_id           113980 non-null  object
 1   date_of_birth       113980 non-null  object
 2   outcome_type        113965 non-null  object
 3   outcome_subtype     59080 non-null   object
 4   animal_type         113980 non-null  object
 5   sex_upon_outcome    113979 non-null  object
 6   age_upon_outcome    113956 non-null  object
 7   breed               113980 non-null  object
 8   color               113980 non-null  object
 9   name                73212 non-null   object
 10  found_location_i    113980 non-null  object
 11  intake_type_i       113980 non-null  object
 12  intake_condition_i  113980 non-null  object
 13  animal_type_i       113980 non-null  object
 14  sex_upon_intake_i   113979 non-null  object
 15  age_upon_intake_i   1139

<a href='#top'>Back to top</a>

## Checking data integrity of select columns <a name='integrity'></a>

all the animal types are the same

In [23]:
df[df['animal_type'] != df['animal_type_i']]

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,...,found_location_i,intake_type_i,intake_condition_i,animal_type_i,sex_upon_intake_i,age_upon_intake_i,breed_i,color_i,name_i,dateime_i


Mis-matched names are `NaN`.

In [30]:
df[df['name'] != df['name_i']][['name', 'name_i']].value_counts(dropna=False)

name  name_i
NaN   NaN       40768
dtype: int64

All the breeds are consistent

In [31]:
df[df['breed'] != df['breed_i']]

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,...,found_location_i,intake_type_i,intake_condition_i,animal_type_i,sex_upon_intake_i,age_upon_intake_i,breed_i,color_i,name_i,dateime_i


color is consistent

In [32]:
df[df['color'] != df['color_i']]

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,...,found_location_i,intake_type_i,intake_condition_i,animal_type_i,sex_upon_intake_i,age_upon_intake_i,breed_i,color_i,name_i,dateime_i


<a href='#top'>Back to top</a>

## Considering nulls <a name='nulls'></a>

name column can be filled with string:

In [33]:
df[df['name'] != df['name_i']][['name', 'name_i']].value_counts(dropna=False)

name  name_i
NaN   NaN       40768
dtype: int64

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113980 entries, 0 to 113979
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   animal_id           113980 non-null  object        
 1   datetime            113980 non-null  datetime64[ns]
 2   monthyear           113980 non-null  datetime64[ns]
 3   date_of_birth       113980 non-null  object        
 4   outcome_type        113965 non-null  object        
 5   outcome_subtype     59080 non-null   object        
 6   animal_type         113980 non-null  object        
 7   sex_upon_outcome    113979 non-null  object        
 8   age_upon_outcome    113956 non-null  object        
 9   breed               113980 non-null  object        
 10  color               113980 non-null  object        
 11  name                73212 non-null   object        
 12  datetime_i          113980 non-null  object        
 13  datetime2_i         113980 no

There are only fifteen `NaN`.  These can be dropped safely

In [35]:
df.outcome_type.value_counts(dropna=False)

Adoption           49533
Transfer           37749
Return to Owner    15252
Euthanasia          8914
Died                1276
Disposal             628
Rto-Adopt            532
Missing               55
Relocate              25
NaN                   15
Lost                   1
Name: outcome_type, dtype: int64

Replace outcome_subtype NaN's with no subtype. Consider the cross tab for 

In [36]:
df.outcome_subtype.value_counts(dropna=False)

NaN                    54900
Partner                31384
Foster                 11019
Rabies Risk             4093
Suffering               3462
SCRP                    2941
Snr                     2831
In Kennel                669
Out State                570
Aggressive               414
Offsite                  338
Medical                  318
In Foster                315
At Vet                   282
Behavior                 125
Enroute                   90
Field                     79
Underage                  36
In Surgery                27
Court/Investigation       25
Prc                       11
In State                  11
Customer S                11
Barn                      10
Possible Theft             7
Emer                       6
Emergency                  6
Name: outcome_subtype, dtype: int64

In [41]:
pd.crosstab(df.outcome_subtype, df.outcome_type)

outcome_type,Adoption,Died,Euthanasia,Missing,Return to Owner,Transfer
outcome_subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aggressive,0,0,414,0,0,0
At Vet,0,90,192,0,0,0
Barn,3,0,0,0,0,7
Behavior,0,0,125,0,0,0
Court/Investigation,0,0,25,0,0,0
Customer S,0,0,0,0,11,0
Emer,0,0,0,0,0,6
Emergency,0,6,0,0,0,0
Enroute,0,90,0,0,0,0
Field,0,0,0,0,79,0


SCRP is the Stray Cat Return Program.  A spay and release program for cats.

In [44]:
df[df.outcome_subtype == 'SCRP'].animal_type.value_counts()

Cat    2941
Name: animal_type, dtype: int64

Drop the one NaN

In [46]:
df.sex_upon_outcome.value_counts(dropna=False)

Neutered Male    36133
Spayed Female    33916
Intact Female    16434
Intact Male      16330
Unknown          11166
NaN                  1
Name: sex_upon_outcome, dtype: int64

In [47]:
df.age_upon_outcome.value_counts(dropna=False)

1 year       18687
2 years      16956
2 months     16000
3 months      6079
1 month       5749
3 years       5734
4 months      3969
4 years       3214
5 years       3141
5 months      2922
6 months      2878
3 weeks       2383
2 weeks       2354
6 years       2063
4 weeks       2007
8 years       1937
7 years       1810
8 months      1738
10 years      1694
10 months     1494
7 months      1388
9 months      1070
1 weeks       1038
9 years       1019
12 years       848
1 week         794
11 years       600
11 months      565
13 years       534
3 days         436
2 days         402
14 years       391
1 day          343
15 years       335
6 days         262
4 days         255
0 years        205
5 days         173
5 weeks        152
16 years       143
17 years        84
18 years        52
19 years        25
NaN             24
20 years        21
22 years         6
28 years         1
30 years         1
23 years         1
21 years         1
24 years         1
25 years         1
Name: age_up

In [48]:
df.sex_upon_intake_i.value_counts(dropna=False)

Intact Male      40620
Intact Female    39824
Neutered Male    11843
Unknown          11166
Spayed Female    10526
NaN                  1
Name: sex_upon_intake_i, dtype: int64

## Function to fill nulls and drop nulls

In [53]:
def null_fill_and_drop(df):
    df.name = df.name.fillna('no name')
    df.outcome_subtype = df.outcome_subtype.fillna('no subtype')
    df = df.drop(columns=['name_i', 'breed_i', 'color_i', 'animal_type_i'])
    df = df.dropna()
    return df

null_fill_and_drop(get_pet_dataframe()).info()

Returning saved csv files.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 113940 entries, 0 to 113979
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   animal_id           113940 non-null  object
 1   datetime            113940 non-null  object
 2   monthyear           113940 non-null  object
 3   date_of_birth       113940 non-null  object
 4   outcome_type        113940 non-null  object
 5   outcome_subtype     113940 non-null  object
 6   animal_type         113940 non-null  object
 7   sex_upon_outcome    113940 non-null  object
 8   age_upon_outcome    113940 non-null  object
 9   breed               113940 non-null  object
 10  color               113940 non-null  object
 11  name                113940 non-null  object
 12  datetime_i          113940 non-null  object
 13  datetime2_i         113940 non-null  object
 14  found_location_i    113940 non-null  object
 15  intake_type_i       1139

<a href='#top'>Back to top</a>

## Fixing the age columns <a name='age'></a>

The age column contains a variety of strings that can be turned into a number.  The smallest unit present is days.  Therefore convert all into days. 



In [62]:
df[['age_upon_outcome']].value_counts()[:5]

age_upon_outcome
1 year              18687
2 years             16956
2 months            16000
3 months             6079
1 month              5749
dtype: int64

In [63]:
def convert_age_column(sr):
    return sr

convert_age_column(df['age_upon_outcome'])

0           1 year
1         3 months
2         2 months
3         2 months
4          8 years
            ...   
113975      1 year
113976     7 years
113977      1 week
113978      1 week
113979      1 week
Name: age_upon_outcome, Length: 113980, dtype: object