# All Pandas All The Time

Pandas is a library we're going to be using pretty much every day in this course, so we're going to do a ton of practice so you can be on your way to becoming a _PANDAS MASTER_.

![Kung fu panda excited](https://data.whicdn.com/images/201331793/original.gif)

Let's continue with the data from the Austin Animal Shelter. 

Data source: [intakes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and [outcomes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238).

Once again starting off with intake data, which is data describing the animals as they enter the shelter.

In [1]:
# Imports! Can't use pandas unless we bring it into our notebook
import pandas as pd

In [3]:
!ls data/

Austin_Animal_Center_Intakes_030921.csv
Austin_Animal_Center_Outcomes_030921.csv


In [4]:
# Grab the data, naming the dataframe 'intakes' this time
# Don't forget to read in DateTime as a datetime column
intakes = pd.read_csv('data/Austin_Animal_Center_Intakes_030921.csv',
                      parse_dates=['DateTime'])

In [5]:
# Check out the first few rows
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [6]:
# Check information on the dataframe
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124222 entries, 0 to 124221
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         124222 non-null  object        
 1   Name              85158 non-null   object        
 2   DateTime          124222 non-null  datetime64[ns]
 3   MonthYear         124222 non-null  object        
 4   Found Location    124222 non-null  object        
 5   Intake Type       124222 non-null  object        
 6   Intake Condition  124222 non-null  object        
 7   Animal Type       124222 non-null  object        
 8   Sex upon Intake   124221 non-null  object        
 9   Age upon Intake   124222 non-null  object        
 10  Breed             124222 non-null  object        
 11  Color             124222 non-null  object        
dtypes: datetime64[ns](1), object(11)
memory usage: 11.4+ MB


Let's do some of the transformations we did last time: dropping the MonthYear column, and changing column names to be lowercase without spaces.

In [8]:
# Drop MonthYear
intakes = intakes.drop(columns='MonthYear')

In [10]:
# Rename columns
intakes = intakes.rename(columns=lambda x: x.replace(" ", "_").lower())

In [11]:
# Sanity check
intakes.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


## Dealing with Dirty Data

It is a fact of the data science life - you will always be surrounded by 'dirty' data. What does it mean for data to be 'dirty'? What are some of the various ways that data can be 'dirty'?

- missing
- special characters that don't render properly
- inconsistent data types
- commas or other signs messing with your numeric data
- lists or other data structures inside columns
- duplicates
- incorrect inputs/nonsense data


In [14]:
# Check for null values recognized by pandas as blank
intakes.isna().sum()

animal_id               0
name                39064
datetime                0
found_location          0
intake_type             0
intake_condition        0
animal_type             0
sex_upon_intake         1
age_upon_intake         0
breed                   0
color                   0
dtype: int64

There is no one way to deal with null values. What are some of the strategies we can use to deal with them?

- fill nulls with something that shows the value is missing ('unknown', 0)
- fill nulls with average or median
- drop those rows/columns


How, in Pandas, can we fill null values recognized by Pandas as null? Let's practice by filling nulls for the Name column with some placeholder value, like 'No name'.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

In [16]:
# Code here to fill nulls in the Name column
intakes['name'] = intakes['name'].fillna(value='No Name')

Now let's check for nulls again...

In [17]:
# Sanity check
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     1
age_upon_intake     0
breed               0
color               0
dtype: int64

Let's try a different strategy for the one lonely null in the 'Sex upon Intake' column - let's just drop that row, since it's only one observation.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [21]:
len(intakes)

124222

In [22]:
# Code here to drop the whole row where Sex upon Intake is null
intakes = intakes.dropna(subset=['sex_upon_intake'])

In [23]:
len(intakes)

124221

In [24]:
# Copy/paste code from above to re-check for nulls
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     0
age_upon_intake     0
breed               0
color               0
dtype: int64

How do we find sneaky null or nonsense values that aren't marked by Pandas as null?

In [25]:
# Run this cell without changes
intakes['age_upon_intake'].value_counts()

1 year       21809
2 years      19069
1 month      11910
3 years       7456
2 months      6735
4 years       4458
4 weeks       4414
5 years       4064
3 weeks       3620
3 months      3279
4 months      3203
5 months      3073
6 years       2717
2 weeks       2498
6 months      2396
7 years       2335
8 years       2284
7 months      1860
10 years      1828
9 months      1824
8 months      1500
9 years       1331
10 months     1028
1 week        1022
1 weeks        888
12 years       880
11 months      795
0 years        754
11 years       746
1 day          635
3 days         578
13 years       575
2 days         479
14 years       383
15 years       336
4 days         328
5 weeks        315
6 days         305
5 days         180
16 years       140
17 years        82
18 years        47
19 years        27
20 years        19
-1 years         5
22 years         5
21 years         1
-3 years         1
25 years         1
-2 years         1
23 years         1
24 years         1
Name: age_up

In [26]:
intakes['age_upon_intake'].unique()

array(['2 years', '8 years', '11 months', '4 weeks', '4 years', '6 years',
       '5 months', '14 years', '1 month', '2 months', '18 years',
       '4 months', '1 year', '6 months', '3 years', '4 days', '1 day',
       '5 years', '2 weeks', '15 years', '7 years', '3 weeks', '3 months',
       '12 years', '1 week', '9 months', '10 years', '10 months',
       '7 months', '9 years', '8 months', '1 weeks', '5 days', '2 days',
       '11 years', '0 years', '17 years', '3 days', '13 years', '5 weeks',
       '19 years', '6 days', '16 years', '20 years', '-1 years',
       '22 years', '23 years', '-2 years', '21 years', '-3 years',
       '25 years', '24 years'], dtype=object)

Analyze the values you're finding in the 'Age upon Intake' column. What doesn't quite fit here?

**Note:** using `.value_counts()` is just one way to look at the values of a column. In this case, it works because we can see which values are the most common, and it's verbose enough to show even the less common values that might be problematic.

So - how do we want to deal with the data in here that doesn't make sense?

- 


What if our goal is creating a column with a common standard for age, one which we could sort to see which animals are the oldest or youngest?

First, let's see what that would look like if we try it as the column is now:

In [33]:
# Run this cell without changes
intakes['age_upon_intake'].sort_values(ascending=True).unique()

array(['-1 years', '-2 years', '-3 years', '0 years', '1 day', '1 month',
       '1 week', '1 weeks', '1 year', '10 months', '10 years',
       '11 months', '11 years', '12 years', '13 years', '14 years',
       '15 years', '16 years', '17 years', '18 years', '19 years',
       '2 days', '2 months', '2 weeks', '2 years', '20 years', '21 years',
       '22 years', '23 years', '24 years', '25 years', '3 days',
       '3 months', '3 weeks', '3 years', '4 days', '4 months', '4 weeks',
       '4 years', '5 days', '5 months', '5 weeks', '5 years', '6 days',
       '6 months', '6 years', '7 months', '7 years', '8 months',
       '8 years', '9 months', '9 years'], dtype=object)

Let's unpack what is happening in that line of code - I take the column 'Age upon Intake' by itself (as a series), then sort the values from lowest to highest (`ascending=True`), then grab only unique results so we can see how it ordered the values without looking through all 115,088.

Does that do what we want it to? Let's discuss how this worked - how did it sort?

- 


To make our problem a bit easier, without dealing with the different ways that age is broken out, let's only look at animals where the age is given in years. How can we do that?

In [35]:
intakes['age_upon_intake']

0           2 years
1           8 years
2         11 months
3           4 weeks
4           4 years
            ...    
124217      2 years
124218       1 year
124219      4 years
124220      3 years
124221      2 years
Name: age_upon_intake, Length: 124221, dtype: object

In [39]:
# Code here to grab only the animals where age is given in years
in_years = intakes['age_upon_intake'].map(lambda x: "year" in x)

In [40]:
in_years

0          True
1          True
2         False
3         False
4          True
          ...  
124217     True
124218     True
124219     True
124220     True
124221     True
Name: age_upon_intake, Length: 124221, dtype: bool

In [56]:
intakes.loc[in_years]

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate
6,A635072,Beowulf,2019-04-16 09:53:00,415 East Mary Street in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6 years,Great Dane Mix,Black
...,...,...,...,...,...,...,...,...,...,...,...
124217,A830428,No Name,2021-03-09 12:07:00,501 Strichen Drive in Travis (TX),Wildlife,Sick,Other,Unknown,2 years,Skunk,Black/White
124218,A830411,No Name,2021-03-09 12:40:00,12609 Dessau Rd in Austin (TX),Stray,Normal,Dog,Intact Male,1 year,Dachshund/Rat Terrier,Brown/White
124219,A830250,*Hansel,2021-03-05 14:31:00,Cesar Chavez And North Lamar in Austin (TX),Stray,Normal,Dog,Intact Male,4 years,German Shepherd,Brown/Black
124220,A830431,Chema,2021-03-09 12:04:00,Austin (TX),Owner Surrender,Normal,Dog,Unknown,3 years,Beagle/Chihuahua Shorthair,Black/Brown


In [49]:
year_rows_v1 = []
for row in intakes['age_upon_intake']:
    if "year" in row:
        year_rows_v1.append(row)

In [50]:
year_rows_v2 = [row for row in intakes['age_upon_intake'] if "year" in row]

In [52]:
year_rows_v1 == year_rows_v2

True

In [58]:
years_intake = intakes.loc[intakes['age_upon_intake'].str.contains('year') == True]

In [60]:
# Check the shape of this subset dataframe
years_intake.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71356 entries, 0 to 124221
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   animal_id         71356 non-null  object        
 1   name              71356 non-null  object        
 2   datetime          71356 non-null  datetime64[ns]
 3   found_location    71356 non-null  object        
 4   intake_type       71356 non-null  object        
 5   intake_condition  71356 non-null  object        
 6   animal_type       71356 non-null  object        
 7   sex_upon_intake   71356 non-null  object        
 8   age_upon_intake   71356 non-null  object        
 9   breed             71356 non-null  object        
 10  color             71356 non-null  object        
dtypes: datetime64[ns](1), object(10)
memory usage: 6.5+ MB


In [61]:
# Sanity check
years_intake.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate
6,A635072,Beowulf,2019-04-16 09:53:00,415 East Mary Street in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6 years,Great Dane Mix,Black


Can we grab only the number of years from this? Let's make a new column where we can put this data.

In [64]:
years_intake['age_upon_intake'][0].split(" ")

['2', 'years']

In [69]:
# Code here to make a new column, 'Age in Years'
years_intake['age_in_years'] = years_intake['age_upon_intake'].str.split(" ").str[0]

# Did you get a 'SettingWithCopyWarning'? No worries - let's discuss

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  years_intake['age_in_years'] = years_intake['age_upon_intake'].str.split(" ").str[0]


In [71]:
years_intake.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,age_in_years
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,8
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,4
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate,2
6,A635072,Beowulf,2019-04-16 09:53:00,415 East Mary Street in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6 years,Great Dane Mix,Black,6


In [72]:
years_intake.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71356 entries, 0 to 124221
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   animal_id         71356 non-null  object        
 1   name              71356 non-null  object        
 2   datetime          71356 non-null  datetime64[ns]
 3   found_location    71356 non-null  object        
 4   intake_type       71356 non-null  object        
 5   intake_condition  71356 non-null  object        
 6   animal_type       71356 non-null  object        
 7   sex_upon_intake   71356 non-null  object        
 8   age_upon_intake   71356 non-null  object        
 9   breed             71356 non-null  object        
 10  color             71356 non-null  object        
 11  age_in_years      71356 non-null  object        
dtypes: datetime64[ns](1), object(11)
memory usage: 9.6+ MB


In [77]:
# Code here to transform that column to an integer
years_intake['age_in_years'] = years_intake['age_in_years'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  years_intake['age_in_years'] = years_intake['age_in_years'].astype('int')


In [78]:
# Code here to check your work
years_intake.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,age_in_years
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,8
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,4
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate,2
6,A635072,Beowulf,2019-04-16 09:53:00,415 East Mary Street in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6 years,Great Dane Mix,Black,6


In [79]:
# Code here to check some statistics on our now-numeric column
years_intake['age_in_years'].describe()

count    71356.000000
mean         3.417428
std          3.166216
min         -3.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         25.000000
Name: age_in_years, dtype: float64

In [80]:
# Code here to check the unique values - in order!
years_intake['age_in_years'].sort_values().unique()

array([-3, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,
       14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])

In [81]:
# Let's check the mean for our now-numeric column
years_intake['age_in_years'].mean()

3.417428106956668

In [82]:
# Now let's check the median
years_intake['age_in_years'].median()

2.0

Let's discuss this column - what does it mean that the mean and median are different? How will that change if we remove some of the nonsense numbers?

- skewed - and skewed towards older data
- removing negatives would skew the data even more!


In [84]:
# Code here to deal with those nonsense numbers
nonsense_years=[-3, -2, -1]

years_intake['age_in_years'] = years_intake['age_in_years'].replace(nonsense_years, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  years_intake['age_in_years'] = years_intake['age_in_years'].replace(nonsense_years, 0)


In [85]:
# Sanity check
years_intake['age_in_years'].unique()

array([ 2,  8,  4,  6, 14, 18,  1,  3,  5, 15,  7, 12, 10,  9, 11,  0, 17,
       13, 19, 16, 20, 22, 23, 21, 25, 24])

In [87]:
set(years_intake['age_in_years'])

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25}

In [86]:
# Code here to re-check your mean/median values
years_intake['age_in_years'].describe()

count    71356.000000
mean         3.417568
std          3.166025
min          0.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         25.000000
Name: age_in_years, dtype: float64

### Duplicates - another kind of dirty data (sometimes)

Some duplicates are legitimate, some are not - let's explore and discuss!

Let's go back to our full intakes dataframe

In [89]:
# Check for duplicates
intakes.duplicated().sum()

19

In [93]:
# Now check specifically for Animal IDs that are duplicated
intakes.duplicated(subset=['animal_id']).sum()

1840

In [94]:
# Handle duplicates - only take the 1st intake for each animal
# Save it as a new version, named clean_intakes
clean_intakes = intakes.drop_duplicates(subset=['animal_id'], keep='last')

In [95]:
clean_intakes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111012 entries, 0 to 124221
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   animal_id         111012 non-null  object        
 1   name              111012 non-null  object        
 2   datetime          111012 non-null  datetime64[ns]
 3   found_location    111012 non-null  object        
 4   intake_type       111012 non-null  object        
 5   intake_condition  111012 non-null  object        
 6   animal_type       111012 non-null  object        
 7   sex_upon_intake   111012 non-null  object        
 8   age_upon_intake   111012 non-null  object        
 9   breed             111012 non-null  object        
 10  color             111012 non-null  object        
dtypes: datetime64[ns](1), object(10)
memory usage: 10.2+ MB


## Group By

We can use a `groupby` function to find out interesting patterns among groups in our data. Let's use one now to find the average age of each animal type in years.

In [96]:
# Run just a groupby on the animal_type column - what's the output?
clean_intakes.groupby(by=['animal_type'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9eb7adbb50>

In [103]:
# Add an aggregation function
years_intake.groupby(by=['animal_type']).agg(['mean', 'count'])

Unnamed: 0_level_0,age_in_years,age_in_years
Unnamed: 0_level_1,mean,count
animal_type,Unnamed: 1_level_2,Unnamed: 2_level_2
Bird,1.697778,450
Cat,3.574686,16489
Dog,3.572886,49296
Livestock,1.571429,7
Other,1.567657,5114


## Merging Dataframes

We were given two data sources here - both an Intakes and an Outcomes CSV. Let's merge them!

![Merge diagram from Data Science Made Simple](http://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)

[Image from Data Science Made Simple's post on Joining/Merging Pandas Data Frames](http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

Even more useful: https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

In [104]:
# Read in our outcomes csv as a dataframe named outcomes
outcomes = pd.read_csv("data/Austin_Animal_Center_Outcomes_030921.csv")

In [105]:
# Check out our outcomes data
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,08/16/2020 11:38:00 AM,08/16/2020 11:38:00 AM,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


What column should we use to merge these DataFrames?

- 


Let's do some quick cleaning on our outcomes dataframe...

In [107]:
# Change the 'DateTime' column here to be recognized as datetime objects
outcomes['DateTime'] = pd.to_datetime(outcomes['DateTime'])

In [108]:
# Change column names to be lower case and remove spaces
outcomes = outcomes.rename(columns=lambda x: x.replace(" ", "_").lower())

In [109]:
outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A794011,Chunk,2019-05-08 18:20:00,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,2020-08-16 11:38:00,08/16/2020 11:38:00 AM,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,2016-02-13 17:59:00,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A674754,,2014-03-18 11:47:00,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


In [110]:
# Drop duplicate animal IDs, keeping only the 1st
# Save this as clean_outcomes
clean_outcomes = outcomes.drop_duplicates(subset=['animal_id'], keep='last')

In [111]:
# Sanity check
clean_outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A794011,Chunk,2019-05-08 18:20:00,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,2020-08-16 11:38:00,08/16/2020 11:38:00 AM,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
4,A674754,,2014-03-18 11:47:00,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
6,A814515,Quentin,2020-05-06 07:59:00,05/06/2020 07:59:00 AM,03/01/2018,Adoption,Foster,Dog,Neutered Male,2 years,American Foxhound/Labrador Retriever,White/Brown


Now... let's merge!

In [115]:
# Code here to merge dataframes
total = pd.merge(clean_intakes, clean_outcomes, on='animal_id', 
                 suffixes=('_intakes', '_outcomes'))

In [116]:
# Code here to check out the details of our new dataframe
total.head()

Unnamed: 0,animal_id,name_intakes,datetime_intakes,found_location,intake_type,intake_condition,animal_type_intakes,sex_upon_intake,age_upon_intake,breed_intakes,...,datetime_outcomes,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type_outcomes,sex_upon_outcome,age_upon_outcome,breed_outcomes,color_outcomes
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,...,2019-01-08 15:11:00,01/08/2019 03:11:00 PM,01/03/2017,Transfer,Partner,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,...,2015-07-05 15:13:00,07/05/2015 03:13:00 PM,07/05/2007,Return to Owner,,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,...,2016-04-21 17:17:00,04/21/2016 05:17:00 PM,04/17/2015,Return to Owner,,Dog,Neutered Male,1 year,Basenji Mix,Sable/White
3,A665644,No Name,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,...,2013-10-21 11:39:00,10/21/2013 11:39:00 AM,09/21/2013,Transfer,Partner,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,...,2014-07-02 14:16:00,07/02/2014 02:16:00 PM,06/29/2010,Return to Owner,,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [117]:
total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110590 entries, 0 to 110589
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   animal_id             110590 non-null  object        
 1   name_intakes          110590 non-null  object        
 2   datetime_intakes      110590 non-null  datetime64[ns]
 3   found_location        110590 non-null  object        
 4   intake_type           110590 non-null  object        
 5   intake_condition      110590 non-null  object        
 6   animal_type_intakes   110590 non-null  object        
 7   sex_upon_intake       110590 non-null  object        
 8   age_upon_intake       110590 non-null  object        
 9   breed_intakes         110590 non-null  object        
 10  color_intakes         110590 non-null  object        
 11  name_outcomes         71998 non-null   object        
 12  datetime_outcomes     110590 non-null  datetime64[ns]
 13 

In [140]:
test = pd.merge(intakes, outcomes, on='animal_id', 
                suffixes=('_intakes', '_outcomes'), how='inner')

In [141]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160586 entries, 0 to 160585
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   animal_id             160586 non-null  object        
 1   name_intakes          160586 non-null  object        
 2   datetime_intakes      160586 non-null  datetime64[ns]
 3   found_location        160586 non-null  object        
 4   intake_type           160586 non-null  object        
 5   intake_condition      160586 non-null  object        
 6   animal_type_intakes   160586 non-null  object        
 7   sex_upon_intake       160586 non-null  object        
 8   age_upon_intake       160586 non-null  object        
 9   breed_intakes         160586 non-null  object        
 10  color_intakes         160586 non-null  object        
 11  name_outcomes         120756 non-null  object        
 12  datetime_outcomes     160586 non-null  datetime64[ns]
 13 

In [142]:
intakes.duplicated(subset=['animal_id']).sum() + outcomes.duplicated(subset=['animal_id']).sum()

26404

In [143]:
test.duplicated(subset=['animal_id']).sum()

49996

In [144]:
intakes[intakes.duplicated(subset=['animal_id']) == True]['animal_id'].value_counts()

A721033    32
A718223    13
A718877    11
A706536    10
A716018     8
           ..
A717851     1
A694578     1
A735930     1
A765200     1
A720573     1
Name: animal_id, Length: 10113, dtype: int64

In [147]:
len(intakes.loc[intakes['animal_id'] == 'A721033'])

33

In [145]:
test.loc[test['animal_id'] == 'A721033']

Unnamed: 0,animal_id,name_intakes,datetime_intakes,found_location,intake_type,intake_condition,animal_type_intakes,sex_upon_intake,age_upon_intake,breed_intakes,...,datetime_outcomes,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type_outcomes,sex_upon_outcome,age_upon_outcome,breed_outcomes,color_outcomes
7751,A721033,Lil Bit,2019-02-24 21:53:00,700 Allen St in Austin (TX),Public Assist,Normal,Dog,Neutered Male,3 years,Rat Terrier Mix,...,2019-08-10 11:56:00,08/10/2019 11:56:00 AM,05/20/2015,Return to Owner,,Dog,Neutered Male,4 years,Rat Terrier Mix,Tricolor/Brown Brindle
7752,A721033,Lil Bit,2019-02-24 21:53:00,700 Allen St in Austin (TX),Public Assist,Normal,Dog,Neutered Male,3 years,Rat Terrier Mix,...,2017-01-10 16:20:00,01/10/2017 04:20:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,1 year,Rat Terrier Mix,Tricolor/Brown Brindle
7753,A721033,Lil Bit,2019-02-24 21:53:00,700 Allen St in Austin (TX),Public Assist,Normal,Dog,Neutered Male,3 years,Rat Terrier Mix,...,2016-10-21 18:55:00,10/21/2016 06:55:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,1 year,Rat Terrier Mix,Tricolor/Brown Brindle
7754,A721033,Lil Bit,2019-02-24 21:53:00,700 Allen St in Austin (TX),Public Assist,Normal,Dog,Neutered Male,3 years,Rat Terrier Mix,...,2019-03-11 16:27:00,03/11/2019 04:27:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,3 years,Rat Terrier Mix,Tricolor/Brown Brindle
7755,A721033,Lil Bit,2019-02-24 21:53:00,700 Allen St in Austin (TX),Public Assist,Normal,Dog,Neutered Male,3 years,Rat Terrier Mix,...,2019-05-21 14:42:00,05/21/2019 02:42:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,4 years,Rat Terrier Mix,Tricolor/Brown Brindle
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8835,A721033,Lil Bit,2018-02-22 10:28:00,6400 Ben White Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,2 years,Rat Terrier Mix,...,2016-09-12 13:40:00,09/12/2016 01:40:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,1 year,Rat Terrier Mix,Tricolor/Brown Brindle
8836,A721033,Lil Bit,2018-02-22 10:28:00,6400 Ben White Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,2 years,Rat Terrier Mix,...,2018-03-08 15:04:00,03/08/2018 03:04:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,2 years,Rat Terrier Mix,Tricolor/Brown Brindle
8837,A721033,Lil Bit,2018-02-22 10:28:00,6400 Ben White Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,2 years,Rat Terrier Mix,...,2018-04-17 11:07:00,04/17/2018 11:07:00 AM,05/20/2015,Return to Owner,,Dog,Neutered Male,2 years,Rat Terrier Mix,Tricolor/Brown Brindle
8838,A721033,Lil Bit,2018-02-22 10:28:00,6400 Ben White Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,2 years,Rat Terrier Mix,...,2019-02-12 15:20:00,02/12/2019 03:20:00 PM,05/20/2015,Return to Owner,,Dog,Neutered Male,3 years,Rat Terrier Mix,Tricolor/Brown Brindle


Let's discuss - can anyone guess why I had us remove duplicates before this merge? What would happen if I didn't? How could we make our combined_df better?

- note the difference in the number of rows between these examples - duplicates merge on every row that's duplicated - creating 1089 rows for this one dog alone (Lil Bit)


## Level Up!

1. Find the **age in days** for all animals, not just the ones whose age is provided in years. Be sure to do this on the original dataframe, not just on subsets of the dataframe.

   - (Assume a year is 365 days, and a month is 30 days)

        
2. Ask a few questions of the combined dataframe that you couldn't figure out by just looking at the intakes or outcomes dataframes by themselves.

   - Example: Can you find out how long each animal in the combined dataframe has been in the shelter? 
        
       - Hint: Check out Date Time objects - a new data type that isn't a string or an integer, but which Pandas can recognize as time! https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
# Code here to work on level up #1


In [None]:
# Code here to work on level up #2
