# Module 1 - Manipulating data with Pandas
## Pandas Part 2

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)

## Scenario:
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. In this lecture, we are continue to look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.

#### _Our goals today are to be able to_: <br/>

Use the pandas library to:

- Get summary info about a dataset and its variables
  - Apply and use info, describe and dtypes
  - Use mean, min, max, and value_counts 
- Use apply and applymap to transform columns and create new values

- Explain lambda functions and use them to use an apply on a DataFrame
- Explain what a groupby object is and split a DataFrame using a groupby
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting


## Getting started

Let's take a moment to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). What kinds of questions can we ask this data and what kinds of information can we get back?

In pairs and as a class, let's generate ideas.

## Switch gears

Before we answer those questions about the animal shelter data, let's practice on a simpler dataset.
Read about this dataset here: https://www.kaggle.com/ronitf/heart-disease-uci
![heart-data](images/heartbloodpres.jpeg)

The dataset is most often used to practice classification algorithms. Can one develop a model to predict the likelihood of heart disease based on other measurable characteristics? We will return to that specific question in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

### 1. Get summary info about a dataset and its variables

Applying and using `info`, `describe`, `mean`, `min`, `max`, `apply`, and `applymap` from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

In [4]:
!pwd
!ls -al

/Users/yl/Desktop/Flatiron/dc-ds-071519/1-Module/week-2/day-6-pandas-part-2
total 80
drwxr-xr-x  7 yl  staff    224 Jul 22 09:42 [34m.[m[m
drwxr-xr-x  4 yl  staff    128 Jul 22 09:42 [34m..[m[m
drwxr-xr-x  3 yl  staff     96 Jul 22 09:42 [34m.ipynb_checkpoints[m[m
-rw-r--r--@ 1 yl  staff  11325 Jul 22 09:43 heart.csv
-rw-r--r--  1 yl  staff  18098 Jul 22 09:42 manipulating_data_with_pandas.ipynb
-rw-r--r--  1 yl  staff   3297 Jul 22 09:37 pre_process_animal_shelter_data.py
-rw-r--r--  1 yl  staff    130 Jul 22 09:37 states.csv


In [2]:
import pandas as pd
uci = pd.read_csv('heart.csv')

In [3]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#### The `.columns` and `.shape` Attributes

In [5]:
uci.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [6]:
uci.shape

(303, 14)

#### The `.info() `and `.describe()` and `.dtypes` methods

Pandas DataFrames have many useful methods! Let's look at `.info()` , `.describe()`, and `dtypes`.

In [7]:
# Call the .info() method on our dataset. What do you observe?

uci.info()
# name of the columns, and how many missing values, it tells the type of columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [8]:
# Call the .describe() method on our dataset. What do you observe?

uci.describe()
# can get the basic statistics. descriptions, can also get IQR = 75% - 25%
# describe only show the numeric data, catagorical data won't show

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [10]:
# T = transpose 
uci.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,303.0,54.366337,9.082101,29.0,47.5,55.0,61.0,77.0
sex,303.0,0.683168,0.466011,0.0,0.0,1.0,1.0,1.0
cp,303.0,0.966997,1.032052,0.0,0.0,1.0,2.0,3.0
trestbps,303.0,131.623762,17.538143,94.0,120.0,130.0,140.0,200.0
chol,303.0,246.264026,51.830751,126.0,211.0,240.0,274.5,564.0
fbs,303.0,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
restecg,303.0,0.528053,0.52586,0.0,0.0,1.0,1.0,2.0
thalach,303.0,149.646865,22.905161,71.0,133.5,153.0,166.0,202.0
exang,303.0,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


In [11]:
# Use the code below. How does the output differ from info() ?
uci.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

#### `.mean()`, .`min()`,` .max()`, `.sum()`

The methods `.mean()`, `.min()`, and `.max()` will perform just the way you think they will!

Note that these are methods both for Series and for DataFrames.

In [12]:
uci.ca.mean()

0.7293729372937293

In [13]:
type(uci.age)

pandas.core.series.Series

#### The Axis Variable

In [14]:
uci.sum() # Try [shift] + [tab] here!

age         16473.0
sex           207.0
cp            293.0
trestbps    39882.0
chol        74618.0
fbs            45.0
restecg       160.0
thalach     45343.0
exang          99.0
oldpeak       315.0
slope         424.0
ca            221.0
thal          701.0
target        165.0
dtype: float64

In [15]:
uci.mean()

age          54.366337
sex           0.683168
cp            0.966997
trestbps    131.623762
chol        246.264026
fbs           0.148515
restecg       0.528053
thalach     149.646865
exang         0.326733
oldpeak       1.039604
slope         1.399340
ca            0.729373
thal          2.313531
target        0.544554
dtype: float64

#### .`value_counts()`

For a DataFrame _Series_, the `.value_counts()` method will tell you how many of each value you've got.

In [None]:
# if value_counts takes catagorical data, it gives frequency, otherwise it gives accumulated numbers

In [17]:
uci['age'].value_counts()

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
64    10
41    10
63     9
67     9
55     8
45     8
42     8
53     8
61     8
65     8
43     8
66     7
50     7
48     7
46     7
49     5
47     5
39     4
35     4
68     4
70     4
40     3
71     3
69     3
38     3
34     2
37     2
77     1
76     1
74     1
29     1
Name: age, dtype: int64

In [16]:
uci['age'].value_counts()[:10]  # [:10], takes the top 10

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
Name: age, dtype: int64

Exercise: What are the different values for restecg?

In [19]:
# Your code here!
uci['restecg'].value_counts()

1    152
0    147
2      4
Name: restecg, dtype: int64

### Apply to Animal Shelter Data
Using `.info()` and `.describe()` and `dtypes` what observations can we make about the data?

What are the breed value counts?

How about age counts for dogs?

In [20]:
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [22]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104442 entries, 0 to 104441
Data columns (total 12 columns):
Animal ID           104442 non-null object
Name                71690 non-null object
DateTime            104442 non-null object
MonthYear           104442 non-null object
Date of Birth       104442 non-null object
Outcome Type        104435 non-null object
Outcome Subtype     47571 non-null object
Animal Type         104442 non-null object
Sex upon Outcome    104440 non-null object
Age upon Outcome    104427 non-null object
Breed               104442 non-null object
Color               104442 non-null object
dtypes: object(12)
memory usage: 9.6+ MB


In [23]:
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A800130,Kolby,07/21/2019 10:56:00 PM,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown
1,A799457,Hazel,07/21/2019 10:55:00 PM,07/21/2019 10:55:00 PM,07/08/2013,,,Dog,Spayed Female,6 years,Pit Bull,Tan/White
2,A800069,,07/21/2019 07:57:00 PM,07/21/2019 07:57:00 PM,06/02/2019,Transfer,Partner,Cat,Intact Male,1 month,Domestic Shorthair,Orange Tabby
3,A795483,*Herb,07/21/2019 07:15:00 PM,07/21/2019 07:15:00 PM,04/21/2019,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair,Orange Tabby/White
4,A795482,*Rain,07/21/2019 07:13:00 PM,07/21/2019 07:13:00 PM,04/21/2019,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair,Orange Tabby/White


In [24]:
animal_outcomes.shape

(104442, 12)

In [26]:
animal_outcomes.Breed.value_counts()

Domestic Shorthair Mix                             29921
Pit Bull Mix                                        7934
Labrador Retriever Mix                              6181
Chihuahua Shorthair Mix                             5984
Domestic Medium Hair Mix                            3018
German Shepherd Mix                                 2685
Bat Mix                                             1741
Domestic Shorthair                                  1688
Domestic Longhair Mix                               1487
Australian Cattle Dog Mix                           1345
Siamese Mix                                         1202
Bat                                                  990
Dachshund Mix                                        973
Boxer Mix                                            876
Border Collie Mix                                    855
Miniature Poodle Mix                                 799
Siberian Husky Mix                                   611
Catahoula Mix                  

In [27]:
animal_outcomes[animal_outcomes['Animal Type'] == 'Dog'].Breed.value_counts

<bound method IndexOpsMixin.value_counts of 0                                         Boxer
1                                      Pit Bull
8                 Pomeranian/Chihuahua Longhair
9                                      Pit Bull
10                                     Pit Bull
11                               Boston Terrier
12                          German Shepherd Mix
15               German Shepherd/Siberian Husky
22                           Labrador Retriever
23                            Staffordshire Mix
28                       Labrador Retriever Mix
29                                Collie Smooth
30                           Labrador Retriever
32                            Border Collie Mix
36                       Labrador Retriever Mix
39                                     Pit Bull
41                          Chihuahua Shorthair
54                      Australian Shepherd Mix
56                           Great Pyrenees Mix
58                   Golden Retriever/Chow C

In [33]:
animal_outcomes[(animal_outcomes['Animal Type'] == 'Dog') & (animal_outcomes['Outcome Type'] == 'Adoption')].Breed.value_counts

<bound method IndexOpsMixin.value_counts of 0                                        Boxer
12                         German Shepherd Mix
15              German Shepherd/Siberian Husky
22                          Labrador Retriever
23                           Staffordshire Mix
29                               Collie Smooth
30                          Labrador Retriever
32                           Border Collie Mix
36                      Labrador Retriever Mix
39                                    Pit Bull
54                     Australian Shepherd Mix
56                          Great Pyrenees Mix
58                  Golden Retriever/Chow Chow
59                                   Akita Mix
93                               Alaskan Husky
94                          Labrador Retriever
96                          Chihuahua Longhair
98                                    Pit Bull
100                                   Pit Bull
101                                  Pekingese
102             

In [34]:
animal_outcomes[(animal_outcomes['Animal Type'] == 'Dog')\
                & (animal_outcomes['Outcome Type'] == 'Adoption')
               ].Breed.value_counts()

Labrador Retriever Mix                            3117
Pit Bull Mix                                      3075
Chihuahua Shorthair Mix                           2864
German Shepherd Mix                               1356
Australian Cattle Dog Mix                          752
Dachshund Mix                                      477
Border Collie Mix                                  459
Boxer Mix                                          405
Catahoula Mix                                      330
Staffordshire Mix                                  329
Miniature Poodle Mix                               317
Australian Shepherd Mix                            295
Siberian Husky Mix                                 274
Jack Russell Terrier Mix                           271
Cairn Terrier Mix                                  262
Rat Terrier Mix                                    261
Pointer Mix                                        261
Chihuahua Longhair Mix                             242
Anatol She

In [35]:
animal_outcomes[(animal_outcomes['Animal Type'] == 'Dog')]\
['Outcome Type'].value_counts()

Adoption           27527
Return to Owner    16873
Transfer           12845
Euthanasia          1643
Rto-Adopt            345
Died                 210
Missing               26
Disposal              14
Name: Outcome Type, dtype: int64

What are the breed `value_counts`?
What's the top breed for adopted dogs?

How about outcome counts for dogs?




### 2.  Changing data

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [36]:
def successor(x):
    '''Adds one to each value of x in a column'''
    return x + 1

In [38]:
uci.applymap(successor).head()
# .applymap whole dataframe
# .map column

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


The `.map()` method takes a function as input that it will then apply to every entry in the Series.

In [39]:
uci['age'].map(successor).tail(10)

293    68
294    45
295    64
296    64
297    60
298    58
299    46
300    69
301    58
302    58
Name: age, dtype: int64

In [40]:
uci['age_plus'] = uci['age'].map(successor)

In [41]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,age_plus
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,64
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,38
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,42
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,57
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,58


In [None]:
# .applymap
# .map
# .apply

In [49]:
asd = round(4.5)
asd

4

In [46]:
sdf = round(3.5)
sdf

4

#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [42]:
uci['oldpeak'].head()

0    2.3
1    3.5
2    1.4
3    0.8
4    0.6
Name: oldpeak, dtype: float64

In [43]:
uci['oldpeak'].map(lambda x: round(x))[:4]

0    2
1    4
2    1
3    1
Name: oldpeak, dtype: int64

Exercise: Use an anonymous function to turn the entries in age to strings

In [47]:
uci['age'].map(lambda x:str(x)).head(2)

0    63
1    37
Name: age, dtype: object

### Apply to Animal Shelter Data

Use an `apply` to change the dates from strings to datetime objects. Similarly, use an apply to change the ages of the animals from strings to floats.

In [None]:
# Your code here

In [52]:
animal_outcomes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [53]:
animal_outcomes.DateTime.dtype

dtype('O')

In [54]:
animal_outcomes.DateTime.head(1)

0    07/21/2019 10:56:00 PM
Name: DateTime, dtype: object

In [57]:
animal_outcomes['date_outcomes'] = animal_outcomes.\
DateTime.map(lambda x:pd.to_datetime(x, format = '%m/%d/%Y', errors = 'ignore'))
animal_outcomes.head(2)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,date_outcomes
0,A800130,Kolby,07/21/2019 10:56:00 PM,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown,07/21/2019 10:56:00 PM
1,A799457,Hazel,07/21/2019 10:55:00 PM,07/21/2019 10:55:00 PM,07/08/2013,,,Dog,Spayed Female,6 years,Pit Bull,Tan/White,07/21/2019 10:55:00 PM


In [58]:
animal_outcomes['date_outcomes'] = animal_outcomes.\
DateTime.map(lambda x:pd.to_datetime(x[:10], format = '%m/%d/%Y', errors = 'ignore'))
animal_outcomes.head(2)
# x[:10] --> works like list, only take the string after the fist 10

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,date_outcomes
0,A800130,Kolby,07/21/2019 10:56:00 PM,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown,2019-07-21
1,A799457,Hazel,07/21/2019 10:55:00 PM,07/21/2019 10:55:00 PM,07/08/2013,,,Dog,Spayed Female,6 years,Pit Bull,Tan/White,2019-07-21


In [67]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104442 entries, 0 to 104441
Data columns (total 12 columns):
Animal ID           104442 non-null object
Name                71690 non-null object
MonthYear           104442 non-null object
Date of Birth       104442 non-null object
Outcome Type        104435 non-null object
Outcome Subtype     47571 non-null object
Animal Type         104442 non-null object
Sex upon Outcome    104440 non-null object
Age upon Outcome    104427 non-null object
Breed               104442 non-null object
Color               104442 non-null object
date_outcomes       104442 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(11)
memory usage: 9.6+ MB


In [63]:
# animal_outcomes.drop(columns ='DateTime', inplace = True)
# animal_outcomes.head(2)

KeyError: "['DateTime'] not found in axis"

In [73]:
animal_outcomes['dob'] = animal_outcomes['Date of Birth'].map(lambda x:pd.to_datetime(x, format = '%m/%d/%Y', errors = 'ignore'))
animal_outcomes.head(2)

Unnamed: 0,Animal ID,Name,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,date_outcomes,dob
0,A800130,Kolby,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown,2019-07-21,2019-05-01
1,A799457,Hazel,07/21/2019 10:55:00 PM,07/08/2013,,,Dog,Spayed Female,6 years,Pit Bull,Tan/White,2019-07-21,2013-07-08


In [70]:
animal_outcomes['Age upon Outcome']

0         2 months
1          6 years
2          1 month
3         2 months
4         2 months
5          5 years
6         3 months
7         3 months
8          3 years
9          2 years
10         4 years
11          1 year
12        4 months
13         1 month
14         1 month
15          1 year
16        3 months
17        7 months
18         1 month
19         1 month
20        2 months
21          1 year
22        6 months
23         2 years
24        2 months
25         2 years
26         1 month
27         1 month
28         4 years
29        6 months
            ...   
104412     3 years
104413    14 years
104414     2 years
104415    14 years
104416    9 months
104417    10 years
104418     3 weeks
104419     3 weeks
104420     1 weeks
104421     1 weeks
104422      1 week
104423    4 months
104424     1 month
104425     3 years
104426     7 years
104427      1 year
104428     2 years
104429     3 years
104430      1 year
104431    5 months
104432    6 months
104433     9

In [75]:
animal_outcomes['age_days'] = (animal_outcomes.date_outcomes - animal_outcomes.dob).dt.days
animal_outcomes.head(2)

Unnamed: 0,Animal ID,Name,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,date_outcomes,dob,age_days
0,A800130,Kolby,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown,2019-07-21,2019-05-01,81
1,A799457,Hazel,07/21/2019 10:55:00 PM,07/08/2013,,,Dog,Spayed Female,6 years,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204


In [76]:
animal_outcomes['age_years'] = (animal_outcomes.age_days/365)
animal_outcomes.head(1)

Unnamed: 0,Animal ID,Name,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years
0,A800130,Kolby,07/21/2019 10:56:00 PM,05/01/2019,Adoption,,Dog,Spayed Female,2 months,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918


In [83]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104442 entries, 0 to 104441
Data columns (total 13 columns):
Animal ID           104442 non-null object
Name                71690 non-null object
Date of Birth       104442 non-null object
Outcome Type        104435 non-null object
Outcome Subtype     47571 non-null object
Animal Type         104442 non-null object
Sex upon Outcome    104440 non-null object
Breed               104442 non-null object
Color               104442 non-null object
date_outcomes       104442 non-null datetime64[ns]
dob                 104442 non-null datetime64[ns]
age_days            104442 non-null int64
age_years           104442 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(9)
memory usage: 10.4+ MB


In [96]:
animal_outcomes.drop(columns = ['MonthYear', 'Age upon Outcome'], inplace = True)

KeyError: "['MonthYear' 'Age upon Outcome'] not found in axis"

In [91]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104442 entries, 0 to 104441
Data columns (total 14 columns):
Animal ID           104442 non-null object
Name                71690 non-null object
Date of Birth       104442 non-null object
Outcome Type        104435 non-null object
Outcome Subtype     47571 non-null object
Animal Type         104442 non-null object
Sex upon Outcome    104440 non-null object
Breed               104442 non-null object
Color               104442 non-null object
date_outcomes       104442 non-null datetime64[ns]
dob                 104442 non-null datetime64[ns]
age_days            104442 non-null int64
age_years           104442 non-null float64
year                104442 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(2), object(9)
memory usage: 11.2+ MB


In [95]:
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years,year,month
0,A800130,Kolby,05/01/2019,Adoption,,Dog,Spayed Female,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918,2019,7
1,A799457,Hazel,07/08/2013,,,Dog,Spayed Female,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204,6.038356,2019,7
2,A800069,,06/02/2019,Transfer,Partner,Cat,Intact Male,Domestic Shorthair,Orange Tabby,2019-07-21,2019-06-02,49,0.134247,2019,7
3,A795483,*Herb,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019,7
4,A795482,*Rain,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019,7


In [99]:
animal_outcomes.drop(columns = ['Outcome Subtype'], inplace = True)
animal_outcomes.head(2)

KeyError: "['Outcome Subtype'] not found in axis"

In [100]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104442 entries, 0 to 104441
Data columns (total 14 columns):
Animal ID           104442 non-null object
Name                71690 non-null object
Date of Birth       104442 non-null object
Outcome Type        104435 non-null object
Animal Type         104442 non-null object
Sex upon Outcome    104440 non-null object
Breed               104442 non-null object
Color               104442 non-null object
date_outcomes       104442 non-null datetime64[ns]
dob                 104442 non-null datetime64[ns]
age_days            104442 non-null int64
age_years           104442 non-null float64
year                104442 non-null int64
month               104442 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(3), object(8)
memory usage: 11.2+ MB


In [101]:
animal_outcomes.head(2)

Unnamed: 0,Animal ID,Name,Date of Birth,Outcome Type,Animal Type,Sex upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years,year,month
0,A800130,Kolby,05/01/2019,Adoption,Dog,Spayed Female,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918,2019,7
1,A799457,Hazel,07/08/2013,,Dog,Spayed Female,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204,6.038356,2019,7


## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [84]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d3dab70>

#### `.groups` and `.get_group()`

In [85]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [86]:
uci.groupby('sex').get_group(0) # .tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,age_plus
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,42
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,58
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1,57
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1,49
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1,59
15,50,0,2,120,219,0,1,158,0,1.6,1,0,2,1,51
16,58,0,2,120,340,0,1,172,0,0.0,2,0,2,1,59
17,66,0,3,150,226,0,1,114,0,2.6,0,0,2,1,67
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2,1,70
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2,1,72


### Aggregating

In [None]:
uci.groupby('sex').std()

Exercise: Tell me the average cholesterol level for those with heart disease.

In [None]:
# Your code here!


### Apply to Animal Shelter Data

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?
 

In [87]:
animal_outcomes.groupby('Animal Type').mean()

Unnamed: 0_level_0,age_days,age_years
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Bird,511.893305,1.402447
Cat,539.735487,1.478727
Dog,1023.865479,2.805111
Livestock,419.6875,1.149829
Other,463.717912,1.27046


In [88]:
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years
0,A800130,Kolby,05/01/2019,Adoption,,Dog,Spayed Female,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918
1,A799457,Hazel,07/08/2013,,,Dog,Spayed Female,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204,6.038356
2,A800069,,06/02/2019,Transfer,Partner,Cat,Intact Male,Domestic Shorthair,Orange Tabby,2019-07-21,2019-06-02,49,0.134247
3,A795483,*Herb,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315
4,A795482,*Rain,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315


#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [90]:
# Your code here
animal_outcomes['year'] = animal_outcomes.date_outcomes.map(lambda x:x.year)
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years,year
0,A800130,Kolby,05/01/2019,Adoption,,Dog,Spayed Female,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918,2019
1,A799457,Hazel,07/08/2013,,,Dog,Spayed Female,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204,6.038356,2019
2,A800069,,06/02/2019,Transfer,Partner,Cat,Intact Male,Domestic Shorthair,Orange Tabby,2019-07-21,2019-06-02,49,0.134247,2019
3,A795483,*Herb,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019
4,A795482,*Rain,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019


In [93]:
animal_outcomes['month'] = animal_outcomes.date_outcomes.map(lambda x:x.month)
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Breed,Color,date_outcomes,dob,age_days,age_years,year,month
0,A800130,Kolby,05/01/2019,Adoption,,Dog,Spayed Female,Boxer,Brown,2019-07-21,2019-05-01,81,0.221918,2019,7
1,A799457,Hazel,07/08/2013,,,Dog,Spayed Female,Pit Bull,Tan/White,2019-07-21,2013-07-08,2204,6.038356,2019,7
2,A800069,,06/02/2019,Transfer,Partner,Cat,Intact Male,Domestic Shorthair,Orange Tabby,2019-07-21,2019-06-02,49,0.134247,2019,7
3,A795483,*Herb,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019,7
4,A795482,*Rain,04/21/2019,Adoption,,Cat,Neutered Male,Domestic Shorthair,Orange Tabby/White,2019-07-21,2019-04-21,91,0.249315,2019,7


## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [94]:
uci.pivot(values = 'sex', columns = 'target').head()

target,0,1
0,,1.0
1,,1.0
2,,0.0
3,,1.0
4,,0.0


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns = ['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns = ['age', 'HP'])

In [None]:
toy1.join(toy2.set_index('age'),
          on = 'age',
          lsuffix = '_A',
          rsuffix = '_B').head()

### `.merge()`

In [None]:
ds_chars = pd.read_csv('ds_chars.csv', index_col = 0)

In [None]:
states = pd.read_csv('states.csv', index_col = 0)

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on = 'state',
               how = 'inner')

### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [None]:
pd.concat([ds_chars, states])

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [None]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

The Url for the Intake Dataset is here: https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD

_Hints_ :
- import and clean the intake dataset first
- use apply/applymap/lambda to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new days-in-shelter variable
- Notice that some values in "days_in_shelter" column are NaN or values < 0 (remove these rows using the "<" operator and ~is.na())
- Use group_by to get some interesting information about the dataset

Make sure to export and save your cleaned dataset. We will use it in a later lecture!

use the notation `df.to_csv()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here