# Module 1 - Reshaping Data with Pandas
## Pandas Part 3

In [3]:
names = ["Benjamin", "Bernadette", "Brian", "Betty", "Bella", "Brunhilda", "Bruno"]

def is_short(a):
    short_names = list(filter(lambda x: len(x) < 8, a))
    return short_names
is_short(names)

['Brian', 'Betty', 'Bella', 'Bruno']

In [4]:
import pandas as pd
uci = pd.read_csv('data/heart.csv')

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [5]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022CA7562B48>

#### `.groups` and `.get_group()`

In [6]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [7]:
uci.groupby('sex').get_group(0)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0


### Aggregating

In [8]:
uci.groupby('sex').mean()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,55.677083,1.041667,133.083333,261.302083,0.125,0.572917,151.125,0.229167,0.876042,1.427083,0.552083,2.125,0.75
1,53.758454,0.932367,130.94686,239.289855,0.15942,0.507246,148.961353,0.371981,1.115459,1.386473,0.811594,2.400966,0.449275


Exercise: Tell me the average cholesterol level for those with heart disease.

In [9]:
# Your code here!
uci.groupby('cp').get_group(1).mean()

age          51.360
sex           0.640
cp            1.000
trestbps    128.400
chol        244.780
fbs           0.100
restecg       0.620
thalach     162.420
exang         0.080
oldpeak       0.316
slope         1.680
ca            0.420
thal          2.140
target        0.820
dtype: float64

### Apply to Animal Shelter Data 

In [10]:
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD')

In [44]:
animal_outcomes

Unnamed: 0,animal_id,name,datetime,monthyear,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,days_upon_outcome,age
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,730,1.43
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2920,4.93
2,A724273,Runster,04/14/2016 06:43:00 PM,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,330,4.15
3,A665644,,10/21/2013 07:59:00 AM,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,28,6.64
4,A682524,Rio,06/29/2014 10:38:00 AM,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,1460,5.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117733,A818421,,06/08/2020 04:14:00 PM,06/08/2020 04:14:00 PM,5411 Evans in Austin (TX),Stray,Normal,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Tabby,21,0.00
117734,A818435,,06/08/2020 05:52:00 PM,06/08/2020 05:52:00 PM,Lady Bird Lake in Austin (TX),Stray,Normal,Cat,Intact Female,3 weeks,Domestic Shorthair Mix,Brown Tabby,21,0.00
117735,A818434,,06/08/2020 05:52:00 PM,06/08/2020 05:52:00 PM,Lady Bird Lake in Austin (TX),Stray,Normal,Cat,Intact Female,3 weeks,Domestic Shorthair Mix,Lynx Point,21,0.00
117736,A811195,*Lawerance,06/08/2020 11:35:00 PM,06/08/2020 11:35:00 PM,8712 Manor Rd in Travis (TX),Stray,Normal,Dog,Neutered Male,5 years,German Shepherd,Brown/Black,1825,0.00


In [40]:
import datetime

animal_outcomes['age'] = pd.to_datetime(animal_outcomes.datetime).map(lambda x: round ((datetime.datetime.now() - x).days / 365, 2))

Unnamed: 0_level_0,age
animal_type,Unnamed: 1_level_1
Bird,3.079106
Cat,3.425278
Dog,3.43151
Livestock,2.779524
Other,3.538572


#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?

In [47]:
animal_outcomes.loc[:, ['animal_type', 'age']].groupby('animal_type').mean()

Unnamed: 0_level_0,age
animal_type,Unnamed: 1_level_1
Bird,3.079106
Cat,3.425278
Dog,3.43151
Livestock,2.779524
Other,3.538572


In [51]:
animal_outcomes.drop('days_upon_outcome',1)

Unnamed: 0,animal_id,name,datetime,monthyear,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,age
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,1.43
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,4.93
2,A724273,Runster,04/14/2016 06:43:00 PM,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,4.15
3,A665644,,10/21/2013 07:59:00 AM,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,6.64
4,A682524,Rio,06/29/2014 10:38:00 AM,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,5.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...
117733,A818421,,06/08/2020 04:14:00 PM,06/08/2020 04:14:00 PM,5411 Evans in Austin (TX),Stray,Normal,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Tabby,0.00
117734,A818435,,06/08/2020 05:52:00 PM,06/08/2020 05:52:00 PM,Lady Bird Lake in Austin (TX),Stray,Normal,Cat,Intact Female,3 weeks,Domestic Shorthair Mix,Brown Tabby,0.00
117735,A818434,,06/08/2020 05:52:00 PM,06/08/2020 05:52:00 PM,Lady Bird Lake in Austin (TX),Stray,Normal,Cat,Intact Female,3 weeks,Domestic Shorthair Mix,Lynx Point,0.00
117736,A811195,*Lawerance,06/08/2020 11:35:00 PM,06/08/2020 11:35:00 PM,8712 Manor Rd in Travis (TX),Stray,Normal,Dog,Neutered Male,5 years,German Shepherd,Brown/Black,0.00


#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [None]:
# Your code here

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [52]:
uci

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [64]:
uci.pivot(values=['cp','sex'], columns='target')

Unnamed: 0_level_0,cp,cp,sex,sex
target,0,1,0,1
0,,3.0,,1.0
1,,2.0,,1.0
2,,1.0,,0.0
3,,1.0,,1.0
4,,0.0,,0.0
...,...,...,...,...
298,0.0,,0.0,
299,3.0,,1.0,
300,0.0,,1.0,
301,0.0,,1.0,


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [65]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'HP'])

In [69]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B')

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


### `.merge()`

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)

In [None]:
states = pd.read_csv('data/states.csv', index_col=0)

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

### `pd.concat()`

In [None]:
pd.concat([ds_chars, states], sort=False)

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [None]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

_Hints_ :
- import and clean the intake dataset first
- use `apply`/`applymap`/`lambda` to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new `days-in-shelter` column
- Notice that some values in `days_in_shelter` are `NaN` or values < 0 (remove these rows using the "<" operator and `isna()` or `dropna()`)
- Use `groupby` to get aggregate information about the dataset (your choice)

To save your dataset:
Use the notation `df.to_csv()` or `df.to_excel()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [261]:
animal_intakes = pd.read_csv('https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD')
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [262]:
animal_intakes.isna().sum()

Animal ID               0
Name                37058
DateTime                0
MonthYear               0
Found Location          0
Intake Type             0
Intake Condition        0
Animal Type             0
Sex upon Intake         1
Age upon Intake         0
Breed                   0
Color                   0
dtype: int64

In [263]:
animal_intakes = animal_intakes.dropna(subset=['Sex upon Intake'])
animal_intakes = animal_intakes.drop('Name',axis=1)

In [264]:
animal_intakes.isna().sum()

Animal ID           0
DateTime            0
MonthYear           0
Found Location      0
Intake Type         0
Intake Condition    0
Animal Type         0
Sex upon Intake     0
Age upon Intake     0
Breed               0
Color               0
dtype: int64

In [265]:
animal_intakes = animal_intakes.drop('DateTime',1)

In [266]:
animal_intakes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 117744 entries, 0 to 117744
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         117744 non-null  object
 1   MonthYear         117744 non-null  object
 2   Found Location    117744 non-null  object
 3   Intake Type       117744 non-null  object
 4   Intake Condition  117744 non-null  object
 5   Animal Type       117744 non-null  object
 6   Sex upon Intake   117744 non-null  object
 7   Age upon Intake   117744 non-null  object
 8   Breed             117744 non-null  object
 9   Color             117744 non-null  object
dtypes: object(10)
memory usage: 9.9+ MB


In [267]:
animal_outcomes.columns = [x.replace(' ', '_').lower() for x in animal_outcomes.columns]

In [268]:
animal_intakes.columns = [x.replace(' ', '_').lower() for x in animal_intakes.columns]

In [269]:
animal_outcomes = animal_outcomes.drop('datetime',1)

In [273]:
animal_intakes['monthyear'].map(lambda x: x.to_datetime)


AttributeError: 'str' object has no attribute 'to_datetime'

## 5. Pandas Practice

### Introduction

In [1]:
# find and import the World Cup data held in data/ folder

### Practice Questions <a id="practice"></a>

1. Subset the DataFrame to only non-null rows.

In [None]:
#Your code here.

2. How many of the matches were in Montevideo?  

In [None]:
#Your code here.

2. b If you haven't already, investigate why this code returns zero:

```python
print(len(df[df.City=="Montevideo"]))
```

In [None]:
#Your code here.

3. How many matches did USA play in 2014?  

Hint: they could have been home or away.  

You can combine conditions like this:  
```python
# Returns rows where either condition is true
df[(condition1) | (condition2)]

# Returns rows where both conditions are true  
df[(condition1) & (condition2)]
```

In [None]:
#Your code here.

4. How many teams played in 1986?

In [None]:
#Your code here.

5. How many matches were there with 5 or more total goals?

In [None]:
#Your code here.

6. Come up with and answer, two other questions you could answer by filtering or subsetting this DataFrame.

In [None]:
#6a Question:

In [None]:
#6a Solution (with code):

In [None]:
#6b Question:

In [None]:
#6b Solution (with code):