# Module 1 - Manipulating data with Pandas¶


### Pandas Part 2


![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)


### Scenario:

You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. In this lecture, we are continue to look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.

#### Our goals today are to be able to: 

Use the pandas library to:

- Get summary info about a dataset and its variables.  
  Apply and use info, describe and dtypes.   
  Use mean, min, max, and value_counts.
- Use apply and applymap to transform columns and create new values.

- Explain lambda functions and use them to use an apply on a DataFrame.

- Explain what a groupby object is and split a DataFrame using a groupby.


https://www.kaggle.com/ronitf/heart-disease-uci/version/1

- The dataset is most often used to practice classification algorithms.  We will return to classification in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('heart.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
print(df.shape)

(303, 14)


In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB
None


- We can change the column type with the `astype` method. Let’s apply this method to the target feature and convert it to a bool:



In [6]:
df['target'] = df['target'].astype('bool')

In [7]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,True
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,True
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,True
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,True


- The `describe` method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [9]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0


- In order to see statistics on non-numerical features, one has to explicitly indicate the data type of interest in the `include` option as a list.

In [10]:
df.describe(include=['object', 'bool'])

Unnamed: 0,target
count,303
unique,2
top,True
freq,165


- For categorical (type object) and boolean (type bool) features we can use the `value_counts` method. Let’s have a look at the distribution of target:
- `value_counts` will also work on int and float types. Try it out!



In [10]:
df['target'].value_counts()

True     165
False    138
Name: target, dtype: int64

- To calculate fractions, pass `normalize=True` to the value_counts function.

In [11]:
df['target'].value_counts(normalize=True)

True     0.544554
False    0.455446
Name: target, dtype: float64

### Sorting

- A DataFrame can be sorted by the value of one of its features (i.e columns). For example, we can sort by *chol* (use ascending=False to sort in descending order):

In [12]:
df.sort_values(by='chol',ascending=False).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
85,67,0,2,115,564,0,0,160,0,1.6,1,0,3,True
28,65,0,2,140,417,1,0,157,0,0.8,2,1,2,True
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3,False
220,63,0,0,150,407,0,0,154,0,4.0,1,3,3,False
96,62,0,0,140,394,0,0,157,0,1.2,1,0,2,True


- We can also sort by multiple columns:



In [13]:
df.sort_values(by=['chol', 'age'],ascending=[True, False]).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3,True
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,False
53,44,0,2,108,141,0,1,175,0,0.6,1,0,2,True
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2,True
267,49,1,2,118,149,0,0,126,0,0.8,2,3,2,False


### Finding Basic Statistics.

- We can find the mean of a specific column.

In [14]:
df['target'].mean()

0.5445544554455446

- We can use boolean indexing to find the mean of the features for people with heart disease.

In [33]:
df_1=df[df['target'] == 1]
target=df_1['target']
target.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

- We can do the above for a specific feature as well.

In [16]:
df[df['target'] == 1]['chol'].mean()

242.23030303030302

- What is the max cholestrol with people with heart disease with age feature 40 or less?

In [17]:
df[(df['target'] == 1) & (df['age'] <= 40) & (df['age']  >= 30) ]['chol'].max()

321

### Changing Data

- https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/

- To apply functions to each column, use apply():

In [18]:
def square(x):
    return x**2

In [19]:
df.apply(square).head() 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,3969,1,9,21025,54289,1,0,22500,0,5.29,0,0,1,1
1,1369,1,4,16900,62500,0,1,34969,0,12.25,0,0,4,1
2,1681,0,1,16900,41616,0,0,29584,0,1.96,4,0,4,1
3,3136,1,1,14400,55696,0,1,31684,0,0.64,4,0,4,1
4,3249,0,0,14400,125316,0,1,26569,1,0.36,4,0,4,1


- Apply `square` to a single column **cp** 

In [20]:
# your code here
df['cp'].apply(square).head()

0    9
1    4
2    1
3    1
4    0
Name: cp, dtype: int64

- We can also use the apply function to subset on specific features.
- Here we use a `lambda` function. More on them later!

In [21]:
df[df['sex'].apply(lambda gender: gender == 1)].head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,True
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,True
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,True
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,True


- The `applymap` method takes a function as input that it will then apply to every entry in the dataframe.

In [12]:
def plusone (x):
    return x+1

In [14]:
df.applymap(plusone).head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


- The `map` method can be used to replace values in a **column** by passing a dictionary of the form {old_value: new_value} as its argument.


In [24]:
dict_values = {1 : '[5,6)', 0 : 'female'}
df['sex'] = df['sex'].map(dict_values)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,"[5,6)",3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,"[5,6)",2,130,250,0,1,187,0,3.5,0,0,2,True
2,41,female,1,130,204,0,0,172,0,1.4,2,0,2,True
3,56,"[5,6)",1,120,236,0,1,178,0,0.8,2,0,2,True
4,57,female,0,120,354,0,1,163,1,0.6,2,0,2,True


#### Grouping

In general, grouping data in Pandas works as follows:

**df.groupby(by=grouping_columns)[columns_to_show].function()**


1. First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
2. Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
3. Finally, one or several functions are applied to the obtained groups per selected columns.

In [25]:
df = pd.read_csv('heart.csv')
df.groupby(['sex']).mean()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,55.677083,1.041667,133.083333,261.302083,0.125,0.572917,151.125,0.229167,0.876042,1.427083,0.552083,2.125,0.75
1,53.758454,0.932367,130.94686,239.289855,0.15942,0.507246,148.961353,0.371981,1.115459,1.386473,0.811594,2.400966,0.449275


In [26]:
columns_to_show = ['sex', 'chol', 'trestbps']

df.groupby(['target'])[columns_to_show].describe(percentiles=[])

Unnamed: 0_level_0,sex,sex,sex,sex,sex,sex,chol,chol,chol,chol,chol,chol,trestbps,trestbps,trestbps,trestbps,trestbps,trestbps
Unnamed: 0_level_1,count,mean,std,min,50%,max,count,mean,std,min,50%,max,count,mean,std,min,50%,max
target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
0,138.0,0.826087,0.380416,0.0,1.0,1.0,138.0,251.086957,49.454614,131.0,249.0,409.0,138.0,134.398551,18.729944,100.0,130.0,200.0
1,165.0,0.563636,0.497444,0.0,1.0,1.0,165.0,242.230303,53.552872,126.0,234.0,564.0,165.0,129.30303,16.169613,94.0,130.0,180.0


- Let’s do the same thing, but slightly differently by passing a list of functions to agg():



In [27]:
columns_to_show = ['sex', 'chol', 'trestbps']

df.groupby(['target'])[columns_to_show].agg([np.mean, np.std, np.min, 
                                            np.max])

Unnamed: 0_level_0,sex,sex,sex,sex,chol,chol,chol,chol,trestbps,trestbps,trestbps,trestbps
Unnamed: 0_level_1,mean,std,amin,amax,mean,std,amin,amax,mean,std,amin,amax
target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,0.826087,0.380416,0,1,251.086957,49.454614,131,409,134.398551,18.729944,100,200
1,0.563636,0.497444,0,1,242.230303,53.552872,126,564,129.30303,16.169613,94,180


### Class Fun!!

- Apply the new tools to the Animal Center data

In [28]:
import pandas as pd
shelter_data=pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD') 
shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A799505,Trixie,07/09/2019 07:30:00 PM,07/09/2019 07:30:00 PM,07/09/2006,Return to Owner,,Dog,Spayed Female,13 years,Norwich Terrier Mix,Tan/White
1,A791983,*Spice,07/09/2019 07:09:00 PM,07/09/2019 07:09:00 PM,04/01/2019,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair Mix,Orange Tabby
2,A799407,Puppy Boy,07/09/2019 06:22:00 PM,07/09/2019 06:22:00 PM,07/08/2014,Return to Owner,,Dog,Neutered Male,5 years,Chihuahua Shorthair,Tan/White
3,A799344,Gibby,07/09/2019 06:11:00 PM,07/09/2019 06:11:00 PM,07/07/2017,Return to Owner,,Dog,Intact Male,2 years,French Bulldog,Brown Brindle
4,A799426,Copper,07/09/2019 05:46:00 PM,07/09/2019 05:46:00 PM,07/08/2017,Return to Owner,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby


In [29]:
# Save a copy of manipulations to your local drive.
shelter_data.to_csv('Shelter_data')