# Module 1 - Manipulating data with Pandas¶


### Pandas Part 2


#### Our goals today are to be able to: 

Use the pandas library to:

- Get summary info about a dataset and its features (variables).  
  Apply and use `info, describe and dtypes`.   
  Use `mean, min, max, and value_counts`.
- Use `apply` and `applymap` to transform columns and create new values.

- Explain `lambda` functions and use them on a DataFrame.

- Explain `groupby`  and split a DataFrame using a groupby.


https://www.kaggle.com/ronitf/heart-disease-uci/version/1

- The dataset is most often used to practice classification algorithms.  We will use several different classification algorithms in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('heart.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

- We can change the column type with the `astype` method. Let’s apply this method to the target feature and convert it to a bool:



In [None]:
df['target'] = df['target'].astype('bool')

In [None]:
df.head()

- The `describe` method gives summary statistics of the numerical feature (int64 and float64 types):mean, standard deviation, range, median, 0.25, and 0.75 quartiles.

In [None]:
df.describe()

- In order to see statistics on non-numerical features, one has to explicitly indicate the data type of interest in the `include` option as a list.

In [None]:
df.describe(include=['object', 'bool'])

- For categorical (type object) and boolean (type bool) features we can use the `value_counts` method. Let’s have a look at the distribution of target:
- `value_counts` will also work on int and float types. Try it out!



In [None]:
df['target'].value_counts()

- To calculate fractions, pass `normalize=True` to the value_counts function.

In [None]:
df['target'].value_counts(normalize=True)

### Sorting

- A DataFrame can be sorted by the value of one of its features (i.e columns). For example, we can sort by *age* (use ascending=False to sort in descending order):

In [None]:
df.sort_values(by='age',ascending=False).head()


- We can also sort by multiple columns:



In [None]:
df.sort_values(by=['age', 'chol'],ascending=[True, False]).head()

### Finding Basic Statistics.

- We can find the mean of a specific column.

In [None]:
#df.target.mean() another option
df['age'].mean()

- We can use boolean indexing to find the mean of the features for people with heart disease.

In [None]:
df_1=df[df['target'] == 1]['age'].mean()
df_1

- We can do the above for other feature as well.

In [None]:
df[df['target'] == 1]['chol'].mean()

- What is the average cholestrol for people with heart disease in their 30's? 

In [None]:
df[(df['target'] == 1) & (df['age'] < 40) & (df['age']  >= 30) ]['chol'].mean()
df.head()

### Changing Data

- https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/

- To apply a functions to each column or row, use apply():

In [None]:
df.apply(np.mean,axis=0)

- We can also use the apply function to subset on specific features.
- Here we use a `lambda` function. More on them later!

In [None]:
df_f=df[df['sex'].apply(lambda x: x == 1)]

- The `applymap` method takes a function as input that it will then apply to every entry in the dataframe.

In [None]:
def square(x):
    return x**2
df.applymap(square).head()

- The `map` method can be used to replace values in a **column** by passing a dictionary of the form {old_value: new_value} as its argument.


In [None]:
dict_values = {1 : '[5,6)', 0 : 'female'}
df['sex'].map(dict_values).head()
df.head()

#### Grouping

In general, grouping data in Pandas works as follows:

**df.groupby(by=grouping_columns)[columns_to_show].function()**


1. First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
2. Then, columns of interest are selected (columns_to_show). 
3. Finally, one or several functions are applied to the obtained groups per selected columns.

In [None]:
df.groupby(['sex']).mean()

In [None]:
columns_to_show = ['sex', 'chol', 'trestbps']

df.groupby(['target'])[columns_to_show].describe(percentiles=[])

- Let’s do the same thing, but slightly differently by passing a list of functions to agg():



In [None]:
columns_to_show = ['sex', 'chol', 'trestbps']

df.groupby(['target'])[columns_to_show].agg([np.mean, np.std, np.min, 
                                            np.max])

### Class Fun!!

- Apply the new tools to the Animal Center data

In [None]:
import pandas as pd
shelter_data=pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD') 
shelter_data.head()

In [None]:
# Save a copy of manipulations to your local drive.
shelter_data.to_csv('Shelter_data')