# Python Maniplulating Data

This chapter, you will know how to transform data format to desired format.

### Indexing DataFrame

You can use `df.loc()`, `df.iloc()` to select partial data.

There are many ways to query from data, please see following examples.

In [None]:
import pandas as pd

pokemon_df = pd.read_csv('./res/pokemon.csv', na_values=['NA'])

print(pokemon_df.head())

print(pokemon_df['Name'].head())
print(pokemon_df[['Name', 'HP']].head())

print(pokemon_df.iloc[0])
print(pokemon_df.iloc[0:1])
print(pokemon_df.iloc[0:1, 0:1])

print(pokemon_df.loc[0])
print(pokemon_df.loc[0:1])
print(pokemon_df.loc[0:1, 'Name': 'HP'].head())

### Q: There is an passenger informations of titanic in `./res/titanic.csv`, please read and put it into  DataFrame `titanic_df`. Print only `PassengerId` and `Survived` columns

In [None]:
# Write your code here


### Changing Index And Label

It is possible for you to use your own index instead of default number index.

Having custom index, it is easier for you to filter data, and sort data.

There is two ways to setup index.

* Call `df.set_index()` directly and input column names.

* Add an argument `index_col=['column_name']` when reading file.

In [None]:
import pandas as pd

pokemon_df = pd.read_csv('./res/pokemon.csv', na_values=['NA'])
print(pokemon_df.head())


pokemon_df.set_index(['Name'])
print(pokemon_df.head())


pokemon_df = pd.read_csv('./res/pokemon.csv', index_col=['Name'], na_values=['NA'])
print(pokemon_df.head())

It is possible to setup multiple index columns. If you need multiple indices, just select multiple columns.

Call `df.sort_index()` to reset order of indices.

In [None]:
pokemon_df = pd.read_csv('./res/pokemon.csv', index_col=['Type 1', 'Type 2'], na_values=['NA'])

print(pokemon_df.head())

pokemon_df = pokemon_df.sort_index()

print(pokemon_df.head())

### Q: Read from `./res/titanic.csv` again. This time, use `Survived`, `Pclass` as indices and read it into `titanic_df`

In [None]:
# Write your code here


### Filtering from multiple indices

You can query using `df.loc[]` and input required indices rows. 

If you want to filter from a range of index, you can use `slice()` function and input indices ranges into it.

In [None]:
print(pokemon_df.info())

print(pokemon_df.loc[['Bug', 'Grass']].info())

print(pokemon_df.loc[('Bug', 'Electric')].info())

print(pokemon_df.loc[(slice('Bug', 'Grass'), slice(None))].info())

### Q: Please filter data of `Survived = 1` and `Pclass = 2`  from DataFrame `titanic_df`.

In [None]:
# Write your code here


### Filtering NaN in DataFrame 

Using `df.dropna()` to drop NaN rows.

Argument `how='any'` will drop a row if any of field is NaN.

Argument `how='all'` will drop a row if all of fields are NaN.

In [None]:
print(pokemon_df.info())
print(pokemon_df.shape)

print(pokemon_df[['Name', 'HP']].dropna(how='any').info())
print(pokemon_df[['Name', 'HP']].dropna(how='any').shape)

print(pokemon_df[['Name', 'HP']].dropna(how='all').info())
print(pokemon_df[['Name', 'HP']].dropna(how='all').shape)

### Q: Check `df.shape` and `df.info()` of DataFrame `titanic_df`, filter rows if any of field is Nan and check it again.

In [None]:
# Write your code


### Transforming DataFrame

##### apply

You can transform dataframe with your own custom function.

If you call DataFrame `apply` function, it will input `Pandas Series`, you can use `Series` provided functions to manipulate data.

In [None]:
print(pokemon_df['Type 1'].value_counts())

def to_upper_case(series):
    return series.str.upper();

print(pokemon_df[['Type 1']].apply(to_upper_case).head())

##### map and lambda function

You can use `map` and `lambda function` to iterate each element in Pandas series.

In some simple operation, you can write logic in the lambda function instead of defining a custom function

In [None]:
print(pokemon_df['Type 1'].value_counts().head())

print(pokemon_df['Type 1'].map(lambda type1: type1.upper()).head())

### Grouping Data

You can do GROUP BY operation on DataFrame and calculate aggregated result.

When you call `df.groupby()`, you will get `DataFrameGroupBy` object, which you can do calculation, such as `count()`, `sum()`, `mean()`.

In [None]:
by_type1 = pokemon_df.groupby('Type 1')

print(type(by_type1))

by_type1_count = by_type1['Name'].count()

print(by_type1_count.head())

If you want to do multiple calculation on `DataFrameGroupBy` object, you can use `agg` function and add function required into it.

In [None]:
by_type1_hp_boundry = by_type1['HP'].agg(['max', 'min'])

by_type1_hp_boundry.head()

You can group by a function instead of index or column name.

In [None]:
# group by Attack but unit = 20
def get_attack_group(df):
    return (df['Attack']%20)*20

print(pokemon_df.groupby(get_ad_group(pokemon_df))['#'].count())

##### Filling missing data

You can use `DataFrameGroupBy.transform()` to convert it to another one.

`transform()` will treat each series isolately.

You can write custom function, such as filling missing data into mean value.

In [None]:
print(pokemon_df.info())

by_type = pokemon_df.groupby(['Type 1'])

def impute_mean(series):
    return series.fillna(series.mean())
    
pokemon_df['fillna_hp'] = by_type['HP'].transform(impute_mean)

print(pokemon_df.info())

pokemon_df[['fillna_hp', 'fillna_attack']] = by_type[['HP', 'Attack']].transform(impute_mean)

print(pokemon_df.info())

You can use `apply` function for `DataFrameGroupBy` object, with custom functions, you have more flexibility.

These functions can be aggregations, transformations or more complex workflows.

In [None]:
by_type = pokemon_df.groupby(['Type 1'])

def cal_complex_score(series):
    return (series['HP'].max() + series['Attack'].max() + series['Defense'].max()) - \
    (series['HP'].min() + series['Attack'].min() + series['Defense'].min())

print(by_type.apply(cal_complex_score).head())

### Q: The `Age` column has NaN values in DataFrame `titanic_df`. Please fill them with median values in same `Sex` and `Pclass`  group.

In [None]:
# Write your code here


##### Filtering data

You can use `filter()` and `lambda function` to remove rows in `DataFrameGroupBy` object, it could help you to filter required rows.

In [None]:
by_type = pokemon_df.groupby('Type 1')

by_type_filtered_min = by_type.filter(lambda row: row['HP'].min() > 40)

print(by_type_filtered_min.head())

### Q: Group DataFrame `titanic_df` using column `Pclass` and filter by Pclass which has over than 200 people.

In [None]:
# Write your code here


### Stacking and Unstacking DataFrame

With multi-indices DataFrame, you can use `df.stack()` function to move index into columns. It is easily for people to read or to compare data.

You put index you like to show in columns into argument `level`, and everything is done. 

It is like `df.pivot_table()` function. However, `pivot_table` cannnot handle multi-indices data. 

In [None]:
pokemon_df = pd.read_csv('./res/pokemon.csv', na_values='NA', index_col=['Type 1', 'Type 2'])[['Attack', 'Defense']].head(10)

print(pokemon_df)

pokemon_max_power_df = pokemon_df.groupby(['Type 1', 'Type 2']).max()

print(pokemon_max_power_df)

untack_pokemon_max_power_df = pokemon_max_power_df.unstack(level='Type 2')

print(untack_pokemon_max_power_df)

You can call `df.stack()` to move to original layout.

In [None]:
print(untack_pokemon_max_power_df.stack(level = 'Type 2'))

### Q: Group DataFrame `titanic_df` by indices `Survived` and `Pclass`, and find max value of column `Age` and `Fare`. Try to unstack it using  `level =Pclass` and see what you found.

In [None]:
# Write your code here
