# Pandas

In this notebook, we explore the different pandas functions that are used for data wrangling. The functions have been divided into different code blocks based on their purpose. 

First let us import a dummy dataset for implementing the operations using the read_csv function.

In [82]:
import pandas as pd
df = pd.read_csv("data/Bank Customer Churn Prediction.csv")

In [83]:
df.head()

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Summarizing data

By default, pandas summarizing functions apply the operations to individual columns.

Counting non-null values in the dataframe for each column

In [84]:
df.count()

customer_id         10000
credit_score        10000
country             10000
gender              10000
age                 10000
tenure              10000
balance             10000
products_number     10000
credit_card         10000
active_member       10000
estimated_salary    10000
churn               10000
dtype: int64

Calculating the mean for specific columns

In [85]:
df[["credit_score", "tenure"]].mean()

credit_score    650.5288
tenure            5.0128
dtype: float64

## Grouping Data and Aggregation Functions

Using the groupby function in pandas returns a groupby object on which we can perform different types of aggregation operations using aggregation functions.

### size()

For example, we can check the size of each group once we group the dataset by gender and apply the size() aggregation function.

In [86]:
df.groupby(by="gender").size()

gender
Female    4543
Male      5457
dtype: int64

### agg()

We can also apply more than one aggregation functions using the **.agg()** at the end of groupby. In this case, we specify a list of string as the argument **"func"** for the agg function, where each string in the list is the name of an aggregation funciton in pandas.

In [87]:
df.groupby(by="gender")["age"].agg(func=["sum", "mean"])

Unnamed: 0_level_0,sum,mean
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,178260,39.238389
Male,210958,38.658237


### rank() 

Applying the rank aggregate function (with groupby) on a dataframe returns a dataframe with one column, which contains the rankings within each group. The ranking is done based on the values of the specified column (specified after the **groupby()** function). 

There are five methods that can be specified to hadle ties in the rank function. If two rows are tied at the same value and the previous row's rank was 2, then the next row will get a rank in one of the five following methods:

- **method = "average":** The average of ranks 3 & 4 or 3.5 will be assigned. Next row gets 5. **(Default Method)**
- **method = "min":** The minimum of 3 & 4 or 3 will be assigned. Next row gets 5.
- **method = "max":** The minimum of 3 & 4 or 4 will be assigned. Next row gets 5.
- **method = "dense":** Both rows get 3 and the next row gets 4.
- **method = "first":** The first row encountered gets 3 and the second row encountered gets 4.

In [79]:
df = pd.DataFrame({"group": ["a", "a", "a", "a", "a", "b", "b", "b", "b", "b"], "value": [2, 4, 2, 3, 5, 1, 2, 4, 1, 5]})

for method in ['average', 'min', 'max', 'dense', 'first']:
    df[f'{method}_rank'] = df.groupby('group')['value'].rank(method)
df.head()

Unnamed: 0,group,value,average_rank,min_rank,max_rank,dense_rank,first_rank
0,a,2,1.5,1.0,2.0,1.0,1.0
1,a,4,4.0,4.0,4.0,3.0,4.0
2,a,2,1.5,1.0,2.0,1.0,2.0
3,a,3,3.0,3.0,3.0,2.0,3.0
4,a,5,5.0,5.0,5.0,4.0,5.0


### cummax() and cummin()

In general the cummax or the cummin function calculated the cumulative max or cumulative minimum along a column. We can think of this operation as proceeding down a column and recording the maximum or minimum value recorded so far. 

When applied with aggregation with groupby, it does the same thing, but separately for two groups. Let us consider the following example:

In [67]:
# Create a sample DataFrame
data = {'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'Value': [4, 2, 6, 3, 8, 1, 5]}
df = pd.DataFrame(data)

# Apply cummax function with groupby
df['Cumulative Max'] = df.groupby('Group')['Value'].cummax()
df['Cumulative Min'] = df.groupby('Group')['Value'].cummin()

df

Unnamed: 0,Group,Value,Cumulative Max,Cumulative Min
0,A,4,4,4
1,A,2,4,2
2,A,6,6,2
3,B,3,3,3
4,B,8,8,3
5,B,1,8,1
6,B,5,8,1
