# Renaming Column using Pandas

# Binning using Pandas

In [None]:
Introduction
When dealing with continuous numeric data, it is often helpful to bin the data into multiple buckets for further analysis. 
There are several different terms for binning including bucketing, discrete binning and discretization.

Pandas supports these approaches using the cut and function. 
In this section you will learn how to use the pandas functions to convert continuous data to a set of discrete buckets.



Binning in Pandas with Age Example
Create Random Age Data
First, let's create a simple pandas DataFrame assigned to the variable df_ages with just one colum for age. 
This column will contain 8 random age values between 21 inclusive and 51 exclusive,

In [4]:
import pandas as pd
import numpy as np
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})

In [5]:
df_ages

Unnamed: 0,age
0,46
1,50
2,40
3,24
4,34
5,36
6,42
7,23


Create New Column of age_bins Via Defining Bin Edges

In [6]:
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])

Print outdf_ages. We can see age values are assigned to a proper bin.

In [7]:
df_ages

Unnamed: 0,age,age_bins
0,46,"(39.0, 49.0]"
1,50,
2,40,"(39.0, 49.0]"
3,24,"(20.0, 29.0]"
4,34,"(29.0, 39.0]"
5,36,"(29.0, 39.0]"
6,42,"(39.0, 49.0]"
7,23,"(20.0, 29.0]"


Let's verify the unique age_bins values.

In [8]:
df_ages['age_bins'].unique()

[(39.0, 49.0], NaN, (20.0, 29.0], (29.0, 39.0]]
Categories (3, interval[int64]): [(20, 29] < (29, 39] < (39, 49]]

Create New Column of of age_by_decade With Labels 20s, 30s, and 40s

This code creates a new column called age_by_decade with the same first 2 arguments as above, and a third argument of labels set to a list of values that correspond to how the age values will be put in bins by decades.

In [9]:
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])

In [10]:
df_ages

Unnamed: 0,age,age_bins,age_by_decade
0,46,"(39.0, 49.0]",40s
1,50,,
2,40,"(39.0, 49.0]",40s
3,24,"(20.0, 29.0]",20s
4,34,"(29.0, 39.0]",30s
5,36,"(29.0, 39.0]",30s
6,42,"(39.0, 49.0]",40s
7,23,"(20.0, 29.0]",20s


# Handling Missing Values

‘fillna()’ function helps to fill missing values in a Pandas Dataframe. It is used for updating missing values with the overall mean/mode/median of the column. For example if we want to impute the ‘Gender’, ‘Married’ and ‘Self_Employed’ columns with their respective modes.

In [None]:
#First we import scipy function to determine the mode

In [20]:
data=pd.read_csv('loan.csv')

In [21]:
from scipy.stats import mode
mode(data['Gender'])

ModeResult(mode=array(['Male'], dtype=object), count=array([489]))

This returns both mode and count. Remember that mode can be an array as there can be multiple values with high frequency. We will take the first one by default always using:

In [22]:
mode(data['Gender']).mode[0]

'Male'

In [None]:
Now we can fill the missing values in the Pandas Dataframe data and check using technique #2.

In [23]:

#Impute the values:
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

In [None]:
#Now check the #missing values again to confirm:

In [37]:
def num_missing(x):
    return sum(x.isnull()) 

In [38]:
print (data.apply(num_missing, axis=0))

Loan_ID               0
Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


Hence, it is confirmed that missing values in Pandas dataframe are imputed. Please note that this is the most primitive form of imputation. Other sophisticated techniques include modeling the missing values, using grouped averages (mean/mode/median).