# Over and Under Sampling

Over and Under Sampling are techniques used for classification problems. Sometimes, our classification dataset might be too heavily tipped to one side. For example, we have 2000 examples for class 1, but only 200 for class 2. That’ll throw off a lot of the Machine Learning techniques we try and use to model the data and make predictions! Our Over and Under Sampling can combat that. Check out the graphic below for an illustration.

<img src="Sampling.png">

## Under and and Over Sampling
In both the left and right side of the image above, our blue class has far more samples than the orange class. In this case, we have 2 pre-processing options which can help in the training of our Machine Learning models.
Undersampling means we will select only some of the data from the majority class, only using as many examples as the minority class has. This selection should be done to maintain the probability distribution of the class. That was easy! We just evened out our dataset by just taking less samples!
Oversampling means that we will create copies of our minority class in order to have the same number of examples as the majority class has. The copies will be made such that the distribution of the minority class is maintained. We just evened out our dataset without getting any more data!

# Read more:
https://elitedatascience.com/imbalanced-classes


In [1]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression



In [2]:
df = pd.read_csv('kiva_loans_20181016.csv')
print(len(df))
# df = df.sample(n=round(len(df)*.2))
# print(len(df))
# df = df.loc[df.status==0].append(df.loc[df.status==1].sample(len(df.loc[df.status==0])))
# df.status.value_counts()

671205


In [3]:
df.shape

(671205, 18)

In [4]:
df.status.value_counts()

1    622877
0     48328
Name: status, dtype: int64

In [5]:
df.dtypes

id                     int64
date                  object
activity              object
sector                object
use                   object
funded_amount          int64
loan_amount            int64
diff_funded_loan       int64
status                 int64
country_code          object
country               object
currency              object
gender                object
borrower_genders      object
lender_count           int64
term_in_months         int64
repayment_interval    object
tags                  object
dtype: object

In [6]:
df.head()

Unnamed: 0,id,date,activity,sector,use,funded_amount,loan_amount,diff_funded_loan,status,country_code,country,currency,gender,borrower_genders,lender_count,term_in_months,repayment_interval,tags
0,653051,1/1/14,Fruits & Vegetables,Food,"To buy seasonal, fresh fruits to sell.",300,300,0,1,PK,Pakistan,PKR,female,female,12,12,irregular,
1,653053,1/1/14,Rickshaw,Transportation,to repair and maintain the auto rickshaw used ...,575,575,0,1,PK,Pakistan,PKR,group,"female, female",14,11,irregular,
2,653068,1/1/14,Transportation,Transportation,To repair their old cycle-van and buy another ...,150,150,0,1,IN,India,INR,female,female,6,43,bullet,"user_favorite, user_favorite"
3,653063,1/1/14,Embroidery,Arts,to purchase an embroidery machine and a variet...,200,200,0,1,PK,Pakistan,PKR,female,female,8,11,irregular,
4,653084,1/1/14,Milk Sales,Food,to purchase one buffalo.,400,400,0,1,PK,Pakistan,PKR,female,female,16,14,monthly,


In [7]:
print (df.country.describe())
print("-"*50)
print (df.country.value_counts())

count          671205
unique             87
top       Philippines
freq           160441
Name: country, dtype: object
--------------------------------------------------
Philippines                         160441
Kenya                                75825
El Salvador                          39875
Cambodia                             34836
Pakistan                             26857
Peru                                 22233
Colombia                             21995
Uganda                               20601
Tajikistan                           19580
Ecuador                              13521
Paraguay                             11903
Nicaragua                            11781
India                                11237
Vietnam                              10843
Nigeria                              10136
Bolivia                               8806
Lebanon                               8792
Armenia                               8631
Palestine                             8167
Samoa          

In [8]:
print (df.sector.describe())
print("-"*50)
print (df.sector.value_counts())

count          671205
unique             15
top       Agriculture
freq           180302
Name: sector, dtype: object
--------------------------------------------------
Agriculture       180302
Food              136657
Retail            124494
Services           45140
Personal Use       36385
Housing            33731
Clothing           32742
Education          31013
Transportation     15518
Arts               12060
Health              9223
Construction        6268
Manufacturing       6208
Entertainment        830
Wholesale            634
Name: sector, dtype: int64


In [9]:
print (df.activity.describe())
print("-"*50)
print (df.activity.value_counts())

count      671205
unique        163
top       Farming
freq        72955
Name: activity, dtype: object
--------------------------------------------------
Farming                           72955
General Store                     64729
Personal Housing Expenses         32448
Food Production/Sales             28106
Agriculture                       27023
Pigs                              26624
Retail                            24771
Clothing Sales                    22339
Home Appliances                   20267
Higher education costs            19742
Fruits & Vegetables               16610
Grocery Store                     15102
Livestock                         13095
Fish Selling                      13060
Food                              10197
Fishing                           10066
Services                           9807
Poultry                            9783
Tailoring                          9657
Animal Sales                       9237
Food Stall                         8905
Sewing 

In [36]:
df1 = df[['status','loan_amount', 'activity', 'sector',  'country','gender','term_in_months']]

In [37]:
df1.head()

Unnamed: 0,status,loan_amount,activity,sector,country,gender,term_in_months
0,1,300,Fruits & Vegetables,Food,Pakistan,female,12
1,1,575,Rickshaw,Transportation,Pakistan,group,11
2,1,150,Transportation,Transportation,India,female,43
3,1,200,Embroidery,Arts,Pakistan,female,11
4,1,400,Milk Sales,Food,Pakistan,female,14


In [38]:
df1.shape

(671205, 7)

In [39]:
X = df1.drop(['status'], axis=1)
feature_names = X.columns
y = df1['status']

There exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html


In [40]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
from collections import Counter
print(sorted(Counter(y_resampled).items()))

[(0, 622877), (1, 622877)]


In [41]:
df2 = pd.DataFrame(X_resampled)
print (df2.head())
print (df2.shape)

df2.columns = ["loan_amount", "activity","sector","country","gender","term_in_months"]

print (df2.head())


     0                    1               2         3       4   5
0  300  Fruits & Vegetables            Food  Pakistan  female  12
1  575             Rickshaw  Transportation  Pakistan   group  11
2  150       Transportation  Transportation     India  female  43
3  200           Embroidery            Arts  Pakistan  female  11
4  400           Milk Sales            Food  Pakistan  female  14
(1245754, 6)
  loan_amount             activity          sector   country  gender  \
0         300  Fruits & Vegetables            Food  Pakistan  female   
1         575             Rickshaw  Transportation  Pakistan   group   
2         150       Transportation  Transportation     India  female   
3         200           Embroidery            Arts  Pakistan  female   
4         400           Milk Sales            Food  Pakistan  female   

  term_in_months  
0             12  
1             11  
2             43  
3             11  
4             14  


In [42]:
y = pd.DataFrame(y_resampled)

print (y.head())
print (y.shape)

   0
0  1
1  1
2  1
3  1
4  1
(1245754, 1)
