**Imbalanced Dataset**: A dataset with an unequal class distribution.

you have a training set of previously observed transactions, each of which was either:

a) Normal

b) Fraudulent

Most transactions are normal and it is not unlikely that fraudulent account for less than 0.1% of the total transactions! Creating a model for this task can be tricky – considering only performance as given by an accuracy score, then a model always predicting “regular” will have a really high performance value!

**Three challenges with imbalanced data**
To understand the challenges associated with imbalanced data, we first introduce some notations:

The majority class is the class with the highest number of samples;

The minority class is the class with the lowest number of samples;

The class ratio for a given dataset is defined as the ratio between the size of the minority class and the size of the majority class. Empirically, data ratios of at least 25% do not affect performance by large margins. This is no longer true, however, as the ratio becomes smaller.

**Our Goals:**
Understand the little distribution of the "little" data that was provided to us.
Create a 50/50 sub-dataframe ratio of "Fraud" and "Non-Fraud" transactions. (NearMiss Algorithm)

In [1]:
import numpy as np
import pandas as pd

In [2]:
credit_card_data = pd.read_csv('/content/credit_data.csv')

In [3]:
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [5]:
credit_card_data.shape

(45646, 31)

In [7]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
45641,42436,-2.481639,-2.439949,0.363642,1.216827,2.572442,-1.26422,-0.443652,0.075853,0.073188,...,-0.039426,0.480591,1.779358,-0.7567,-0.161099,0.685617,0.223071,0.139619,0.0,0.0
45642,42436,1.223475,0.014944,0.471312,-0.03841,-0.566793,-0.86797,-0.058213,-0.14408,0.164904,...,-0.053292,-0.09368,0.106348,0.471407,0.135555,0.968336,-0.065171,0.005184,7.49,0.0
45643,42436,1.258657,0.421016,0.325437,0.684259,-0.292529,-1.052786,0.145228,-0.253567,-0.100521,...,-0.278029,-0.757417,0.119613,0.369393,0.246145,0.091553,-0.017156,0.032557,0.89,0.0
45644,42437,-0.500147,1.00077,1.809639,-0.114551,0.333865,-0.577076,1.062325,-0.51305,-0.048285,...,-0.193814,-0.16141,0.036965,0.400154,-0.802486,-0.076097,-0.214317,-0.22916,2.69,0.0
45645,42437,-0.652459,0.17729,1.955607,-1.879724,-0.368457,,,,,...,,,,,,,,,,


In [8]:
#determine the distribution of 2 classes

credit_card_data['Class'].value_counts()

Class
0.0    45503
1.0      142
Name: count, dtype: int64

This is highly imbalanced Dataset.

0 --> Legit Transactions

1 --> Fraudulent Transactions

In [9]:
#seperating the legit and fraudulent transcaction

legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [10]:
legit.shape

(45503, 31)

In [11]:
fraud.shape

(142, 31)

Under-Sampling

Build a sample dataset containing similar distribution of Legit & Fraudulent Transactions.

In [14]:
legit_sample = legit.sample(n=142)

In [15]:
print(legit_sample.shape)

(142, 31)


**Concatenate the TWO Dataframes**



In [16]:
new_dataset = pd.concat([legit_sample, fraud], axis = 0)
#here we are concatenate by rows and that's why we give axis = 0

In [17]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
31472,36381,1.135098,-1.42228,1.059353,-0.501487,-1.576462,0.78881,-1.485158,0.382476,0.374725,...,0.222936,0.603012,-0.279977,-0.476777,0.46032,-0.04505,0.047191,0.026024,112.03,0.0
6019,6760,1.106749,-0.443511,1.233067,0.419639,-1.258706,-0.407504,-0.742668,-0.022259,2.450776,...,-0.239373,-0.407701,0.027451,0.401553,0.049561,0.946797,-0.07539,0.019086,67.13,0.0
38250,39314,1.315628,0.243653,-0.012243,0.174643,0.336275,0.134965,0.001303,-0.031962,-0.283308,...,-0.283366,-0.794221,-0.056043,-1.006506,0.405274,0.157946,-0.027495,-0.002804,0.89,0.0
45214,42253,-0.952581,1.094085,1.459546,0.709784,0.427683,-0.495198,0.997167,-0.118206,-0.729915,...,-0.474931,-1.386448,-0.053405,-0.190191,0.22231,-0.682166,0.187508,0.167792,55.84,0.0
5516,5574,-0.381526,2.625176,-2.390895,1.928058,0.226136,-1.810457,0.045186,0.627731,0.966145,...,-0.211417,-0.21893,0.323634,0.184009,-0.482339,-0.428442,0.205543,0.006366,0.89,0.0


In [18]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
44091,41791,-7.222731,6.155773,-10.82646,4.180779,-6.123555,-3.114136,-6.895112,5.161516,-2.516477,...,0.9127,-0.630358,0.190887,-0.061263,0.379775,-0.266845,1.193695,0.257468,99.99,1.0
44223,41851,-19.139733,9.286847,-20.134992,7.818673,-15.652208,-1.668348,-21.340478,0.6419,-8.55011,...,-2.182692,0.520543,-0.760556,0.662767,-0.948454,0.121796,-3.381843,-1.256524,139.9,1.0
44270,41870,-20.906908,9.843153,-19.947726,6.155789,-15.142013,-2.239566,-21.234463,1.151795,-8.73967,...,-1.977196,0.652932,-0.519777,0.541702,-0.053861,0.112671,-3.765371,-1.071238,1.0,1.0
44556,41991,-4.566342,3.353451,-4.572028,3.616119,-2.493138,-1.09,-5.551433,0.447783,-2.424414,...,2.674466,-0.02088,-0.302447,-0.086396,-0.51606,-0.295102,0.195985,0.141115,1.0,1.0
45203,42247,-2.524012,2.098152,-4.946075,6.456588,3.173921,-3.058806,-0.18471,-0.39042,-3.649812,...,0.027935,0.220366,0.976348,-0.290539,1.161002,0.663954,0.456023,-0.405682,1.0,1.0


In [19]:
new_dataset['Class'].value_counts()

Class
0.0    142
1.0    142
Name: count, dtype: int64