# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [16]:
# Your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler, StandardScaler, PolynomialFeatures, MinMaxScaler
import seaborn as sns

In [2]:
kaggle=pd.read_csv('imbalance.csv')

In [3]:
print(len(kaggle))

6362620


In [4]:
kaggle_sample = kaggle.sample(100000)

In [5]:
print(len(kaggle_sample))

100000


In [7]:
kaggle_sample.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
5822068,402,PAYMENT,32872.81,C622466656,0.0,0.0,M461115506,0.0,0.0,0,0
575692,25,PAYMENT,6484.0,C499402175,0.0,0.0,M1253513532,0.0,0.0,0,0
5536781,381,CASH_IN,100476.84,C1729690147,6027104.86,6127581.7,C366794197,629542.16,529065.32,0,0
1769664,162,CASH_IN,252551.63,C1714481096,15714.0,268265.63,C343228954,3156572.74,2904021.11,0,0
4882012,348,TRANSFER,429758.75,C60573377,0.0,0.0,C1314990685,1330178.45,1759937.2,0,0


In [6]:
kaggle_sample.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [9]:
kaggle_sample.type.value_counts()

CASH_OUT    34985
PAYMENT     33709
CASH_IN     22333
TRANSFER     8328
DEBIT         645
Name: type, dtype: int64

In [10]:
kaggle_sample.nameOrig.value_counts()

C1520724606    2
C1831353184    2
C1725839947    2
C622466656     1
C1982736652    1
              ..
C328540088     1
C331960921     1
C1665592160    1
C1840636014    1
C959229039     1
Name: nameOrig, Length: 99997, dtype: int64

In [8]:
kaggle_sample.nameDest.value_counts()

C392347203     8
C265067678     5
C306206744     5
C451111351     5
C241558961     5
              ..
M116356543     1
M690342430     1
C1964398989    1
C1445814382    1
C68063083      1
Name: nameDest, Length: 92923, dtype: int64

In [12]:
#important features should be type and amount

### What is the distribution of the outcome? 

In [13]:
# Your response here
kaggle_sample.isFraud.value_counts(normalize=True)

0    0.99875
1    0.00125
Name: isFraud, dtype: float64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [14]:
# Your code here
kaggle_sample.drop(columns=['nameDest','nameOrig'],inplace=True) # dropping columns with too many unique values
kaggle_sample

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
5822068,402,PAYMENT,32872.81,0.00,0.00,0.00,0.00,0,0
575692,25,PAYMENT,6484.00,0.00,0.00,0.00,0.00,0,0
5536781,381,CASH_IN,100476.84,6027104.86,6127581.70,629542.16,529065.32,0,0
1769664,162,CASH_IN,252551.63,15714.00,268265.63,3156572.74,2904021.11,0,0
4882012,348,TRANSFER,429758.75,0.00,0.00,1330178.45,1759937.20,0,0
...,...,...,...,...,...,...,...,...,...
2529672,205,CASH_OUT,223287.37,872580.00,649292.63,0.00,223287.37,0,0
2313527,188,PAYMENT,30441.18,213781.56,183340.37,0.00,0.00,0,0
996649,45,CASH_OUT,222415.96,0.00,0.00,2960006.78,3182422.74,0,0
1481409,141,CASH_OUT,37126.11,0.00,0.00,2230913.91,2268040.02,0,0


In [17]:
le = LabelEncoder()
label_cols = ['type']
kaggle_sample[label_cols] = kaggle_sample[label_cols].apply(le.fit_transform)

### Run a logisitc regression classifier and evaluate its accuracy.

In [18]:
# Your code here
y = kaggle_sample['isFraud']
X = kaggle_sample.drop(labels='isFraud', axis=1)

scaler = StandardScaler()
scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

In [19]:
lr = LogisticRegression()

lr.fit(X_train, y_train)

acc = lr.score(X_test, y_test)*100

print(round(acc, 2))

99.8


### Now pick a model of your choice and evaluate its accuracy.

In [21]:
# Your code here
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression

regr = RandomForestRegressor(max_depth=10, random_state=1,n_estimators=90)

regr.fit(X_train, y_train)

y_pred_forest=regr.predict(X_test)

print(r2_score(y_test, y_pred_forest))

0.6375123654485192


### Which model worked better and how do you know?

In [2]:
# Your response here
#linear regression worked best as it had a higher value

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.