# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
# Your code here
import zipfile 
import pandas as pd 

with zipfile.ZipFile("../archive (3).zip") as z:
    with z.open("PS_20174392719_1491204439457_log.csv") as f:
        kaggle_df = pd.read_csv(f)


In [2]:
kaggle_df = kaggle_df.sample(100000)

In [3]:
kaggle_df.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [4]:
kaggle_df.shape

(100000, 11)

In [5]:
kaggle_df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,243.72349,184535.9,838736.4,860169.1,1115018.0,1243651.0,0.00117,0.0
std,142.367774,706558.2,2888697.0,2924544.0,3388285.0,3773148.0,0.034185,0.0
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13372.34,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,75325.24,14475.5,0.0,134571.3,216975.3,0.0,0.0
75%,336.0,208728.2,107714.4,148626.4,952692.1,1116704.0,0.0,0.0
max,739.0,53670510.0,34657150.0,34616320.0,191916700.0,236251200.0,1.0,0.0


In [6]:
kaggle_df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2110643,183,CASH_IN,276041.01,C148948261,517099.0,793140.01,C562863101,40681.95,0.0,0,0
1495837,142,CASH_IN,302753.44,C343390956,6220797.9,6523551.34,C515132998,15350502.11,15047748.67,0,0
1108056,130,CASH_OUT,402452.54,C77948695,247366.78,0.0,C1384245820,826029.96,1228482.5,0,0
3394546,255,CASH_OUT,3376.9,C809201105,0.0,0.0,C1824584041,88984.17,92361.06,0,0
165619,12,CASH_IN,87338.92,C18630671,16099751.95,16187090.87,C1396755641,5018206.33,5505886.84,0,0


### What is the distribution of the outcome? 

In [7]:
# Your response here
kaggle_df.value_counts(normalize=True)

step  type      amount    nameOrig     oldbalanceOrg  newbalanceOrig  nameDest     oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud
1     CASH_IN   80448.13  C161982472   4274305.14     4354753.26      C20671747    124139.21       43691.09        0        0                 0.00001
306   PAYMENT   1054.53   C73563022    38874.00       37819.47        M2109867061  0.00            0.00            0        0                 0.00001
                2233.57   C644344366   0.00           0.00            M432445649   0.00            0.00            0        0                 0.00001
                2195.00   C1797312755  0.00           0.00            M852421074   0.00            0.00            0        0                 0.00001
                2186.72   C82894008    311.00         0.00            M698318441   0.00            0.00            0        0                 0.00001
                                                                                                               

In [8]:
kaggle_df.isFraud.value_counts(normalize=True)

0    0.99883
1    0.00117
Name: isFraud, dtype: float64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [9]:
# Your code here
#Yess as long as the step unit represents the same fraction of time 
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
label_columns = ["type"]
kaggle_df[label_columns] = kaggle_df[label_columns].apply(le.fit_transform)

In [10]:
kaggle_df.drop(labels=['nameDest', 'nameOrig'], axis=1, inplace=True) # GOT AN ERROR ON THE REGRESSION BECAUSE I FORGOT TO REMOVE THESE COLUMNS xD

### Run a logisitc regression classifier and evaluate its accuracy.

In [11]:
# Your code here
Y= kaggle_df["isFraud"]
X= kaggle_df.drop(["isFraud"], axis=1)

from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=3)

In [17]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()#class_weight='balanced')
lr.fit(X_train, Y_train)
acc = lr.score(X_test, Y_test)*100

print("Logistic Regression Test : ",acc)

Logistic Regression Test :  99.795


In [18]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = lr.predict(X_test)
print(accuracy_score(Y_test, y_pred)*100)
cm = confusion_matrix(Y_test, y_pred)
print(cm)

99.795
[[19948    30]
 [   11    11]]


### Now pick a model of your choice and evaluate its accuracy.

In [19]:
# Your code here
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, Y_train)
y_pred = dtc.predict(X_test)
acc = dtc.score(X_test, Y_test)*100
print("Decision Tree Test Accuracy ",acc,)
cm = confusion_matrix(Y_test, y_pred)
print(cm)


Decision Tree Test Accuracy  99.92
[[19968    10]
 [    6    16]]


### Which model worked better and how do you know?

In [14]:
# Your response here
#the decison tree worked better with an accuravy of 99.92 

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.