# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [184]:
import pandas as pd

n = 100000
df = pd.read_csv("Fraud.csv", nrows=n)

### What is the distribution of the outcome? 

In [185]:
df.head(20)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0


In [186]:
fraud_counts = df['isFraud'].value_counts()

print("Number of Fraudulent Transactions:", fraud_counts[1])
print("Number of Non-Fraudulent Transactions:", fraud_counts[0])

Number of Fraudulent Transactions: 116
Number of Non-Fraudulent Transactions: 99884


### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score.

In [187]:
# Checking the data types
df.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [188]:
# Checking the missing values
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [189]:
# Display unique values in each column
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in '{column}': {unique_values}")
    print()

Unique values in 'step': [ 1  2  3  4  5  6  7  8  9 10]

Unique values in 'type': ['PAYMENT' 'TRANSFER' 'CASH_OUT' 'DEBIT' 'CASH_IN']

Unique values in 'amount': [9.8396400e+03 1.8642800e+03 1.8100000e+02 ... 1.8377491e+05 8.2237170e+04
 2.0096560e+04]

Unique values in 'nameOrig': ['C1231006815' 'C1666544295' 'C1305486145' ... 'C104331851' 'C707662966'
 'C1868032458']

Unique values in 'oldbalanceOrg': [1.7013600e+05 2.1249000e+04 1.8100000e+02 ... 2.1509709e+05 1.5992900e+05
 1.1011700e+05]

Unique values in 'newbalanceOrig': [160296.36  19384.72      0.   ... 155908.34 222947.91  90020.44]

Unique values in 'nameDest': ['M1979787155' 'M2044282225' 'C553264065' ... 'M1257036576' 'M1785344556'
 'M1419201886']

Unique values in 'oldbalanceDest': [     0.    21182.    41898.   ...  39334.53  54925.05 592635.66]

Unique values in 'newbalanceDest': [     0.    40348.79 157982.12 ... 183153.72  52596.25 118762.9 ]

Unique values in 'isFraud': [0 1]

Unique values in 'isFlaggedFraud': [0]


In [190]:

# Check the number of different values in the 'nameDest' column
num_unique_nameDest = df['nameOrig'].nunique()
print("Number of unique values in 'nameDest':", num_unique_nameDest)

Number of unique values in 'nameDest': 100000


In [191]:
# Dummify the 'type' column while keeping the original column
df = pd.get_dummies(df, columns=['type'], prefix='type', drop_first=True)

# Display the first few rows of the dummified DataFrame
df.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0,0,1,0
2,1,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0,0,1
3,1,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,1,0,0,0
4,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0,0,1,0


In [192]:
# dropping the columns'isFlaggedFraud' (only has the value 0) and 'step' (categorical)
df = df.drop(['isFlaggedFraud', 'step'], axis=1)


In [193]:
c_starting = df[df['nameDest'].str.startswith('C')]
m_starting = df[df['nameDest'].str.startswith('M')]

correlation_c = c_starting[['oldbalanceDest', 'newbalanceDest']].corr()
correlation_m = m_starting[['oldbalanceDest', 'newbalanceDest']].corr()

print("Correlation matrix for 'C' starting values:")
print(correlation_c)

print("\nCorrelation matrix for 'M' starting values:")
print(correlation_m)

Correlation matrix for 'C' starting values:
                oldbalanceDest  newbalanceDest
oldbalanceDest        1.000000        0.932956
newbalanceDest        0.932956        1.000000

Correlation matrix for 'M' starting values:
                oldbalanceDest  newbalanceDest
oldbalanceDest             NaN             NaN
newbalanceDest             NaN             NaN


In [194]:
# Create new binary columns 'isCustomer' and 'isMerchant'
df['NameDestC'] = df['nameDest'].str.startswith('C').astype(int)
df['NameDestM'] = df['nameDest'].str.startswith('M').astype(int)

# Drop the original 'nameDest' column
df.drop(columns=['nameDest'], inplace=True)

In [195]:
# Checking if all the values start with letter C in the column 'nameOrig'
all_start_with_C = df['nameOrig'].str.startswith('C').all()
print("All values start with 'C':", all_start_with_C)

All values start with 'C': True


In [196]:
# Convert nameOrig to numerical (lets check if the model is related with the different features)
df['nameOrig'] = df['nameOrig'].str.replace('C', '').astype(int)


In [197]:
df

Unnamed: 0,amount,nameOrig,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,NameDestC,NameDestM
0,9839.64,1231006815,170136.0,160296.36,0.00,0.00,0,0,0,1,0,0,1
1,1864.28,1666544295,21249.0,19384.72,0.00,0.00,0,0,0,1,0,0,1
2,181.00,1305486145,181.0,0.00,0.00,0.00,1,0,0,0,1,1,0
3,181.00,840083671,181.0,0.00,21182.00,0.00,1,1,0,0,0,1,0
4,11668.14,2048537720,41554.0,29885.86,0.00,0.00,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,4020.66,1410794718,159929.0,155908.34,0.00,0.00,0,0,0,1,0,0,1
99996,18345.49,744303677,6206.0,0.00,0.00,0.00,0,0,0,1,0,0,1
99997,183774.91,104331851,39173.0,222947.91,54925.05,0.00,0,0,0,0,0,1,0
99998,82237.17,707662966,6031.0,0.00,592635.66,799140.46,0,1,0,0,0,1,0


In [198]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Select the columns you want to normalize
columns_to_normalize = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

# Apply Min-Max Scaling to the selected columns
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])


### Run a logisitc regression classifier and evaluate its accuracy.

In [199]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the dataset into features (X) and target variable (y)
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions using the trained model
y_pred = logreg.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9989


### Now pick a model of your choice and evaluate its accuracy.

In [200]:
# Create a Random Forest classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
accuracy_rf = random_forest.score(X_test, y_test)
print("Random Forest Accuracy:", accuracy_rf)


Random Forest Accuracy: 0.9994


### Which model worked better and how do you know?

In [201]:
"""
The Random Forest model worked better with a slightly higher accuracy of 0.9994 
compared to the Logistic Regression model's accuracy of 0.9989.

The Random Forest model achieved a higher accuracy, indicating that it made more 
correct predictions on the data compared to the Logistic Regression model.
"""

"\nThe Random Forest model worked better with a slightly higher accuracy of 0.9994 \ncompared to the Logistic Regression model's accuracy of 0.9989.\n\nThe Random Forest model achieved a higher accuracy, indicating that it made more \ncorrect predictions on the data compared to the Logistic Regression model.\n"

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.