# Introduction to SMOTE

In this notebook, we will use the SMOTE (Synthethic Minority Oversampling Technique) algorithm to oversample the minority class in the dataset. We will then train a logistic regression model on the oversampled dataset and compare the results with the model trained on the original dataset.

We will use the following packages:

- [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/): a Python package to deal with imbalanced datasets
- [scikit-learn](https://scikit-learn.org/stable/): a Python package for machine learning
- [pandas](https://pandas.pydata.org/): a Python package for data manipulation and analysis
- [numpy](https://numpy.org/): a Python package for scientific computing






### Import packages

In [4]:
# Import packages
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

In [6]:
# Input our data 
df = pd.read_csv('/Users/jasonrobinson/Documents/Data-Engineering-Credit-Card-Transactions/data/transactions.csv')
df.head()

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No
1,0,0,2002,9,1,06:42,$38.48,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
2,0,0,2002,9,2,06:22,$120.34,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
3,0,0,2002,9,2,17:45,$128.95,Swipe Transaction,3414527459579106770,Monterey Park,CA,91754.0,5651,,No
4,0,0,2002,9,3,06:23,$104.71,Swipe Transaction,5817218446178736267,La Verne,CA,91750.0,5912,,No


In [7]:
df.shape

(19963, 15)

Title: Dealing with Class Imbalance
Author: Jason Robinson
Date: 2022-12-25
Category: Data Science
Tags: imbalanced-learn, SMOTE, logistic regression, class imbalance
Slug: cls_imb_smote
Status: draft

In machine learning, we often encounter datasets with imbalanced classes. For example, in a dataset of credit card transactions, the majority of transactions are legitimate, while a small fraction are fraudulent. In such a dataset, the model will be biased towards the majority class, and will not be able to detect the minority class. This is a problem because the minority class is the one we are interested in detecting. In this post, we will use the SMOTE (Synthethic Minority Oversampling Technique) algorithm to oversample the minority class in the dataset. We will then train a naive Bayes model on the oversampled dataset and compare the results with the model trained on the original dataset.


