# Importing necessary libraries and importing dataset
Name: Divyam Malik <br>
Roll No: 102103142 <br>
Group: 3CO5

# Sampling Techniques and Model Evaluation

This Python code demonstrates the application of various sampling techniques on credit card fraud data, along with the evaluation of machine learning models using different sampling strategies. The project includes:

- **Data Preprocessing:** Normalization of the 'Amount' feature and preparation of the dataset.
- **Sampling Techniques:** Utilization of Simple Random Sampling, Systematic Sampling, Cluster Sampling, Stratified Sampling, and Bootstrap Sampling.
- **Machine Learning Models:** Evaluation of models such as Random Forest, Logistic Regression, Support Vector Machine (SVM), Decision Trees, and AdaBoost for each sampling strategy.
- **Model Evaluation:** Calculation of accuracy scores for each model to assess their performance.

The project aims to explore the impact of various sampling methods on model accuracy and provides a comprehensive understanding of different strategies for handling imbalanced datasets.


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.preprocessing import normalize

In [5]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [6]:
url = "https://raw.githubusercontent.com/AnjulaMehto/Sampling_Assignment/main/Creditcard_data.csv"
df = pd.read_csv(url)

In [7]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,...,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0
mean,283.005181,-0.176963,0.217169,0.875172,0.285628,-0.005029,0.159081,0.123329,-0.057547,-0.030384,...,0.004888,-0.096995,-0.040344,-0.002501,0.114337,0.022782,0.023353,-0.017045,68.66829,0.011658
std,171.834196,1.294724,1.173401,1.031878,1.258758,1.098143,1.225682,0.852075,0.830144,0.878183,...,0.609335,0.607228,0.358724,0.621507,0.429667,0.484227,0.300934,0.278332,197.838269,0.107411
min,0.0,-6.093248,-12.114213,-5.694973,-4.657545,-6.631951,-3.498447,-4.925568,-7.494658,-2.770089,...,-4.134608,-2.776923,-3.553381,-1.867208,-1.389079,-1.243924,-2.377933,-2.735623,0.0,0.0
25%,126.5,-0.896416,-0.174684,0.308677,-0.460058,-0.534567,-0.630717,-0.296289,-0.16788,-0.517068,...,-0.213746,-0.525289,-0.176915,-0.379766,-0.166227,-0.313631,-0.047868,-0.033083,5.9875,0.0
50%,282.0,-0.382618,0.285843,0.905435,0.395919,-0.116612,-0.109581,0.116329,0.034755,-0.08227,...,-0.075802,-0.076551,-0.048353,0.091886,0.143723,-0.026414,0.023199,0.021034,16.665,0.0
75%,432.0,1.110739,0.885745,1.532969,1.117559,0.452818,0.482972,0.57539,0.252395,0.412261,...,0.095149,0.307438,0.070085,0.426339,0.425798,0.260408,0.112199,0.087023,55.5275,0.0
max,581.0,1.586093,5.267376,3.772857,4.075817,7.672544,5.122103,4.808426,2.134599,5.459274,...,5.27342,1.57475,3.150413,1.215279,1.13672,3.087444,2.490503,1.57538,3828.04,1.0


In [8]:
df.Class.value_counts()

0    763
1      9
Name: Class, dtype: int64

In [9]:
Amount = normalize([df['Amount']])[0]
df['Amount'] = Amount
df = df.iloc[:, 1:]
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.025729,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0.000463,1
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0.065115,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.021237,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0.012036,0


In [10]:
x = df.drop('Class', axis=1)
y = df['Class']

In [12]:
sampler = RandomOverSampler(sampling_strategy=0.95)
x_resample, y_resample = sampler.fit_resample(x, y)
print(y_resample.value_counts())

0    763
1    724
Name: Class, dtype: int64


In [13]:
resample = pd.concat([x_resample, y_resample], axis=1)
resample

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.025729,0
1,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,0.000463,1
2,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0.065115,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.021237,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,0.012036,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1482,-0.928088,0.398194,1.741131,0.182673,0.966387,-0.901004,0.879016,-0.156590,-0.142117,-0.574775,...,0.066353,0.281378,-0.257966,0.385384,0.391117,-0.453853,-0.104448,-0.125765,0.000172,1
1483,-0.928088,0.398194,1.741131,0.182673,0.966387,-0.901004,0.879016,-0.156590,-0.142117,-0.574775,...,0.066353,0.281378,-0.257966,0.385384,0.391117,-0.453853,-0.104448,-0.125765,0.000172,1
1484,0.073497,0.551033,0.451890,0.114964,0.822947,0.251480,0.296319,0.139497,-0.123050,-0.142617,...,-0.128758,-0.381932,0.151012,-1.363967,-1.389079,0.075412,0.231750,0.230171,0.000170,1
1485,0.073497,0.551033,0.451890,0.114964,0.822947,0.251480,0.296319,0.139497,-0.123050,-0.142617,...,-0.128758,-0.381932,0.151012,-1.363967,-1.389079,0.075412,0.231750,0.230171,0.000170,1


## Now we would use 5 different type of sampling techniques and use them on 5 different models

### Simple random sampling
Simple random sampling involves randomly selecting individuals or elements from a population, ensuring each has an equal chance of being chosen.

In [27]:
n = int((1.96*1.96 * 0.5*0.5)/(0.05**2))
SimpleSampling = resample.sample(n=n, random_state=42)
SimpleSampling.shape

X = SimpleSampling.drop('Class', axis=1)
y = SimpleSampling['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression()
nb_model = SVC(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
knn_model = AdaBoostClassifier(random_state=42)

models = [rf_model, lr_model, nb_model, dt_model, knn_model]
model_names = ['Random Forest', 'Logistic Regression', 'SVM', 'Decision Trees', 'AdaBoost']

accuracies = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"{name} : {accuracy:.4f}")

# Print the best model for Simple Random Sampling
best_model_idx = np.argmax(accuracies)
best_model_name = model_names[best_model_idx]
print(f"\nBest Model for Simple Random Sampling: {best_model_name} with Accuracy: {accuracies[best_model_idx]:.4f}")

Random Forest : 0.9870
Logistic Regression : 0.8701
SVM : 0.8831
Decision Trees : 0.9610
AdaBoost : 0.9481

Best Model for Simple Random Sampling: Random Forest with Accuracy: 0.9870


### Systematic Sampling
Systematic sampling involves selecting every kth element from a list or population after randomly choosing a starting point, providing a structured method for representative sampling.

In [31]:
import random

SystematicSampling = resample.sample(frac=1, random_state=42).reset_index(drop=True)

sampling_interval = 2
SystematicSample = SystematicSampling.iloc[::sampling_interval]
SystematicSample.shape

X = SystematicSample.drop('Class', axis=1)
y = SystematicSample['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression()
nb_model = SVC(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
knn_model = AdaBoostClassifier(random_state=42)

models = [rf_model, lr_model, nb_model, dt_model, knn_model]
model_names = ['Random Forest', 'Logistic Regression', 'SVM', 'Decision Trees', 'AdaBoost']

accuracies = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"{name} : {accuracy:.4f}")

# Print the best model for Systematic Sampling
best_model_idx = np.argmax(accuracies)
best_model_name = model_names[best_model_idx]
print(f"\nBest Model for Systematic Sampling: {best_model_name} with Accuracy: {accuracies[best_model_idx]:.4f}")

Random Forest : 1.0000
Logistic Regression : 0.8926
SVM : 0.9664
Decision Trees : 1.0000
AdaBoost : 0.9933

Best Model for Systematic Sampling: Random Forest with Accuracy: 1.0000


### Cluster Sampling 
Cluster sampling involves dividing a population into clusters, randomly selecting some clusters, and then including all individuals or elements within the chosen clusters in the sample

In [32]:
from sklearn.cluster import KMeans

num_clusters = 10

kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)

clusters = kmeans.fit_predict(resample)
clusters = pd.Series(clusters)

selected_clusters = random.sample(range(num_clusters), 3)
ClusterSample = resample.loc[clusters.isin(selected_clusters)]
print(ClusterSample.shape)

X = ClusterSample.drop('Class', axis=1)
y = ClusterSample['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression()
nb_model = SVC(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
knn_model = AdaBoostClassifier(random_state=42)

models = [rf_model, lr_model, nb_model, dt_model, knn_model]
model_names = ['Random Forest', 'Logistic Regression', 'SVM', 'Decision Trees', 'AdaBoost']

accuracies = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"{name} : {accuracy:.4f}")

# Print the best model for Cluster Sampling
best_model_idx = np.argmax(accuracies)
best_model_name = model_names[best_model_idx]
print(f"\nBest Model for Cluster Sampling: {best_model_name} with Accuracy: {accuracies[best_model_idx]:.4f}")

(473, 30)
Random Forest : 1.0000
Logistic Regression : 0.9579
SVM : 1.0000
Decision Trees : 0.9789
AdaBoost : 1.0000

Best Model for Cluster Sampling: Random Forest with Accuracy: 1.0000


### Stratified Sampling
Stratified sampling involves dividing a population into distinct subgroups or strata based on certain characteristics, and then independently sampling from each stratum to ensure representation of diverse segments within the overall sample.

In [33]:
n = int((1.96*1.96 * 0.5*0.5)/((0.05)**2))
StratifiedSampling = resample.groupby('Class')
StratifiedSample=StratifiedSampling.sample(frac= 0.45)
StratifiedSample.shape

X = StratifiedSample.drop('Class', axis=1)
y = StratifiedSample['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression()
nb_model = SVC(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
knn_model = AdaBoostClassifier(random_state=42)

models = [rf_model, lr_model, nb_model, dt_model, knn_model]
model_names = ['Random Forest', 'Logistic Regression', 'SVM', 'Decision Trees', 'AdaBoost']

accuracies = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"{name} : {accuracy:.4f}")

# Print the best model for Stratified Sampling
best_model_idx = np.argmax(accuracies)
best_model_name = model_names[best_model_idx]
print(f"\nBest Model for Stratified Sampling: {best_model_name} with Accuracy: {accuracies[best_model_idx]:.4f}")

Random Forest : 1.0000
Logistic Regression : 0.8955
SVM : 0.9701
Decision Trees : 0.9925
AdaBoost : 0.9851

Best Model for Stratified Sampling: Random Forest with Accuracy: 1.0000


### Bootstrap Sampling
Bootstrap sampling, or bootstrapping, is a resampling technique that involves repeatedly and randomly sampling with replacement from the observed data to create multiple simulated datasets, enabling estimation of the sampling distribution and statistical inference.

In [34]:
n_bootstrap = 100
desired_sample_size = 400
BootstrapSamples = pd.DataFrame()
for _ in range(n_bootstrap):
    resampled_data = resample.sample(n=len(df), replace=True, random_state=42)
    BootstrapSamples = pd.concat([BootstrapSamples, resampled_data])
    if BootstrapSamples.shape[0] >= desired_sample_size:
        break
BootstrapSamples = BootstrapSamples.iloc[:desired_sample_size, :]
print("Final Shape of Bootstrap Samples DataFrame:", BootstrapSamples.shape)

X = BootstrapSamples.drop('Class', axis=1)
y = BootstrapSamples['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression()
nb_model = SVC(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
knn_model = AdaBoostClassifier(random_state=42)

models = [rf_model, lr_model, nb_model, dt_model, knn_model]
model_names = ['Random Forest', 'Logistic Regression', 'SVM', 'Decision Trees', 'AdaBoost']

accuracies = []

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"{name} : {accuracy:.4f}")

# Print the best model for Bootstrap Sampling
best_model_idx = np.argmax(accuracies)
best_model_name = model_names[best_model_idx]
print(f"\nBest Model for Bootstrap Sampling: {best_model_name} with Accuracy: {accuracies[best_model_idx]:.4f}")

Final Shape of Bootstrap Samples DataFrame: (400, 30)
Random Forest : 1.0000
Logistic Regression : 0.9250
SVM : 0.9750
Decision Trees : 0.9625
AdaBoost : 1.0000

Best Model for Bootstrap Sampling: Random Forest with Accuracy: 1.0000
