<h1><center>Over Sampling and its effects on Model Metrics</center></h1>

We will try and explore the impact of sampling on Accuracy, Precision and Recall. 

We will use 3 sampling techniques:
* Over Sampling 
* Under Sampling
* SMOTE(Synthetic Minority Over-sampling Technique)

We will see which one has the most impact on Metrics. We will also try and see when we need to use Sampling and if it really helps.

Here, we are taking a simple dataset with 2 independent columns, no outliers and no null values.

In [None]:
import pandas as pd
import numpy as np

In [None]:
data=pd.read_csv("/kaggle/input/social-network-ads/Social_Network_Ads.csv")
data.head()

In [None]:
data.Purchased.value_counts()

In [None]:
data.info()

Let's look at the importance of class imbalance

In [None]:
import seaborn as sns 
sns.distplot(data.EstimatedSalary)

In [None]:
sns.distplot(data.Age)

In [None]:
sns.boxplot(data.EstimatedSalary)

In [None]:
from sklearn.model_selection import train_test_split
X=data.drop("Purchased",axis=1)
y=data.Purchased
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=300)

In [None]:
samplingdf=pd.DataFrame(columns=["Model","Sampling","Accuracy","Recall","Precision"])

In [None]:
def Model_pipeline(X_train,X_test,y_train,y_test,sampling,samplingdf):    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_trainstand=scaler.fit(X_train).transform(X_train)
    X_teststand=scaler.transform(X_test)
    #classification Metrics
    def metrics(clf,model,sampling,samplingdf):
        print("Model Type:",model,sampling)
        from sklearn.metrics import accuracy_score
        from sklearn.metrics import recall_score
        from sklearn.metrics import precision_score
        y_pred_test = clf.predict(X_teststand)
        print("Accuracy for Test set:")
        print(accuracy_score(y_test,y_pred_test)) 
        print("\n")
        print("Recall for Test set:")
        print(recall_score(y_test,y_pred_test,pos_label=1))
        print("\n")
        print("Precision for Test set:")
        print(precision_score(y_test,y_pred_test,pos_label=1))
        print("\n")
        print("------------------------------------------------------------------------------")
        input1=pd.Series([model,sampling,accuracy_score(y_test,y_pred_test),
                recall_score(y_test,y_pred_test,pos_label=1),precision_score(y_test,y_pred_test,pos_label=1)], index = samplingdf. columns)
        samplingdf=samplingdf.append(input1,ignore_index=True)
        return samplingdf
    
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(random_state=0).fit(X_trainstand, y_train)
    samplingdf=metrics(clf,"Logistic Regression",sampling,samplingdf)
    
    from sklearn.tree import DecisionTreeClassifier
    clf = DecisionTreeClassifier(random_state=0).fit(X_trainstand,y_train)
    samplingdf=metrics(clf,"Decision Tree",sampling,samplingdf)
    
    #GBM
    from sklearn.ensemble import GradientBoostingClassifier
    clf=GradientBoostingClassifier(random_state=0).fit(X_trainstand,y_train)
    samplingdf=metrics(clf,"Gradient Booster",sampling,samplingdf)
    
    #RandomForest
    from sklearn.ensemble import RandomForestClassifier
    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_trainstand, y_train)
    samplingdf=metrics(clf,"Random Forest",sampling,samplingdf)
    
    #XGBoost
    from xgboost import XGBClassifier
    XGB_model = XGBClassifier(learning_rate=0.05)
    XGB_model.fit(X_trainstand, y_train)
    samplingdf=metrics(clf,"XGBoost",sampling,samplingdf)
    
    #SVM
    from sklearn.svm import SVC
    clf = SVC(random_state=0)
    clf.fit(X_trainstand, y_train)
    samplingdf=metrics(clf,"SVM normal",sampling,samplingdf)
    
    #kernal SVM
    from sklearn.svm import SVC
    clf = SVC(kernel="rbf")
    clf.fit(X_trainstand, y_train)
    samplingdf=metrics(clf,"SVM Kernal",sampling,samplingdf)
    
    return samplingdf

In [None]:
samplingdf=Model_pipeline(X_train,X_test,y_train,y_test,"No sampling",samplingdf)

In [None]:
samplingdf

# Over Sampling

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=300)
X_train, y_train = ros.fit_resample(X_train, y_train)

In [None]:
samplingdf=Model_pipeline(X_train,X_test,y_train,y_test,"Over sampling",samplingdf)

# Under sampling

In [None]:
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=300)
X_train, y_train = cc.fit_resample(X_train, y_train)

In [None]:
samplingdf=Model_pipeline(X_train,X_test,y_train,y_test,"Under sampling",samplingdf)

# SMOTE

In [None]:
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=300)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [None]:
samplingdf=Model_pipeline(X_train,X_test,y_train,y_test,"SMOTE",samplingdf)

In [None]:
samplingdf

In [None]:
samplingdf.sort_values(by=["Model","Sampling"])

# Analysing the Results

In [None]:
samplingdf.groupby("Sampling").mean()

In [None]:
import seaborn as sns 

print(sns.barplot(data=samplingdf.groupby("Sampling").median().reset_index(),x="Sampling",y="Accuracy"))


In [None]:
print(sns.barplot(data=samplingdf.groupby("Sampling").median().reset_index(),x="Sampling",y="Recall"))

In [None]:
print(sns.barplot(data=samplingdf.groupby("Sampling").median().reset_index(),x="Sampling",y="Precision"))

Considering the aggregate of all models, we can see that sampling does not have much of an effect on accuracy,with a barely 1% difference.

However especially over sampling and SMOTE increase the recall considerably with over a 6% increase with SMOTE.

But we have to keep in mind that precision is effected significantly as we upscale with a reduction of more than 3-4%, this might not be important, for example a cancer prediction will depend on recall. however its important in applications such as youtube recommendations,etc. 

## Let's look at how it affects weak models

In [None]:
samplingdf[samplingdf.Model=="Logistic Regression"]

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="Logistic Regression"],x="Sampling",y="Accuracy")

There doesn't seem to be much affect on accuracy with a 1%-2% increase due to oversampling, I would not consider this significant.

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="Logistic Regression"],x="Sampling",y="Recall")

Recall has increased immensly due to sampling with fair 40-45% increase, any sampling definately increases Recall. Over sampling and Under sampling seem to be giving the same recall 

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="Logistic Regression"],x="Sampling",y="Precision")

Precision has taken a hit with 10-12% decrease due to sampling.

## How about Strong Models?

In [None]:
samplingdf[samplingdf.Model=="XGBoost"]

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="XGBoost"],x="Sampling",y="Accuracy")

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="XGBoost"],x="Sampling",y="Recall")

In [None]:
sns.barplot(data=samplingdf[samplingdf.Model=="XGBoost"],x="Sampling",y="Precision")

From the above charts we see that for strong models:
* Accuracy increases but barely and only when using SMOTE
* Recall increases significantly,especially when using SMOTE
* Precision decreases especially when under sampling, but the decrease is insignificant when using SMOTE

# Conclusion

### From all the above I have come to a conclusion that sampling is extremely effective in increasing Recall, which is the popular metric among others. However, when precision is involved, the safest bet is not indulge in any Sampling. 

### I believe that the best sampling to use in any circumstance is SMOTE since its has the highest increase in recall in most cases and lowest decrease in precision.