The data consists of 5 columns:

* Variance of Wavelet Transformed image (continuous)
* Skewness of Wavelet Transformed image (continuous)
* Curtosis of Wavelet Transformed image (continuous)
* Entropy of image (continuous)
* Class (integer)

Where class indicates whether or not a Bank Note was authentic.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


data = pd.read_csv('../input/bank-note-authentication-uci-data/BankNote_Authentication.csv')
data.head()

In [None]:
data.isnull().sum()

In [None]:
print(f"So, the data has no missing values.\n\n\nNo. of observations in our dataset is {data.shape[0]}")

In [None]:
display("Let's have a look on relationships between features of our data")
sns.pairplot(data,hue='class');

In [None]:
sns.countplot(x='class',data=data)
plt.title('Classes (Authentic 1 vs Fake 0)');

In [None]:
data['class'].value_counts()

Our data is not balanced. Standard classifier algorithms like Decision Tree and Logistic Regression have a bias towards classes which have higher number of instances. They tend to only predict the majority class data. The features of the minority class are treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class. It can affect our model badly. So, here I choose over-sampling.

Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample. This is done in order to obtain approximately the same number of instances for both the classes. I have used resample library from python.

In [None]:
from sklearn.utils import resample,shuffle
df_majority = data[data['class']==0]
df_minority = data[data['class']==1]
df_minority_upsampled = resample(df_minority,replace=True,n_samples=762,random_state = 123)
balanced_df = pd.concat([df_minority_upsampled,df_majority])
balanced_df = shuffle(balanced_df)
balanced_df['class'].value_counts()

In [None]:
#importing librarires needed for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

*First, I apply Random Forest algorithm from sklearn and then by creating a pipeline.*

In [None]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(balanced_df.drop('class',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=balanced_df.columns[:-1])

X = df_feat
y = balanced_df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train,y_train)
rfc_preds = rfc.predict(X_test)
print(classification_report(y_test,rfc_preds))
print(confusion_matrix(y_test,rfc_preds))

In [None]:
score=cross_val_score(rfc,X,y,cv=5)
(100*score.mean()).round(2)

In [None]:
my_pipeline= Pipeline([('scaler',StandardScaler()),('rfc',RandomForestClassifier())])
my_pipeline.fit(X_train,y_train)
pp_preds = my_pipeline.predict(X_test)
print(classification_report(y_test,pp_preds))
print(confusion_matrix(y_test,pp_preds))

*Benefit of creating a pipeline is that it automates our preprocessing and modelling part. Here, we did not handle missing values otherwise it can also be included in pipeline. Then, when predicting for new unseen data we simply pass it through pipeline and need not apply standardization and imputation on it separately.

Also, during k-fold cross validation pipeline helps to achieve more accurate results. Because of pipeline, standardization is not fitted on test data during k-folds otherwise train-test data leakage occurs.*

In [None]:
score1=cross_val_score(my_pipeline,X,y,cv=5)
(100*score1.mean()).round(2)