# Basic Pipeline

A basic machine learning pipeline includes:
1. Understand the problem
2. Explore the data
3. Pre-process (clean) the data
4. Divide the data into training set and testing set
5. Train a machine learning classifier
6. Test the performance of the trained model

In this practical, we will go through a toy example to get an intuition of these key-points.

## 1. Understand the problem

This step includes identifying the problem. Is it a regression or a classification problem? If it is a classification problem, do you want to perform a supervised or a unsupervised classification? Do you already know the true classes of the given data or do you rather want to find the intrinsic clusters in the data and then draw an inference? 

### Look at the data

In [None]:
import pandas as pd
data = pd.read_csv ('./data/data.csv')
data.shape
data.head()

Is this a classification problem? 

## 2. Explore the data 

### Check for missing values

In [None]:
data.isnull().sum()

None of the columns contain any missing values

### Check summary statistics

In [None]:
data.describe()

Do you see anything odd in this data?

### Visualize this data 

Vizualizing the data, whenever possible, not only helps to identify outliers/anomalies in the data but also helps to get an intuition of the underlying decision boundary best separating the two classes.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x="Feature1", y="Feature2", hue="Class", data=data, 
                edgecolor='black', markers= ["s", "^"], style= "Class")

Which decision boundary would best separate the two classes?

How does adding more noise affect the ease of finding a decision boundary?

## 3. Pre-process the data 

Are there any outliers in the plot above?

## 4. Divide the data into training and testing

In this example, we will use 75% of the data for training the classifier and remaining 25% of the data for testing the classifier

In [None]:
from sklearn.model_selection import train_test_split

#Define a random state for reproducibility of reuslts
randomState = 123
# Play around with the split size to see how increasing or decreasing 
# the training data size affect the classifier performance in the next steps
testSize = 0.25
dataTrain, dataTest = train_test_split(data,test_size=testSize, random_state = randomState)

# Check that the splitting worked perfectly
print(dataTrain.shape, dataTest.shape)

### Extract features and classes

In [None]:
yTrain = dataTrain.pop('Class')
xTrain = dataTrain

yTest = dataTest.pop('Class')
xTest = dataTest

print(xTrain.shape, yTrain.shape, xTest.shape, yTest.shape)

## 5. Train a machine learning classifier 

Only use the training set for this analysis!! To test the performance of the classifier on the real-world data in a fair way, it is better to keep the testing set as strictly unseen by the classifier. This can be done by excluding it completely while training the classifier. This means that the classfier learns to predict the classes using the training data and we then check how well it performs on the new or unseen data using the testing data.

To evaluate the performance of the classifier, we will compute the $accuracy$ of predictions, defined as:

$$Accuracy = \frac{Number\, of\, data\, points\, correctly\, predicted}{Total\, number\, of\, data\, points}$$

Several other metrics exist to evaluate the performance of the classifier but is out of the scope of this practical.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Initializing classifiers
clf1 = LogisticRegression(random_state=randomState, solver='lbfgs')
clf2 = RandomForestClassifier(random_state=randomState, n_estimators=100)
clf3 = GaussianNB()
clf4 = SVC(gamma='auto')

In [None]:
import matplotlib.gridspec as gridspec
import itertools
from mlxtend.plotting import plot_decision_regions
import numpy as np

# Training 4 different classifiers to analyze their performance on the training data
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'SVM']
for clf, lab, grd in zip([clf1, clf2, clf3, clf4],
                         labels,
                         itertools.product([0, 1], repeat=2)):
    clf.fit(xTrain, yTrain)
    
    #Plotting the decision boundaries for the trained classifiers
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=xTrain.values, y=yTrain.values.astype(np.integer), clf=clf, legend=2)
    #print the classification score in the title
    score = clf.score(xTrain,yTrain)
    plt.title('%s, Accuracy = %1.3f' %(lab,score))

plt.show()


Which classifier gives the best decision boundary?

## 6. Test the performance of the trained model

Only use the testing set for this analysis!! Because your classifier was trained on the training set, using the same data set again for testing its performance will give overly-optimistic results and will fail to serve the purpose of doing a "test".

In [None]:
bestClassifier = # Enter the best classifier found above to see its performance on the testing data
yPredict = bestClassifier.predict(xTest)
ax = plt.subplot()
fig = plot_decision_regions(X=xTest.values, y=yTest.values.astype(np.integer), clf=bestClassifier, legend=2)
plt.title("Best classifier")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

How does the "best classifier" according to our training data performs on the new data?

In [None]:
#Print the metrics
from sklearn.metrics import classification_report
report = classification_report(yTest,yPredict)
print(report)

Depending on the problem at hand, it might make sense to look into other performance metrics as well, other than just using the accuracy as in this example.

# Advanced topics

If this has been exciting so far, it is worth checking out the following concepts (in no order of importance, i.e. they are all important to know!!) for streamlining your pipeline according to the given data and the problem at hand.
1. Performance metrics
2. Bias vs Variance
3. Cross-validation
4. Available classifiers
5. Hyperparameter tuning
6. Dimentionality reduction
7. Regularization
8. Categorical vs continuous data
9. Loss functions
10. Data carpentry