### Guidelines
* (✔) You are given a dataset with 2 features and a label
* (✔) Preprocess in Jupyter Notebook
    * (✔) Deal with missing values
    * (✔) Deal with categorical features
    * (✔) Build the preprocessing with the pipeline class
* (✔) Build the model with Jupyter Notebook
    * (✔) Classification Algorithm
    * (✔) Hyperparameter tuning if necessary
    * (✔) Accuracy of >= 70%
* (✔) Save the model and preprocessing as 'model.pkl' using the joblib module
* You have to build a flask-based web app
    * (✔) 2 routes
    * (✔) Gathering valid inputs from the client
    * (✔) Predicting based on the inputs and serving the prediction back
        * (✔) Load the saved model for predictions
    * (✔) Interface for the client to feed inputs and see prediction
        * (✔) Developed with a single HTML page
        * (✔) Prediction returned once the client presses "Submit"
        * (✔) Warning and wait for correct inputs if the client does not enter an input or gives the wrong inputs
    * (✔) Optional: CSS Styling
* (✔) Deploy the predictive model
    * (✔) Serve the saved model for any client to use
    * (✔) User should be able to enter in data and get a prediction
* (✔) Answer the Questions

### Grading
* WebApp runs flawlessly, as per above specification and gives predictions - 60%
* Model was built correctly, including all preprocessing using the "Pipeline" class - 15%
* Model was developed with an accuracy of 70% or above - 15%
* Questions below are answered - 10%

In [50]:
'''Import standard data sceince libraries'''
from matplotlib import pyplot as plt # used for plotting graphs
import pandas as pd # used for data manipulation and analysis
import numpy as np # used for numerical computing
import os # used for file handling
import joblib # used to save the model

'''Import sklearn libraries'''
from sklearn.preprocessing import StandardScaler # used for scaling data
from sklearn.preprocessing import OneHotEncoder # used for encoding data
from sklearn.model_selection import train_test_split # used for splitting dataset
from sklearn.metrics import accuracy_score # used for evaluating model
from sklearn.compose import ColumnTransformer # used to apply different preprocessing to different columns
from sklearn.pipeline import Pipeline # used to chain together different transformers
from sklearn.impute import SimpleImputer # used to fill in missing values

'''Different Classification Algorithms'''
from sklearn.ensemble import RandomForestClassifier # random forest classifier
from sklearn.svm import SVC # support vector machine classifier
from sklearn.linear_model import LogisticRegression # logistic regression classifier
from sklearn.model_selection import GridSearchCV # used for hyperparameter tuning
from sklearn.tree import DecisionTreeClassifier # decision tree classifier
from sklearn.naive_bayes import GaussianNB # naive bayes classifier


'''Import warnings module to ignore warnings'''
import warnings
warnings.filterwarnings('ignore')

In [2]:
'''Read the data'''
data = pd.read_csv('MP3_Dataset.csv') # read the data
print(f"Data:\n {data.head(10)}") # print the first 10 rows of the data

Data:
        Test Group  label
0  0.496714     C      1
1 -0.138264     C      1
2       NaN     B      0
3  1.523030     B      0
4 -0.234153     A      0
5 -0.234137     A      0
6  1.579213     B      1
7  0.767435     C      1
8 -0.469474     B      1
9       NaN     C      1


In [3]:
'''Seperate the data into features and labels'''
X = data.drop('label', axis=1) # drop the label column from the data and assign the rest to X
y = data['label'] # assign the label column to y
print(f"Features:\n {X.head(10)} \n\nLabels:\n{y.head(10)}") # print the first 10 rows of the features

Features:
        Test Group
0  0.496714     C
1 -0.138264     C
2       NaN     B
3  1.523030     B
4 -0.234153     A
5 -0.234137     A
6  1.579213     B
7  0.767435     C
8 -0.469474     B
9       NaN     C 

Labels:
0    1
1    1
2    0
3    0
4    0
5    0
6    1
7    1
8    1
9    1
Name: label, dtype: int64


In [4]:
'''Print the number of missing values in the data'''
print(f"Missing values in different features:\n{X.isna().sum()}") # print the number of missing values in each column

Missing values in different features:
Test     10
Group     0
dtype: int64


In [5]:
'''Seperate feature type to help define which colums to be processed by the transformer'''
numerical_feature = ['Test'] # numerical feature
categorical_feature = ['Group'] # categorical feature

In [6]:
'''Split the data into training and testing set'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # split into train and test

In [7]:
'''Build the preprocessing pipeline'''
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # fill in numeric missing values with the mean of the column
    ('scaler', StandardScaler()) # scale the numerical data to have a mean of 0 and a standard deviation of 1
]) # Build the numeric transformer

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # fill in categorical missing values with the mode of the column
    ('encoder', OneHotEncoder(handle_unknown='ignore')) # encode the categorical data into a one-hot encoded format
]) # Build the categorical transformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_feature), # use the numeric_transformer on the numerical_features
        ('cat', categorical_transformer, categorical_feature) # use the categorical_transformer on the categorical_features
    ]) # Combine the transformers into a single preprocessor

### **Tuning 5 Classification Models to Various Hyperparameters**

In [8]:
'''Building a Random Forest Classifier'''
rfc = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],    # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider when looking for the best split
}
grid_search_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=15, scoring='accuracy', verbose=1, n_jobs=-1)
rfc_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', grid_search_rfc)])
rfc_pipeline.fit(X_train, y_train)
print("Best Parameters: ", rfc_pipeline.named_steps['classifier'].best_params_)
print(f"Best Score: {rfc_pipeline.named_steps['classifier'].best_score_*100:.2f}%")

Fitting 15 folds for each of 324 candidates, totalling 4860 fits
Best Parameters:  {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 50}
Best Score: 87.56%


In [9]:
'''Building a Support Vector Machine Classifier'''
svc = SVC()
param_grid = {
    'C': [0.1, 1, 10, 100, 1000], # Regularization parameter
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'], # Kernel type
    'degree': [2, 3, 4],  # Degree of the polynomial kernel (only for 'poly')
    'gamma': ['scale', 'auto'],  # Kernel coefficient for 'rbf', 'poly' and 'sigmoid'
    'coef0': [0.0, 0.1, 0.5, 1.0]  # Independent term in kernel function (only for 'poly' and 'sigmoid')
}
grid_search_svc = GridSearchCV(estimator=svc, param_grid=param_grid, cv=15, scoring='accuracy', verbose=1, n_jobs=-1)
svc_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', grid_search_svc)])
svc_pipeline.fit(X_train, y_train)
print("Best Parameters: ", svc_pipeline.named_steps['classifier'].best_params_)
print(f"Best Score: {svc_pipeline.named_steps['classifier'].best_score_*100:.2f}%")

Fitting 15 folds for each of 480 candidates, totalling 7200 fits
Best Parameters:  {'C': 1, 'coef0': 1.0, 'degree': 2, 'gamma': 'auto', 'kernel': 'sigmoid'}
Best Score: 79.56%


In [10]:
'''Building a Logistic Regression Classifier'''
logreg = LogisticRegression()
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],  # Regularization type
    'C': [0.01, 0.1, 1, 10, 100],  # Inverse of regularization strength
    'max_iter': [20, 50, 100, 200]  # Maximum number of iterations
}
grid_search_logreg = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=15, scoring='accuracy', verbose=1, n_jobs=-1)
logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', grid_search_logreg)])
logreg_pipeline.fit(X_train, y_train)
print("Best Parameters: ", logreg_pipeline.named_steps['classifier'].best_params_)
print(f"Best Score: {logreg_pipeline.named_steps['classifier'].best_score_*100:.2f}%")

Fitting 15 folds for each of 80 candidates, totalling 1200 fits
Best Parameters:  {'C': 1, 'max_iter': 20, 'penalty': 'l2'}
Best Score: 76.00%


In [11]:
'''Decision Tree Classifier'''
dt = DecisionTreeClassifier()
param_grid = {
    'criterion': ['gini', 'entropy'],  # Function to measure the quality of a split
    'splitter': ['best', 'random'],    # Strategy used to choose the split at each node
    'max_depth': [None, 3, 6, 9, 18, 27, 36],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],    # Minimum number of samples required to be at a leaf node
    'max_features': [None, 'auto', 'sqrt', 'log2']  # Number of features to consider when looking for the best split
}
grid_search_dt = GridSearchCV(estimator=dt, param_grid=param_grid, cv=15, scoring='accuracy', verbose=1, n_jobs=-1)
dt_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', grid_search_dt)])
dt_pipeline.fit(X_train, y_train)
print("Best Parameters: ", dt_pipeline.named_steps['classifier'].best_params_)
print(f"Best Score: {dt_pipeline.named_steps['classifier'].best_score_*100:.2f}%")

Fitting 15 folds for each of 1008 candidates, totalling 15120 fits
Best Parameters:  {'criterion': 'gini', 'max_depth': 27, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}
Best Score: 87.78%


In [12]:
'''Naive Bayes Classifier'''
gnb = GaussianNB()
param_grid = {
    'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]  # Portion of the largest variance of all features
}
grid_search_gnb = GridSearchCV(estimator=gnb, param_grid=param_grid, cv=15, scoring='accuracy', verbose=1, n_jobs=-1)
gnb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', grid_search_gnb)])
gnb_pipeline.fit(X_train, y_train)
print("Best Parameters: ", gnb_pipeline.named_steps['classifier'].best_params_)
print(f"Best Score: {gnb_pipeline.named_steps['classifier'].best_score_*100:.2f}%")

Fitting 15 folds for each of 5 candidates, totalling 75 fits
Best Parameters:  {'var_smoothing': 1e-09}
Best Score: 74.89%


### **Testing the Classification Models on the training set**

In [45]:
'''Instantiate Models With Best Parameters'''
rfc = RandomForestClassifier(max_depth=None, max_features='log2', min_samples_leaf=4, min_samples_split=5, n_estimators=50)
svc = SVC(C=1, coef0=1.0, degree=2, gamma='auto', kernel='sigmoid')
logreg = LogisticRegression(C=1, max_iter=20, penalty='l2')
dt = DecisionTreeClassifier(criterion='gini', max_depth=9, max_features='log2', min_samples_leaf=2, min_samples_split=5, splitter='best')
gnb = GaussianNB(var_smoothing=1e-09)

In [46]:
'''Build the model pipelines'''
# Random Forest Classifier
rfc_test_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # preprocess the data
    ('classifier', rfc) # classify the data using a random forest classifier
]) # Build the model

# SVM Classifier
svc_test_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # preprocess the data
    ('classifier', svc) # classify the data using a random forest classifier
]) # Build the model

# Logistic Regression Classifier
logreg_test_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # preprocess the data
    ('classifier', logreg) # classify the data using a random forest classifier
]) # Build the model

# Decision Tree Classifier
dt_test_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # preprocess the data
    ('classifier', dt) # classify the data using a random forest classifier
]) # Build the model

# Gaussian Naive Bays Classifier
gnb_test_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # preprocess the data
    ('classifier', gnb) # classify the data using a random forest classifier
]) # Build the model

In [47]:
''''Train and Test the Models'''
# Random Forest Classifier
rfc_test_pipeline.fit(X_train, y_train) # fit the model on the training data
y_pred_rfc = rfc_test_pipeline.predict(X_test) # predict the labels of the test data

# SVM Classifier
svc_test_pipeline.fit(X_train, y_train) # fit the model on the training data
y_pred_svm = svc_test_pipeline.predict(X_test) # predict the labels of the test data

# Logistic Regression Classifier
logreg_test_pipeline.fit(X_train, y_train) # fit the model on the training data
y_pred_logreg = logreg_test_pipeline.predict(X_test) # predict the labels of the test data

# Decision Tree Classifier
dt_test_pipeline.fit(X_train, y_train) # fit the model on the training data
y_pred_dt = dt_test_pipeline.predict(X_test) # predict the labels of the test data

# Gaussian Naive Bays Classifier
gnb_test_pipeline.fit(X_train, y_train) # fit the model on the training data
y_pred_gnb = gnb_test_pipeline.predict(X_test) # predict the labels of the test data

In [48]:
'''Evaluate the models'''
# Random Forest Classifier
rfc_acc = accuracy_score(y_test, y_pred_rfc) # calculate the accuracy of the model
print(f"Random Forest Accuracy: {rfc_acc}") # print the accuracy of the model

# SVM Classifier
svm_acc = accuracy_score(y_test, y_pred_svm) # calculate the accuracy of the model
print(f"SVM Accuracy: {svm_acc}") # print the accuracy of the model

# Logistic Regression Classifier
lr_acc = accuracy_score(y_test, y_pred_logreg) # calculate the accuracy of the model
print(f"Logistic Regression Accuracy: {lr_acc}") # print the accuracy of the model

# Decision Tree Classifier
dt_acc = accuracy_score(y_test, y_pred_dt) # calculate the accuracy of the model
print(f"Decision Tree Accuracy: {dt_acc}") # print the accuracy of the model

# Gaussian Naive Bays Classifier
gnb_acc = accuracy_score(y_test, y_pred_gnb) # calculate the accuracy of the model
print(f"Gaussian Naive Bayes Accuracy: {gnb_acc}") # print the accuracy of the model

Random Forest Accuracy: 0.7
SVM Accuracy: 0.65
Logistic Regression Accuracy: 0.7
Decision Tree Accuracy: 0.65
Gaussian Naive Bayes Accuracy: 0.75


* #### Looks like Gaussian Naive Bayes performs the best out of the other models
* #### It has a 0.75 test accuracy score, which is well above the 0.70 threshold
* #### So let's build our model around it!

In [49]:
'''Save the model with the best accuracy as model.pkl'''
joblib.dump(gnb_test_pipeline, 'model.pkl') # save the model as model.pkl

['model.pkl']

## Questions
1. What did you specifically do to get an accuracy of 70% or above for the model?
* Ans) I first listed all of the classification models we did during the year. I then picked the 5 that the instructor had done examples for (Random Forest, Logistic Regression, Decision Tree, Naive Bayes, and SVM). I looked at the documentation for the models in sklearn's docs to determine the most common hyperparameters for each of them. I then plugged a selection of the hyperparameters into GridSearchCV to find the best hyperparamters for each model and also to see their training accuracy. After I was done, I added them into the pipeline and evaluted them on the test data. From this, I found that Gaussian Naive Bayes performed the best with a test acuracy of 75%. Interstingly, this model had the lowest train accuracy, however, that was also still above 70%. This shows that there is an equal and oppsite force between the validation and train error, so it is necessary to balance these to avoid both overfitting and underfitting.

2. What were the challenges you faced and how did you address them / solve them in building the flask-based app?
* Ans) The main challenge I faced when building the flask app was issuing warnings to the user. For this, I threw ValueError exceptions if the user inputted the wrong information. Any other exceptions are printed as a stack trace to the user. This sort of correlates to another challenge in that I did not know if I needed to define ranges for user input of the "Test Value" since the model is only trained on a certain range. Thus, I eventually decided to round to the lowest and highest whole numbers of the model so that it can provide insights onto its scope.

3. What were your key takeaways / learning from this course.
* Ans) From this course, my biggest takeaway was learning preprocessing techniques. Models are easy as they are simply plug and chug with some parameter or hyperparameter tuning from looking at the documentation. However, understanding the preprocessing that goes behind making data ready is perhaps the most challenging part about this course and is something I learned the most. From TF-IDF vectorizers, to tensors, to pipelines, to lemmatization, to onehotencoding, and so much more, preprocessing is definetly the key valuable takeaway/learning from this course