> **tomato juice dataset**
<br>` 'quality' is the target feature for classification `
<br>` the other features are chemical properties of our product `

**Import the main libraries**

In [None]:
import numpy as np
import pandas as pd

from time import time

_import the local library_

In [None]:
# add parent folder path where lib folder is
import sys
if ".." not in sys.path:import sys; sys.path.insert(0, '..') 

In [None]:
from mylib import show_labels_dist, show_metrics, bias_var_metrics

**Import the Dataset**

In [None]:
## file path: windows style
df = pd.read_csv('..\\datasets\\tomatjus.csv')

## file path: unix style
#df = pd.read_csv('../datasets/tomatjus.csv')

# shape method gives the dimensions of the dataset
print('Dataset dimensions: {} rows, {} columns'.format(
    df.shape[0], df.shape[1]))

In [None]:
df.info()

***
**Data Preparation and EDA** (unique to this dataset)
* _Check for missing values_
* _Quick visual check of unique values_
* _Split the classification feature out of the dataset_
* _Check column names of categorical attributes ( for get_dummies() )_
* _Check column names of numeric attributes ( for Scaling )_

**_Let's skip the checking_**

**<br>Classification target feature**
<br>"the Right Answers", or more formally "the desired outcome"
<br>Must be in a separate dataset for classification ,,,

_Make it a multi-class problem, using text labels_

In [None]:
##  divide into classes by giving a range for quality
##  Make it a multi-class problem: {3,4,5} {6} {7.8}
bins = (2, 5, 6, 8)
group_names = ['Average', 'Premium', 'Special']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)

* Split the classification feature out of the dataset 

In [None]:
## Feature being predicted ("the Right Answer")
labels_col = 'quality'
y = df[labels_col]

## Features used for prediction 
# pandas has a lot of rules about returning a 'view' vs. a copy from slice
# so we force it to create a new dataframe 
X = df.copy()
X.drop(labels_col, axis=1, inplace=True)

***
**<br>Create Test // Train Datasets**
> Split X and y datasets into Train and Test subsets,<br>keeping relative proportions of each class (stratify)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=50, 
                                                    stratify=y)
# train_test_split does random selection, 
#      so we should reset the dataframe indexes
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

**<br>Scaling** comes _after_ test // train split

In [None]:
numeri = X.select_dtypes(include=['float64','int64']).columns
print(numeri.to_list())

In [None]:
# scaling the Numeric columns 
# StandardScaler range: -1 to 1, MinMaxScaler range: zero to 1

# from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# sklearn docs say 
#   "Don't cheat - fit only on training data, then transform both"
#   fit() expects 2D array: reshape(-1, 1) for single col or (1, -1) single row

for i in numeri:
    arr = np.array(X_train[i])
    scale = MinMaxScaler().fit(arr.reshape(-1, 1))
    X_train[i] = scale.transform(arr.reshape(len(arr),1))

    arr = np.array(X_test[i])
    X_test[i] = scale.transform(arr.reshape(len(arr),1))  

**<br>Classifier Selection**

In [None]:
# prepare list
models = []

##  --  Linear  --  ## 
#from sklearn.linear_model import LogisticRegression 
#models.append (("LogReg",LogisticRegression())) 
#from sklearn.linear_model import SGDClassifier 
#models.append (("StocGradDes",SGDClassifier())) 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
models.append(("LinearDA", LinearDiscriminantAnalysis())) 
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis 
#models.append(("QuadraticDA", QuadraticDiscriminantAnalysis())) 

##  --  Support Vector  --  ## 
#from sklearn.svm import SVC 
#models.append(("SupportVectorClf", SVC())) 
#from sklearn.svm import LinearSVC 
#models.append(("LinearSVC", LinearSVC())) 
#from sklearn.linear_model import RidgeClassifier
#models.append (("RidgeClf",RidgeClassifier())) 

##  --  Non-linear  --  ## 
#from sklearn.tree import DecisionTreeClassifier 
#models.append (("DecisionTree",DecisionTreeClassifier())) 
#from sklearn.naive_bayes import GaussianNB 
#models.append (("GaussianNB",GaussianNB())) 
#from sklearn.neighbors import KNeighborsClassifier 
#models.append(("K-NNeighbors", KNeighborsClassifier())) 

##  --  Ensemble: bagging  --  ## 
#from sklearn.ensemble import RandomForestClassifier 
#models.append(("RandomForest", RandomForestClassifier())) 
##  --  Ensemble: boosting  --  ## 
#from sklearn.ensemble import AdaBoostClassifier 
#models.append(("AdaBoost", AdaBoostClassifier())) 
#from sklearn.ensemble import GradientBoostingClassifier 
#models.append(("GradientBoost", GradientBoostingClassifier())) 

##  --  NeuralNet (simplest)  --  ## 
#from sklearn.linear_model import Perceptron 
#models.append (("SingleLayerPtron",Perceptron())) 
#from sklearn.neural_network import MLPClassifier 
#models.append(("MultiLayerPtron", MLPClassifier()))

print(models)

**<br>Target Label Distributions** (standard block)

In [None]:
# from our local library
show_labels_dist(X_train,X_test,y_train,y_test)

**<br>Fit and Predict** (standard block)

In [None]:
# evaluate each model in turn
results = []

print('macro average: unweighted mean per label')
print('weighted average: support-weighted mean per label')
print('MCC: correlation between prediction and ground truth')
print('     (+1 perfect, 0 random prediction, -1 inverse)\n')

for name, clf in models:
    trs = time()
    print('Confusion Matrix:', name)
    
    clf.fit(X_train, y_train)
    ygx = clf.predict(X_test)
    results.append((name, ygx))
    
    tre = time() - trs
    print ("Run Time {} seconds".format(round(tre,2)) + '\n')
    
# Easy way to ensure that the confusion matrix rows and columns
#   are labeled exactly as the classifier has coded the classes
#   [[note the _ at the end of clf.classes_ ]]

    show_metrics(y_test, ygx, clf.classes_)   # from our local library
    print('\nParameters: ', clf.get_params(), '\n\n')

**Bias - Variance Decomposition** (standard block)

In [None]:
# from our local library
# reduce (cross-validation) folds for faster results
folds = 20
for name, clf in models:
    print('Bias // Variance Decomposition:', name)
    bias_var_metrics(X_train,X_test,y_train,y_test,clf,folds)

***

***
**Methods for imbalanced datasets**<br>
> * Over / Under sampling
> * Assigning class weights

_Use only the training data and labels!_ The idea is that making the model is about using _groups_ of observations to learn _patterns_ that can be used to make an accurate prediction of which class every _individual_ observation drawn from the population belongs to.

In other words, the training data should represent the characteristics of the _classes_ to make the predictive model, while the test data should represent the _population_ where the actual mix of classes (distribution) is unknown.

* Class Balance

In [None]:
from yellowbrick.target import ClassBalance
# The ClassBalance visualizer has a “compare” mode, 
#   to create a side-by-side bar chart instead of a single bar chart 

# Instantiate the visualizer
visualizer = ClassBalance()
visualizer.fit(y_train, y_test)        # Fit the data to the visualizer
_ = visualizer.show()                  # Finalize and render the figure
# assign visualizer.show() to a null variable to avoid printing some trash

In [None]:
# save our original datasets before we test the modified ones
XtrainOriginal = X_train
XtestOriginal = X_test

***
**<br>Handling Imbalance with Over-sampling // Under-sampling**
<br><br>
Sampling with replacement allows duplicate values, so we only do this for the training data, to give it a more balanced set of observations to work with. The _pandas dataframe.groupby.sample_ method does not allow sample_amounts to be larger than the group size if replace is False, but if replace is True then replacement will occur even in groups that could have been downsampled.
<br><br>
Here we create a dictionary that maps each class to number of samples, then _groupby.apply_ with a lambda to create the sample, and conditional logic to determine with or without replacement:

In [None]:
# labels_col = 'quality'
yt = pd.DataFrame(y_train)    # series to dataframe
ff = yt[[labels_col]].apply(lambda x: x.value_counts())
ss = ff[labels_col].to_dict()
ss

In [None]:
# set new values - anything goes!
ss['Premium'] = 450
ss['Average'] = ss['Average'] - ss['Special']
ss['Special'] = round(ss['Special'] * 2.5)
ss

In [None]:
# add the labels back to the features dataframe 
xy_train = X_train.copy()
xy_train[labels_col] = yt[labels_col]

In [None]:
# notice the index numbers - random order from test_train_split()
xy_train.head(4)

In [None]:
# technically, this is a "one-liner" ...
balanced_df = xy_train.groupby(labels_col, 
                               as_index=False, group_keys=False, sort=False
                        ).apply(lambda g: g.sample(n=ss[g.name],
                                                   replace=(len(g) < ss[g.name])
                                                  )).reset_index(drop=True)

In [None]:
# we reset the index at the end, so it is neat
balanced_df.head(4)

In [None]:
## Feature being predicted ("the Right Answer")
ytrain_b = balanced_df[labels_col]

## Features used for prediction 
Xtrain_b = balanced_df.drop(labels_col, axis=1)

In [None]:
# substitute the datasets
X_train = Xtrain_b
y_train = ytrain_b

**<br>Target Label Distributions** (standard block)

* Class Balance

**<br>Fit and Predict** (standard block)

**Bias - Variance Decomposition** (standard block)

***
**Handling Imbalance with Class Weights**

Balanced weighting is a widely used method for imbalanced classification models. It involves applying specified class weights for the majority and minority classes that are used in the classifier training process to achieve better model results.

Unlike over- or under-sampling (a pre-processing step), balanced weighting does not modify the dataset. Instead, each observation is weighted so that wrong predictions for the minority class are given more weight when the loss value is calculated during the model training process. Weights for the loss function can be arbitrary, but a typical choice is weights based on the distribution of labels. 

**SKlearn Classifiers with Class Weights**
<br>A limited number of classifiers can take _class_weight='balanced'_ as an argument. This uses the values of y (labels) to automatically adjust weights inversely proportional to class frequencies in the input data (X) as<br>_n_samples / (n_classes * np.bincount(y))_

In [None]:
# prepare list - these support a class_weight argument
models = []

##  --  Linear  --  ## 
#from sklearn.linear_model import LogisticRegression 
#models.append (("LogReg",LogisticRegression(class_weight='balanced'))) 
#from sklearn.linear_model import SGDClassifier 
#models.append (("StocGradDes",SGDClassifier(class_weight='balanced'))) 

##  --  Support Vector  --  ## 
#from sklearn.svm import SVC 
#models.append(("SupportVectorClf", SVC(class_weight='balanced'))) 
#from sklearn.svm import LinearSVC 
#models.append(("LinearSVC", LinearSVC(class_weight='balanced'))) 
#from sklearn.linear_model import RidgeClassifier
#models.append (("RidgeClf",RidgeClassifier(class_weight='balanced'))) 

##  --  Non-linear  --  ## 
from sklearn.tree import DecisionTreeClassifier 
models.append (("DecisionTree",DecisionTreeClassifier(class_weight='balanced'))) 

##  --  Ensemble: bagging  --  ## 
#from sklearn.ensemble import RandomForestClassifier 
#models.append(("RandomForest", RandomForestClassifier(class_weight='balanced'))) 

print(models)

**_Make sure we are using the right dataset !!_**

In [None]:
X_train = XtrainOriginal
y_train = ytrainOriginal

**<br>Fit and Predict** (standard block)

**Bias - Variance Decomposition** (standard block)

***
***