> **tomato juice dataset**
<br>` 'quality' is the target feature for classification `
<br>` the other features are chemical properties of our product `

**Import the main libraries**

In [None]:
import numpy as np
import pandas as pd

import warnings
# supress all
warnings.filterwarnings("ignore")

**Import the Dataset**

In [None]:
## file path: windows style
df = pd.read_csv('..\\datasets\\tomatjus.csv')

## file path: unix style
#df = pd.read_csv('../datasets/tomatjus.csv')

# shape method gives the dimensions of the dataset
print('Dataset dimensions: {} rows, {} columns'.format(
    df.shape[0], df.shape[1]))

In [None]:
df.info()

***
**Data Preparation and EDA** (unique to this dataset)
* _Check for missing values_
* _Quick visual check of unique values_
* _Split the classification feature out of the dataset_
* _Check column names of categorical attributes ( for get_dummies() )_
* _Check column names of numeric attributes ( for Scaling )_

**Check for missing values**

In [None]:
cnt=0
print('Missing Values - ')
for col in df.columns:
    nnul = pd.notnull(df[col]) 
    if (len(nnul)!=len(df)):
        cnt=cnt+1
        print('\t',col,':',(len(df)-len(nnul)),'null values')
print('Total',cnt,'features with null values')

# address missing values here

**Quick visual check of unique values, deal with unique identifiers**

In [None]:
# Identify columns with only one value 
# or with number of unique values == number of rows
n_eq_one = []
n_eq_all = []

print('Unique value count (',df.shape[0],'Rows in the dataset )')
for col in df.columns:
    lc = len(df[col].unique())
    print(col, ' ::> ', lc)
    if lc == 1:
        n_eq_one.append(df[col].name)
    if lc == df.shape[0]:
        n_eq_all.append(df[col].name)

In [None]:
# Drop columns with only one value
if len(n_eq_one) > 0:
    print('Dropping single-valued features')
    print(n_eq_one)
    data.drop(n_eq_one, axis=1, inplace=True)

# Drop or bin columns with number of unique values == number of rows
if len(n_eq_all) > 0:
    print('Dropping unique identifiers')
    print(n_eq_all)
    data.drop(n_eq_all, axis=1, inplace=True)

# continue with featue selection / feature engineering

**Classification target feature**
<br>"the Right Answers", or more formally "the desired outcome"
<br>Must be in a separate dataset for classification ,,,

_Make it a multi-class problem, using text labels_

In [None]:
##  divide into classes by giving a range for quality
##  Make it a multi-class problem: {3,4,5} {6} {7.8}
bins = (2, 5, 6, 8)
group_names = ['Average', 'Premium', 'Special']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)

* Split the classification feature out of the dataset 

In [None]:
## Feature being predicted ("the Right Answer")
labels_col = 'quality'
y = df[labels_col]

## Features used for prediction 
# pandas has a lot of rules about returning a 'view' vs. a copy from slice
# so we force it to create a new dataframe 
X = df.copy()
X.drop(labels_col, axis=1, inplace=True)

In [None]:
# generate a sorted list of unique labels to use later
from sklearn.utils.multiclass import unique_labels
targetlabels = unique_labels(y)

**Check column names of categorical attributes**
<br>Features with text values (categorical attributes) need to be normalised
<br>by changing them to numeric types that the algorithms find easier to work with

In [None]:
categori = X.select_dtypes(include=['object','category']).columns
print(categori.to_list())

**Check column names of numeric attributes**
<br>Features with numeric values need to be normalised
<br>by changing them to small numbers in a specific range (scaling)

In [None]:
numeri = X.select_dtypes(include=['float64','int64']).columns
print(numeri.to_list())

In [None]:
# The proper place to do scaling comes later in the pipeline ,,, 

***
**<br>Create Test // Train Datasets**
> Split X and y datasets into Train and Test subsets,<br>keeping relative proportions of each class (stratify)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=50, stratify=y)

**<br>Target Label Distributions**

In [None]:
# shape method gives the dimensions of the dataset
print('X_train: {} rows, {} columns'.format(X_train.shape[0], X_train.shape[1]))
print('X_test:  {} rows, {} columns'.format(X_test.shape[0], X_test.shape[1]))
print()
print('y_train: {} rows, 1 column'.format(y_train.shape[0]))
print('y_test:  {} rows, 1 column'.format(y_test.shape[0]))
print()

## Here's a nice report:  
# 1. series to dataframe conversion
my_train = pd.DataFrame(y_train)
my_test = pd.DataFrame(y_test)
# 2. dataframe copy with [[ -- ]]
av_train = my_train[[labels_col]].apply(lambda x: x.value_counts())
av_test = my_test[[labels_col]].apply(lambda x: x.value_counts())
# 3. add a new column
av_train['pct_train'] = round((100 * av_train / av_train.sum()),2)
av_test['pct_test'] = round((100 * av_test / av_test.sum()),2)
# 4. combine the dataframes
av_tt = pd.concat([av_train,av_test], axis=1) 
# 5. print the report
print('Frequency and Distribution of labels')
print(av_tt)

***
Next are standard steps for all datasets: _scaling, classifiers, results_

**Scaling** comes _after_ test // train split

In [None]:
# scaling the Numeric columns 
# from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# StandardScaler range: -1 to 1, MinMaxScaler range: zero to 1
# ColumnTransformer returns a numpy.ndarray so we lose the feature names;
# we process one column at a time to preserve the dataframe

# sklearn docs say 
#   "Don't cheat - fit only on training data, then transform both"
#   fit() expects 2D array: reshape(-1, 1) for single col or (1, -1) single row

for i in numeri:
    arr = np.array(X_train[i])
    scale = MinMaxScaler().fit(arr.reshape(-1, 1))
    X_train[i] = scale.transform(arr.reshape(len(arr),1))

    arr = np.array(X_test[i])
    X_test[i] = scale.transform(arr.reshape(len(arr),1))

**<br>Classifier Selection**

The feature engineering process involves selecting the minimum 
   required features to produce a valid model because the more 
   features a model contains, the more complex it is (and the 
   more sparse the data), therefore the more sensitive the model 
   is to errors due to variance. 
   
A common approach to eliminating features is to find their relative 
   importance, then eliminate weak features or combinations of 
   features and re-evalute to see if the model fares better during 
   cross-validation
   
In scikit-learn, tree models and ensembles of trees provide a 
   `feature_importances_` attribute when fitted, and linear models provide a `coef_` attribute when fitted.
<br><br>
`feature_importances_`
> `sklearn.ensemble.AdaBoostClassifier()`
  `sklearn.ensemble.ExtraTreesClassifier()`
  `sklearn.ensemble.GradientBoostingClassifier()`
  `sklearn.ensemble.RandomForestClassifier()`
  `sklearn.tree.DecisionTreeClassifier()`
  `sklearn.tree.ExtraTreeClassifier()`

`coef_`
> `sklearn.linear_model.LogisticRegression()`
  `sklearn.linear_model.RidgeClassifier()`
  `sklearn.linear_model.SGDClassifier()`
  `sklearn.discriminant_analysis.LinearDiscriminantAnalysis()`
  `sklearn.svm.LinearSVC()`
   
NOTE: `feature_importances_` and `coef_` can be misleading for 
      high cardinality features (many unique values). 
      The `permutation_importance` function works with any classifier, 
      and is an alternative in these cases

In [None]:
# prepare list
models = []

## feature_importances_
#from sklearn.tree import DecisionTreeClassifier
#models.append(('DecisionTree', DecisionTreeClassifier()))
from sklearn.ensemble import RandomForestClassifier
models.append(('RandomForest', RandomForestClassifier()))
#from sklearn.ensemble import AdaBoostClassifier
#models.append(('AdaBoostClassifier', AdaBoostClassifier()))
#from sklearn.ensemble import GradientBoostingClassifier
#models.append(('GradientBoostingClassifier', GradientBoostingClassifier()))

## coef_
#from sklearn.linear_model import LogisticRegression
#models.append(('LogisticRegression', LogisticRegression()))
#from sklearn.linear_model import SGDClassifier
#models.append(('StochasticGradientDescent', SGDClassifier()))
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
models.append(('LinearDiscriminantAnalysis', LinearDiscriminantAnalysis()))
#from sklearn.svm import SVC
#models.append(('SupportVectorClassifier', SVC()))

print(models)

 ***
 **_These examples only work with one classifier_** for example
>models[0][1]  <br>models[1][1]  <br>models[2][1]
 ***

**<br>Feature Importance Visualisation**
<br>Generally these viualisers incorporate `classifier.fit()`
or want it immediately before,
<br>so prior predictions are not relevant

In [None]:
import matplotlib.pyplot as plt

# Create a list of the feature names
cols = list(X_train.columns)

The Yellowbrick visualizer uses `feature_importances_` or `coef_`  to rank and plot importance according to the explained variance each feature contributes to the model. 
* Using `feature_importances_` it is usual to plot relative importance, 
   where the most important feature is 100%, 
   and the rest are relative percent of the most important feature.
* Using `coef_` it is better to set `relative=False` 
   to draw the true magnitude of the coefficient (which may be negative). 

In [None]:
from yellowbrick.model_selection import feature_importances

clf = models[0][1]
# Use the quick method and immediately show the figure
viz = feature_importances(clf, X_train, y_train, labels=cols, relative=False)
# default: relative=True
viz = feature_importances(clf, X_train, y_train, labels=cols, topn=5)

**<br>Stacked Feature Importances**
<br>This one only works with classifiers that return `coef_`
<br>`feature_importances_` is always an array of shape `n_features`
<br><br>
`coef_` is the same for binary classification, but in the multiclass 
   case it is an array of shape `n_classes, n_features` so the 
   relative importance of the feature to the prediction of the 
   probability of a specific class can be shown.

In [None]:
from yellowbrick.model_selection import FeatureImportances

clf = models[1][1]

viz = FeatureImportances(clf, stack=True, labels=cols, relative=False)
viz.fit(X_train, y_train)
viz.show()

viz = FeatureImportances(clf, stack=True, labels=cols, topn=5)
viz.fit(X_train, y_train)
viz.show()

**<br>RFE with CV**
<br>ALWAYS EXPECT CROSS VALIDATION TO TAKE SOME TIME
<br>Recursive Feature Elimination (RFE) is a feature selection method 
   that fits a model and removes the weakest feature (or features) 
   until the specified number of features is reached. Features are ranked by the model’s `feature_importances_` or `coef_`
   attribute, and by recursively eliminating a small number of 
   features per loop, RFE attempts to eliminate dependencies and 
   collinearity that may exist in the model. 
   
   To find the optimal number of features, the RFECV visualizer uses cross-validation to score different feature subsets and select the best scoring collection of features, and plots the number of features in the model along with 
their cross-validated test score and variability (shows some trash in spite of filterwarnings).

In [None]:
from sklearn.model_selection import StratifiedKFold
from yellowbrick.model_selection import rfecv

clf = models[0][1]
crosval = StratifiedKFold(shuffle=True, random_state = 11)

# Create the validation curve visualizer
viz = rfecv(clf, X_train, y_train, 
            cv=crosval, scoring='f1_weighted', n_jobs= -1)

In [None]:
# To know which features are being kept we can use the support_ attribute.
viz.support_

In [None]:
# To see the ranking from best (1) to worst, check the ranking_ attribute.
viz.ranking_

In [None]:
# the dataframe column index of selected features 
colndx = viz.get_support(indices=True)
colndx

In [None]:
# let's use a dataframe for display
features_kept = pd.DataFrame({'columns': X_train.columns,
                             'Kept': viz.support_})
features_kept

In [None]:
# create a new dataframe using only the selected features
# X_RFE = X_train.iloc[:, viz.support_]
X_RFE = X_train.iloc[colndx]
X_RFE.head()