<h1>All-in-One project<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Define-a-problem-statement-and-goals" data-toc-modified-id="Define-a-problem-statement-and-goals-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Define a problem statement and goals</a></span></li><li><span><a href="#Understand-the-dataset-and-features" data-toc-modified-id="Understand-the-dataset-and-features-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><a href="https://archive.ics.uci.edu/ml/datasets/hepatitis" target="_blank">Understand the dataset and features</a></a></span><ul class="toc-item"><li><span><a href="#Features" data-toc-modified-id="Features-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Features</a></span></li><li><span><a href="#Loading-the-dataset" data-toc-modified-id="Loading-the-dataset-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Loading the dataset</a></span></li><li><span><a href="#Knowing-the-dataset-shape-and-features'-values-and-data-types" data-toc-modified-id="Knowing-the-dataset-shape-and-features'-values-and-data-types-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Knowing the dataset shape and features' values and data types</a></span></li><li><span><a href="#Understanding-the-outcome-values:" data-toc-modified-id="Understanding-the-outcome-values:-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Understanding the outcome values:</a></span></li></ul></li><li><span><a href="#Splitting-the-data:-Use-any-Splitting-Criterion" data-toc-modified-id="Splitting-the-data:-Use-any-Splitting-Criterion-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Splitting the data: Use any Splitting Criterion</a></span></li><li><span><a href="#Data-preprocessing-and-algorithm-selection-steps-on-training/validation-then-use-the-output-parameters-from-the-training-on-testing" data-toc-modified-id="Data-preprocessing-and-algorithm-selection-steps-on-training/validation-then-use-the-output-parameters-from-the-training-on-testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data preprocessing and algorithm selection steps on training/validation then use the output parameters from the training on testing</a></span><ul class="toc-item"><li><span><a href="#Select-the-best-model-and-features" data-toc-modified-id="Select-the-best-model-and-features-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Select the best model and features</a></span></li><li><span><a href="#How-do-you-comment-on-under-fitting-or-over-fitting?" data-toc-modified-id="How-do-you-comment-on-under-fitting-or-over-fitting?-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>How do you comment on under-fitting or over-fitting?</a></span></li></ul></li></ul></div>

# Define a problem statement and goals

# [Understand the dataset and features](https://archive.ics.uci.edu/ml/datasets/hepatitis)

## Features
   
     1. Class: DIE, LIVE
     2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
     3. SEX: male, female
     4. STEROID: no, yes
     5. ANTIVIRALS: no, yes
     6. FATIGUE: no, yes
     7. MALAISE: no, yes
     8. ANOREXIA: no, yes
     9. LIVER BIG: no, yes
    10. LIVER FIRM: no, yes
    11. SPLEEN PALPABLE: no, yes
    12. SPIDERS: no, yes
    13. ASCITES: no, yes
    14. VARICES: no, yes
    15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
    16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
    17. SGOT: 13, 100, 200, 300, 400, 500, 
    18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
    19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
    20. HISTOLOGY: no, yes

## Loading the dataset

In [None]:
import numpy as np
import pandas as pd
import math
from scipy.stats import loguniform
from statistics import mean
from scipy import stats
#import scikit_posthocs
import warnings 

from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, RandomizedSearchCV
from sklearn.metrics import classification_report,f1_score
from sklearn.impute import SimpleImputer 
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from scikit_posthocs import posthoc_nemenyi_friedman


warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv('~/DATA/hepatitis.csv', thousands=',', na_values='?')

In [None]:
data

## Knowing the dataset shape and features' values and data types

1. Is it classification/regression/unsupervised problem.
2. You need to see if the dataset small and what are the acceptable values in features.
3. Any decision in the above item should be considered in step 2 should be used in the external dataset. Example: If you remove noise/outliers from the whole dataset to clean dataset, keep track these decisions to apply in external examples.

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
#from the above there is no datatype object that condradicts with the above description
allFeatures=data.columns[1:len(data.columns)]
catFeatures=data.columns[list(range(2,14))+list(range(15,17))+list(range(18,20))]
numFeatures= [i for i in allFeatures if not(i in catFeatures)]

In [None]:
numFeatures

In [None]:
catFeatures

In [None]:
#in-place command 
for c in catFeatures:
    data[c]=data[c].astype('object')

In [None]:
data.dtypes

In [None]:
data[numFeatures].describe()

In [None]:
data[numFeatures].hist()

In [None]:
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True



for cat in catFeatures:
    fig, ax = plt.subplots()
    data[cat].value_counts().plot(ax=ax, kind='bar', xlabel=cat, ylabel='frequency')
    plt.show()

## Understanding the outcome values:
1. If classification, see if you have imbalance values (equal ratios is balance). 
2. If regression, see if the outcome is skewed.
3. I prefer not to do any step to deal with any outcome malfunction before splitting.

In [None]:
#imbalance dataset
data['Class'].value_counts()

In [None]:
X = data[data.columns[1:20]]
y = data[data.columns[0]]

# Splitting the data: Use any Splitting Criterion

<b> In your project </b>

 1. If the number of data points $\le 1000$, use bootstrapping.
 2. If you don't have enough data points with respect to the number of features, then use feature selection or reduction algorithm to reduce the number of features. You may use the rule of thumb like number of data points at least equals to 10/20 * number of features.
 3. If you have very few number of features, then you may try to randomly select the data points to satisfy the Step 2 rule and repeat the process many times creating different models and average the performance.

In [None]:
#Since the size is less than 1000, we can use bootstapoing from the begining but here we will do 1 hoult out of 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Data preprocessing and algorithm selection steps on training/validation then use the output parameters from the training on testing

1. Removing Outliers/noise and conduct descriptive statistics if you don't do them before.
2. Imputing the missing, balance the data by imbalance algorithms/Stratified Kfold, scaling numeric or encoding categorical features  and hyper-parameter optimization. 
3. Feature selection/dimensionality reduction methods; you may use them before or in-parallel with Step 4.
4. If your problem is classification/regression, you can try clustering to understand better the problem.
5. Select the algorithm(s):
    - Do literature review to collect the methods that use.
    - Determine what is the new in your method:
        - Using new algorithm/dataset/features. -- in the project, you need to use at least new algorithm not taught in the course (Ask the instructor).

In [None]:
#check nulls in training
X_train.isnull().sum()

In [None]:
#See https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_mode.fit(X_train[catFeatures])
imp_mean.fit(X_train[numFeatures])

In [None]:
X_train[catFeatures]=imp_mode.transform(X_train[catFeatures])
X_test[catFeatures]=imp_mode.transform(X_test[catFeatures])

X_train[numFeatures]=imp_mean.transform(X_train[numFeatures])
X_test[numFeatures]=imp_mean.transform(X_test[numFeatures])


In [None]:
X_train.isnull().sum()

In [None]:
X_test.isnull().sum()

In [None]:
scalor=StandardScaler()

scalor.fit(X_train[numFeatures])
X_train[numFeatures]=scalor.transform(X_train[numFeatures])
X_test[numFeatures]=scalor.transform(X_test[numFeatures])

In [None]:
X_train

In [None]:
X_test

In [None]:
X_train=pd.get_dummies(X_train)

In [None]:
X_train

In [None]:
X_test=pd.get_dummies(X_test)

In [None]:
X_test

In [None]:
X_test = X_test.reindex(columns = X_train.columns, fill_value=0)
X_test

In [None]:
y_train.value_counts()

## Select the best model and features

In [None]:
#Data is imbalance -- we can use imbalance algorithm in (https://pypi.org/project/imbalanced-learn/) or directly use repeatedstratisfied Kfold or both.

rsFolds=RepeatedStratifiedKFold(n_splits=20, n_repeats=10, random_state=0)

In [None]:
classifiers=[LogisticRegression(), SVC(), KNeighborsClassifier()]

#Logistic Regression parameter space
parmLogi=dict()
parmLogi['solver'] = [ 'liblinear']
parmLogi['penalty'] = [ 'l1', 'l2']
parmLogi['C'] = loguniform(1e-5, 100)

#Support Vector Space parameter Space

parm_SVCKernel = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']} 


#KNeighborsClassifier parameter Space
max_K=int(math.sqrt(1500))
parm_KNN= {
    'n_neighbors':list(range(3,max_K,2)),
    'metric':['euclidean','minkowski']
}

paramatersClassifiers=[parmLogi,parm_SVCKernel,parm_KNN]

In [None]:
#Split the test to 100 splits of test data

test_splits=[]

for i in range(0,100):
    S,S_y=resample(X_test,y_test,replace=True,random_state=0)
    test_splits.append([S,S_y])

In [None]:
np.random.seed(0)
target_names = ['class 0', 'class 1']
scoresTests=[]
for i in range(0,3):
    print('Classifier_'+str(i)+' --> ',classifiers[i])
    model= RandomizedSearchCV(classifiers[i], paramatersClassifiers[i], scoring='f1_micro', cv=rsFolds, random_state=1)
    model.fit(X_train,y_train)
    print('Best Score: %s' % model.best_score_)
    print('Best Hyperparameters: %s' % model.best_params_)
    y_pred=model.predict(X_train)
    print(classification_report(y_train, y_pred, target_names=target_names))
    scorePerClassifier=[]
    for j in range(0, len(test_splits)):
        y_test_pred=model.predict(test_splits[j][0])
        scorePerClassifier.append(f1_score(test_splits[j][1], y_test_pred, average='micro'))
    print (mean(scorePerClassifier))
    scoresTests.append(scorePerClassifier)                                                                                                                            

In [None]:
groups=[scoresTests[i] for i in range(0,len(classifiers))]
print(len(groups), len(groups[0]))

In [None]:
#Check if Friedman test is signifiant

chi_square,p_value_mean=stats.friedmanchisquare(*groups)
print(p_value_mean)

In [None]:
#If if Friedman test is signifiant, then do pairwise posthoc_nemenyi_friedman

In [None]:
trans_groups=np.array(groups).T
print(trans_groups)

In [None]:

p=posthoc_nemenyi_friedman(trans_groups)
print(p)

## How do you comment on under-fitting or over-fitting?

<b> From the above results , you can say the best is KNN, why?. </b>