## Table of Contents
## 1)  [First Question](#q1)
###    - [First Evaluation](#e1)
## 2) [Second Question](#q2)
###    - [Second Evaluation](#e2)

## Problem 3
* Dataset: Red Wine Quality: src https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009?select=winequality-red.csv
* the first question will be a classical question to see whether I can predict quality of wine from predictor variables
* the second question will be numerical prediction and see whether I can predict pH levels from predictor variables
* Evaluation and reference to which datacamp course was used will be documented in the notebook after each trial run

In [48]:
# Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display
import seaborn as sns

from scipy import stats
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, classification_report
import plotly.express as px
from kmodes.kmodes import KModes

from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree
import joblib

from sklearn import set_config
sns.set_palette('Set2')
set_config(display='diagram')

from yellowbrick.cluster import KElbowVisualizer


## First Question <a class="anchor" id="q1"></a>
### Predict the wine quality from predictor variables
### Classification methods will be used
### Dataset: Red Wine Quality, source: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009?select=winequality-red.csv

In [3]:
# import dataset

wine = pd.read_csv('data/winequality-red.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [8]:
wine.columns
wine['quality'].unique()

array([5, 6, 7, 4, 8, 3])

It appears that our categorical variable has already been properly discretized for us

In [9]:
wine.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [14]:
X = wine.drop('quality', axis=1)
y = wine['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [11]:
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), {}, cv=5)
rf_grid.fit(X_train, y_train)

In [16]:
print('train score', rf_grid.best_score_)
print('test score', rf_grid.score(X_test, y_test))

train score 0.6814574616457462
test score 0.67


In [36]:
n_est = [round(num) for num in np.logspace(start=2, stop=3, num=10)]


In [37]:
rf_param_grid = {
    'n_estimators': n_est,
    'criterion': ['gini', 'entropy']
}

In [38]:
rf_grid2 = GridSearchCV(RandomForestClassifier(random_state=42), rf_param_grid, cv=5)
rf_grid2.fit(X_train, y_train)

In [40]:
print('train score', rf_grid2.best_score_)
print('test score', rf_grid2.score(X_test, y_test))
rf_grid2.best_estimator_

train score 0.6889609483960948
test score 0.675


In [41]:
print('Confusion Matrix \n',confusion_matrix(rf_grid2.predict(X_test),y_test), '\n')
print('Accuracy Score \n', rf_grid2.score(X_test, y_test), '\n')
print('Classification Report \n',classification_report(rf_grid2.predict(X_test),y_test))

Confusion Matrix 
 [[  0   0   0   0   0   0]
 [  0   0   0   0   0   0]
 [  1   7 124  36   0   0]
 [  0   6  39 122  23   0]
 [  0   0   1  11  24   5]
 [  0   0   0   0   1   0]] 

Accuracy Score 
 0.675 

Classification Report 
               precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         0
           5       0.76      0.74      0.75       168
           6       0.72      0.64      0.68       190
           7       0.50      0.59      0.54        41
           8       0.00      0.00      0.00         1

    accuracy                           0.68       400
   macro avg       0.33      0.33      0.33       400
weighted avg       0.71      0.68      0.69       400



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Evaluation <a class="anchor" id="e1"></a>
It seems that with the random forest classifier the accuracy score was around 67% and with hyperaparameter tuning
not much improvement was gained. A 67% score is a moderate score. Perhaps standardizing and normalizing data would
have resulted in better accuracy.

From datacamp, the methods used such as Random Forest Classifier as well as hypertuning, came from the
'Machine Learning with Tree-Based Models in Python' course.

## Second Question <a class="anchor" id="q2"></a>
### Dataset: Using same dataset as First Question
### Numerical Predictin problem to predict pH levels of wine using the predictor variables

In [47]:
X = wine.drop('pH', axis=1)
y = wine['pH']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [50]:
rfr_pipe = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestRegressor())])

In [51]:
rfr_grid = GridSearchCV(rfr_pipe, {}, cv=5)
rfr_grid.fit(X_train, y_train)

In [53]:
print('train score', rfr_grid.best_score_)
print('test score', rfr_grid.score(X_test, y_test))
print('train std', rfr_grid.cv_results_['std_test_score'][rfr_grid.best_index_])

train score 0.7347060485698234
test score 0.7892788011293226
train std 0.025818950412307664


### Evaluation <a class="anchor" id="e2"></a>
We have R^2 or coefficient of determination score of .78 which is a strong score, and means that our predictor variables
are able to explain 78% of the variance of our target variable wine quality.

Methods used of Random Forest Regressor were from 'Machine Learning with Tree-Based Models in Python' course.

