<h3> In this notebook we will perform preprocessing of the total wine database including</h3>
<ol>
    <li> Creating dummy variables from categorical variables where needed </li>
    <li> Scaling continuous variables</li>
    <li> Split into a training and test test </li>
    <li> Saving the pre-processed and split data into separate CSV files </li>
</ol>

In [31]:
# Load Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [32]:
# Load wine database files
wine_qual = pd.read_csv('../data/WineQual.csv')

In [33]:
# Print out head
wine_qual.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine_color
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,2
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,2
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,2
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,2
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,2


In [34]:
# Create explanatory and predictor variables
x_cols = list(wine_qual.columns)
x_cols.remove('quality')
y = wine_qual['quality']
X = wine_qual[x_cols]

In [35]:
# Create dummies and add back to data table
# From before, red is 2 and white is 1
# to make this a dummy variable, subtract 1 from the wine_color column to make red=1 and white=0
# Quality is an ordinal variable where the higher the number the better the quality
X = pd.get_dummies(X, drop_first = True)
X.loc[:,'wine_color'] -= 1

In [36]:
# Stratify along wine_color for test train split
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify= y, random_state = 42, \
                                                    test_size=0.20)

In [37]:
# Perform standard scaler on training set and also apply this scaler to test set
# To prevent leakage and because we learn something from the data when we scale, scaling should be done first
# to training data and this scaler applied to the test data.
s = StandardScaler()

scaled_X_train = s.fit_transform(X_train)
df_scaled_X_train = pd.DataFrame(scaled_X_train, index=X_train.index, columns=x_cols)

scaled_X_test = s.fit_transform(X_test)
df_scaled_X_test = pd.DataFrame(scaled_X_test, index=X_test.index, columns=x_cols)


In [38]:
print(df_scaled_X_test.head())

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
6338      -0.015878          0.158563     0.009537       -0.677853   0.189381   
3921       0.059319          0.098361     0.151232        1.922783   0.032669   
2051      -0.316669         -0.804670    -0.344700        2.603418  -0.437469   
2059      -1.143844          0.700381     1.001401       -0.799758  -0.500154   
4557      -0.843053          0.158563     1.001401        1.353894  -0.437469   

      free sulfur dioxide  total sulfur dioxide   density        pH  \
6338            -0.797747             -1.547784 -0.011850  0.041330   
3921             0.885150              1.033111  0.852970 -0.503226   
2051             0.613715              0.552531  1.650082  0.464874   
2059             0.070845              0.961914 -0.473498 -0.563732   
4557             1.699455              1.691685  0.696010 -0.321707   

      sulphates   alcohol  wine_color  
6338   1.286766  0.689241   -1.704084  
3921  

<h3>This is the modeling step for the capstone project</h3>
<ol>We will look at several models including:
    <li>Ordinal Regression with various kernals</li>
    <ol>
        <li>Probit</li>
        <li>Logit</li>
        <li>One Customer Kernal</li>
    </ol>
    <li>Tree Regression</li>
    <ol>
        <li>Random Forest Regression</li>
        <li>Other forest methodologies</li>
    </ol>
</ol>

<h3>Load additional required packages</h3>

In [39]:
import scipy.stats as stats
from statsmodels.miscmodels.ordinal_model import OrderedModel

In [52]:
# Probit Model First
mod_prob = OrderedModel(y_train, df_scaled_X_train, distr='probit')
res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()

Optimization terminated successfully.
         Current function value: 1.093262
         Iterations: 34
         Function evaluations: 36
         Gradient evaluations: 36


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-5681.7
Model:,OrderedModel,AIC:,11400.0
Method:,Maximum Likelihood,BIC:,11520.0
Date:,"Tue, 19 Apr 2022",,
Time:,21:56:05,,
No. Observations:,5197,,
Df Residuals:,5179,,
Df Model:,18,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.1821,0.036,5.099,0.000,0.112,0.252
volatile acidity,-0.3764,0.023,-16.498,0.000,-0.421,-0.332
citric acid,-0.0084,0.020,-0.433,0.665,-0.047,0.030
residual sugar,0.4769,0.049,9.695,0.000,0.380,0.573
chlorides,-0.0400,0.020,-2.027,0.043,-0.079,-0.001
free sulfur dioxide,0.1593,0.023,7.009,0.000,0.115,0.204
total sulfur dioxide,-0.1044,0.030,-3.430,0.001,-0.164,-0.045
density,-0.5245,0.077,-6.805,0.000,-0.676,-0.373
pH,0.1302,0.025,5.226,0.000,0.081,0.179


In [55]:
predicted = res_prob.model.predict(res_prob.params, exog=scaled_X_test)


[[1.23845725e-04 2.79821103e-03 1.48453983e-01 ... 2.71528189e-01
  4.34241281e-02 7.69120001e-04]
 [5.56849727e-04 8.78268992e-03 2.56310681e-01 ... 1.70793579e-01
  1.73167428e-02 1.77201468e-04]
 [1.88859999e-03 2.15194666e-02 3.73211216e-01 ... 9.92798477e-02
  6.66875814e-03 4.14577621e-05]
 ...
 [9.18000858e-05 2.21862529e-03 1.31947057e-01 ... 2.91436382e-01
  5.07631078e-02 9.95969405e-04]
 [5.65384264e-03 4.64476272e-02 4.88139895e-01 ... 5.12711859e-02
  2.27867346e-03 8.57998289e-06]
 [2.15394432e-05 7.09564669e-04 7.19335775e-02 ... 3.77887241e-01
  9.75170963e-02 3.05618482e-03]]


In [58]:
print(predicted[1].max())

0.5460622558445611
