<h3> In this notebook we will perform preprocessing of the total wine database including</h3>
<ol>
    <li> Creating dummy variables from categorical variables where needed </li>
    <li> Scaling continuous variables</li>
    <li> Split into a training and test test </li>
    <li> Saving the pre-processed and split data into separate CSV files </li>
</ol>

In [31]:
# Load Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [32]:
# Load wine database files
wine_qual = pd.read_csv('../data/WineQual.csv')

In [33]:
# Print out head
wine_qual.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine_color
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,2
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,2
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,2
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,2
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,2


In [34]:
# Create explanatory and predictor variables
x_cols = list(wine_qual.columns)
x_cols.remove('quality')
y = wine_qual['quality']
X = wine_qual[x_cols]

In [35]:
# Create dummies and add back to data table
# From before, red is 2 and white is 1
# to make this a dummy variable, subtract 1 from the wine_color column to make red=1 and white=0
# Quality is an ordinal variable where the higher the number the better the quality
X = pd.get_dummies(X, drop_first = True)
X.loc[:,'wine_color'] -= 1

In [36]:
# Stratify along wine_color for test train split
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify= y, random_state = 42, \
                                                    test_size=0.20)

In [37]:
# Perform standard scaler on training set and also apply this scaler to test set
# To prevent leakage and because we learn something from the data when we scale, scaling should be done first
# to training data and this scaler applied to the test data.
s = StandardScaler()

scaled_X_train = s.fit_transform(X_train)
df_scaled_X_train = pd.DataFrame(scaled_X_train, index=X_train.index, columns=x_cols)

scaled_X_test = s.fit_transform(X_test)
df_scaled_X_test = pd.DataFrame(scaled_X_test, index=X_test.index, columns=x_cols)


In [38]:
print(df_scaled_X_test.head())

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
6338      -0.015878          0.158563     0.009537       -0.677853   0.189381   
3921       0.059319          0.098361     0.151232        1.922783   0.032669   
2051      -0.316669         -0.804670    -0.344700        2.603418  -0.437469   
2059      -1.143844          0.700381     1.001401       -0.799758  -0.500154   
4557      -0.843053          0.158563     1.001401        1.353894  -0.437469   

      free sulfur dioxide  total sulfur dioxide   density        pH  \
6338            -0.797747             -1.547784 -0.011850  0.041330   
3921             0.885150              1.033111  0.852970 -0.503226   
2051             0.613715              0.552531  1.650082  0.464874   
2059             0.070845              0.961914 -0.473498 -0.563732   
4557             1.699455              1.691685  0.696010 -0.321707   

      sulphates   alcohol  wine_color  
6338   1.286766  0.689241   -1.704084  
3921  

<h3>This is the modeling step for the capstone project</h3>
<ol>We will look at several models including:
    <li>Ordinal Regression with various kernals</li>
    <ol>
        <li>Probit</li>
        <li>Logit</li>
        <li>One Customer Kernal</li>
    </ol>
    <li>Tree Regression</li>
    <ol>
        <li>Random Forest Regression</li>
        <li>Other forest methodologies</li>
    </ol>
</ol>

<h3>Load additional required packages</h3>

In [39]:
import scipy.stats as stats
from statsmodels.miscmodels.ordinal_model import OrderedModel

In [62]:
# Logit Model First
mod_prob = OrderedModel(y_train, df_scaled_X_train, distr='logit')
res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()

Optimization terminated successfully.
         Current function value: 1.083973
         Iterations: 62
         Function evaluations: 63
         Gradient evaluations: 63


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-5633.4
Model:,OrderedModel,AIC:,11300.0
Method:,Maximum Likelihood,BIC:,11420.0
Date:,"Tue, 19 Apr 2022",,
Time:,22:26:50,,
No. Observations:,5197,,
Df Residuals:,5179,,
Df Model:,18,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.3509,0.064,5.455,0.000,0.225,0.477
volatile acidity,-0.6657,0.042,-16.025,0.000,-0.747,-0.584
citric acid,-0.0087,0.034,-0.254,0.800,-0.076,0.058
residual sugar,0.8589,0.088,9.709,0.000,0.686,1.032
chlorides,-0.0695,0.035,-1.977,0.048,-0.138,-0.001
free sulfur dioxide,0.3054,0.041,7.423,0.000,0.225,0.386
total sulfur dioxide,-0.2266,0.054,-4.205,0.000,-0.332,-0.121
density,-0.9667,0.139,-6.953,0.000,-1.239,-0.694
pH,0.2538,0.045,5.671,0.000,0.166,0.341


<h3>We should drop citric acid as it has a large p-value and is colinear with acidity</h3>

In [63]:
df_scaled_X_train.drop('citric acid', inplace=True, axis=1)

<h3>Re-run the model</h3>

In [65]:
# Without citric acid
mod_prob = OrderedModel(y_train, df_scaled_X_train, distr='logit')
res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()

Optimization terminated successfully.
         Current function value: 1.083979
         Iterations: 59
         Function evaluations: 60
         Gradient evaluations: 60


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-5633.4
Model:,OrderedModel,AIC:,11300.0
Method:,Maximum Likelihood,BIC:,11410.0
Date:,"Tue, 19 Apr 2022",,
Time:,22:33:34,,
No. Observations:,5197,,
Df Residuals:,5180,,
Df Model:,17,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.3476,0.063,5.515,0.000,0.224,0.471
volatile acidity,-0.6622,0.039,-16.889,0.000,-0.739,-0.585
residual sugar,0.8594,0.088,9.716,0.000,0.686,1.033
chlorides,-0.0710,0.035,-2.047,0.041,-0.139,-0.003
free sulfur dioxide,0.3055,0.041,7.428,0.000,0.225,0.386
total sulfur dioxide,-0.2281,0.054,-4.257,0.000,-0.333,-0.123
density,-0.9683,0.139,-6.973,0.000,-1.241,-0.696
pH,0.2547,0.045,5.714,0.000,0.167,0.342
sulphates,0.3039,0.034,8.909,0.000,0.237,0.371


In [67]:
predicted = res_prob.model.predict(res_prob.params, which = 'cumprob')


In [69]:
predicted[1]

array([4.20322672e-04, 3.82526978e-03, 8.63726010e-02, 5.65682502e-01,
       9.31322006e-01, 9.98285074e-01, 1.00000000e+00])