# Wine Quality Python Project

Yi Ren

The dataset is about [wine quality](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) comes from the UCI Machine Learning Repository. For this homework, we need to create a document that goes through process of reading the data, combining it, manipulating/creating any variables, and fitting and choosing a final model for both the multiple linear regression modeling and the logistic regression modeling.

## Read in and Combine Data
### Read in two .csv files
Read in the winequality-red.csv and winequality-white.csv files available on the uci machine learning repository site.

In [59]:
import pandas as pd
red_df = pd.read_csv('winequality-red.csv', sep = ';')
white_df = pd.read_csv('winequality-white.csv', sep = ';')

### Combined datasets based on the type
Combine these two datasets and create a new variable that represents the type of wine (red or white)

In [60]:
red_df['type'] = 'red'
white_df['type'] = 'white'
wine_df = pd.concat([red_df,white_df])

## Split the Data

Split up the data set into a training and test set. 

In [61]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    wine_df.drop('alcohol', axis = 1),
    wine_df['alcohol'], 
    test_size = 0.20,
    random_state = 40)

## Regression Task (alcohol as Response)

### Train Models
#### 1. Fit four different multiple linear regression models

+ MLR with fxied acidity, residual sugar, density, pH and sulphates

In [62]:
from sklearn.model_selection import cross_validate
from sklearn import linear_model
mlr1 = linear_model.LinearRegression()
cv1 = cross_validate(mlr1,
                     X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values,
                     y_train.values,
                     cv = 5,
                     scoring = ('r2', 'neg_mean_squared_error'))

+ MLR with fxied acidity, residual sugar, density, pH and sulphates with their interaction terms

In [63]:
from sklearn.preprocessing import PolynomialFeatures
mlr2 = linear_model.LinearRegression()
poly = PolynomialFeatures(interaction_only = True, include_bias = False)
cv2 = cross_validate(mlr2,
                     poly.fit_transform(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values),
                     y_train.values,
                     cv = 5,
                     scoring = ('r2', 'neg_mean_squared_error'))

+ MLR with residual sugar, density and pH with quadratic terms and interaction terms.

In [64]:
mlr3 = linear_model.LinearRegression()
poly2 = PolynomialFeatures(degree = 2, include_bias = False)
cv3 = cross_validate(mlr3,
                     poly2.fit_transform(X_train[['residual sugar','density', 'pH']].values),
                     y_train.values,
                     cv = 5,
                     scoring = ('r2', 'neg_mean_squared_error'))

+ MLR with residual sugar, density and pH with quadratic terms, cubic terms and interaction terms.

In [65]:
mlr4 = linear_model.LinearRegression()
poly3 = PolynomialFeatures(degree = 3, include_bias = False)
cv4 = cross_validate(mlr4,
                     poly3.fit_transform(X_train[['residual sugar','density', 'pH']].values),
                     y_train.values,
                     cv = 5,
                     scoring = ('r2', 'neg_mean_squared_error'))

Use CV to select the best MLR model

In [66]:
print(round(sum(cv1['test_neg_mean_squared_error']),4),
      round(sum(cv2['test_neg_mean_squared_error']),4),
      round(sum(cv3['test_neg_mean_squared_error']),4),
      round(sum(cv4['test_neg_mean_squared_error']),4))

-1.7386 -1.5164 -2.4233 -2.3054


The negative mean squared error of -1.5164 indicates the best model is the second model.

#### 2. Fit a LASSO model

In [67]:
from sklearn.linear_model import LassoCV
reg_Lasso = LassoCV(cv = 5, random_state = 0).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

Use CV to select the tuning parameter

In [68]:
from sklearn.linear_model import Lasso
Lasso = Lasso(reg_Lasso.alpha_).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

#### 3. Fit a Ridge Regression Model

In [69]:
from sklearn.linear_model import RidgeCV
reg_Ridge = RidgeCV(cv = 5).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

Use CV to select the tuning parameter

In [70]:
from sklearn.linear_model import Ridge
Ridge = Ridge(reg_Ridge.alpha_).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

#### 4. Fit an Elastic Net Model

In [71]:
from sklearn.linear_model import ElasticNetCV
reg_EN = ElasticNetCV(cv = 5,
                     random_state = 0,
                     l1_ratio = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.98, 0.99, 1],
                     n_alphas = 50).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

Refit on full training data with best tuning parameters

In [72]:
from sklearn.linear_model import ElasticNet
EN = ElasticNet(alpha = reg_EN.alpha_, l1_ratio = reg_EN.l1_ratio_).fit(X_train[['fixed acidity','residual sugar','density','pH','sulphates']].values, y_train.values)

## Test Models
Using your four selected models, compare their performance on the test set.

In [73]:
from sklearn.model_selection import cross_val_predict
mlr2 = linear_model.LinearRegression()
poly = PolynomialFeatures(interaction_only = True, include_bias = False)
mlr_pred = cross_val_predict(mlr2,
                     poly.fit_transform(X_test[['fixed acidity','residual sugar','density','pH','sulphates']].values),
                     y_test.values,
                     cv = 5)

In [74]:
Lasso_pred = Lasso.predict(X_test[['fixed acidity','residual sugar','density','pH','sulphates']].values)
Ridge_pred = Ridge.predict(X_test[['fixed acidity','residual sugar','density','pH','sulphates']].values)
EN_pred = EN.predict(X_test[['fixed acidity','residual sugar','density','pH','sulphates']].values)

+ Using RMSE as model metric

In [75]:
from sklearn.metrics import mean_squared_error
print([np.sqrt(mean_squared_error(y_test, mlr_pred)),
       np.sqrt(mean_squared_error(y_test, Lasso_pred)),
       np.sqrt(mean_squared_error(y_test, Ridge_pred)),
       np.sqrt(mean_squared_error(y_test, EN_pred))])

[0.9259848727020377, 1.1488355339128369, 1.06468740113817, 1.148970504239183]


The second MLR has the lowest RMSE of 0.926 indicating a better fit. Ridge Regression model has better performance than both Lasso model and ElasticNet model, who have the same RMSE.

+ Using MAE as model metric

In [76]:
print([sum(abs(mlr_pred-y_test))/len(y_test),
       sum(abs(Lasso_pred-y_test))/len(y_test),
       sum(abs(Ridge_pred-y_test))/len(y_test),
       sum(abs(EN_pred-y_test))/len(y_test)])

[0.4367529343587056, 0.9306031779209479, 0.8541579493412735, 0.9312620451368903]


The second MLR has the lowest MAE of 0.437 indicating a better fit. Ridge Regression model has better performance than both Lasso model and ElasticNet model, who have the same MAE.

## Classification Task (Wine Type as Response)

### Split the data
Split up the data set into a training and test set

In [77]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    wine_df.drop('type', axis = 1),
    wine_df['type'], 
    test_size = 0.20,
    random_state = 40)

### Fit logistic regression
Use negative log-loss as the metric for choosing models during the training process

#### 1. Fit a Regression Model without penalty

In [78]:
from sklearn.linear_model import LogisticRegression
log_reg_full = LogisticRegression(solver = 'newton-cholesky',
                                  penalty = None,
                                  random_state = 0)

In [79]:
log_reg_full.fit(X_train2, y_train2)

#### 2. Fit a Lasso Regression Model

In [80]:
from sklearn.linear_model import LogisticRegressionCV
log_reg_cv1 = LogisticRegressionCV(cv = 5,
                                   solver = 'saga',
                                   penalty = 'l1',
                                   Cs = 250,
                                   scoring = 'neg_log_loss',
                                   max_iter = 8000,
                                   random_state = 5)

In [81]:
log_reg_cv1.fit(X_train2, y_train2)

In [82]:
log_reg_lasso = LogisticRegression(solver = 'saga',
                                   penalty = 'l1',
                                   C = log_reg_cv1.C_[0],
                                   max_iter = 8000,
                                   random_state = 10)
log_reg_lasso_fit = log_reg_lasso.fit(X_train2, y_train2)

#### 3. Fit a Ridge Regression Model

In [83]:
log_reg_cv2 = LogisticRegressionCV(cv = 5,
                                   solver = 'saga',
                                   penalty = 'l2',
                                   Cs = 250,
                                   max_iter = 8000,
                                   scoring = 'neg_log_loss',
                                   random_state = 10)

In [84]:
log_reg_cv2.fit(X_train2, y_train2)

In [85]:
log_reg_ridge = LogisticRegression(solver = 'saga',
                                    penalty = 'l2',
                                    C = log_reg_cv2.C_[0],
                                    max_iter = 8000,
                                    random_state = 10)
log_reg_ridge_fit = log_reg_ridge.fit(X_train2, y_train2)

#### 4. Fit an Elastic Net Model

In [86]:
log_reg_cv3 = LogisticRegressionCV(cv = 5,
                                   solver = 'saga',
                                   penalty = 'elasticnet',
                                   Cs = 250,
                                   max_iter = 8000,
                                   scoring = 'neg_log_loss',
                                   l1_ratios = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.98, 0.99, 1],
                                   tol = 0.001,
                                   random_state = 10)

In [87]:
log_reg_cv3.fit(X_train2, y_train2)

In [95]:
from sklearn.linear_model import LogisticRegression
log_reg_elastic = LogisticRegression(solver = 'saga',
                                    penalty = 'elasticnet',
                                    l1_ratio = log_reg_cv3.l1_ratio_[0],
                                    C = log_reg_cv3.C_[0],
                                     max_iter = 8000,
                                    random_state = 5)
log_reg_elastic_fit = log_reg_elastic.fit(X_train2, y_train2)

### Compare your models on both log-loss and accuracy

In [96]:
cv_proba_preds_full = log_reg_full.predict_proba(X_test2)
cv_proba_preds_lasso = log_reg_lasso.predict_proba(X_test2)
cv_proba_preds_ridge = log_reg_ridge.predict_proba(X_test2)
cv_proba_preds_elastic = log_reg_elastic.predict_proba(X_test2)

In [97]:
from sklearn.metrics import log_loss, accuracy_score
print([log_loss(y_test2, cv_proba_preds_full),
       log_loss(y_test2, cv_proba_preds_lasso),
       log_loss(y_test2, cv_proba_preds_ridge),
       log_loss(y_test2, cv_proba_preds_elastic)])

[0.058538477010746014, 0.09254222903934638, 0.09254229518442608, 0.09254167348009674]


In [98]:
cv_preds_full = log_reg_full.predict(X_test2)
cv_preds_lasso = log_reg_lasso.predict(X_test2)
cv_preds_ridge = log_reg_ridge.predict(X_test2)
cv_preds_elastic = log_reg_elastic.predict(X_test2)
print([accuracy_score(y_test2, cv_preds_full),
       accuracy_score(y_test2, cv_preds_lasso),
       accuracy_score(y_test2, cv_preds_ridge),
       accuracy_score(y_test2, cv_preds_elastic)])

[0.9938461538461538, 0.9769230769230769, 0.9769230769230769, 0.9769230769230769]


The best logistic regression model is the full model, which has the minimum log-loss of 0.585 and maximum accuracy of 0.994.