# Building a Model to Predict Red Wine Quality

Using the UCI Machine Learning Red Wine Quality dataset, I'll demonstrate how to quickly train and compare different algorithms to build a model that classifies whether a wine is high quality or not.

Below is an outline of my work:

1. Import libraries and read in the dataset.
2. Review variable correlation and check dataset for nulls, of which there are none.
3. Plot wine quality ratings in a histogram and find the 50th percentile.
4. Build my supervisor ('IsHighQualityWine') based off of 50th percentile value of wine quality column.
5. All of the data is numerical, so I can move on and separate it into X (independent variables) and Y (our supervisor) to prepare to train models.
6. Build a function so I can input which algorithm I'd like to train and test with and output metrics such as precision, recall, and f-score.
7. Finally, build one last function that allows me to run multiple iterations of the BuildModel function with various algorithms.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn.metrics as metrics
from sklearn.metrics import roc_curve, confusion_matrix, f1_score, recall_score, precision_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

file_path = '/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv'
quality = pd.read_csv(file_path)
display(quality)

Get an idea of vairable correlation:

In [None]:
correlation = quality.iloc[:,1:].corr()
display(correlation)

Check for nulls in the dataset:

In [None]:
quality.isna().sum()

Look at frequeny of wine quality ratings:

In [None]:
plt.hist(quality['quality'])

Find the 50th percentile of wine quality ratings and use that to build a supervisor, IsHighQualityWine:

In [None]:
isHQ_split = np.percentile(quality['quality'],50)
print('I will split wine quality values at ' + str(isHQ_split) + ' to build my supervisor.')
quality['IsHighQualityWine'] = np.where(quality['quality'] >= isHQ_split,1,0)
quality.head(10)

My BuildModel function allows the user to train and test a model, evaluate the model, and output the results. It has the ability to use any of four algorithms from: Logistic Regression, Decision Tree, Random Forest, and Naive Bayes.

In [None]:
def BuildModel(df, Algorithm, ScaleData = 0):
    # Separate data into independent and dependent variables
    X = df.iloc[:,1:-2] # features
    Y = df.iloc[:,-1] # supervisor

    if ScaleData == 1:
        from sklearn.preprocessing import StandardScaler
        scale = StandardScaler()
        X_scaled = scale.fit_transform(X)
        X_scaled = pd.DataFrame(X_scaled)
        X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size = 0.2, random_state = 0)
    else:
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
    
    # Save training set if need to reference in future when evaluating model
    TrainingSet = pd.merge(X_train, pd.DataFrame(Y_train), left_index  = True, right_index = True)
    
    if Algorithm == 'LogisticRegression':
        Classifier = LogisticRegression(max_iter=1000)

    if Algorithm == 'DecisionTree':
        Classifier = tree.DecisionTreeClassifier()

    if Algorithm == 'RandomForest':
        Classifier = RandomForestClassifier(n_estimators = 1000)
    
    if Algorithm == 'NaiveBayes':
        Classifier = GaussianNB()
            
    Classifier = Classifier.fit(X_train,Y_train)
    Y_pred = Classifier.predict(X_test)
    
    # Evaluate model with confusion matrix
    cf = confusion_matrix(Y_test,Y_pred)
    f_score = f1_score(Y_test, Y_pred)
    precision = precision_score(Y_test,Y_pred)
    recall = recall_score(Y_test,Y_pred)
    accuracy = accuracy_score(Y_test,Y_pred)

    print('Using the {} classifier, we receive a precision value of {}, recall value of {}, and f-score of {}. \n' \
          .format(Algorithm,round(precision,4),round(recall,4),round(f_score,4)))

I'll build a logistic regression model to test my function, first without scaling the features and then with scaling:

In [None]:
BuildModel(quality,'LogisticRegression')

In [None]:
BuildModel(quality,'LogisticRegression', 1)

The BuildAndCompareModels function below calls the BuildModel function and allows a user to create and compare models built using different algorithms. To use this function, I'll input an array such as ['LogisticRegression','DecisionTree'] and it will loop through to build each model specified. The output will be the precision, recall, and f-score values from each model that was built.

In [None]:
def BuildAndCompareModels(df, AlgArray, ScaleData = 0):
    for i in range(0,len(AlgArray)):
        BuildModel(df,AlgArray[i],ScaleData)

I'll build all four algorithms and see how they compare.

In [None]:
BuildAndCompareModels(quality,['LogisticRegression','NaiveBayes','DecisionTree','RandomForest'])

# Summary:
* Logistic Regression, Naive Bayes, and Decision Tree all performed about the same
* We got slightly higher performance with Random Forest
* Random Forest compute time was only marginally slower than the other three algorithms

In order to get the best predictions at a low compute cost, I would choose to deploy the Random Forest model. It's also worth noting that Random Forest is not much more difficult to explain to a client than something like Logistic Regression, so if there was a business need to keep the deployed model simple, this selection would meet that need.