## Step 1: Importing the required libraries

We are majorly using the scikit learn library, which comes with lots of pre-built functions and algorithms.

In [None]:
# import all required modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
%precision 2
%matplotlib inline

## Step 2: Importing dataset

In [None]:
# load dataset from the directory where your dataset is saved.
data = pd.read_csv('winequality-red.csv', sep=',')

In [None]:
# show dataset head to have a feel of the type of data you're working with.
data.head()

In [None]:
# confirm non-null values
# null values affect the accuracy of your model. If found, either avoid using such features or use the optional...
# ...method of averages.
data.info()

In [None]:
data['quality'].value_counts()

## Step 3: Preprocessing data

Here we simply state our good and bad quality wine rules, transform the data as needed and assign x and y features.

In [None]:
# preprocess the data
# starting from good and bad wine quality rules
bins = (2, 6.5, 8)
group_names = ['bad', 'good']
data['quality'] = pd.cut(data['quality'], bins = bins, labels = group_names)
data['quality'].unique()

In [None]:
# use the labelencoder
data['quality'] = LabelEncoder().fit_transform(data['quality'])

In [None]:
# show dataset head again to confirm quality transformation
data.head(10)

In [None]:
# assign X and y features
X = data.drop('quality', axis = 1)
y = data['quality']

Let's have a look at the number of good vs bad wine plot.

In [None]:
sns.countplot(data['quality'])

In [None]:
data['quality'].value_counts()

In [None]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

To avoid bias, we use the StandardScaler from sklearn to ensure that all features are treated equally.

In [None]:
# apply standard scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Step 4: Applying Machine Learning Algorithms

The sklearn library also comes with several algorithms that can be applied easily and quickly. Since this is a classification problem, where output can only be good (1) or bad (0), let's try to use these three well known algorithms. You may include decision trees in your version of code.

# Random Forest Classifier

In [None]:
%%timeit
rfc = RandomForestClassifier(n_estimators = 200)
# fit in the data
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)
pred_rfc[:20]

### RFC model accuracy 

In [None]:
# use the confusion_matrix and classification_report to measure model accuracy
print(classification_report(y_test, pred_rfc))
print(confusion_matrix(y_test, pred_rfc))
rfc_accuracy = accuracy_score(y_test, pred_rfc)
print('Model Accuracy:', str(rfc_accuracy * 100) + '%')

The random forest algoritm is 90% accurate. This is pretty good, can it be better? Let's try three more algorithms

# Logistic Regression

In [None]:
%%timeit
lr = LogisticRegression(random_state=10, solver='liblinear')
# fit in the data
lr.fit(X_train, y_train)
pred_lr = lr.predict(X_test)
pred_lr[:20]

### Logistic Regression accuracy

In [None]:
print(classification_report(y_test, pred_lr))
print(confusion_matrix(y_test, pred_lr))
lr_accuracy = accuracy_score(y_test, pred_lr)
print('Model Accuracy:', str(lr_accuracy * 100) + '%')

This ran much faster than RFC, while maintaining a descent accuracy score.

# SVM Classifier

In [None]:
%%timeit
clf = svm.SVC()
# fit in the training data
clf.fit(X_train, y_train)
pred_clf = clf.predict(X_test)
pred_clf[:20]

### SVM model accuracy

In [None]:
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))
clf_accuracy = accuracy_score(y_test, pred_clf)
print('Model Accuracy:', str(clf_accuracy * 100) + '%')

The SVM classifier also did a good job. Albeit, with an accuracy of 90%, it is safe to say the Random Forest Classifier is still our go-to algorithm. What about neural networks? Let's see how that will perform on our data.

# Neural Networks

In [None]:
%%timeit
mlpc = MLPClassifier(hidden_layer_sizes = (11, 11, 11), max_iter = 2000)
# fit in the training set
mlpc.fit(X_train, y_train)
pred_mlpc = mlpc.predict(X_test)
pred_mlpc[:20]

### Neural Networks accuracy

In [None]:
print(classification_report(y_test, pred_mlpc))
print(confusion_matrix(y_test, pred_mlpc))
mlpc_accuracy = accuracy_score(y_test, pred_mlpc)
print('Model Accuracy:', str(mlpc_accuracy * 100) + '%')

This is really good as well. It goes to show that you may select either Random Forest Classifier, Logistic Regression or Neural Networks for this problem. However, when results are too close in this kind of case, you may choose to select an algorithm based on the time it takes it to run. 

In this case, the fastest algorithm that still holds a descent accuracy score is Logistic Regression.

# Test New Data

In [None]:
data.head(10)

In [None]:
# test new data
Xnew = [[7.5, 0.60, 0.06, 1.5, 0.055, 13.0, 30.0, 0.9950, 3.33, 0.56, 9.8]]
Xnew = sc.transform(Xnew)
ynew_rfc = rfc.predict(Xnew)
ynew_lr = lr.predict(Xnew)
ynew_clf = clf.predict(Xnew)
ynew_mlpc = mlpc.predict(Xnew)
print('Wine quality (RFC):', ynew_rfc)
print('Wine quality (LR):', ynew_lr)
print('Wine quality (SVM):', ynew_clf)
print('Wine quality (Neural Networks):', ynew_mlpc)

In [None]:
# test existing good wine
wine = [[7.8, 0.58, 0.02, 2.0, 0.073, 9.0, 18.0, 0.9968, 3.36, 0.57, 9.5]]
wine = sc.transform(wine)
ynew_rfc = rfc.predict(wine)
ynew_lr = lr.predict(wine)
ynew_clf = clf.predict(wine)
ynew_mlpc = mlpc.predict(wine)
print('Wine quality (RFC):', ynew_rfc)
print('Wine quality (LR):', ynew_lr)
print('Wine quality (SVM):', ynew_clf)
print('Wine quality (Neural Networks):', ynew_mlpc)

What esle? A good next step would be to try out more algoritms like decision trees or k-nearest neighbors. Then you can add the "%%timeit" function to check runtime of each algorithm and choose the optimal algorithm you wish.