# Red Wine Quality Prediction
The reason for this analysis and model is to predict the quality of red wine based of it's unique features. This will score it into a quality range of 0 to 10.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
wine = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

## Variable Identification
First I will explore each variable first, I want to find out the data type of each and how many null entries I have in the dataset.

In [None]:
wine.info()

Wow! No missing entries!

In [None]:
wine.describe()

Looks like we have no examples of wines that are below 3 quality and higher than 8 quality as these are our min and max's for the quality target column.

In [None]:
wine.head()

## Univariate Analysis
Now I will visualize some features to try and find some outliers and see if we can find some interesting stats.

In [None]:
sb.countplot(x="quality", data=wine)

We have alot of examples of wines with a quality of 5 and 6 but the rest are either little amount of entries or non existant. This needs to be sorted out.

## Bi-variate Analysis
Now I will compare features against each other to try and find some correlation between them.

In [None]:
def correlation_heatmap(train):
    correlations = train.corr()
    
    fig, ax = plt.subplots(figsize=(14,14))
    sb.heatmap(correlations, vmax=1.0, center=0, fmt='.2f', square=True, linewidths=.5, annot=True, cbar_kws={"shrink":.70})
    plt.show()
correlation_heatmap(wine)

In [None]:
grid = sb.PairGrid(wine)
grid.map(sb.scatterplot)

In [None]:
f, ax = plt.subplots(figsize=(15, 6.5))
sb.scatterplot(x="volatile acidity", y="quality",
              data=wine)

In [None]:
sb.boxplot(x="volatile acidity", y="quality", data=wine, whis=[0,100], width=.6, orient="h")

What we can determine from this is that wines with a lower volatile acidity which leads to an acidic taste more frequently correlate with a higher quality of wine and vice versa.

## Preproccessing data
Here I will transform data to perform better for our model

In [None]:
wine['quality'] = [1 if x>=7 else 0 for x in wine['quality']]
wine.head()

In [None]:
X = wine.drop('quality', axis=1)
y = wine['quality']

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

num_attribs = list(X)

pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_attribs)
])

prepared = pipeline.fit_transform(X)

In [None]:
X_df = pd.DataFrame(data=prepared)

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
StratSplit = StratifiedShuffleSplit(test_size=0.1, random_state=42)
for train_index, test_index in StratSplit.split(X_df, y):
    X_train, X_test = X_df.iloc[train_index], X_df.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

## Train Model
Now I will train KNN on our data and finetune the model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)

In [None]:
from sklearn.metrics import roc_curve, auc
neighbors = list(range(1,30))
train_results = []
test_results = []
for n in neighbors:
   model = KNeighborsClassifier(n_neighbors=n)
   model.fit(X_train, y_train)
   train_pred = model.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   train_results.append(roc_auc)
   y_pred = model.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   test_results.append(roc_auc)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(neighbors, train_results, color="b", label="Train AUC")
line2, = plt.plot(neighbors, test_results, color="r", label="Test AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('n_neighbors')
plt.show()

We see a spike at 3 neighbors for both the training and test set, so I will change my model to this above!

In [None]:
from sklearn.metrics import classification_report
y_pred = neigh.predict(X_train)
print(classification_report(y_train, y_pred))

In [None]:
y_pred = neigh.predict(X_test)
print(classification_report(y_test, y_pred))