# Wine quality study<br>
## Purpose of analysis
- Understand the characteristics of the data by EDA.
- Attempt to extract hard to understand information of data by cluster & Isomap analysis.
- To predict the wine type and evaluation of quality, use 2way "Logistic Regression" and "Decision Tree".

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebraa
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
## Libraries
import os
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

# For clustering
from sklearn import cluster
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import Isomap

# Data preprocessing library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Madhine learning
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import RandomOverSampler

# Evaluation library
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc

# Data loadig

In [None]:
df = pd.read_csv("../input/wine-quality/winequalityN.csv", header=0)
df.head()

# Data checking

In [None]:
# Data size
print("Data size:{}".format(df.shape))

In [None]:
# Data info
df.info()

There are null data in some columns. but the nuber is small.

In [None]:
# "type" values
df["type"].value_counts()

In [None]:
# Overview, Basic features
df.describe()

## Handling "Null data"
Null data nuber is small from whole data. So, I decided to delete the null data by index.

In [None]:
df.dropna(inplace=True)
df.shape

## EDA

### Category counts

In [None]:
# Wine type and Wine quality data count
fig, ax = plt.subplots(1,2,figsize=(15,6))

# Wine type
sns.countplot(df["type"], ax=ax[0])
ax[0].set_title("Wine type")

# Wine quality
sns.countplot(df["quality"], ax=ax[1])
ax[1].set_title("Wine type")

### Data pairplot by wine type

In [None]:
# Each wine type, plotting check of whole data, sample=1000.
sns.pairplot(df.sample(n=1000, random_state=10), hue="type", hue_order=['white', 'red'])
plt.legend()

- "white" or "red" type distribution is split in some features.
- For predict with Machine learning, they may need to drop outer data.

### Data pairplot by wine quality

In [None]:
# Each wine type, plotting check of whole data, sample=1000.
sns.pairplot(df.sample(n=1000, random_state=10), hue="quality", hue_order=[3,4,5,6,7,8,9])
plt.legend()

- "volatile acidity", "citric acid", "free sulfur dioxide", "density", "pH", "alcohol" may be important to Wine quality.<br>
### Next check the correlation with quality.

In [None]:
corr_values = df.iloc[:,1:]

# Heatmap
plt.figure(figsize=(15,15))
hm = sns.heatmap(corr_values.corr(),
                cbar=True,
                annot=True,
                square=True,
                cmap="RdBu_r",
                fmt=".2f",
                annot_kws={"size":10},
                yticklabels=corr_values.columns,
                vmax=1,
                vmin=-1,
                center=0)
plt.xlabel("Variables")
plt.ylabel("Variables")

- About quality, there is not clear correlation.
- Relatively, 'free sulfur dioxide'&'total sulfur dioxide' have positive correlation, and 'density'&alcohol have a negative correlation.

### Cluster & Isomap analysis

In [None]:
# Cluster data set
cluster_params = df.iloc[:,1:-1]

# Standarized 
sc = StandardScaler()
sc.fit(cluster_params)
params = sc.transform(cluster_params)

# Create 5 clusters
kmeans = cluster.KMeans(n_clusters=5, max_iter=30, init="random", random_state=0)
kmeans.fit(params)
labels = kmeans.labels_

In [None]:
# Isomap, Compress to 2D.
iso = Isomap(n_components=2)
iso.fit(params)
data_projected = iso.transform(params)
data_projected.shape

## 2D plot by Wine quality

In [None]:
plt.figure(figsize=(13,10))
plt.scatter(data_projected[:,0], data_projected[:,1], c=df["quality"], edgecolor='none', alpha=0.7, cmap=plt.cm.get_cmap('hsv', 6))
plt.colorbar(label="quality", ticks=range(6))

- Around "horizontal value=0" & "vertical value<-5"is good score for quality like a trench.
- There may be opotunity by 2 ways, the direction is from left upper side to there or right upper side to there.

## 2D plot Cluster

In [None]:
plt.figure(figsize=(13,10))
plt.scatter(data_projected[:,0], data_projected[:,1], c=labels, edgecolor='none', alpha=0.8, cmap=plt.cm.get_cmap('nipy_spectral', 5))
plt.colorbar(label="cluster", ticks=range(5))

- Good quality cluster is cluster 1.
- They can separate to right upper and left upper, next I confirm the 2 route.<br>
1st route : cluster 2 ⇒ cluster 1 ⇒ cluster 4<br>
2nd route : cluster 3 ⇒ cluster 0 ⇒ cluster 4

In [None]:
df["Cluster"] = labels

# group by Cluster, confirm with mean value.
df.groupby("Cluster").mean().reset_index()

### Analys of 1st route<br>
cluster 2 ⇒ cluster 1 ⇒ cluster 4

In [None]:
# To make the dataframe
df_1st_route = df.groupby("Cluster").mean().reset_index().query("Cluster==2 | Cluster==1 | Cluster==4").sort_values(by="quality")

# sort index by cluster route
def cluster_route_flg(x):
    if x["Cluster"] == 2:
        res=1
    elif x["Cluster"] == 1:
        res=2
    else:
        res=3
    return res

df_1st_route["Cluster_route"] = df_1st_route.apply(cluster_route_flg, axis=1)

df_1st_route.reset_index(inplace=True)
df_1st_route.head()

- 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'alcohol' can be a certain linear relationship.

In [None]:
# Preparing plot values
x = df_1st_route.iloc[:,1:-1]
y = df_1st_route["quality"]

# Visualization
fig, ax = plt.subplots(1,5, figsize=(25,4))
plt.subplots_adjust(wspace=0.3, hspace=0.3)

# residual sugar
ax[0].plot(x["residual sugar"], y, 'o',markersize=5)
ax[0].set_xlabel("residual sugar")
ax[0].set_ylabel("quality")

# chlorides
ax[1].plot(x["chlorides"], y, 'o',markersize=5)
ax[1].set_xlabel("chlorides")
ax[1].set_ylabel("quality")

# free sulfur dioxide
ax[2].plot(x["free sulfur dioxide"], y, 'o',markersize=5)
ax[2].set_xlabel("free sulfur dioxide")
ax[2].set_ylabel("quality")

# total sulfur dioxide
ax[3].plot(x["total sulfur dioxide"], y, 'o',markersize=5)
ax[3].set_xlabel("total sulfur dioxide")
ax[3].set_ylabel("quality")

# alcohol
ax[4].plot(x["alcohol"], y, 'o',markersize=5)
ax[4].set_xlabel("alcohol")
ax[4].set_ylabel("quality")

In [None]:
# To make the dataframe
df_2nd_route = df.groupby("Cluster").mean().reset_index().query("Cluster==3 | Cluster==0 | Cluster==4")

# sort index by cluster route
def cluster_route_flg(x):
    if x["Cluster"] == 3:
        res=1
    elif x["Cluster"] == 0:
        res=2
    else:
        res=3
    return res

df_2nd_route["Cluster_route"] = df_2nd_route.apply(cluster_route_flg, axis=1)

df_2nd_route.reset_index(inplace=True)
df_2nd_route.head()

df_2nd_route

- 'volatile acidity', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'alcohol' can be a certain linear relationship.

In [None]:
# Preparing plot values
x = df_2nd_route.iloc[:,1:-1]
y = df_2nd_route["quality"]

# Visualization
fig, ax = plt.subplots(1,5, figsize=(25,4))
plt.subplots_adjust(wspace=0.3, hspace=0.3)

# volatile acidity
ax[0].scatter(x["volatile acidity"], y, s=40)
ax[0].set_xlabel("volatile acidity")
ax[0].set_ylabel("quality")

# residual sugar
ax[1].scatter(x["residual sugar"], y, s=40)
ax[1].set_xlabel("residual sugar")
ax[1].set_ylabel("quality")

# free sulfur dioxide
ax[2].scatter(x["free sulfur dioxide"], y, s=40)
ax[2].set_xlabel("free sulfur dioxide")
ax[2].set_ylabel("quality")

# total sulfur dioxide
ax[3].scatter(x["total sulfur dioxide"], y, s=40)
ax[3].set_xlabel("total sulfur dioxide")
ax[3].set_ylabel("quality")

# alcohol
ax[4].scatter(x["alcohol"], y, s=40)
ax[4].set_xlabel("alcohol")
ax[4].set_ylabel("quality")

- volatile acidity is different from 1st route.
- It is interesting point that 'residual sugar' trend is reverse trend from 1st route.
- This is suggest that the best cluster 1 is strong high quality cluster. And we can select if we add the specific feature, we can choose 2ways.
*May be that wine type, white or red.

## Prediction of wine type by Logistic Regression & Decision Tree

### Data preprocessing

In [None]:
# Making the flag of wine type
mapping = {"white":0, "red":1}
df["type_flg"] = df['type'].map(mapping)
df['type_flg'].value_counts()

In [None]:
# Data preparing
X = df.iloc[:,1:12]
y = df["type_flg"]

# Data splitting to make the training data and validation data
# training data :70%, validation(test data) :30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Taking veryfing to Standarlized data
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
# Logistic Regression
lr = LogisticRegression()

param_range = [0.001, 0.01, 0.1, 1.0]
penalty = ['l1', 'l2']
param_grid = [{"C":param_range, "penalty":penalty}]

gs_lr = GridSearchCV(estimator=lr, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_lr = gs_lr.fit(X_train_std, y_train)

print(gs_lr.best_score_.round(3))
print(gs_lr.best_params_)

In [None]:
# Decision tree
tree = DecisionTreeClassifier(max_depth=4, random_state=10)

param_range = [3, 6, 9, 12]
leaf = [10, 15, 20]
criterion = ["entropy", "gini", "error"]
param_grid = [{"max_depth":param_range, "criterion":criterion, "max_leaf_nodes":leaf}]

gs_tree = GridSearchCV(estimator=tree, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_tree = gs_tree.fit(X_train, y_train)

print(gs_tree.best_score_.round(3))
print(gs_tree.best_params_)

In [None]:
print("-"*50)
# Logistic Regression Result
y_pred = gs_lr.best_estimator_.predict(X_test_std)
print("Logistic Regression Result")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

# Decision tree
y_pred = gs_tree.best_estimator_.predict(X_test)
print("Decision tree")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

### It was easy to identify the type of wine and it could be predicted with high accuracy.

## Prediction of quality type by Logistic Regression & Decision Tree

To define binary classification, make the flag quality_flg.<br>
High_quality : (flag=1, quality>=7)<br>
Low_quality : (flag=1, quality<7)<br>

In [None]:
def quality_flag(x):
    if x["quality"] >= 7:
        res = 1
    else:
        res = 0
    return res

df["quality_flg"] = df.apply(quality_flag, axis=1)
df["quality_flg"].value_counts()

High quality data is fewer than Low quality data.

In [None]:
# Data preparing
X = df.iloc[:,1:12]
y = df["quality_flg"]

# Data splitting to make the training data and validation data
# training data :70%, validation(test data) :30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Taking veryfing to Standarlized data
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
# Logistic Regression
lr = LogisticRegression()

param_range = [0.001, 0.01, 0.1, 1.0]
penalty = ['l1', 'l2']
param_grid = [{"C":param_range, "penalty":penalty}]

gs_lr = GridSearchCV(estimator=lr, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_lr = gs_lr.fit(X_train_std, y_train)

print(gs_lr.best_score_.round(3))
print(gs_lr.best_params_)

In [None]:
# Decision tree
tree = DecisionTreeClassifier(max_depth=4, random_state=10)

param_range = [3, 6, 9, 12]
leaf = [10, 15, 20]
criterion = ["entropy", "gini", "error"]
param_grid = [{"max_depth":param_range, "criterion":criterion, "max_leaf_nodes":leaf}]

gs_tree = GridSearchCV(estimator=tree, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_tree = gs_tree.fit(X_train, y_train)

print(gs_tree.best_score_.round(3))
print(gs_tree.best_params_)

In [None]:
print("-"*50)
# Logistic Regression Result
y_pred = gs_lr.best_estimator_.predict(X_test_std)
print("Logistic Regression Result")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

# Decision tree
y_pred = gs_tree.best_estimator_.predict(X_test)
print("Decision tree")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

### It was more hard to identify the quality of wine than wine type.<br>
Recall is low, I think it is from high quality data is few.<br>
### So next, I tried "over sampling method" to recall sore up.

In [None]:
# Set a RandomOverSampler
ros = RandomOverSampler(sampling_strategy = 'auto', random_state=10)

In [None]:
# Making the training data
X_train_resampled, y_train_resampled = ros.fit_sample(X_train_std, y_train)

# Logistic Regression
lr = LogisticRegression()

param_range = [0.001, 0.01, 0.1, 1.0]
penalty = ['l1', 'l2']
param_grid = [{"C":param_range, "penalty":penalty}]

gs_lr = GridSearchCV(estimator=lr, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_lr = gs_lr.fit(X_train_resampled, y_train_resampled)

print(gs_lr.best_score_.round(3))
print(gs_lr.best_params_)

In [None]:
# Making the training data
X_train_resampled, y_train_resampled = ros.fit_sample(X_train, y_train)

# Decision tree
tree = DecisionTreeClassifier(max_depth=4, random_state=10)

param_range = [3, 6, 9, 12]
leaf = [10, 15, 20]
criterion = ["entropy", "gini", "error"]
param_grid = [{"max_depth":param_range, "criterion":criterion, "max_leaf_nodes":leaf}]

gs_tree = GridSearchCV(estimator=tree, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_tree = gs_tree.fit(X_train_resampled, y_train_resampled)

print(gs_tree.best_score_.round(3))
print(gs_tree.best_params_)

In [None]:
print("-"*50)
# Logistic Regression Result
y_pred = gs_lr.best_estimator_.predict(X_test_std)
print("Logistic Regression Result")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

# Decision tree
y_pred = gs_tree.best_estimator_.predict(X_test)
print("Decision tree")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

Recall can be improved, but accuracy and precision is down.<br>
Because, there are trade off relation, so we decide that which is better in this situation. 

### Next try is separate the model, by wine type white or red.

I try for only red wine, and I checked.

### Red wine model

In [None]:
# Data preparing
X = df.query("type=='red'").iloc[:,1:12]
y = df.query("type=='red'")["quality_flg"]

# Data splitting to make the training data and validation data
# training data :70%, validation(test data) :30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Taking veryfing to Standarlized data
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
# Logistic Regression
lr = LogisticRegression()

param_range = [0.001, 0.01, 0.1, 1.0]
penalty = ['l1', 'l2']
param_grid = [{"C":param_range, "penalty":penalty}]

gs_lr = GridSearchCV(estimator=lr, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_lr = gs_lr.fit(X_train_std, y_train)

print(gs_lr.best_score_.round(3))
print(gs_lr.best_params_)

In [None]:
# Decision tree
tree = DecisionTreeClassifier(max_depth=4, random_state=10)

param_range = [3, 6, 9, 12]
leaf = [10, 15, 20]
criterion = ["entropy", "gini", "error"]
param_grid = [{"max_depth":param_range, "criterion":criterion, "max_leaf_nodes":leaf}]

gs_tree = GridSearchCV(estimator=tree, param_grid=param_grid, scoring="accuracy", cv=10, n_jobs=-1)
gs_tree = gs_tree.fit(X_train, y_train)

print(gs_tree.best_score_.round(3))
print(gs_tree.best_params_)

In [None]:
print("-"*50)
# Logistic Regression Result
y_pred = gs_lr.best_estimator_.predict(X_test_std)
print("Logistic Regression Result")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

# Decision tree
y_pred = gs_tree.best_estimator_.predict(X_test)
print("Decision tree")
print("confusion_matrix = \n", confusion_matrix(y_true=y_test, y_pred=y_pred))
print("accuracy = %.3f" % accuracy_score(y_true=y_test, y_pred=y_pred))
print("precision = %.3f" % precision_score(y_true=y_test, y_pred=y_pred))
print("recall = %.3f" % recall_score(y_true=y_test, y_pred=y_pred))
print("f1_score = %.3f" % f1_score(y_true=y_test, y_pred=y_pred))
print("-"*50)

Each model accuracy can be improved, precision and recall are improved,too.<br>
It can be a way for improving prediction accuracy. 

This Wine quality study ends here.