# Red Wine Quality Classification

Dataset URL: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009


## 문제
**quality 컬럼의 값이 6.5보다 크면 와인이 good이고, 아니라면 bad로 분류되도록 하는 문제를 푼다.**

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.

* $quality$ > 6.5 => "good"
* TRUE => "bad"


## Inspiration
Use machine learning to determine which physiochemical properties make a wine 'good'!

## Load Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.metrics import plot_confusion_matrix
from scipy.stats import norm, boxcox
from collections import Counter
from scipy import stats

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
data.head()

In [None]:
# columns 
data.columns

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
#check missing values
data.isnull().sum()

missing values가 없음을 확인할 수 있었다.

## 데이터 분석

In [None]:
# 변수별 상관계수 확인
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), annot=True)

Target column인 'quality'와 가장 연관성이 떨어지는 컬럼을 확인할 수 있다.

1. volatile acidity
2. total sulfur dioxide
3. density

In [None]:
data[["volatile acidity","quality"]].groupby(["quality"], as_index = False).mean().sort_values(by = "quality").style.background_gradient("Reds")

In [None]:
data[["total sulfur dioxide","quality"]].groupby(["quality"], as_index = False).mean().sort_values(by = "quality").style.background_gradient("Reds")

In [None]:
data[["density", "quality"]].groupby(["quality"], as_index = False).mean().sort_values(by = "quality").style.background_gradient("Reds")

quality(등급) 별 평균 치로 보았을 때, 뚜렷한 패턴이 없는 것을 확인할 수 있었다.

#### Quality Class 변환

* if quality > 6.5: 1
* else: 0

가 되도록 quality 컬럼의 값을 변환해준다.

In [None]:
bins = (2, 6.5, 8)
labels = [0, 1]
data['quality'] = pd.cut(x = data['quality'], bins = bins, labels = labels)

In [None]:
data['quality'].value_counts()

In [None]:
data

quality 컬럼의 값이 0과 1로 변경되었다.

In [None]:
plt.figure(figsize=(7,5))
sns.countplot(data["quality"], palette='Set2')
plt.title("quality distribution", color = "black", fontweight= 'bold', fontsize = 11)
plt.show()

클래스별로 **균형잡힌 데이터셋**이 아님을 확인할 수 있다.

## 데이터 전처리

In [None]:
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import collections

In [None]:
y = data['quality']
x = data.drop(['quality', 'volatile acidity', 'total sulfur dioxide', 'density'], axis=1)

In [None]:
x.head()

데이터 분석에서 확인한 target(quality)와 연관성이 적은 컬럼 3개룰 제거한다.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=22)

train, test set을 분리한다.

### Imbalanced Dataset

앞서 확인했듯이 이 데이터셋은 0, 1 클래스 중 1의 경우가 적은 imbalanced dataset이다.

In [None]:
smote = SMOTE(random_state=14)
x_train_sm, y_train_sm = smote.fit_resample(x_train, y_train)

print("Before: ", collections.Counter(y_train))
print("After: ", collections.Counter(y_train_sm))

따라서, SMOTE를 활용해 balance를 맞춰준다.

### Scale 조정

In [None]:
scaler = StandardScaler()
x_train_sm = scaler.fit_transform(x_train_sm) 
x_test = scaler.transform(x_test) 

## Model Train

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
accs = []

### Decision Tree

In [None]:
dt=DecisionTreeClassifier()

#train
dt.fit(x_train_sm, y_train_sm)
y_pred=dt.predict(x_test)

a = dt.score(x_test, y_test)
accs.append(a)
print("Accuracy:", a * 100)
print(classification_report(y_test, y_pred))

print()

cm_aaa = confusion_matrix(y_test, y_pred)

plot_confusion_matrix(dt, x_test, y_test, cmap='binary')
plt.show()

### RandomForest

In [None]:
rf=RandomForestClassifier()

#train
rf.fit(x_train_sm, y_train_sm)
y_pred=rf.predict(x_test)

a = rf.score(x_test, y_test)
accs.append(a)
print("Accuracy:", a * 100)
print(classification_report(y_test, y_pred))

print()

cm_aaa = confusion_matrix(y_test, y_pred)

plot_confusion_matrix(rf, x_test, y_test, cmap="binary")
plt.show()

### XGB

In [None]:
xgb=XGBClassifier()

#train
xgb.fit(x_train_sm, y_train_sm)
y_pred=xgb.predict(x_test)

a = xgb.score(x_test, y_test)
accs.append(a)
print("Accuracy:", a * 100)
print(classification_report(y_test, y_pred))

print()

cm_aaa = confusion_matrix(y_test, y_pred)

plot_confusion_matrix(xgb, x_test, y_test, cmap="binary")
plt.show()

## Model Evaluation

In [None]:
df_result = pd.DataFrame({"accuracy":accs, "model":["Decision Tree","RandomForest",
             "XGBClassifier"]})

df_result.style.background_gradient("Greens")

In [None]:
g = sns.barplot("accuracy", "model", data = df_result, palette='Set3')
g.set_xlabel("score")
g.set_title("Classifier Model Test Accuracy", color = "Black")
plt.show()

# Conclusion

3가지 모델을 사용해 학습해본 결과,

Random Forest Classifier의 성능이 정확도 약 90.00 %로 가장 높았다. Random Forest Classifier의 f1-score는 0.94이다.

<br></br>

### 추가

실행할 때마다 결과가 약간씩 차이를 보인다.

하지만 항상 88 ~ 90% 정도의 결과를 보였다.