<img src="https://upload.wikimedia.org/wikipedia/commons/1/11/Ch%C3%A2teau_P%C3%A9trus.jpg" width="700" style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 50%;" >

### 1. Data Preview
### 2. EDA
### 3. Data Preprocessing
### 4. Model training
### 5. Model Evaluation

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Data Preview 

In [None]:
df = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe().transpose()

### It looks there is no missing data, let's check!

In [None]:
df.isnull().sum()

# 2. EDA

In [None]:
sns.countplot(data=df, x = 'quality')

According to above count plot our data is obviouly imbalanced. Since 8 is best quality we have a lot of 5 quality in our dataset.

In [None]:
fig = plt.figure(figsize=(7,4), dpi=150)
sns.heatmap(data= df.corr(), annot=True)


Heat map plot shows us that there isn't any string relationship between our features or our label. The corrolation between pH and fixed acidity is -0.68 which makes sense since pH is in indirect rellation with acidity. For our label alcohol feature has the highest corrolation which is 0.48.
### Now let plot some more plots to see the rellations more clear.

In [None]:
sns.scatterplot(data=df, x = 'quality', y='alcohol')

In [None]:
sns.scatterplot(data=df, x = 'alcohol', y='density')

In [None]:
sns.barplot(data=df, x = 'quality',y = 'citric acid')

# 3. Data preprocessing

### Since our dataset is imbalance, It's reasonable to convert our label to binary. We can divide it to good and bad. 3 to 5 is good wine and 6 to 8 is bad wine.

In [None]:
bins = (2, 5.5, 8)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)

In [None]:
df['quality'].value_counts()

### well that's a balance set. Let's move on to our models.

# 4. Model Training

## Train|Test Split

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('quality', axis = 1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## SVC

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svc = SVC()
param_grid = {'C' : [0.001,0.01,0.1,0.4,0.5,1]}
grid = GridSearchCV(svc, param_grid)
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
grid_pred = grid.predict(X_test)
plot_confusion_matrix(grid, X_test, y_test)

In [None]:
print(classification_report(y_test,grid_pred))

## Note: We could actually use those multi class label and balance them. There is a hyperparameter in svc called class wight. If we equall class wight to balace our model will dedicate more weight to our imbalance data.

## Random Forest

In [None]:
from sklearn.ensemble  import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

In [None]:
plot_confusion_matrix(rfc, X_test, y_test)

In [None]:
print(classification_report(y_test,pred_rfc))

## Random Forest performed better.

## If my notebook was helpfull for, make sure to give it an upvote. 
## Thank you!