# Wine quality prediction with Machine Learning
#### Ruimeng Wang

## Part 1. Basic Data Analysis

Let us implement some basic analyze about the dataset. First, let us read the head of the dataset.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

red_wine=pd.read_csv("C:/Users/wrm/Desktop/wine-quality/winequality-red.csv")
white_wine=pd.read_csv("C:/Users/wrm/Desktop/wine-quality/winequality-white.csv")
print (red_wine.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

Let us see whether there are some empty terms in the dataset.

In [3]:
print (red_wine.isnull().any())

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool


So, there should be no null terms in the dataframe. Then let us use the describtion method to see the basic statistical situation of the dataframe.

In [4]:
print (red_wine.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

Now, let us implement some visualization methods.

In [5]:
plot(sns.pairplot(red_wine)

SyntaxError: unexpected EOF while parsing (<ipython-input-5-c9449da7fb7f>, line 1)

## Part 2. Machine Learning

In [6]:
X=red_wine.drop(columns=["quality"])
y=red_wine["quality"]
#print (x.head())
#print (y.head())

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
clf.fit(X_train,y_train)
y_predict=clf.predict(X_test)
#print (y_predict)

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print (classification_report(y_predict,y_test))
print (accuracy_score(y_predict,y_test))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         3
           5       0.72      0.68      0.70       231
           6       0.67      0.60      0.63       235
           7       0.46      0.56      0.50        57
           8       0.14      0.50      0.22         2

   micro avg       0.63      0.63      0.63       528
   macro avg       0.33      0.39      0.34       528
weighted avg       0.66      0.63      0.64       528

0.6268939393939394


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


Amazing! Simply use the raw method we could acquire rather high-prediction accuracy(60%, random guss is just around 16%). Let us implement some feature engineering and parameter tuning.

In [8]:
from sklearn.model_selection import GridSearchCV
#Finding best parameters for our RF model
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200,300,400,500,600, 700, 800],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)
print (CV_rfc.best_params_)



{'max_features': 'sqrt', 'n_estimators': 500}
