**About the dataset:**
Input variables (based on physicochemical tests):
* 1 - fixed acidity
* 2 - volatile acidity
* 3 - citric acid
* 4 - residual sugar
* 5 - chlorides
* 6 - free sulfur dioxide
* 7 - total sulfur dioxide
* 8 - density
* 9 - pH
* 10 - sulphates
* 11 - alcohol

Output variable (based on sensory data):
* 12 - quality (score between 0 and 10)

In [None]:
#import basic libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

In [None]:
raw_data = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
# Check the data
raw_data.info()

There are 1599 observations in the dataset. Luckily, no missing values!
Let's have a look at the data.

In [None]:
raw_data.head()

In [None]:
sns.countplot(raw_data.quality);

*Though feature quality can take values from 0 to 10, here we have only 6 possible values,i.e.,(3,4,5,6,7,8). Let us partition it into 'good' and 'bad' range. Values less than and equal to 5 will corespond to bad quality wine and vice versa.*

In [None]:
data.corr()['quality'].sort_values()[:-1]

In [None]:
data = raw_data.copy()
plt.figure(figsize=(12,12))
sns.heatmap(data.corr(),annot=True);

We can observe quality is highly correlated with volatile acidity and alcohol features.


In [None]:
def quality_trans(x):
    if x<6:
        return 0
    else:
        return 1
data.quality = data.quality.map(quality_trans)
sns.countplot(data.quality);

In [None]:
data.quality.value_counts()

To make this data balanced let's upsample the minority class using sklearn library resample.

In [None]:
from sklearn.utils import resample,shuffle
df_majority = data[data['quality']==1]
df_minority = data[data['quality']==0]
df_minority_upsampled = resample(df_minority,replace=True,n_samples=855,random_state = 123)
balanced_df = pd.concat([df_minority_upsampled,df_majority])
balanced_df = shuffle(balanced_df)
balanced_df.quality.value_counts()

In [None]:
balanced_df.describe()

Comparing mean values of all the features we can see there is difference in their magnitude. So, we will standardize our data to get all the features on same scale.

Also, mean and max value of feature residual sugar have a huge gap implying resence of outliers.

In [None]:
sns.boxplot(balanced_df['residual sugar']);

In [None]:
len(balanced_df[balanced_df['residual sugar']>4])

In [None]:
sns.boxplot(balanced_df['volatile acidity']);

As for now I am not dealing with outliers in features because data is not much big.

In [None]:
# standardization
from sklearn.preprocessing import StandardScaler
X = balanced_df.drop('quality',axis=1)
y = balanced_df.quality
scaled_X = pd.DataFrame(StandardScaler().fit_transform(X),columns=X.columns)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(scaled_X,y,test_size=0.3,shuffle=True,random_state=42)
x_train.shape,x_test.shape

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score,accuracy_score

classifiers = {
    'Logistic Regression' : LogisticRegression(),
    'Decision Tree' : DecisionTreeClassifier(),
    'Random Forest' : RandomForestClassifier(),
    'Support Vector Machines' : SVC(),
    'K-nearest Neighbors' : KNeighborsClassifier(),
    'XGBoost' : XGBClassifier()
}
results=pd.DataFrame(columns=['Accuracy in %','F1-score'])
for method,func in classifiers.items():
    func.fit(x_train,y_train)
    pred = func.predict(x_test)
    results.loc[method]= [100*np.round(accuracy_score(y_test,pred),decimals=4),
                         round(f1_score(y_test,pred),2)]
results

In [None]:
#Now lets try to do some evaluation for random forest model using cross validation.
from sklearn.model_selection import cross_val_score
rfc_eval = cross_val_score(estimator = RandomForestClassifier(), X = x_train, y = y_train, cv = 10)
rfc_eval.mean()

Random forest model seems promising. :)

***Please upvote!!!***