# Red Wine Quality Prediction using Regression

# Problem Statement

The dataset is related to the red variant of the Portuguese "Vinho Verde" wine. For more details,
consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical
(inputs) and sensory (the output) variables are available (e.g. there is no data about grape types,
wine brand, wine selling price, etc.).

These datasets can be viewed as regression tasks. The classes are ordered and not balanced (e.g.
there are much more normal wines than excellent or poor ones).

Apply Regression and find the quality of Wine


In [4]:
"""
Attribute Information:

Input variables (based on physicochemical tests):

1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
"""
print()




# Importing the required modules

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,confusion_matrix,mean_absolute_error
%matplotlib inline

# Loading the dataset

In [6]:
df = pd.read_csv("winequality-red.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'winequality-red.csv'

In [None]:
# visualize the first ten rows of dataset
df.head(10)

In [None]:
# Check the shape of the dataset for the number of rows and columns
df.shape

# Data Analyzing and Wrangling

As we can see, the name of some columns contains spaces, which is something we do not really want when treating data, this is why we are going to replace spaces with "_" . 

In [None]:
# replace spaces with _ for each column
df.columns = df.columns.str.replace(" ","_")

In [None]:
# check whether there are object data types
df.info()

In [None]:
df.describe()

In [None]:
# check whether there are missing values
df.isnull().sum()

There are no object data types and null values in the dataset. Now, our dataset is "ready to go".

# Exploratory Data Analysis

Target variable is "quality". Let us plot some information about it.

In [None]:
df["quality"].value_counts()

In [None]:
# visualizing the different quality values
sns.countplot(x=df["quality"])

In [None]:
df.count()

Let us study the correlation between our label "quality" and features of the dataset and check which are highly correlated and play an important role in predicting the quality of a wine

In [None]:
# calculate and order the correlations with respect to quality
correlations = df.corr()["quality"].sort_values(ascending=False)
correlations

From the above data, we can infer that alcohol is highly and positively correlated with quality of wine whereas volatile_acidity is highly and negatively correlated with quality of wine.

In [None]:
correlations.plot(kind="bar")

Let's plot the correlation matrix to have a better understanding of how features correlate ith each other.

In [None]:
# heatmap to plot all correlations between features
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True,cmap="coolwarm")

In [None]:
"""
From this matrix we can observe, apart from the information we had before, some obvious correlations ( threshold value >= 0.5 ) among features like 

fixed_acidity        -->  citric_acid, density, pH
volatile_acidity     -->  citric_acid
citric_acid          -->  volatile_acidity, pH
free_sulfur_dioxide  -->  total_sulfur_dioxide
total_sulfur_dioxide -->  free_sulfur_dioxide
density              -->  fixed_acidity, alcohol
pH                   -->  fixed_acidity, citric_acid
alcohol              -->  density

From all these features , we select those features having high correlation with quality and do not take into account those features whose values might be redundant and not provide information at all
"""
print()

In [None]:
print(abs(correlations) > 0.2)

From all the values, we are selecting alcohol, sulphates, citric_acid and volatile_acidity in order to study them better and see the distribution of values that separate the different qualities.

Alcohol percent in different quality wines

In [None]:
alc =sns.boxplot(x="quality",y="alcohol", data=df)
alc.set(title="Alcohol Percent in Different Quality Wines")

From the above boxplot,we can observe that the quality of wines is increasing when the percent of alcohol increased

Sulphates percent in different quality wines

In [None]:
sp =sns.boxplot(x="quality",y="sulphates", data=df)
sp.set(title="Sulphates Percent in Different Quality Wines")

From the above boxplot,we can observe slight increment in the quality of wine as percent of sulphates increased

Citric acid percent in different quality wines

In [None]:
cit =sns.boxplot(x="quality",y="citric_acid", data=df)
cit.set(title="Citric Acid Percent in Different Quality Wines")

From the above boxplot, we can observe that adding citric acid to these wines seem to get higher quality ratings.

Volatile acid percent in different quality wines

In [None]:
vol =sns.boxplot(x="quality",y="volatile_acidity", data=df)
vol.set(title="Volatile Acidity Percent in Different Quality Wines")

For the volatile acidity , we can clearly observe how less it is present, the wine will have high ratings

From the above features, we see obvious correlation between volatile_acidity and citric_acid, but we select volatile_acidity as it is having high correlation with quality than citric_acid

# Features Selection

In [None]:
# features having coefficient > threshold_value will be selected
def get_correlation(data, threshold):
    corr_col = []
    corr_matrix = data.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j]) > threshold:
                col_name = corr_matrix.columns[i]
                if col_name not in corr_col:
                    corr_col.append(col_name)
    return corr_col

Slicing the dataset into features and label (quality)

In [None]:
x = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [None]:
x

In [None]:
y

Splitting the data into training and testing datasets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
corr_features = get_correlation(x_train,0.6)
corr_features

Drop the columns of corr_features which are highly correlated to each other

In [None]:
x_train.drop(corr_features,axis=1)
x_test.drop(corr_features,axis=1)

Data scaling of the features dataset

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

In [None]:
x_train

In [None]:
y_train

# Linear Regression

In [None]:
# fit the model
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)

In [None]:
# predict using x_test values
y_pred = linear_reg.predict(x_test)
print(y_pred[:100])
y_pred = np.round(y_pred)

Accuracy of Linear Regression Model

In [None]:
# Plotting absolute error between each value of predicted value and test label value
sns.displot(abs(y_test-y_pred))

In [None]:
# Evaluation of the model
print('Mean Absolute Error     : ',mean_absolute_error(y_test, y_pred))
print('Root Mean Squared Error : ',np.sqrt(mean_absolute_error(y_test,y_pred)))

In [None]:
acc_score = round(accuracy_score(y_test, y_pred)*100,2)
print("Accuracy of the Linear Regression model = ",acc_score,"%")

In [None]:
cnf = confusion_matrix(y_test,y_pred)
cnf

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot = True,cmap="coolwarm")

# SVM Regressor

In [None]:
svr = SVR(kernel="rbf")
svr.fit(x_train, y_train)

In [None]:
y_pred_svr = svr.predict(x_test)
print(y_pred_svr[:100])
y_pred_svr = np.round(y_pred_svr)

In [None]:
# Plotting absolute error between each value of predicted value and test label value
sns.displot(abs(y_test-y_pred_svr))

In [None]:
# Evaluation of the model
print('Mean Absolute Error     : ',mean_absolute_error(y_test, y_pred_svr))
print('Root Mean Squared Error : ',np.sqrt(mean_absolute_error(y_test,y_pred_svr)))

Accuracy of SVM Regressor Model

In [None]:
acc_score_svr = round(accuracy_score(y_test, y_pred_svr)*100,2)
print("Accuracy of the SVM Regressor model = ",acc_score_svr,"%")

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred_svr)
sns.heatmap(cf_matrix, annot = True,cmap="coolwarm")

# Decision Tree Regressor

In [None]:
dtr = DecisionTreeRegressor(random_state=142)
dtr.fit(x_train, y_train)

In [None]:
y_pred_dtr = dtr.predict(x_test)
print(y_pred_dtr)
y_pred_dtr = np.round(y_pred_dtr)

In [None]:
# Plotting absolute error between each value of predicted value and test label value
sns.displot(abs(y_test-y_pred_dtr))

In [None]:
# Evaluation of the model
print('Mean Absolute Error     : ',mean_absolute_error(y_test, y_pred_dtr))
print('Root Mean Squared Error : ',np.sqrt(mean_absolute_error(y_test,y_pred_dtr)))

Accuracy of Decision Tree Regressor Model

In [None]:
acc_score_dtr = round(accuracy_score(y_test, y_pred_dtr)*100,2)
print("Accuracy of the Decision Tree Regressor model = ",acc_score_dtr,"%")

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred_dtr)
sns.heatmap(cf_matrix, annot = True,cmap="coolwarm")

# Random Forest Regressor

In [None]:
rfr = RandomForestRegressor(n_estimators=10,random_state = 1)
rfr.fit(x_train, y_train)

In [None]:
y_pred_rfr = rf.predict(x_test)
print(y_pred_rfr)
y_pred_rfr = np.round(y_pred_rfr)

In [None]:
# Plotting absolute error between each value of predicted value and test label value
sns.displot(abs(y_test-y_pred_rfr))

In [None]:
# Evaluation of the model
print('Mean Absolute Error     : ',mean_absolute_error(y_test, y_pred_rfr))
print('Root Mean Squared Error : ',np.sqrt(mean_absolute_error(y_test,y_pred_rfr)))

Accuracy of Random Forest Regressor Model

In [None]:
acc_score_rfr = round(accuracy_score(y_test, y_pred_rfr)*100,2)
print("Accuracy of the Random Forest Regressor model = ",acc_score_rfr,"%")

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred_rfr)
sns.heatmap(cf_matrix, annot = True,cmap="coolwarm")

From this observations, we have found for this particular dataset RANDOM FOREST REGRESSOR predicted the QUALITY OF RED WINE with highest accuracy, followed by LINEAR REGRESSION and SVM REGRESSION. DECISION TREES REGRESSION  had the minimum accuracy.

We have improved the model by selecting the features that best fits the model using correlations
