<center><h1 style="font-size:300%;"><b style="color:red;">RED WINE QUALITY - DATA ANALYSIS</b></h1></center>

<h1 style="font-size:200%;">ABOUT DATASET</h1>

<p style="font-size:150%;">The Wine Quality dataset contains information about various physicochemical properties
of wines. The entire dataset is grouped into two categories: red wine and white wine. Each
wine has a quality label associated with it. The label is in the range of 0 to 10. In the next
section, we are going to download and load the dataset into Python and perform an initial
analysis to disclose what is inside it.</p>

<h1 style="font-size:200%;">FEATURES DESCRIPTION</h1>
<ul>
    <li style="font-size:150%;"><b>Fixed acidity:</b> It indicates the amount of tartaric acid in wine and is measured in g/dm3</li>
    <li style="font-size:150%;"><b>Volatile acidity:</b> It indicates the amount of acetic acid in the wine. It is measured in g/dm3.</li>
    <li style="font-size:150%;"><b>Citric acid:</b> It indicates the amount of citric acid in the wine. It is also measured in g/dm3</li>
    <li style="font-size:150%;"><b>Residual sugar:</b> It indicates the amount of sugar left in the wine after the fermentation process is done. It is also measured in g/dm3</li>
    <li style="font-size:150%;"><b>Free sulfur dioxide:</b> It measures the amount of sulfur dioxide (SO2) in free form. It is also measured in g/dm3</li>
    <li style="font-size:150%;"><b>Total sulfur dioxide:</b> It measures the total amount of SO2 in the wine. This chemical works as an antioxidant and antimicrobial agent.</li>
    <li style="font-size:150%;"><b>Density:</b> It indicates the density of the wine and is measured in g/dm3.</li>
    <li style="font-size:150%;"><b>pH:</b> It indicates the pH value of the wine. The range of value is between 0 to 14.0, which indicates very high acidity, and 14 indicates basic acidity.</li>
    <li style="font-size:150%;"><b>Sulphates:</b> It indicates the amount of potassium sulphate in the wine. It is also measured in g/dm3.</li>
    <li style="font-size:150%;"><b>Alcohol:</b> It indicates the alcohol content in the wine.</li>
    <li style="font-size:150%;"><b>Quality:</b> It indicates the quality of the wine, which is ranged from 1 to 10. Here, the higher the value is, the better the wine.</li>



------------

# Import Required Libraries and Read the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

# Descriptive Statistics

In [None]:
#Statistical Analysis:
df.describe()

In [None]:
#datatype information:
df.info()

In [None]:
#Get the datatypes of each feature:
df.dtypes

# DATA PREPROCESSING

In [None]:
#check for missing values:
df.isnull().sum()

# EXPLORATORY DATA ANALYSIS

In [None]:
#count plot of quality variable:

sns.set(rc={'figure.figsize':(14, 8)})
sns.countplot(df['quality'])

In [None]:
#lets see whether our data has outliers or not:


# create box plots
fig, ax = plt.subplots(ncols=6, nrows=2, figsize=(20,10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.boxplot(y=col, data=df, color='r', ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

## FIND CORRELATED COLUMNS:

In [None]:
#Method 1
sns.pairplot(df)

In [None]:
#Method 2
sns.heatmap(df.corr(), annot=True, fmt='.2f', linewidths=2)

<h1 style="font-size:160%;">Insights From Above Figure:</h1>

<ul>
    <li style="font-size:130%;">Alcohol is positively correlated with the quality of the red wine.</li>
    <li style="font-size:130%;">Alcohol has a weak positive correlation with the pH value.</li>
    <li style="font-size:130%;">Citric acid and density have a strong positive correlation with fixed acidity.</li>
    <li style="font-size:130%;">pH has a negative correlation with density, fixed acidity, citric acid, and sulfates.</li>
</ul>

In [None]:
# Lets see how alcohol concentration is distributed with respect to the quality of the red wine.
sns.distplot(df['alcohol'])

* <p style="font-size:150%;">we can see that alcohol distribution is positively skewed with the quality of the red wine. We can verify this using the skew method from scipy.stats. Check the snippet given here:</p>

In [None]:
from scipy.stats import skew
skew(df['alcohol'])

* <p style="font-size:150%;">The output verifies that alcohol is positively skewed. That gives deeper insight into the alcohol column.</p>


In [None]:
#Dist plot of all features:
# create dist plot
fig, ax = plt.subplots(ncols=6, nrows=2, figsize=(20,10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, color='r', ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

* <p style="font-size:150%;">The above figures show the distribution of the features. Few of them are normally distributed where other are rightly skewed. The range of each feature is also not huge.</p>

## Alcohol Vs Quality:

In [None]:
sns.boxplot(x='quality', y='alcohol', data = df)

* <p style="font-size:150%;">In above Figure - showing some dots outside of the graph. Those are outliers. Most of the outliers as shown in Figure 12.7 are around wine with quality 5 and 6. We can remove the outliers by passing an argument, showoutliers=False</p>


In [None]:
sns.boxplot(x='quality', y='alcohol', data = df, showfliers=False)

* <p style="font-size:150%;">The higher the alcohol concentration is, the higher the quality of the wine.</p>

## Alcohol versus pH

In [None]:
sns.jointplot(x='alcohol',y='pH',data=df, kind='reg')

* <p style="font-size:150%;">This Figure shows that alcohol is weakly positively related to the pH values. Moreover, the regression line is depicted in the figure, illustrating the correlation between them.</p>

* <p style="font-size:150%;">We can quantify the correlation using Pearson regression from scipy.stats, as shown here:</p>

In [None]:
from scipy.stats import pearsonr
def get_correlation(column1, column2, df):
    pearson_corr, p_value = pearsonr(df[column1], df[column2])
    print("Correlation between {} and {} is {}".format(column1,column2, pearson_corr))
    print("P-value of this correlation is {}".format(p_value))

In [None]:
get_correlation('alcohol','pH', df)

# MODEL DEVELOPMENT AND EVALUATION

In [None]:
#Import Model libraries:

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,cross_validate
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

### HANDLING IMBALANCED DATA

In [None]:
X = df.drop('quality', axis=1)
y = df['quality']

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(k_neighbors=4)
# transform the dataset
X, y = oversample.fit_resample(X, y)

In [None]:
# classify function
from sklearn.model_selection import cross_val_score, train_test_split
def classify(model, X, y):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    # train the model
    model.fit(x_train, y_train)
    print("Accuracy:", model.score(x_test, y_test) * 100)
    
    # cross-validation
    score = cross_val_score(model, X, y, cv=5)
    print("CV Score:", np.mean(score)*100)

In [None]:
model = DecisionTreeClassifier()
classify(model, X, y)

In [None]:
model = RandomForestClassifier()
classify(model, X, y)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
classify(model, X, y)

In [None]:
import xgboost as xgb
model = xgb.XGBClassifier()
classify(model, X, y)

<center><h1 style="font-size:300%;"><b style="color:red;">END OF IMPLEMENTATION</b></h1></center>

<h1 style="font-size:200%;">CONCLUSION</h1>

<p style="font-size:150%;">In this kernel, I have used the Wine Quality dataset provided by UCI to perform EDA. We discussed how we can perform EDA techniques such as data loading, data wrangling, data transformation, correlation between variables, regression analysis, and building classical ML models based on the datasets. </p>


<center><h1 style="font-size:300%;">If you like the kernel, Please share your views in comments section and Give an Upvote</h1></center>