<a></a>
# Objective  
The goal of this kernel is to explore the features in the dataset to make decisions on how to engineer new features and accurately predict wine quality. 

[Basic Analysis](#Basic-Analysis)  
[Feature Distributions](#Feature-Distributions)

First, we import the necessary libraries

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import StandardScaler

In [None]:
df = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

# Basic Analysis

In [None]:
df.head()

Using .info() we see that all features are numeric and none have missing values which is the most convenient configuration one could ask for

In [None]:
df.info()

The heatmap below shows that some of our features are quite correlated and may be very useful in the feature engineering process

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.pairplot(df, hue='quality')

Along the diagonal of the pairplot, many of the features are heavily skewed, so we will alter our dataframe to have scaled values using sklearn's Standard Scaler. Note that we do not scale the 'quality' column because that's our target variable

In [None]:
scaled_df = pd.DataFrame(StandardScaler().fit_transform(df.loc[:, df.columns != 'quality']))
scaled_df.columns = df.columns[:-1]

In [None]:
scaled_df.head()

# Feature Distributions  
We will use boxplots to visualize the distributions of the features and can help with identifying outliers that we way want to remove

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=df['fixed acidity'],name="Fixed Acidity"))
fig.show()


In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=df['volatile acidity'],name="Volatile Acidity"))
fig.add_trace(go.Box(x=df['citric acid'],name="Citric Acid"))
fig.show()

We'll remove the 'Volatile Acidity' outlier by setting the threshold to be under 1.5

In [None]:
adjusted_df = df[df['volatile acidity'] < 1.5]

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['chlorides'],name="Chlorides"))
fig.add_trace(go.Box(x=scaled_df['sulphates'],name="Sulphates"))
fig.show()

In [None]:
adjusted_df = df[df['sulphates'] < 7]
adjusted_df = df[df['chlorides'] < 6]

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['free sulfur dioxide'],name="Free Sulfur Dioxide"))
fig.add_trace(go.Box(x=scaled_df['total sulfur dioxide'],name="Total Sulfur Dioxide"))
fig.show()

In [None]:
adjusted_df = df[df['total sulfur dioxide'] < 6]

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['residual sugar'],name="Residual Sugar"))
fig.show()

In [None]:
adjusted_df = df[df['residual sugar'] < 7]

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['density'],name="Density"))
fig.show()

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['pH'],name="pH"))
fig.show()

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['alcohol'],name="Alcohol"))
fig.show()

In [None]:
fig = go.Figure() 
fig.add_trace(go.Box(x=scaled_df['fixed acidity'],name="Fixed Acidity"))
fig.add_trace(go.Box(x=scaled_df['volatile acidity'],name="Volatile Acidity"))
fig.add_trace(go.Box(x=scaled_df['citric acid'],name="Citric Acid"))
fig.add_trace(go.Box(x=scaled_df['residual sugar'],name="Residual Sugar"))
fig.add_trace(go.Box(x=scaled_df['chlorides'],name="Chlorides"))
fig.add_trace(go.Box(x=scaled_df['free sulfur dioxide'],name="Free Sulfur Dioxide"))
fig.add_trace(go.Box(x=scaled_df['total sulfur dioxide'],name="Total Sulfur Dioxide"))
fig.add_trace(go.Box(x=scaled_df['density'],name="Density"))
fig.add_trace(go.Box(x=scaled_df['pH'],name="pH"))
fig.add_trace(go.Box(x=scaled_df['sulphates'],name="Sulphates"))
fig.add_trace(go.Box(x=scaled_df['alcohol'],name="Alcohol"))
fig.update_layout(title="Summary of Wine Quality Features")
fig.show()

Once we've removed the outliers we can rescale our dataframe using adjusted_df

In [None]:
scaled_df = pd.DataFrame(StandardScaler().fit_transform(adjusted_df.loc[:, adjusted_df.columns != 'quality']))
scaled_df.columns = df.columns[:-1]

In [None]:
sns.pairplot(scaled_df)