### Importing important libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

### Loading the dataset

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Wine_Dataset/winequality-red.csv",sep=";")

In [3]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
print("Shape of our dataframe is: ",df.shape)

Shape of our dataframe is:  (1599, 12)


Out of the 12, one is the target variable ('quality') and rest 11 are input variables.

In [6]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


 - We can notice that the mean value is less than the median value of each column.
 - There is notably a large difference between 75th %tile and max values of predictors “residual sugar”, "free sulfur dioxide”, "total sulfur dioxide”. This indicates that some values of these 3 variables lie much farther from the general range of values( up to 75th %tile). Thus we can conclude that there are extreme values i.e Outliers in our dataset.

In [7]:
df['quality'].unique()

array([5, 6, 7, 4, 8, 3], dtype=int64)

**Few key insights just by looking at the target variable are as follows:**

 - Target variable/Dependent variable is discrete and categorical in nature.
 - **“quality”** score scale ranges from 1 to 10; 1 being poor and 10 being the best.
 - 1,2,9 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 8.

In [8]:
df['quality'].value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

- This tells us the vote count of each quality score in descending order.
- **“quality”** has most values concentrated in the categories 5, 6 and 7.
- Only a few observations made for the categories 3 & 8.

**Renaming names of the columns**

In [10]:
df.rename(columns={'ficxed acidity':'fixed_acidity','citric acid':'citric_acid','volatile acidity':'volatile_acidity','residual sugar':'residual_sugar',
                  'free sulphur dioxide':'free_sulphur_dioxide','total sulphur dioxide': 'total_sulphur_dioxide'},inplace=True)

**Splitting data into featues and labels set**

In [11]:
X = df.iloc[:,:11]
y = df.iloc[:,-1]

### Using the Pywedge library for EDA

In [17]:
import pywedge as pw
charts = pw.Pywedge_Charts(df, c=None, y = 'quality')

In [24]:
plots = charts.make_charts()

HTML(value='<h2>Pywedge Make_Charts </h2>')

Tab(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output()), _titles={'0': '…

HTML(value='<h4><em>Charts compiled by Pywedge make_charts </em></h4>')