# [CSMODEL] Project - Case Study

**Section:** S11<br>
**Group no.:** Group 2<br>
**Group Members:**
- Cadao, Krischelle Lourdes
- Hernandez, Pierre Vincent
- Villaceran, Marissa Ann

## Import Libraries

These libraries will be used in the notebook:
- **`nummpy`** is a software library for Python that contains a large collection of mathematical functions, as well as convenient data structures to represent vectors and matrices.
- **`pandas`** is a software library for Python that is designed for data manipulation and data analysis.
- **`matplotlib`** is a software libary for data visualization, which allows us to easily render various types of graphs.
- **`seaborn`** is a software library for data visualization, this builds on top of `matplotlib` and integrates with `pandas` data structures.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# sets the theme of the charts
plt.style.use('seaborn-darkgrid')

# inline plots
%matplotlib inline
# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

## Red Wine Quality Dataset

The dataset includes information on the red variant of vinho verde, a wine product from the Mino region of Portugal. The most common physicochemical tests were selected for analysis, where chemical components are assessed to ensure purity and the absence of harmful contaminants or residuals from the manufacturing process. The data collection process for the Red Wine Quality dataset took place from May 2004 to February 2007 and the samples were directly taken from CRVV, an authorized certification agency that  aims to improve the quality and marketing of vinho verde.

The data was collected using a computerized system called iLab, which manages the data from the wine testing from producers to laboratory and sensory analysis. As a result, every entry in the dataset corresponds to an analytical or sensory test. During the preprocessing stage, the database was transformed to include a distinct wine sample (with all tests) per row.

The data collection method used in this dataset has implications that a data-driven strategy that assesses the chemical components can be integrated for evaluating the quality of red wine, as opposed to relying solely on sensory analysis by human tasters, which can be subjective and prone to errors.


The dataset contains *1600 observations* or rows and *11 variables or columns*. In each row, this represents the data of red wine samples from the north of Portugal to model red wine quality based on physicochemical tests. Each sample includes the **`fixed acidity`**, **`volatile acidity`**, **`citric acid`**, **`residual sugar`**, **`free sulfur dioxide`**, **`total sulfur dioxide`**, **`density`**, **`pH`**, **`sulphates`**, **`alcohol`**, and its output–the **`quality`**. An individual entry on the dataset has one information each in every column. Under the given dataset from kaggle, it comprises only one file which is the `csv` file.

The following are the descriptions of each variable in the dataset.<br>

**Input variables:**
- **`fixed acidity`**
    - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- **`volatile acidity`**
    - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- **`citric acid`**
    - found in small quantities, citric acid can add 'freshness' and flavor to wines
- **`residual sugar`**
    - the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- **`chlorides`**
    - the amount of salt in the wine
- **`free sulfur dioxide`**
    - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- **`total sulfur dioxide`**
    - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- **`density`**
    - the density of water is close to that of water depending on the percent alcohol and sugar content
- **`pH`**
    - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- **`sulphates`**
    - a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
- **`alcohol`**
    - the percent alcohol content of the wine

**Output  variable (based on sensory data):**

- **`quality (score between 0 and 10)`**
    - output variable (based on sensory data, score between 0 and 10)

# Reading the Dataset

The first step is to load the dataset using `pandas`. This will load the dataset into a pandas `DataFrame`. The dataset was loaded using the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [None]:
redwinequality_df = pd.read_csv('winequality-red.csv')
redwinequality_df.head(10).style.background_gradient(axis=None)

The dataset should now be loaded in the `redwinequality_df` variable. `redwinequality_df` is a [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). It is a data structure for storing tabular data, and the main data structure used in pandas.

In [None]:
redwinequality_df.columns

# Cleaning the Dataset

To begin with cleaning the dataset, the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function is called to display the general information about the dataset.

In [None]:
redwinequality_df.info()

Based on the results of the function, it confirms that there are about **1599** observations and **12** variables in the dataset.

#### Multiple representation of same categorical values

Since all the variables are of numerical datatype, then checking of multiple representations of the same categoraical value is not needed anymore.

#### Datatype and formatting of values

It is also evident that the datatype and formatting of all the values of each variables are consistent and correct. This can be checked further by accessing the `dtypes` property of the `redwinequality_df` DataFrame.  

In [None]:
redwinequality_df.dtypes

#### Checking for missing data (`NaN`s)

To handle missing data in our dataset, each variable will be checked if it contains a `NaN` / `null` value. The [`isnull`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) and [`any`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html) functions will be used to list each variable with a boolean value indicating if the variable contains a `NaN` / `null` value.

In [None]:
redwinequality_df.isnull().any()

By looking at the resulting list, there are no missing data or variables containing `NaN`s.

#### Duplicate data

As it was stated from the source of the dataset, the data was collected using a computerized system called iLab, which manages the data from the wine testing from producers to laboratory and sensory analysis. During the preprocessing stage, the database was transformed to include a ***distinct wine sample (with all tests) per row***, which means that each row or observation in the database are unique. Thus, duplicate data in the same variable is permissible and is part of the observation to be analyzed.

# Exploratory Data Analysis

### 1. Is there a correlation between the quality rating and any of the physicochemical properties in the dataset?

In order to retrieve the correlation between the quality rating and the physicochemical properties in the dataset, correlation will be done to each of the variables with each other. By using the **`corr`** function, it will compute pairwaise correlation of columns.

Since the dataset only has a small range of values, then the **Pearson Correlation** method is going to be used for computing the correlation coefficient.

In [None]:
# pearson correlation between each variable with each other
all_corr = redwinequality_df.corr(method='pearson')

# print numerical correlation values
all_corr

The correlation result can be visually represented using heatmap.

In [None]:
# Matplotlib heatmap of correlation of each variable with each other
plt.rcParams['axes.grid'] = False # to disable white grid lines
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(all_corr, cmap='plasma', interpolation='nearest')
ax.set_title('Correlation of all variables with each other', size=20)
ax.set_xticks(range(len(all_corr.columns)), all_corr.columns, rotation='vertical', size=12)
ax.set_yticks(range(len(all_corr.columns)), all_corr.columns, size=12)
fig.colorbar(im, orientation='vertical')
for i in range(len(all_corr.columns)):
    for j in range(len(all_corr.columns)):
        text = ax.text(j, i, round(all_corr.iloc[i, j], 2),
                       ha="center", va="center", color="black", size=12)
plt.show()
plt.rcParams['axes.grid'] = True # to enable white grid lines

Since the area of interest is the correlation between the quality rating and any of the physicochemical properties in the dataset, then the quality column would be extracted from the `all_corr` dataframe.

The extracted `quality` column from the correlation dataframe will be visualized using heatmap.

In [None]:
# correlation between quality rate variable and the pyhsicochemical properties
quality_rest = all_corr[['quality']].drop('quality', axis=0).sort_values(by='quality', ascending=False)

quality_rest 
# quality_rest = all_corr[['quality']].sort_values(by='quality', ascending=False)

# Matplotlib heatmap of correlation between quality rate and the rest of the variables
plt.rcParams['axes.grid'] = False # to disable white grid lines
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(quality_rest, cmap='plasma', interpolation='nearest')
ax.set_title('Correlation between quality rate and the pyhsicochemical properties', size=20)
ax.set_xticks(range(len(quality_rest.columns)), quality_rest.columns, size=12)
ax.set_yticks(range(len(quality_rest.index)), quality_rest.index, size=12)
fig.colorbar(im, orientation='vertical')
for i in range(len(quality_rest.index)):
    text = ax.text(0, i, round(quality_rest.iloc[i,0], 2),
                    ha="center", va="center", color="black", size=12)
plt.show()
plt.rcParams['axes.grid'] = True # to enable white grid lines

Based on the correlation heatmap of `qualtiy rate vs. the physicochemical properties`, the **`alcohol`** physicochemical property variable yielded the relatively highest correlation value with the **`quality`** variable at **0.48**. This could possibly imply that the *quality rate* of a red wine is **positively correlated** with the amount of *alcohol content* it has.

On the opposite, the **`volatile acidity`** physicochemical property variable yielded the the relatively lowest correlation value with the **`quality`** variable at **-0.39**. This could possibly impy that the *quality rate* of a red wine is **negatively correlated** with the amount of *volatile acidity content* it has.

It is also evident that there are physicochemical properties that are close to not having a correlation (values close to **0**) with the **`quality`** variable,  such as the **`residual sugar`**, **`free sulfur dioxide`**, and **`pH`** variables. These variables might not have that much of an effect with the *quality rate* of a red wine.

### 2. How are the physicochemical properties of red wine distributed? 


The central limit theorem states that as the sample size increases, the sampling distribution of the mean approaches a normal distribution. In this case, since there are 1600 observations, we can safely assume that the sampling distribution of the mean is approximately normal. This means that we can describe the distribution of the physicochemical properties by its mean and standard deviation.

#### 2.1 Distribution of Fixed Acidity

In [None]:
fixacid_df= redwinequality_df.agg({"fixed acidity": ["mean", "median","std", "min", "max"]})
round(fixacid_df,2)

The distribution of `fixed acidity` is **8.32 ± 1.74** (mean ± std).

In [None]:
redwinequality_df.boxplot("fixed acidity", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the fixed acidity
fixacid_median = redwinequality_df["fixed acidity"].median()
fixacid_median_value = float(fixacid_median)

# Calculate the interquartile range (IQR)
fixacid_q1, fixacid_q3 = np.percentile(redwinequality_df["fixed acidity"], [25, 75])
fixacid_iqr = fixacid_q3 - fixacid_q1

# Calculate the upper and lower whiskers
fixacid_upper_whisker = min(redwinequality_df["fixed acidity"].max(), fixacid_q3 + 1.5*fixacid_iqr)
fixacid_lower_whisker = max(redwinequality_df["fixed acidity"].min(),fixacid_q1 - 1.5*fixacid_iqr)

#Calculate center of the box for visual inspection
fixacid_center=(fixacid_q3 +fixacid_q1)/2

#Print values 
print("Median of fixed acidity: {:.2f}".format(fixacid_median_value))
print("Third Quartile: {:.2f}".format(fixacid_q3))
print("First Quartile: {:.2f}".format(fixacid_q1))
print("Upper whisker: {:.2f}".format(fixacid_upper_whisker))
print("Lower whisker: {:.2f}".format(fixacid_lower_whisker))
print("Center of the box: {:.2f}".format(fixacid_center))

Based on the boxplot of `fixed acidity`, it can be seen that there exists outliers outside the upper whisker (12.35); however, none below the lower whisker (4.60). The 'fixed acidity' column has a typical range of values, also known as the middle 50% of the data, which falls between 7.10 and 9.20.

Moreover, based on visual inspection, the median falls to the bottom of the center of the box; thus, the **distribution is positively skewed**.

#### 2.2 Distribution of Volatile Acidity

In [None]:
volacid_df=redwinequality_df.agg({"volatile acidity": ["mean", "median","std", "min", "max"]})
round(volacid_df,2)

The distribution of volatile acidity is **0.53 ± 0.18** (mean ± std).

In [None]:
redwinequality_df.boxplot("volatile acidity", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the volatile acidity
volacid_median = redwinequality_df["volatile acidity"].median()
volacid_median_value = float(volacid_median)

# Calculate the interquartile range (IQR)
volacid_q1, volacid_q3 = np.percentile(redwinequality_df["volatile acidity"], [25, 75])
volacid_iqr = volacid_q3 - volacid_q1

# Calculate the upper and lower whiskers
volacid_upper_whisker = min(redwinequality_df["volatile acidity"].max(), volacid_q3 + 1.5*volacid_iqr)
volacid_lower_whisker = max(redwinequality_df["volatile acidity"].min(),volacid_q1 - 1.5*volacid_iqr)

#Calculate center of the box for visual inspection
volacid_center=(volacid_q3 +volacid_q1)/2

#Print values 
print("Median of volatile acidity: {:.2f}".format(volacid_median_value))
print("Third Quartile: {:.2f}".format(volacid_q3))
print("First Quartile: {:.2f}".format(volacid_q1))
print("Upper whisker: {:.2f}".format(volacid_upper_whisker))
print("Lower whisker: {:.2f}".format(volacid_lower_whisker))
print("Center of the box: {:.2f}".format(volacid_center))

Based on the boxplot of `volatile acidity`, it can be seen that there exists outliers outside the upper whisker (1.02); however, none below the lower whisker (0.12). The 'volatile acidity' column has a typical range of values, also known as the middle 50% of the data, which falls between 0.39 and 0.64.

Moreover, based on visual inspection, the median falls on the center of the box; thus, the **distribution is symmetric**.

#### 2.3 Distribution of Citric Acid

In [None]:
citricacid_df= redwinequality_df.agg({"citric acid": ["mean", "median","std", "min", "max"]})
round(citricacid_df,2)

The distribution of citric acid is **0.27 ± 0.19** (mean ± std).

In [None]:
redwinequality_df.boxplot("citric acid", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the citric acidity
citricacid_median = redwinequality_df["citric acid"].median()
citricacid_median_value = float(citricacid_median)

# Calculate the interquartile range (IQR)
citricacid_q1, citricacid_q3 = np.percentile(redwinequality_df["citric acid"], [25, 75])
citricacid_iqr = citricacid_q3 - citricacid_q1

# Calculate the upper and lower whiskers
citricacid_upper_whisker = min(redwinequality_df["citric acid"].max(), citricacid_q3 + 1.5*citricacid_iqr)
citricacid_lower_whisker = max(redwinequality_df["citric acid"].min(),citricacid_q1 - 1.5*citricacid_iqr)

# Calculate center of the box for visual inspection
citricacid_center = (citricacid_q3 + citricacid_q1)/2

# Print values 
print("Median of citric acid: {:.2f}".format(citricacid_median_value))
print("Third Quartile: {:.2f}".format(citricacid_q3))
print("First Quartile: {:.2f}".format(citricacid_q1))
print("Upper whisker: {:.2f}".format(citricacid_upper_whisker))
print("Lower whisker: {:.2f}".format(citricacid_lower_whisker))
print("Center of the box: {:.2f}".format(citricacid_center))

round(redwinequality_df[['citric acid']].describe(),2)


Based on the boxplot of `citric acid`, it can be seen that there exists outliers outside the upper whisker (0.91); however, none below the lower whisker (0.00). The 'citric acid' column has a typical range of values, also known as the middle 50% of the data, which falls between 0.09 and 0.42.

Moreover, based on visual inspection, the median falls on the center of the box; thus, the **distribution is symmetric**.

#### 2.4 Distribution of Residual Sugars

In [None]:
sugar_df= redwinequality_df.agg({"residual sugar": ["mean", "median","std", "min", "max"]})
round(sugar_df,2)

The distribution of residual sugar is **2.54 ± 1.41** (mean ± std).

In [None]:
redwinequality_df.boxplot("residual sugar", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the residual sugar
sugar_median = redwinequality_df["residual sugar"].median()
sugar_median_value = float(sugar_median)

# Calculate the interquartile range (IQR)
sugar_q1, sugar_q3 = np.percentile(redwinequality_df["residual sugar"], [25, 75])
sugar_iqr = sugar_q3 - sugar_q1

# Calculate the upper and lower whiskers
sugar_upper_whisker = min(redwinequality_df["residual sugar"].max(), sugar_q3 + 1.5*sugar_iqr)
sugar_lower_whisker = max(redwinequality_df["residual sugar"].min(),sugar_q1 - 1.5*sugar_iqr)

#Calculate center of the box for visual inspection
sugar_center=(sugar_q3 +sugar_q1)/2

#Print values 
print("Median of residual sugar: {:.2f}".format(sugar_median_value))
print("Third Quartile: {:.2f}".format(sugar_q3))
print("First Quartile: {:.2f}".format(sugar_q1))
print("Upper whisker: {:.2f}".format(sugar_upper_whisker))
print("Lower whisker: {:.2f}".format(sugar_lower_whisker))
print("Center of the box: {:.2f}".format(sugar_center))

Based on the boxplot of `residual sugar`, it can be seen that there exists outliers outside the upper whisker (3.65); however, none below the lower whisker (0.90). The 'residual sugar' column has a typical range of values, also known as the middle 50% of the data, which falls between 1.90 and 2.60.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

#### 2.5 Distribution of Chlorides

In [None]:
chloride_df= redwinequality_df.agg({"chlorides": ["mean", "median","std", "min", "max"]})
round(chloride_df,2)

The distribution of chlorides is **0.09 ± 0.05** (mean ± std).

In [None]:
redwinequality_df.boxplot("chlorides", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the chlorides
chloride_median = redwinequality_df["chlorides"].median()
chloride_median_value = float(chloride_median)

# Calculate the interquartile range (IQR)
chloride_q1, chloride_q3 = np.percentile(redwinequality_df["chlorides"], [25, 75])
chloride_iqr = chloride_q3 - chloride_q1

# Calculate the upper and lower whiskers
chloride_upper_whisker = min(redwinequality_df["chlorides"].max(), chloride_q3 + 1.5*chloride_iqr)
chloride_lower_whisker = max(redwinequality_df["chlorides"].min(),chloride_q1 - 1.5*chloride_iqr)

#Calculate center of the box for visual inspection
chloride_center=(chloride_q3 +chloride_q1)/2

#Print values 
print("Median of chlorides: {:.2f}".format(chloride_median_value))
print("Third Quartile: {:.2f}".format(chloride_q3))
print("First Quartile: {:.2f}".format(chloride_q1))
print("Upper whisker: {:.2f}".format(chloride_upper_whisker))
print("Lower whisker: {:.2f}".format(chloride_lower_whisker))
print("Center of the box: {:.2f}".format(chloride_center))

Based on the boxplot of `chlorides`, it can be seen that there exists outliers outside the upper whisker (0.12) and lower whisker (0.04). The 'chlorides' column has a typical range of values, also known as the middle 50% of the data, which falls between 0.07 and 0.09.

Moreover, based on visual inspection, the median is at the center of the box; thus, the **distribution is symmetric**.

#### 2.6 Distribution of free sulfur dioxide

In [None]:
free_sd_df= redwinequality_df.agg({"free sulfur dioxide": ["mean", "median","std", "min", "max"]})
round(free_sd_df,2)

The distribution of chlorides is **15.87 ± 10.46** (mean ± std).

In [None]:
redwinequality_df.boxplot("free sulfur dioxide", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the free sulfur dioxide
free_sd_median = redwinequality_df["free sulfur dioxide"].median()
free_sd_median_value = float(free_sd_median)

# Calculate the interquartile range (IQR)
free_sd_q1, free_sd_q3 = np.percentile(redwinequality_df["free sulfur dioxide"], [25, 75])
free_sd_iqr = free_sd_q3 - free_sd_q1

# Calculate the upper and lower whiskers
free_sd_upper_whisker = min(redwinequality_df["free sulfur dioxide"].max(), free_sd_q3 + 1.5*free_sd_iqr)
free_sd_lower_whisker = max(redwinequality_df["free sulfur dioxide"].min(),free_sd_q1 - 1.5*free_sd_iqr)

#Calculate center of the box for visual inspection
free_sd_center=(free_sd_q3 +free_sd_q1)/2

#Print values 
print("Median of free sulfur dioxide: {:.2f}".format(free_sd_median_value))
print("Third Quartile: {:.2f}".format(free_sd_q3))
print("First Quartile: {:.2f}".format(free_sd_q1))
print("Upper whisker: {:.2f}".format(free_sd_upper_whisker))
print("Lower whisker: {:.2f}".format(free_sd_lower_whisker))
print("Center of the box: {:.2f}".format(free_sd_center))

Based on the boxplot of `free sulfur dioxide`, it can be seen that there exists outliers outside the upper whisker (42.00); however, none below the lower whisker (1.00). The 'free sulfur dioxide' column has a typical range of values, also known as the middle 50% of the data, which falls between 7.00 and 21.00.

Moreover, based on visual inspection, the median is at the center of the box; thus, the **distribution is symmetric**.

#### 2.7 Distribution of total sulfur dioxide

In [None]:
total_sd_df= redwinequality_df.agg({"total sulfur dioxide": ["mean", "median","std", "min", "max"]})
round(total_sd_df,2)

The distribution of total sulfur dioxide is **46.47 ± 32.90** (mean ± std).

In [None]:
redwinequality_df.boxplot("total sulfur dioxide", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the total sulfur dioxide
total_sd_median = redwinequality_df["total sulfur dioxide"].median()
total_sd_median_value = float(total_sd_median)

# Calculate the interquartile range (IQR)
total_sd_q1, total_sd_q3 = np.percentile(redwinequality_df["total sulfur dioxide"], [25, 75])
total_sd_iqr = total_sd_q3 - total_sd_q1

# Calculate the upper and lower whiskers
total_sd_upper_whisker = min(redwinequality_df["total sulfur dioxide"].max(), total_sd_q3 + 1.5*total_sd_iqr)
total_sd_lower_whisker = max(redwinequality_df["total sulfur dioxide"].min(),total_sd_q1 - 1.5*total_sd_iqr)

#Calculate center of the box for visual inspection
total_sd_center=(total_sd_q3 +total_sd_q1)/2

#Print values 
print("Median of total sulfur dioxide: {:.2f}".format(total_sd_median_value))
print("Third Quartile: {:.2f}".format(total_sd_q3))
print("First Quartile: {:.2f}".format(total_sd_q1))
print("Upper whisker: {:.2f}".format(total_sd_upper_whisker))
print("Lower whisker: {:.2f}".format(total_sd_lower_whisker))
print("Center of the box: {:.2f}".format(total_sd_center))

Based on the boxplot of `total sulfur dioxide`, it can be seen that there exists outliers outside the upper whisker (122.00); however, none below the lower whisker (6.00). The 'total sulfur dioxider' column has a typical range of values, also known as the middle 50% of the data, which falls between 22.00 and 122.00.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

#### 2.8 Distribution of Density

In [None]:
density_df= redwinequality_df.agg({"density": ["mean", "median","std", "min", "max"]})
round(density_df,4)

The distribution of density is **0.9967 ± 0.0019** (mean ± std).

In [None]:
redwinequality_df.boxplot("density", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the density
density_median = redwinequality_df["density"].median()
density_median_value = float(density_median)

# Calculate the interquartile range (IQR)
density_q1, density_q3 = np.percentile(redwinequality_df["density"], [25, 75])
density_iqr = density_q3 - density_q1

# Calculate the upper and lower whiskers
density_upper_whisker = min(redwinequality_df["density"].max(), density_q3 + 1.5*density_iqr)
density_lower_whisker = max(redwinequality_df["density"].min(),density_q1 - 1.5*density_iqr)

#Calculate center of the box for visual inspection
density_center=(density_q3 +density_q1)/2

#Print values 
print("Median of density: {:.4f}".format(density_median_value))
print("Third Quartile: {:.4f}".format(density_q3))
print("First Quartile: {:.4f}".format(density_q1))
print("Upper whisker: {:.4f}".format(density_upper_whisker))
print("Lower whisker: {:.4f}".format(density_lower_whisker))
print("Center of the box: {:.4f}".format(density_center))

Based on the boxplot of density, it can be seen that there exists outliers outside the upper whisker (1.0012) and lower whisker (0.9922). The 'density' column has a typical range of values, also known as the middle 50% of the data, which falls between 0.9956 and 0.9978.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

#### 2.9 Distribution of pH

In [None]:
ph_df = redwinequality_df.agg({"pH": ["mean", "median","std", "min", "max"]})
round(ph_df,2)

The distribution of density is **3.31 ± 0.15** (mean ± std).

In [None]:
redwinequality_df.boxplot("pH", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the pH
ph_median = redwinequality_df["pH"].median()
ph_median_value = float(ph_median)

# Calculate the interquartile range (IQR)
ph_q1, ph_q3 = np.percentile(redwinequality_df["pH"], [25, 75])
ph_iqr = ph_q3 - ph_q1

# Calculate the upper and lower whiskers
ph_upper_whisker = min(redwinequality_df["pH"].max(), ph_q3 + 1.5*ph_iqr)
ph_lower_whisker = max(redwinequality_df["pH"].min(),ph_q1 - 1.5*ph_iqr)

#Calculate center of the box for visual inspection
ph_center=(ph_q3 +ph_q1)/2

#Print values 
print("Median of pH: {:.2f}".format(ph_median_value))
print("Third Quartile: {:.2f}".format(ph_q3))
print("First Quartile: {:.2f}".format(ph_q1))
print("Upper whisker: {:.2f}".format(ph_upper_whisker))
print("Lower whisker: {:.2f}".format(ph_lower_whisker))
print("Center of the box: {:.2f}".format(ph_center))

Based on the boxplot of `pH`, it can be seen that there exists outliers outside the upper whisker (3.68) and lower whisker (2.92). The 'pH' column has a typical range of values, also known as the middle 50% of the data, which falls between 3.21 and 3.40.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

#### 2.10 Distribution of sulphates

In [None]:
sulph_df = redwinequality_df.agg({"sulphates": ["mean", "median","std", "min", "max"]})
round(sulph_df,2)

The distribution of sulphates is **0.66 ± 0.17** (mean ± std).

In [None]:
redwinequality_df.boxplot("sulphates", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the sulphates
sulph_median = redwinequality_df["sulphates"].median()
sulph_median_value = float(sulph_median)

# Calculate the interquartile range (IQR)
sulph_q1, sulph_q3 = np.percentile(redwinequality_df["sulphates"], [25, 75])
sulph_iqr = sulph_q3 - sulph_q1

# Calculate the upper and lower whiskers
sulph_upper_whisker = min(redwinequality_df["sulphates"].max(), sulph_q3 + 1.5*sulph_iqr)
sulph_lower_whisker = max(redwinequality_df["sulphates"].min(),sulph_q1 - 1.5*sulph_iqr)

#Calculate center of the box for visual inspection
sulph_center=(sulph_q3 +sulph_q1)/2

#Print values 
print("Median of sulphates: {:.2f}".format(sulph_median_value))
print("Third Quartile: {:.2f}".format(sulph_q3))
print("First Quartile: {:.2f}".format(sulph_q1))
print("Upper whisker: {:.2f}".format(sulph_upper_whisker))
print("Lower whisker: {:.2f}".format(sulph_lower_whisker))
print("Center of the box: {:.2f}".format(sulph_center))

Based on the boxplot of `sulphates`, it can be seen that there exists outliers outside the upper whisker (1.00); however, none below the lower whisker (0.33). The 'sulphates' column has a typical range of values, also known as the middle 50% of the data, which falls between 0.55 and 0.73.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

#### 2.11 Distribution of alcohol

In [None]:
alc_df = redwinequality_df.agg({"alcohol": ["mean", "median","std", "min", "max"]})
round(alc_df,2)

The distribution of alcohol is 10.42 ± 1.07 (mean ± std).

In [None]:
redwinequality_df.boxplot("alcohol", figsize=(15, 10))
plt.show()

In [None]:
# Get the median of the alcohol
alc_median = redwinequality_df["alcohol"].median()
alc_median_value = float(alc_median)

# Calculate the interquartile range (IQR)
alc_q1, alc_q3 = np.percentile(redwinequality_df["alcohol"], [25, 75])
alc_iqr = alc_q3 - alc_q1

# Calculate the upper and lower whiskers
alc_upper_whisker = min(redwinequality_df["alcohol"].max(), alc_q3 + 1.5*alc_iqr)
alc_lower_whisker = max(redwinequality_df["alcohol"].min(),alc_q1 - 1.5*alc_iqr)

#Calculate center of the box for visual inspection
alc_center=(alc_q3 +alc_q1)/2

#Print values 
print("Median of alcohol: {:.2f}".format(alc_median_value))
print("Third Quartile: {:.2f}".format(alc_q3))
print("First Quartile: {:.2f}".format(alc_q1))
print("Upper whisker: {:.2f}".format(alc_upper_whisker))
print("Lower whisker: {:.2f}".format(alc_lower_whisker))
print("Center of the box: {:.2f}".format(alc_center))

Based on the boxplot of alcohol, it can be seen that there exists outliers outside the upper whisker (13.50); however, none below the lower whisker (8.40). The 'alcohol' column has a typical range of values, also known as the middle 50% of the data, which falls between 9.50 and 11.10.

Moreover, based on visual inspection, the median is near the center of the box; thus, the **distribution is approximately symmetric**.

### 3. Are there any variables that are highly correlated with each other?

In order to know or easily determine which highly correlated with each other the **`corr`** function and heatmap was used again.

In [None]:
# pearson correlation between each variable with each other
all_corr = redwinequality_df.corr(method='pearson')

# Matplotlib heatmap of correlation of each variable with each other
plt.rcParams['axes.grid'] = False # to disable white grid lines
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(all_corr, cmap='plasma', interpolation='nearest')
ax.set_title('Correlation of all variables with each other', size=20)
ax.set_xticks(range(len(all_corr.columns)), all_corr.columns, rotation='vertical', size=12)
ax.set_yticks(range(len(all_corr.columns)), all_corr.columns, size=12)
fig.colorbar(im, orientation='vertical')
for i in range(len(all_corr.columns)):
    for j in range(len(all_corr.columns)):
        text = ax.text(j, i, round(all_corr.iloc[i, j], 2),
                       ha="center", va="center", color="black", size=12)
plt.show()
plt.rcParams['axes.grid'] = True # to enable white grid lines

From the above correlation plot for the given dataset for wine quality prediction, we can easily see which items are related highly with each other, both positively and negatively

The highly positively correlated items are :
1. fixed acidity and citric acid
2. free sulphur dioxide and total sulphor dioxide
3. fixed acidity and density
4. alcohol and quality

The highly negatively correlated items are :
1. citric acid and volatile acidity
2. fixed acidity and ph
3. density and alcohol

### 4. How does the alcohol content vary across different quality scores?

In order to observe the how does the *alcohol content* vary across different quality scores, its distribution will be calculated using the sum and difference of its  mean and standard deviation value.

In [None]:
# alcohol content per quality score
alcohol_per_quality = redwinequality_df.groupby("quality").agg({"alcohol": ["count", "min", "max", "sum", "median", "mean", "std", ]})

# get string representation of sum and difference of mean and standard deviation for distribution column
alcohol_per_quality["alcohol", "distribution"] = round(alcohol_per_quality["alcohol"]["mean"],2).astype(str) + " ± " + round(alcohol_per_quality["alcohol"]["std"],2).astype(str)

round(alcohol_per_quality,2)

The distribution of alcohol per quality rate can be found in the **`distribution`** column of the **`alcohol_per_quality`** dataframe.

The distribution of the **alcohol content** per **quality rate** can be visually represented using boxplots.

In [None]:
# boxplot of alcohol content per quality rate
redwinequality_df.boxplot("alcohol", by="quality", figsize=(15, 10))
plt.show()

In order to determine the scewness of alcohol content distribution per quality rate, the **median** (**`50%`**) and **mean** must be retreived. This is done by grouping the **`alcohol`** column by **`quality`** using the **`groupby`** function and calling the **`describe`** function on the **`alcohol`** column.

In [None]:
bx_vals = redwinequality_df.groupby("quality")["alcohol"].describe()
bx_vals

The distribution of **`alcohol`** in a **`quality`** group is negatively skewed if **mean** is greater than the **median**. It is positively skewed if **mean** is less than the **median**. And no skew if **mean** and **median** are both equal.

The list of **`quality`** groups that are negatively skewed are assigned to **`neg-skew`** while the list of **`quality`** groups that are positively skewed are assigned to **`pos-skew`**, and the list of **`quality`** groups that are not skewed are assigned to **`no-skew`**. This can be done using the **`tolist`** function.

In [None]:
neg_skew = bx_vals.index[bx_vals['mean'] > bx_vals['50%']].tolist()
pos_skew = bx_vals.index[bx_vals['mean'] < bx_vals['50%']].tolist()
no_skew = bx_vals.index[bx_vals['mean'] == bx_vals['50%']].tolist()

print("Negative skew: ", neg_skew)
print("Positive skew: ", pos_skew)
print("No skew: ", no_skew)

Based on the results, the **`quality`** groups **`[3, 4, 5, 6]`**'s alcohol distribution is negatively skewed and the **`quality`** groups **`[7, 8]`**'s alcohol distribution is positively skewed. Lastly, there are no **`quality`** groups that are not skewed.

# Formulated Research Question


Our exploratory data analysis tackled all variables that can be measured on how much impact it has to each quality of the wine. The study aims to determine the relationship between the physicochemical properties of wine towards the quality rating of wine. With this, the primary research question aims to focus towards predicting properties that influence the quality rating of wine through modeling the dataset. Thus, our research aims to identify and answer the question:

**Can physicochemical properties of wine predict its quality rating?**

The relevance of the study primarily focuses on quality assessments in the wine industry. Most traditional systems assess wine quality through human experts which is time consuming and expensive. This project aims to determine which are the best quality red wine indicators and produce insights onto these factors to our model’s red wine quality assessment. This can provide an easier and efficient way to predict the quality of the wine. Through this, winemakers and consumers would be able to examine how a fine change in each property could affect the quality of the wine. In addition, the system can address the importance of each physiochemical property to identify which factor should be disregarded for reduction of cost.


#### References


Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez (2009)
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal

The importance of the physicochemical composition of wine on the score awarded in an official contest | IVES. (2022, June 13). IVES |. https://ives-openscience.eu/13679/

‌