# Red Wind Quality - Descriptive Statistics

## Table of Contents

1. [Red Wind Quality - Descriptive Statistics](#red-wind-quality---descriptive-statistics)
2. [Importing necessary libraries](#importing-necessary-libraries)
3. [Loading Dataset](#loading-dataset)
4. [Initial Information About Dataset](#initial-information-about-dataset)
    - [Basic information](#basic-information)
    - [Process null values](#process-null-values)
        - [Check null values](#check-null-values)
    - [Process duplicate rows](#process-duplicate-rows)
        - [Check duplicate rows](#check-duplicate-rows)
        - [Remove duplicate rows](#remove-duplicate-rows)
    - [Reformat the column's name](#reformat-the-columns-name)
5. [Descriptive statistics of numeric variables](#descriptive-statistics-of-numeric-variables)
6. [The End](#the-end)

## Importing necessary libraries

In [2]:
import pandas as pd

## Loading Dataset

In [3]:
wine = pd.read_csv('winequality-red.csv')
df = wine.copy()
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Initial Information About Dataset

### Basic information

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Base on the description included in the dataset file, there are 11 input variables and 1 output variable.

#### Input variables:
- **Fixed acidity**: most acids involved with wine or fixed or nonvolatile
- **Valatile acidity**: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- **Citric acid**: found in small quantities, citric acid can add 'freshness' and flavor to wines
- **Residual sugar**: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- **Chlorides**: the amount of salt in the wine
- **Free sulfur dioxide**: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- **Total sulfur dioxide**: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- **Density**: the density of water is close to that of water depending on the percent alcohol and sugar content
- **pH**: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- **Sulphates**: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
- **Alcohol**: the percent alcohol content of the wine

#### Output variable
- **Quality**: output variable (based on sensory data, score between 0 and 10)

### Process null values

#### Check null values

In [5]:
print("Totally there are {} null values in the dataset".format(df.isnull().sum().sum()))

Totally there are 0 null values in the dataset


### Process duplicate rows

#### Check duplicate rows

In [6]:
duplicated_rows = df[df.duplicated()]
rows, columns = duplicated_rows.shape
print(f"Rows that have duplicated values: {rows}")

Rows that have duplicated values: 240


Since there are 240 duplicated rows in the dataset, it could affect negatively not only on the training time, but also the performance of the model. Because of that, we need to remove duplicated rows.

#### Remove duplicate rows

In [7]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1359 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1359 non-null   float64
 1   volatile acidity      1359 non-null   float64
 2   citric acid           1359 non-null   float64
 3   residual sugar        1359 non-null   float64
 4   chlorides             1359 non-null   float64
 5   free sulfur dioxide   1359 non-null   float64
 6   total sulfur dioxide  1359 non-null   float64
 7   density               1359 non-null   float64
 8   pH                    1359 non-null   float64
 9   sulphates             1359 non-null   float64
 10  alcohol               1359 non-null   float64
 11  quality               1359 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 138.0 KB


### Reformat the column's name

In [8]:
df.rename(columns = {"fixed acidity": "fixed_acidity", "volatile acidity": "volatile_acidity",
                    "citric acid": "citric_acid", "residual sugar": "residual_sugar",
                    "chlorides": "chlorides", "free sulfur dioxide": "free_sulfur_dioxide",
                    "total sulfur dioxide": "total_sulfur_dioxide"}, inplace = True)

## Descriptive statistics of numeric variables

In [9]:
df.describe().apply(lambda x: x.apply('{:.2f}'.format))

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0,1359.0
mean,8.31,0.53,0.27,2.52,0.09,15.89,46.83,1.0,3.31,0.66,10.43,5.62
std,1.74,0.18,0.2,1.35,0.05,10.45,33.41,0.0,0.16,0.17,1.08,0.82
min,4.6,0.12,0.0,0.9,0.01,1.0,6.0,0.99,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,1.0,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.08,14.0,38.0,1.0,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.43,2.6,0.09,21.0,63.0,1.0,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.61,72.0,289.0,1.0,4.01,2.0,14.9,8.0


From the output, we could see there are lots of figures generated:

- **Count**: Shows the number of non-null values in each column.
- **Mean**: Indicates the average value for each numerical column.
- **Std**: The standard deviation, representing how spread out the values are from the mean.
- **Min** & **Max**: Show the minimum and maximum values, thus giving the range of the data (range = max - min).
- **25%** (Q1), **50%** (Median), **75%** (Q3): These are the quartiles. The 50% is the median. These give insights into how the data is distributed (e.g. symmetric, skewed).

Distribution/Spread: If the mean and median (50%) are close, the distribution is fairly symmetric; if not, the data may be skewed.

# The End