# Exploratory Data Analysis on Wine Quality

The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality and consists of 6000+ sample data for combined Red and White wine types. 


##  Data Description

**Fixed acidity:** most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**Volatile acidity:** the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

**Citric acid:** found in small quantities, citric acid can add 'freshness' and flavor to wines

**Residual sugar:** the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

**Chlorides:** the amount of salt in the wine

**Free sulfur dioxide:** the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**Total sulfur dioxide:** amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

**Density:** the density of water is close to that of water depending on the percent alcohol and sugar content

**pH:** describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**Sulphates:** a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

**Alcohol:** the percent alcohol content of the wine

**Quality:** the quality is a rate from 0 to 10 which is given to the wines by assessors, and type denotes whether the wine is red or white.


## Read and Clean raw data 

First we will read data two data tables with read.csv

In [None]:
red <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', header = TRUE, sep = ';')
white <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', header = TRUE, sep = ';')

##  Task 1:
* What is the dimension of data set i.e. how many records and how many variables does exist? 
* What is the name of variables? 
* What is the attribute of each variable?

Functions to explore the data:

In [None]:
?dim()

In [None]:
?str()

In [None]:
?names()

Now, you explore the dataset.

**There are 6497 records of wine evaluation, and 12 variables are in the data set. Two datasets have the same variable names.**


We will first append two datasets together to create the master dataset for further analysis. A factor variable "type" is introduced as the indicator of wine type.

In [None]:
# We want 'type' to be 'Red' & 'White' ('red' & 'white', to match redwhitewine.csv)
red[, 'type'] <- 'Red'
white[, 'type'] <- 'White'
# Combine the two datasets and change type to a factor variable
wine <- rbind(red, white) # and join
wine$type <- as.factor(wine$type) 

### <font color='blue'>Additional information: differences between different data binding methods </font> 


![](http://media.wiley.com/Lux/72/326772.image0.jpg)

In [None]:
head(wine)

Some names were too long, so we use some of common abbreviation:

RS - residual sugar <br>
ABV - percentage of alcohol <br>
SO2 - sulfur.dioxide 


In [None]:
names(wine)[c(4,6,7,11)]<-c("RS","free.SO2", "total.SO2","ABV")
head(wine)

In [None]:
write.csv(wine, "Wine.csv",row.names=FALSE)

Here is the cleaned data, we will save the master data to a csv file for easy access later.

Now we will first check data type. 

In [None]:
str(wine)

Only one variable is factor variable, all other variables are numaric variables. 

## Task 2:
Find the descriptive statistics of variables. Describe the interesting observations.
There are many ways in R to compute the descriptive statistics. Here, you can try the function 
<font color="DodgerBlue">summary()</font>

In [2]:
summary()

ERROR: Error in is.factor(object): argument "object" is missing, with no default


Now, try to use the <font color="DodgerBlue">summary()</font> function to print out summary statistics such as
<ul>
  <li>min/max</li>
  <li>median/mean</li>
  <li>quantiles</li>
</ul>


If you need advanced descriptive statistics, you can use describe from the <a href="https://www.rdocumentation.org/packages/psych/versions/1.8.12"> psych </a>library.

In [None]:
library(psych)

Try the <a href="https://www.rdocumentation.org/packages/psych/versions/1.8.12/topics/describe">describe()</a> function

Summarize your observations from the Summary below:

We can compare the standard deviation for variables. Find
<ul>
        <li>The variable with the smallest standard deviation</li>
        <li>The variable with the larges standard deviation</li>
</ul>

In order to do so, you migth need

<ul>
    <li> <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html">sort()</a></li>
    <li> <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/apply.html">apply()</a></li>
    <li> <a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/sd.html">sd()</a></li>
</ul>

### <font color='blue'>Additional information: what is apply doing?</font> 

Tutorial on apply function can be found, for example, https://www.datacamp.com/community/tutorials/r-tutorial-apply-family

## Task 3
Generate a boxplot for all variables, boxplots for variables two by two, and separate boxplots for each variable.

<font color='blue'>Additional information:  What is boxplot ?</font> 

![](http://www.comfsm.fm/~dleeling/statistics/boxplot-explained.png)

You can consider using 
<ul>
    <li><a href="https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html">boxplot()</a></li>
    <li> the <b>ggplot2</b> libraray: you can refer to this <a href="http://t-redactyl.io/blog/2016/04/creating-plots-in-r-using-ggplot2-part-10-boxplots.html">online tutorial</a></li>
</ul>

## Task 4: Check correlation
Investigate the associations between variables. Calculate the correlation between variables.
As discussed in the lecture, we could use the scatter plot, where we plot one variable against another.
There are many methods, for example
<ul>
    <li><font color="DodgerBlue">plot()</font></li>
    <li><font color="DodgerBlue">pairs()</font></li>
    <li><font color="DodgerBlue">cor()</font></li>
    <li><font color="DodgerBlue">scatterplotMatrix()</font> in the <b>car</b> library</li>
</ul>

Scatterplots are usefull to examine in bivariate data such as linearity, slope, and strength.


####  <font color='blue'>Additional information: What is correlation coefficient? </font> 

![](http://www.resacorp.com/images/slrund031.gif)



Strong correlation?

![](http://www.uow.edu.au/content/groups/public/@web/@stsv/documents/mm/uow153493.gif)

## Task 5
Find the distribution of variables, and describe the shape of the distribution.
Histogram is one of the ways to study the distribution of a variable. 
Try to generate the histogram of 
* ABV: the Alcohol content 

You can consider the <a href="https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html">hist()</a> function.

The distribution of alcohol data is skewed to 
* the right?
* or the left?

Now, further study the distribution of <b>ABV</b> for read wine only

The distribution of alcohol content data for red wine is skewed to 
* the right?
* ro the left?

Alternatively we can use ggplot. Lets compare four indicators 
* pH
* free.SO2
* total.SO2
* ABV

for red wine and wihte wine separately with ggplot. You might need the following libraries
* gridExtra

Start with red wine.

What are your observations from the four histograms?

A similar distributon analysis for white wine is given below

What are your observations?

# Further studies

## Correlation plots


As below, installing and loading GGally fails, could be an R version issue, not all libraries work all the time. 

There are alternatives:

* library(corrgram)
* library(corrplot)

#### but in this case none of them worked... (they may work for you)
Below is an example of what was produced by GGally:

<img src = "GGally.png">

from http://rstudio-pubs-static.s3.amazonaws.com/24803_abbae17a5e154b259f6f9225da6dade0.html

A mashup of scatter, correlations and boxplot, a bit messy, there are alternatives.. DIY

In [None]:
library(GGally) 
# GGally install:
# install.packages('GGally', repos = c('http://cran.ms.unimelb.edu.au'))

In [None]:
# may need to update R
# there are alternatives: 
# Correlogram Example
# install.packages('corrplot', repos = c('http://cran.ms.unimelb.edu.au'))
# library(corrplot)
# install.packages('corrgram', repos = c('http://cran.ms.unimelb.edu.au'))
# library(corrgram)
# both failed, DIY

In [None]:
# DIY correlation plot
# http://stackoverflow.com/questions/31709982/how-to-plot-in-r-a-correlogram-on-top-of-a-correlation-matrix
# there's some truth to the quote that modern programming is often stitching together pieces from SO 

colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60')
## colorRamp() returns a function which takes as an argument a number
## on [0,1] and returns a color in the gradient in colorRange
myColorRampFunc <- colorRamp(colorRange)

panel.cor <- function(w, z, ...) {
    correlation <- cor(w, z)

    ## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it
    col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 )

    ## square it to avoid visual bias due to "area vs diameter"
    radius <- sqrt(abs(correlation))
    radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary
    x <- radius * cos(radians)
    y <- radius * sin(radians)
    ## make them full loops
    x <- c(x, tail(x,n=1))
    y <- c(y, tail(y,n=1))

    ## trick: "don't create a new plot" thing by following the
    ## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/
    ## This allows
    par(new=TRUE)
    plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1)
    polygon(x, y, border=col, col=col)
}

# usage e.g.:
# pairs(mtcars, upper.panel = panel.cor)

In [None]:
pairs(wine[sample.int(nrow(wine),1000),], upper.panel=panel.cor)

Nice, blue indicates positive correlation, red indicates negative correlation.


#### What does the absence of a circle (red or blue) indicate?
#### What shape is the corresponding scatter likely to be?


Now look at the blue (positive) and red (negative) correlations. e.g. 'freeSO2' and 'totalSO2' (on the plot diagonal). To the right of 'freeSO2' is a large blue circle representing a positive (+ve) correlation with total SO2 (more SO2 = more SO2?) the corresponding scatter plot is below 'freeSO2', it's a left to right upwards trend. Conversely, note density vs alcohol, a large red circle (3 squares to the right of density), the corresponding scatter is diametrically opposite (3 squares below density), a right to left upwards trend.

Observations from the plot:


* As suspected, free SO2 and total SO2 are highly correlated with each other (correlation coefficient 0.72) and negatively correlated with acidity.

* pH is negatively correlated with fixed acidity, citric acid, total SO2 and residual sugar. The negative correlation with the residual sugar makes sense, since sugar has not yet oxidized into acids. Moreover, pH is positively correlated with the volatile acidity, which is a bit counter-intuitive.

* Residual sugar and density are also positively correlated, which we guess makes sense, adding sugar ought to increase the density!

* According to description, sulphates are added to produce SO2 which acts as antimicrobial and antioxidant; total SO2 is also added for the same purpose, but why then they are negatively correlated, perhaps one is converted into another.

* It is surprising to see positive correlation between total SO2 and residual sugar, maybe more SO2 is added to prevent sugar from being converted, and thus make sure that wine tastes a bit sugary.

* It is nice to see volatile acidity is negatively correlated with SO2, as SO2 is added in the wine to prevent acetic acid formation.

* Wine quality is highly correlated with alcohol quantity and density . However, alcohol and density are negatively correlated (correlation coefficient: -0.69). Therefore, there might be collinearity problem. It’s the alcohol amount that reduces the density, due to chemistry, hence alcohol amount is a good choice as a wine quality predictor. 

* Wine quality is negatively correlated with the volatile acidity, as too high levels of it leads to vinegary taste, supporting the description about the data set.

https://en.wikipedia.org/wiki/Wine_fault#Sulfur_compounds


#### Should red & white wine be analysed together?


In [None]:
pairs(wine[wine$type=="Red",-13], upper.panel = panel.cor) # just red

In [None]:
pairs(wine[wine$type=="White",-13], upper.panel = panel.cor) # just white

Are the correlation matrix the same ?


Let's have a look at red and white wine separately. 

In [None]:
ggplot(aes(x=ABV),data =wine) + 
    geom_density(aes(fill = type)) +
    facet_wrap(~quality) +
    ggtitle('Alcohol and Quality Relationship')

There seems to be no significant bias of the alcohol content even though there are samples with higher alcohol 
content for red wine exhibiting a higher density reading for the quality levels of 3 and 5 as compared to white wine.
From our earlier scatterplot matrices, alcohol seems to exhibit a strong correlation with pH value.

In [None]:
ggplot(aes(x=ABV, y=pH),data = wine) + 
    geom_jitter(aes(color = type, bg = type), alpha=1/10,,pch=21, cex=4) +
    facet_wrap(~quality) +
    scale_color_brewer(type = 'div') +
    ggtitle('Alcohol content and pH Relationship')

As expected, there seems to a dip in density while the Alcohol content increases,
and the white wine exhibits a more prominent dip.

The negative correlation of Alcohol with Total and Free SO2 and Chlorides are analysed as below:

In [None]:
a1<-ggplot(aes(x=ABV,y=total.SO2), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)
 
a2<-ggplot(aes(x=ABV,y=free.SO2), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)
  
a3<-ggplot(aes(x=ABV,y=chlorides), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(a1,a2,a3,ncol=2)

The observations from the above analysis are as follows:

* Total SO2: 
White wine exhibits higher Total SO2 contents than Red wine across all Alcohol levels. 
Total SO2 content decreases with Alcohol content for White wine

* Free SO2: 
Again White wine exhibits higher Free SO2 levels across all Alcohol content though the unit difference between Red and White wine seems to be lower as compared to the Total SO2 difference. The Free SO2 content decrease as the alcohol content increases for White wine.

* Chloride: 
Red wine has a higher chloride content than white wine with increasing Alcohol content.
The Chloride content is quite high at lower Alcohol content between 8 and 9 but then exhibits steady reduction till Alcohol content level of 13 before a spike. White wine exhibits lower Chloride levels across Alcohol content levels and holds a steady pattern throughout

* Sulphur Dioxide: 
Usage of SO2 in Wines has been a discussion topic for long time due to the health related issues. It will be interesting to see the distribution of SO2 across Red and White wine, and their final impact on quality.

Analysis of Free SO2 across the Red and White wine is provided below

In [None]:
ggplot(aes(x = quality, y = free.SO2), data = wine) + 
    geom_point(aes(color=type),alpha=1/4, position = 'jitter') +
    ggtitle(' Free SO2 and  Quality Relationship')


This indicates for the same quality ratings, white wine has higher free SO2 than the read wine on average across all
the quality ratings.

Then, we analyze the Total SO2.

In [None]:
ggplot(aes(x = quality, y = total.SO2), data = wine) + 
    geom_point(aes(color=type),alpha=1/4, position = 'jitter') +
    ggtitle('Total SO2 and Quality Relationship')

The analysis plot indicates again the existence of higher total SO2 in the White wine samples as compared to Red wine.

The relationship of the Total SO2 with sulpahtes and residual sugar is analysed below:

In [None]:
b1<-ggplot(aes(x=total.SO2,y=sulphates), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)
  
b2<-ggplot(aes(x=total.SO2,y=RS), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(b1,b2,ncol=2)

The observations from the above analysis is provided below:

* Sulphate:

The sulphate level is higher for the red wine as compared to white wine, with a huge spike around 150.

For the total SO2 level around 250, the sulphates level of white wine is higher that that of red wine.

White wine have a total SO2 level higher than 280 units.

* Residual Sugar

White wine exhibits a high level of Residual sugar around 250 as compared to Red wine, and in general the quantity of Residual sugar seems to increase as total.SO2 increases. The two lines a lot.

The relationships between free.SO2 and Sulphate/Residual Sugar are analyzed as below:

In [None]:
c1<-ggplot(aes(x=free.SO2,y=sulphates), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)
 
c2<-ggplot(aes(x=free.SO2,y=RS), data = wine) +
    geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(c1,c2,ncol=2)

Similarly, one can conclude that

* Sulphate level is quite high for the red wine as compared to white wine.
Red wine does not exhibit a Free SO2 level beyond 70 units

* Residual Sugar:
white wine exhibits a higher level of residual sugar and has peaks around 150.

A final comparison is done between the Red and White wine to understand the difference between the two variants 
for the parameter of Total and Free SO2 and the PH values.

In [None]:
s1<-ggplot(aes(x=pH,y=free.SO2), data = wine) +
    geom_line(aes(color=type),stat='summary',fun.y=median)

s2<-ggplot(aes(x=pH,y=total.SO2), data = wine) +
    geom_line(aes(color=type),stat='summary',fun.y=median)

grid.arrange(s1,s2,ncol=2)

The above plot indicates that white wine does exhibit higher SO2 components as compared to Red Wine for 
similar pH values across all pH values within the sample.
There seems to be a higher variation for both SO2 values in 
both Red and White wines between a pH value of 3.5 and 4.0. 
A closer look at these pH interval is given below

In [None]:
t1<-ggplot(aes(x=pH,y=free.SO2), data = wine) +
    geom_line(aes(color=type),stat='summary',fun.y=median) +
    xlim(3.5,4.0)

t2<-ggplot(aes(x=pH,y=total.SO2), data = wine) +
    geom_line(aes(color=type),stat='summary',fun.y=median) +
    xlim(3.5,4.0)

grid.arrange(t1,t2,ncol=2)

The above plots indicate a high peak for free SO2 (60 units) for a pH value of 3.65 while a high peak for 
red wine for a pH value of 3.75 (41 units). In the case of Total SO2, the peak of around 180 units for white at a pH level 
around 3.62 while Red wine exhibits a peak of around 105 units at a pH level of 3.85. Also, it is observed that only 
red wine has a pH value beyond 3.85 and the Total and Free SO2 levels at this level is low.

In [None]:
ggplot(aes(x=ABV,y=free.SO2), data = wine) +
    geom_line(aes(color=type),stat='summary',fun.y=median) +
    ggtitle('Alcohol and Free SO2 relationship')

The above plot indicates that for the same alcohol content, free.SO2 is higher in white wine than 
red wine and also the free SO2 decreases quite significantly with the alcohol content increases.

In [None]:
ggplot(aes(x = pH),data = wine) + 
    geom_density(aes(fill = type)) +
    facet_wrap(~quality) +
    ggtitle('pH values  relationship with Quality')

From the above analysis plot, there doesn't seem to be any specific relations between pH values and quality in terms of the spread. However the Red wine tends to exhibit a higher pH value density than white wine for quality rating till 7 while quality rating of 8 has more similar values of density. The quality rating of 9 exhibits a more narrower spread for pH value between 3.1 and 3.6.


## Summary

pH value is considered an important parameter when determining the quality of the Wine.The analysis over the samples however indicate that there is no specific values of pH which provides bias for quality ratings and a higher density of Red Wine samples did indicate a higher PH values as compared to White wine samples for the same quality ratings.These pH value however was found to be optimum between a value of 3.0 and 3.5. A pH value of higher than 3.5 tends to exhibit a higher SO2 values which can be concern for people with concerns of health issues with SO2. Samples with higher alcohol content did exhibit lower SO2 counts and also White wine samples exhibited a higher level of SO2 components as compared to Red wine for the same level of Alcohol.

Some of the learnings from the analysis were as follows:

* The understanding that Red Wine generally exhibits more SO2 properties than White wine seems to be not true as per the samples considered.The analysis proves that White wine exhibit a higher level of SO2 properties
* It always seemed that pH value was a key factor in determining the quality of the wines but from the analysis ,it seems that pH value do not exhibit any patterns which can be utilized as a key deterministic variable for wine quality testing by sensory analysis.
* From the samples analyzed,the wines with higher Alcohol content exhibited lower SO2 content as compared with samples with lower Alcohol content. 4.For the buyer conscious of the sugar content in the wines,White wine exhibits more residual sugar and at we have seen spikes in the residual sugar for certain ranges of the Free and Total SO2 primarily with White wine.

A limitation of the current analysis is that the current data consists of samples collected from a specific portugal region.It will be interesting to obtain datasets across various wine making regions to eliminate any bias created by any specific qualities of the product,



## Reference

* http://rstudio-pubs-static.s3.amazonaws.com/24803_abbae17a5e154b259f6f9225da6dade0.html
* https://rpubs.com/Bilal_Mahmood/EDA