# Table of Contents

* [Introduction](#introduction)
* [Data at a glance](#glance-data)
* [Input Data](#read-data)
* [Data Cleaning](#data-cleaning)
* [Data Preprocessing](#data-preprocessing)
* [XGBoost](#XGBoost)
    - [XGBClassifier](#XGBClassifier)
    - [XGBRegressor](#XGBRegressor)


<a id="introduction"></a>
# Introduction

![picture resource: https://mountaintoplodge.com/blog/wine-tastings-poconos/](https://mountaintoplodge.com/wp-content/uploads/2018/09/wine-tasting-1500x609.jpg)
(taken from https://mountaintoplodge.com/)

**goal**: The objectives of this notebook is to model wine quality with XGBoost and Random forest, and examine the results with correspoinding test, such as r2, auc curve... etc. 

**about dataset**
The dataset is related to red and white variants of the Portuguese wine. We consider good wine as the quality is greater than 6.5. 
* **fixed acidity** means most acids involved with wine
* **volatile acidity** shows the amount of acetic acid in wine, which at <mark>too high of levels can lead to an unpleasant, vinegar taste</mark>
* **citric acid** found in <mark>small quantities, citric acid can add 'freshness'</mark> and flavor to wines
* **residual sugar** is the amount of sugar remaining after fermentation stops, it's <mark>rare to find wines with less than 1 gram/liter</mark> 
* **chlorides** means the amount of salt in the wine
* **free sulfur dioxide** shows the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion
* **total sulfur dioxide** is amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine
* **density** is the density of water is close to that of water depending on the percent alcohol
* **pH** describes how acidic or basic a wine is on a <mark>scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 </mark>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# for correlation matrix
from pandas.plotting import scatter_matrix
from matplotlib import cm 



from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier



#k-fold validation

from sklearn.model_selection import GridSearchCV


from sklearn.metrics import confusion_matrix,classification_report, auc,roc_curve, r2_score
from collections import OrderedDict 
from sklearn import metrics
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
"""
<a id='data-at-a-glance'></a>
# Data at a glance
In the following table, we give a peek at the data to have a general understanding of the data. 
* All of the features are numeric and the dependent variable( wine quality ) is 
"""

In [None]:
"""
<a id='XGBoost-review' ></a>
# A Brief Review on XGBoost
XGBoost stands for eXtreme Gradient Boosting and was first proposed by [Tianqi Chen](https://arxiv.org/abs/1603.02754). 
XGBoost is a model, dealing with supervised learning problems, where we predict the target variable with features. 

### Boosting Algorithm 
Ensemble learning (or boosting) technique combines weak (base) learners into a strong learners. Though the base learners tends to have limited power, we can use ensemble learning to form more accurate model. *We can consider XGBoost is a special case of Newton Boosting. We can considered it as the interpretation of Newton method is functional space. It is numerical optimization algorithms in function space. *

Given a finite training set $\{ x_i, y_i \},i=1...n$, we hope to solve the objective function:
$$\hat{\Phi}(F)=\sum^{n}_{i=1}L(y_i,F(x_i))$$

### Numerical Optimization in Function Space

Assume the true risk R(f) is known to us, and we iterative risk minimization procedures in function space. We are minimizing
$$R(f(x))=E[L(Y,f(x))|X=x] , \forall x \in \mathcal{X}$$ 
at each iteration 

#### Construct $F(x)$
Practically, We view boosting algorithms as numerical optimization in function space. Boosting algorithms implements a sequantial process, generates weak learners and combines them into strong learners. We can consider the weak learners as a set of basis functions. Boosting algorithm sequentially add base models to improves the fitting. 

$$f^{(m)}(x)=f^{(m-1)}(x)+f_m(x)$$
$f^{(m-1)}$ is current estimate. We take the "step" $f_m$ in function space, and get $f^{(m)}$. Therefore, we can view f(x) as the combination of initial guess and all successive "steps" taken previously in function space. 

$$F(x)\equiv f^{(M)}(x)\equiv \sum_{m=0}^{M}f_m(x)$$

, where $f(x)$ is the prediction, M is the number of weak learners and $f_0(x)$

forward stagewise additive modeling (FSAM)$\hat{f}(x)=\hat{f}^{(M)}(x)=\sum_{m=1}^{M}\hat{\theta}_m \hat{c}_m(x_i)$



### System Features:
* Parallelization : The parallelisatio happens during the tree construction and enables distributed training and predicting across nodes. 
* Cache Optimization : XGBoost keeps all of the immediate calculation 
* Distributed Computing
* Out-of-Core Computing

### Model Features
* Gradient Boosting
* Stochastic Gradient Boosting
* Regularizaed Gradient Boosting

### Algorithm Features
* Sparse Aware
* Block Structure
* Continued Training

### CART algorithm 
"""

<a id="read-data"></a>
### Read the data 

In [None]:
df = pd.read_csv(os.path.join(dirname,filename))
df.head()

After we read the data, we want to make sure our system has successfully get the data. After we run the read_csv function from panda library, nothing will appear. 

So we have to find a way to see the data. <mark>Instead of seeing all the data, we can only see the first few row by using head function. </mark>

<a id="data-cleaning"></a>
## Data Cleaning
In the data cleaning we are going to conduct the following process to make sure our data is concise and neat. 
1. Dropping unnecessary columns in a DataFrame : If we don't eliminate the unnecessary column, we will add redundant parameter model. This will make the model become unnecessarily complex. 
2. Detecting missing values: 
3. Fixing expected types
* 4. Detecting outlier

In [None]:
df.isna().sum()

<a id="outliner"></a>
### Outlier
There is no missing value in this data. This is a good news, because it means that our dataset is clean. However, we can not relax our guard now, because <mark>there is still a possibility that the recorder input the wrong value. In other words, we are going to detect for outlier</mark>  

**We can observe it by seeing if the maximum or minimum value for every variables are in the normal range. Another possible way is to see the scatter plot.** If a point behaves significantly different from others, than it has possibility to be an outlier. 

In regression model, outlier can sometimes alter the result. In tree models, however, decision trees are robust to outliers. 

In [None]:
df.describe()

From the table above, it is not hard to find that all of the variable are in the normal range. For example, it's rate to observe wines to have residual sugar less than 1gram/liter, and our minimum residual sugar data is 0.9 gram/liter, which is also close to 1 gram/liter. 

So we can say that, in our first glance, our data set may not have outlier! 

Woo-hoo!

<a id="histogram"></a>
## Density Plot and Histogram
We hope to discover the underlying frequency distribution of every variables through histogram and denstiy plot. Besides underlying distribution, histogram can also double checks our conclusion on outliers. Noting that since all of our variables are continuous, we construct histogram to plot the frequency of score occurrences in a continuous data set that has divided into classes. Suppose our data is discrete, then we have to choose bar charts. 

In [None]:

fig = plt.figure()
key = df.keys()
f, axes = plt.subplots(3,4,figsize = (20,25)) # rows, columns
for j in range(4): 
    for i in range(3):
        sns.distplot(df[key[4*i+j]],ax = axes[i,j]) 

plt.show()



The density plot draws a smooth line around the histogram and guess the underlying distribution of variables. From the histograms and density plots above, we can know that most of the features follows right-skewed distribution.

Fortunetely, **tree models are insensitive to skew data**. <mark>Since tree models are non-parametric method, which doesn't make any assumption on the population distribution and sample size, tree models are more robust to outliers, nonlinear relationship.</mark> 

Decision trees tries to find the variable that maximize difference in two brances, and the data distribution doesn't play an important role. 

<a id="data-preprocessing"></a>
## Data Preprocessing
From the dataset description, we know that good wine is defined to be quality greater than 6.5. Since ">=" ( greater than or equal to ) is an comparison operators, we will have boolean result. <mark>We hope our response variable to be integer, so we have to change change the data type after the comparison. </mark>

In [None]:
df['quality'] = df['quality'].values.astype(np.int)
df['good_bad_wine'] = df['quality']>=6.5
df['good_bad_wine'] = df['good_bad_wine'].values.astype(np.int)
df.head()


Again, after the preprocessing, we want to know if we are in the right direction and have a look at the dataset. We use the head function to check the dataset. 

<a id="scatter-plot"></a>
## Scatter Plot : Relationship between variables
We take a look at the raw data by using scatter plot. In scatter plot, we can observe the relationship between two variables. In regression analysis, we care about multicollinearity. If the degree of correlation between independent variables are high, it can cause problems when interpreting the results. 

A regression coefficient can be interpreted as ,for each 1 unit change in an independent variable, the mean change in the dependent variable when you hold other independent variables constant. The higher the correlated degree, the harder to change one variable without change another, and variables change in unison. 

Fortunately, **decision trees and boosted trees algorithms are immune to multicollenearity**. When the tree split, they choose one prefect correlated features. <mark>The splitting only take place while the tree choosing from one of the prefectly correclated features. </mark>

In [None]:
import plotly.express as px
def hide_current_axis(*args, **kwds):
    plt.gca().set_visible(False)
    

    
plt.figure(figsize = (60,60))
cmap = cm.get_cmap('gnuplot')
plt.figure(figsize = (10,10))
grid = sns.PairGrid(df,hue = 'good_bad_wine')
grid = grid.map_lower(sns.scatterplot, alpha=0.3, edgecolor='none')

grid.map_diag(plt.hist)
grid.map_upper(hide_current_axis)



We give weach point a distinct hue to show membership of each point and distinguish its wine quality. 

The graph shows that good wine have some obvious pattern in some dependent variable, even if cheap plonk doesn't have any pattern on that variable. For example, good wine tends to have low volatine acidity, slightly higher sulphates than average wine ,low total sulfur dioxide, high alcohol. 


<a id="correlation-matrix"></a>
### correlation matrix
We use correlation matrix to show the correlation coefficients between variables. We are using a matrix of Pearson-type correlations. In most cases, these correlations are influenced by outliers, unequal variances, nonnormality, and nonlinearities. 

In [None]:
plt.figure(figsize = (10,10))
sns.heatmap(df.corr(),annot = True,cmap='PuBuGn')
plt.show()

From the correlation graph, we observe the following :
1. The quality of wine is somewhat related to volatile acidity. The higher the wine quality, the lower the volatile acidity. 
2. The citric acid, aulphates and alcohol degree are also related to wine quality. The alcohol degree relates to the wine quality the most, and sulphates and citric acid degree are less related. The higher the wine quality, the higher the citric acid, sulphates, and alcohol degree. 
3. fixed acidity, residual sugar, free sulfur dioxide and pH doesn't have apparent pattern accross different quality. 

we only consider wine as good or bad. To balance data and fairness, for quality larger than five, we classify as 'good'. Otherwise, the wine is consider as bad. 

<a id="xgboost"></a>
## XGBoost
As we know, xgboost provides classifier and regressor. Classifier deals with discrete target, and Regressor deals with continuous outcome. In this project, we will first train the data with <mark>XGBClassifier and use ROC curve and AUC score to examinate the outcome</mark>. We then use <mark>XGBRegressor to solve the problem again, and examinate the model with r2 score. </mark>

<a id="XGBClassifier"></a>
### XGBClassifier
We first initialize the XGBoost, and use the GridSearchCV to build pipline to find the best model. 

[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit()



In [None]:
x = df[df.columns[0:11]].values
y = df['good_bad_wine'].values.astype(np.int)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=42)
print('X train size: ', x_train.shape)
print('X test size: ', x_test.shape)
print('y train size: ', y_train.shape)
print('y test size: ', y_test.shape)

In the parameters, Gamma specifies the minimum loss reduction required to make a split. "reg_alpha" and "reg_lambda" are regularize parameter. In the "tuned_parameters", the labels start with the stage of the pipline, following with two underdash. 

In [None]:
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

my_pipeline = Pipeline([('pca', PCA()),  ('xgbrg', XGBClassifier(random_state = 2))])
tuned_parameters = {'pca__n_components': [5, 10,None],'xgbrg__gamma':[0,0.5,1],'xgbrg__reg_alpha':[0,0.5,1], 'xgbrg__reg_lambda':[0,0.5,1],"xgbrg__learning_rate": [0.1, 0.5, 1]}

xgb = GridSearchCV(my_pipeline, cv=5,param_grid = tuned_parameters,  scoring='roc_auc')
xgb.fit(x_train, y_train)
print('The best model is: ', xgb.best_params_)


The GridSearchCV results tells us that it is better not to perform PCA in preprocessing. The best mobel has parameters gamma as 0.5, learning rate 0.5 and reg
The following is the ROC curve on the default model. 

In [None]:
prediction = xgb.predict_proba(x_test)[:,1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,prediction)

plt.subplots(figsize=(6,6))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlim(0,1)
plt.ylim(0,1)
plt.xlabel('False positive rate', fontsize = 14)
plt.ylabel('True positive rate', fontsize = 14)
plt.show()

An ROCcurve can show the performance of cassification problems. It plots both true positive rate and false positive rate. Recall that True Positve Rate(TPR) and False Positive Rate(FPR) are defined as following:

$$TPR = \frac{\text{True Positive (TP)}}{\text{True Positive (TP) + False Negative (FN)}}$$
$$FPR = \frac{FP}{FP+TN}$$

More items are positive when lowering the classification threshold. This will result in both FP and TP increase. 
![](http://)


In [None]:
print("AUC is: ", auc(false_positive_rate, true_positive_rate))

We first fit the training data with default model, and get the AUC score 0.902. 

<a id="XGBRegressor"></a>
### XGBRegressor


In [None]:
from xgboost import XGBRegressor
x = df[df.columns[0:11]].values
y = df['quality'].values.astype(np.int)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=42)
print('X train size: ', x_train.shape)
print('X test size: ', x_test.shape)
print('y train size: ', y_train.shape)
print('y test size: ', y_test.shape)


In [None]:
reg = XGBRegressor(random_state = 2)
reg.fit(x_train, y_train)

In [None]:
tuned_parameters = {'gamma':[0,0.5,1], 'reg_lambda': [1,5,10], 'reg_alpha':[0,1,5], 'subsample': [0.6,0.8,1.0]}

reg = GridSearchCV(XGBRegressor(random_state = 2, n_estimators=500), tuned_parameters, cv=5, scoring='r2')
reg.fit(x_train, y_train)

In [None]:
print('The best model is: ', reg.best_params_)

In [None]:
"""
prediction = reg.predict(x_test)
plt.plot(y_test, prediction, linestyle='', marker='o')
plt.xlabel('true values', fontsize = 16)
plt.ylabel('predicted values', fontsize = 16)
plt.show()

"""

In [None]:
prediction = reg.predict(x_test)
print('The r2_score on the test set is: ',r2_score(y_test, prediction))


<a id="random-forest"></a>
## Random Forest


Random forest split the training data into several parts, and training a decision tree for all parts. The final prediction is the voting result of all tree's individual prediction. By voting, averaging all prediction result, random forest improves the problem that decision tree is easily overfitting

Here we split the data into training and testing set. Like we have done before, the percentage of the test size is 33% of total data, and we use the same random state as previous model. We set the random state because we hope to have the same result everytime we run the program. 

In [None]:
x = df[df.columns[0:11]].values
y = df['good_bad_wine'].values.astype(np.int)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=42)



In the scikit-learn's library, RandomFOrestClassifier has all parameters that DecisionTreeClassifier has.

In [None]:
"""
tuned_parameters = {'n_estimators':[500], 'max_depth': [2,3,5,7], 'max_features': [0.5,0.7,0.9],'n_jobs':[-1],'min_samples_leaf':[1,5,10]} 
#,'random_state':[14]
clf = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv=5, scoring='roc_auc')
clf.fit(X_train, y_train)
"""

In [None]:
"""
# Generate the "OOB error rate" vs. "n_estimators" plot.

for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
"""

Random forest classifier with log2 performs the best. The model performs the best when the number of estimators is around 150 to 250. 

In [None]:
rf = RandomForestClassifier(n_estimators = 100,oob_score = True)

max_features = ['sqrt','log2',None]
n_estimators = [int(x) for x in np.linspace(start = 100,stop = 250,num = 16 )]

param_grid = {'max_features':max_features,'n_estimators' : n_estimators }
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(x_train,y_train)



##### Reference
https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/