#### Hi, welcome to my project! Today we will be classifying wine color based on its features using Decision Tree algorithms and then perform a regression in order to predict a continuous value. 
#### We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is Wine_Quality_Data.csv

### Let's import all libraries we will need at the beginning of the analysis

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns 

Import the dataset and examine the features, then look for null values and delete rows which contain them:

In [None]:
### BEGIN SOLUTION
filepath = '../input/wine-quality/Wine_Quality_Data.csv'
data = pd.read_csv(filepath, sep=',')

In [None]:
data.shape

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.dtypes

In [None]:
data.color.value_counts()

In [None]:
data.color.value_counts(normalize=True)

In [None]:
data.isnull().sum().sum()

We can see our dataset does not contain any null values in its fields, so we can continue with replacing or encoding our label column.

### Let's convert our color label to an integer, this is a quick way to do it using Pandas.

**White=0, Red=1** 

In [None]:
data['color'] = data['color'].replace('white',0).replace('red',1).astype(np.int)

In [None]:
data.iloc[:,-1]

In [None]:
data.iloc[:,-1].value_counts()

## Splitting our dataset:

### Now we have to split our data intro train and test sets, in this project we will use StratifiedShuffleSplit. If possible, preserve the indices of the split for later.

In [None]:
feature_cols = [x for x in data.columns if x not in 'color']
feature_cols

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Split the data into two parts with 1000 points in the test data
# This creates a generator
strat_shuff_split = StratifiedShuffleSplit(n_splits=1, test_size=1000, random_state=42)

# Get the index values from the generator
train_idx, test_idx = next(strat_shuff_split.split(data[feature_cols], data['color']))

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'color']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'color']

### Now check the percent composition of each quality level in the train and test data sets. The data set is mostly white wine, as can be seen below.

In [None]:
y_test.value_counts(normalize=True).sort_index()

In [None]:
y_train.value_counts(normalize=True).sort_index()

# Decision Tree Classifier:
Let's define our model without setting limits on maximum depth, features, or leaves.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt = dt.fit(X_train, y_train)

Determine how many nodes are present and the depth of this tree:

In [None]:
dt.tree_.node_count, dt.tree_.max_depth

In [None]:
dt.classes_

### Let's define a function which can compute error metrics of our model:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

Predicting y with our model for x_train and x_test:

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

### Now let's use our function with its corresponding arguments:

In [None]:
a=measure_error(y_train, y_train_pred, 'train')
b=measure_error(y_test, y_test_pred, 'test')

In [None]:
c=pd.concat([a,b],axis=1)
c

**The decision tree predicts a little better on the training data than the test data, which is consistent with (mild) overfitting. Also notice the perfect recall score for the training data. In many instances, this prediction difference is even greater than that seen here.**

### More meaningful:

In [None]:
# The error on the training and test data sets
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_full_error
### END SOLUTION

### Let's use grid search with cross validation to find the best parameters of our decision tree. 

In [None]:
### BEGIN SOLUTION
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, dt.n_features_+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

In [None]:
GR.best_score_

In [None]:
GR.best_estimator_

The number of nodes and the maximum depth of the tree:

In [None]:
GR.best_estimator_.tree_.node_count, GR.best_estimator_.tree_.max_depth

In [None]:
GR.classes_

### Let's measure the errors on the train and test sets as before and compare them to those from the previous tree:

In [None]:
y_train_pred_gr = GR.predict(X_train)
y_test_pred_gr = GR.predict(X_test)

train_test_gr_error = pd.concat([measure_error(y_train, y_train_pred_gr, 'train'),
                                 measure_error(y_test, y_test_pred_gr, 'test')],
                                axis=1)

In [None]:
train_test_gr_error

These test metrics are a little better than the previous ones. So it would seem like the previous example overfit the data, but only slightly.

## Confusion matrix for both training and testing datasets: 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm=confusion_matrix(y_train,y_train_pred_gr, labels=GR.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=GR.classes_)
disp.plot(cmap='Blues')
plt.title('Confusion matrix for training dataset')

In [None]:
cm=confusion_matrix(y_test,y_test_pred_gr, labels=GR.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=GR.classes_)
disp.plot(cmap='Blues')
plt.title('Confusion matrix for testing dataset')

# Decision Tree Regressor:

In this part of the project we will develop a DTR model which can help us predict a continuous label, in this case we will deal with residual sugar being our label.

In [None]:
### BEGIN SOLUTION
feature_cols = [x for x in data.columns if x != 'residual_sugar']

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'residual_sugar']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'residual_sugar']

Below we can see that all of the feature columns are numerical type which is suit to use for regression algorithms.

In [None]:
data[feature_cols].dtypes

In [None]:
from sklearn.tree import DecisionTreeRegressor

dr = DecisionTreeRegressor().fit(X_train, y_train)

param_grid = {'max_depth':range(1, dr.tree_.max_depth+1, 2),
              'max_features': range(1, dr.n_features_+1)}

GR_sugar = GridSearchCV(DecisionTreeRegressor(random_state=42),
                     param_grid=param_grid,
                     scoring='neg_mean_squared_error',
                      n_jobs=-1)

GR_sugar = GR_sugar.fit(X_train, y_train)

Let's see the best parameters found by the GridSeachCV:

In [None]:
GR_sugar.best_estimator_

The number of nodes and the maximum depth of the tree. This tree has lots of nodes, which is not so surprising given the continuous data.

In [None]:
GR_sugar.best_estimator_.tree_.node_count, GR_sugar.best_estimator_.tree_.max_depth

### Let's compute error metrics on train and test data sets. Take into account that this case of study is continuous, so we will use mean squared error and coefficient of determination (r2 score):

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y_train_pred_gr_sugar = GR_sugar.predict(X_train)
y_test_pred_gr_sugar  = GR_sugar.predict(X_test)

train_test_gr_sugar_error = pd.Series({'train': mean_squared_error(y_train, y_train_pred_gr_sugar),
                                         'test':  mean_squared_error(y_test, y_test_pred_gr_sugar)},
                                          name='MSE').to_frame().T

train_test_gr_sugar_r2 = pd.Series({'train': r2_score(y_train, y_train_pred_gr_sugar),
                                         'test':  r2_score(y_test, y_test_pred_gr_sugar)},
                                          name='R2 score').to_frame().T

pd.concat([train_test_gr_sugar_error, train_test_gr_sugar_r2])

## Plotting of actual vs predicted residual sugar:

We could create a new dataframe with actual test label and predicted as columnn, then set the first one as index and use the plot tool:

In [None]:
ph_test_predict = pd.DataFrame({'test':y_test.values,
                                'predict': y_test_pred_gr_sugar}).set_index('test').sort_index()
ph_test_predict

In [None]:
sns.set_context('notebook')
sns.set_style('white')
fig = plt.figure(figsize=(6,6))
ax = plt.axes()

ph_test_predict = pd.DataFrame({'test':y_test.values,
                                'predict': y_test_pred_gr_sugar}).set_index('test').sort_index()

ph_test_predict.plot(marker='o', ls='', ax=ax)
ax.set(xlabel='Test', ylabel='Predict', xlim=(0,35), ylim=(0,35));

Or we could just use the scatter plot from matplotlib:

In [None]:
fig = plt.figure(figsize=(6,6))
ax = plt.axes()
ax.scatter(y_test,y_test_pred_gr_sugar)

ax.set(xlabel='Test', ylabel='Predict', xlim=(0,35), ylim=(0,35));

### In order to display the decision trees we built, we require an additional command line program (GraphViz) and Python library (PyDotPlus). GraphViz can be installed with a package manager on Linux and Mac. For PyDotPlus, either pip or conda (conda install -c conda-forge pydotplus) can be used to install the library.

# Displaying decision trees: 

### First decision tree, where wine color was predicted and the number of features and/or splits are not limited.
### Last decision tree, where wine color was predicted but a grid search was used to find the optimal depth and number of features.

In [None]:
!conda install -c conda-forge pydotplus -y

In [None]:
from io import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

In [None]:
### BEGIN SOLUTION
# Create an output destination for the file
dot_data = StringIO()

export_graphviz(dt, out_file=dot_data, filled=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

# View the tree image
filename = 'wine_tree.png'
graph.write_png(filename)
Image(filename=filename) 

In [None]:
# Create an output destination for the file
dot_data = StringIO()

export_graphviz(GR.best_estimator_, out_file=dot_data, filled=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

# View the tree image
filename = 'wine_tree_prune.png'
graph.write_png(filename)
Image(filename=filename) 
### END SOLUTION