<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%205/MIS701_Lab_5_DT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Lecture 5**

**Introduction to Decision Tree and KNN **
Author: Associate Professor Lemai Nguyen

Objective:
**Predict golf playing**
Predict if a play will likely to play golf based on weather conditions.

**Context**: To help a golf club needs to predict if a golfer comes to play based on their play history and weather conditions. 

**Data**: 
Outlook = The outlook of the weather

Temperature = The temperature of the weather

Humidity = The humidity of the weather

Windy = A variable if it is windy that day or not

Play = The label, if the golfer played golf that day or not

**Source**: Kotu and Deshpande, 2019, chapter 4

**Loading Libraries and Functions**

Read about Logistic Regression at:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Train Test Split:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

Classification metrics:
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [None]:
!pip install pydotplus #interface for graph visualisation
!pip install graphviz #for graph visualisation

In [None]:
# load libraries
import pandas as pd #for data manipulation and analysis
import numpy as np
 
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.io.parsers.readers import annotations

from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.tree import DecisionTreeRegressor # Import Decision Tree Regressor

from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for model evaluation

#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import ConfusionMatrixDisplay

# **Biopsy**

Let's try another dataset we are familiar with from Week 4

## **Loading data**

In [None]:
url='https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/biopsy_ln.csv'

In [None]:
# load dataset
#records = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/MIS710/biopsy_ln.csv")
records = pd.read_csv('https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/biopsy_ln.csv')


## **Inspecting and preparing data**

In [None]:
#explore the dataset
print(records)

In [None]:
records.info()

In [None]:
#Inspect missing data
records.isnull().sum()

In [None]:
#drop irrelevant variables
records=records.drop(['ID'], axis=1)

In [None]:
#convert categorical data to numerical 
def coding_diagnosis(x):
        if x=='cancerous': return 1
        if x=='healthy': return 0
       
records['Diagnosis'] = records['diagnosis'].apply(coding_diagnosis)

print(records.sample(10))

In [None]:
records.info()

### **Visually Exploring Data**
1. Explore histograms of continuous variables
2. Generate barcharts of categorical variables
3. Convert data as needed
3. Explore relationships among the variables using heatmaps
4. Explore logistric regression relationships between variables 

In [None]:
#create histograms
for i in records.iloc[:,:6]: 
    plt.hist(records[i])
    plt.title(i)
    plt.show()

In [None]:
#create barchats
plot=sns.countplot(data=records, x='diagnosis')
plt.show()

In [None]:
#create barchats
plot=sns.countplot(data=records, x='Diagnosis')


In [None]:
sns.heatmap(data=records.corr(), cmap="Blues", annot=True)

In [None]:
sns.regplot(x=records['V2'], y=records['Diagnosis'], logistic=True, ci=None)

In [None]:
for i in records.iloc[:,0:5]: 
  sns.regplot(x=records[i], y=records['Diagnosis'], logistic=True, ci=None)
  plt.title(i)
  plt.show()

## **Selection Features and Label**

Select predictors (attributes) for Classification
Set role (Target)

In [None]:
#Selecting predictors and label
features = records.columns[0:5]
X=records[features]  #Input data
y=records['Diagnosis'] # Target variable

In [None]:
y.head()

In [None]:
X.head()

## **Splitting the Dataset**

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)  # 80% training and 20% testing 

#inspect the split datasets
print(X_train.head())
print(y_train.head())

print('Training dataset size:',X_train.shape[0])
print('Test dataset size:',X_test.shape[0])


## **Training and Applying a Decision Tree Classifier**

Train a model using the training dataset
Make prediction using the model for the test dataset
Read about DecisionTreeClassifier at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html



In [None]:
# Create Decision Tree classifier object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3) #default criterion is gini, max_depth=25

# Train Decision Tree Classifer with the traning dataset 
clf = clf.fit(X_train, y_train)

#Make predictions for the test dataset
y_pred = clf.predict(X_test)


**Inspect Predictions**

In [None]:
#join unseen y_test with predicted value into a data frame
inspection=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

#join X_test with the new dataframe
inspection=pd.concat([X_test,inspection], axis=1)

inspection.sample(20)

## **Evaluating the model**



1.   Calculate Accuracy, Precision, Recall, F1


Classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics







In [None]:
#Model Evaluation, calculate metrics: Accuracy, Precision, Recall, F1,
print("Accuracy: ", '%.3f' % metrics.accuracy_score(y_test,y_pred))
print("Precision: ",'%.3f' % metrics.precision_score(y_test,y_pred))
print("Recall: ", '%.3f' % metrics.recall_score(y_test,y_pred))
print("F1: ", '%.3f' % metrics.f1_score(y_test,y_pred))


In [None]:
#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### **Plot ROC (Receiver operating characteristic) curve and confusion matrix**

ROC surve
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html?highlight=plot_roc_curve#sklearn.metrics.plot_roc_curve

Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html?highlight=plot%20confusion%20matrix#sklearn.metrics.plot_confusion_matrix

In [None]:
#get predicted probabilities for the main class
y_pred_probs = clf.predict_proba(X_test)
y_pred_probs = y_pred_probs[:, 1]
y_pred_probs

In [None]:
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import ConfusionMatrixDisplay

RocCurveDisplay.from_predictions(y_test, y_pred_probs)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

### **Visualising the trees**

In [None]:
!pip install pydotplus #interface for graph visualisation
!pip install graphviz #for graph visualisation

In [None]:
import six
import sys
sys.modules['sklearn.externals.six'] = six

In [None]:
#Import libraries and classes
from six import StringIO
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = features,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('Biopsy.png')
Image(graph.create_png())

In [None]:
# Create Logitic Regression classifer object

#Create an initial Logistic Regression model
logreg = LogisticRegression(max_iter=100)

# Complete the code to train Logistic Regression Classifer with the traning dataset 
logreg = logreg.fit(X_train, y_train)

#Complete the code to make predictions for the test dataset
y_pred = logreg.predict(X_test)


In [None]:
#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# **Telco Churn**


In [None]:
# Load necessary libraries here
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.metrics import precision_recall_curve, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for model evaluation

## Import dataset

In [None]:
# Load data using pandas.read_csv(filepath_or_url, sep=',')
url = 'https://raw.githubusercontent.com/thuc-github/MIS710-T12023/main/Week%204/WA_Fn-UseC_-Telco-Customer-Churn.csv'

df = pd.read_csv(url)

## EDA

* How many rows and columns in the dataset? 
* Return the first n rows.
* What are the columns and their datatypes?
* Is there any missing values? 
* Any strong correlation from the dataset?  
* How to deal with categorical features? 



### Data Exploration
* Demographics (age, gender, partner and dependent status)
* Customer account information (Tenures, contracts)
* Distribution of services 
* Relation between variables 
* Distribution of predictor variable (`Churn`)

## Data preparation 


1.   Prepare X, y
2.   Prepare X_train, X_test, y_train, y_test (hint: using `train_test_split')



## Model implementation

1. Try with the original data. What's the performance?
2. Let's add data normalisation. Has the performance been improved?

## Performance evaluation
* Classification report
* Confusion matrix 
* Importance weight
* ROC and AUC

# Insurance

## Loading libraries

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error, mean_absolute_error

## Import dataset

In [None]:
# Load data using pandas.read_csv(filepath_or_url, sep=',')
url = 'https://raw.githubusercontent.com/thuc-github/MIS710-T12023/main/Week%203/insurance.csv'

df = pd.read_csv(url)


## EDA

* How many rows and columns in the dataset? 
* Return the first n rows.
* What are the columns and their datatypes?
* Is there any missing values? 
* How to deal with categorical features? 
* Any strong correlation from the dataset?  
* What are the stats for the `charges`? Plot overall distribution of `charges`; and ditribution of chareges for smoker and non-smokers. Practice more with `bmi`, `age` and `sex` variables. 



In [None]:
# How many rows and columns in the dataset?
df

# Return the first n rows.
df.head()

# What are the columns and their datatypes?
df.info()

# Is there any missing values?
df.isnull().sum()

# Any strong correlation from the dataset?
df.corr()

In [None]:
# Correlation plot
f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(240,10,as_cmap=True),
            square=True, ax=ax)

In [None]:
# How to deal with categorical features?

from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)


In [None]:
''' 
What are the stats for the charges? Plot overall distribution of charges; 
and ditribution of chareges for smoker and non-smokers. 
'''
df.charges.describe()

In [None]:
df.charges.hist(bins=50, figsize=(12,8))

In [None]:
df.charges.hist(by=df.smoker, bins=50, figsize=(12,8))

In [None]:
# Alternative using seaborn

f= plt.figure(figsize=(12,5))

ax=f.add_subplot(121)
sns.distplot(df[(df.smoker == 1)]["charges"],color='c',ax=ax)
ax.set_title('Distribution of charges for smokers')

ax=f.add_subplot(122, sharex = ax)
sns.distplot(df[(df.smoker == 0)]['charges'],color='b',ax=ax)
ax.set_title('Distribution of charges for non-smokers')

## Data preparation 


1.   Prepare X, y
2.   Prepare X_train, X_test, y_train, y_test (hint: using `train_test_split')



In [None]:
X = df.drop(['charges'], axis = 1)
y = df.charges

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

## Model implementation

1. Try with the original data. What's the performance?
2. Let's add data normalisation. Has the performance been improved?

In [None]:
lr = LinearRegression().fit(X_train,y_train)

y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print('MSE_Train: {}, MSE_Test: {}, MAE_Train: {}, MAE_Test: {}'.format(mean_squared_error(y_train, y_train_pred),
                                                      mean_squared_error(y_test, y_test_pred),
                                                      mean_absolute_error(y_train, y_train_pred),
                                                      mean_absolute_error(y_test, y_test_pred)))

print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,y_train_pred),
r2_score(y_test,y_test_pred)))

In [None]:
from sklearn.tree import DecisionTreeRegressor

lr = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5, max_leaf_nodes=10, min_samples_leaf=2, min_samples_split=2).fit(X_train,y_train)

y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print('MSE_Train: {}, MSE_Test: {}, MAE_Train: {}, MAE_Test: {}'.format(mean_squared_error(y_train, y_train_pred),
                                                      mean_squared_error(y_test, y_test_pred),
                                                      mean_absolute_error(y_train, y_train_pred),
                                                      mean_absolute_error(y_test, y_test_pred)))

print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,y_train_pred),
r2_score(y_test,y_test_pred)))

In [None]:
plt.figure(figsize=(10,6))

plt.scatter(y_train_pred, y_train_pred - y_train,
          c = 'black', marker = 'o', s = 35, alpha = 0.5,
          label = 'Train data')
plt.scatter(y_test_pred, y_test_pred - y_test,
          c = 'c', marker = 'o', s = 35, alpha = 0.7,
          label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper left')
plt.hlines(y = 0, xmin = 0, xmax = 60000, lw = 2, color = 'red')
plt.show()

# **House Price**
Let's try another dataset we are familiar with from Week 3

## **Loading data**

In [None]:
# load dataset
records = pd.read_csv('https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/Housing3.csv')

#explore the dataset
print(records)

print('Sample size:', records.shape[0])
print('Number of columns:', records.shape[1]) 

## **Inspecting and handling missing data and incorrectly recorded data**

In [None]:
#write code to display the following

In [None]:
#area is wrongly documented as string
records['area'] = records['area'].apply(pd.to_numeric, errors='coerce')

In [None]:
#write code to inspect missing data

In [None]:
#Fill in missing numerical data with mean and categorical data with mode
records['area'].fillna(records['area'].mean(),inplace=True)
records['furnishingstatus'].fillna(records['furnishingstatus'].mode()[0], inplace=True) #there can be more than one mode

#handle missing mainroad data

In [None]:
#Last week, we learned to convert categorical variables to numerical using LabelEncoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
#for example:
records['mainroad_N'] = encoder.fit_transform(records['mainroad', 'basement',)
records['basement_N'] = encoder.fit_transform(records['basement'])

#encode other variables as needed

## **Exploratory data analysis**

In [None]:
#explore data yourself

#for example, generate dendrograms to show hierarchical clustering  
sns.clustermap(records.corr(), square=True, cmap='Blues', annot=True, row_cluster=False)

## **Selection Features and Label**
Select predictors (attributes) for Classification Set role (Target)

In [None]:
#Select predictors
#features=['area','bedrooms', ....]
X=records[features]


In [None]:
#specify the label
y=records['price']
y.head()

## **Splitting the Dataset**
Split arrays or matrices into random train and test subsets https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

In [None]:
#from sklearn.model_selection import train_test_split # Import train_test_split function

# Split dataset into training set 70% and test set 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)  # 70% training and 30% testing 

#inspect the split datasets
print(X_train.head())
print(y_train.head())

print('Training dataset size:',X_train.shape)
print('Test dataset size:',X_test.shape)


## **Training a Decision Tree Regressor**
Train a model using the training dataset Make prediction using the model for the test dataset Read about DecisionTreeRegressor at: 
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

In [None]:
#Import DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor
#instantiate a decision tree regressor and fit it with the training data
regressor = DecisionTreeRegressor(max_depth=20, max_leaf_nodes=15, random_state=1)

#write code to train the regressor



## **Applying the Decision Tree Regressor on the testset**

In [None]:
#predict prices



## **Evaluating the model performance**


In [None]:
#Evaluate the model
from sklearn import metrics
print('Mean Absolute Error:', '%.0f' % metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', '%.0f' % metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', '%.0f' %np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
records['price'].describe()

Comment on the errors in relation to the price stats

In [None]:
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(regressor, out_file=dot_data, 
filled=True, rounded=True,
special_characters=True, feature_names = features,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_png('HousePricePrediction.png')
Image(graph.create_png())


# **Congratulations**
You now can try another dataset on your own: https://www.kaggle.com/datasets/ahmettyilmazz/fuel-consumption