# Classification with Python
Docs: https://scikit-learn.org/stable/modules/tree.html#classification<br/>

Two major types of problems that machine learning algorithms try to solve are:<br/>

Regression — Predict continuous value of a given data point<br/>
Classification — Predict the class of the given data point<br/>

Classification Problem: Given all the other factors (i.e. features), predict whether a passenger died or not during the tragic Titanic accident.<br/>
Regression Problem: Given all the other factors (i.e. features), predict the age of a passenger.

## Decision Trees

A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and ask a question; edges represent the answers the to the question; and the leaves represent the actual output or class label. They are used in non-linear decision making with simple linear decision surface.

Decision trees classify the examples by sorting them down the tree from the root to some leaf node, with the leaf node providing the classification to the example. Each node in the tree acts as a test case for some attribute, and each edge descending from that node corresponds to one of the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new nodes.

![image.png](attachment:image.png)

Now, you may use this table to decide whether to play or not. But, what if the weather pattern on Saturday does not match with any of rows in the table? This may be a problem. A decision tree would be a great way to represent data like this because it takes into account all the possible paths that can lead to the final decision by following a tree-like structure.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's build a simple decision tree using scikit learn package with Python.

In [2]:
%cd "C:\\Users\\yasin.unlu\\Documents\\Original Docs\\Documents1\\Docs\\Teaching\\PythonForDataScienceSummer2020\\Week-8"

C:\Users\yasin.unlu\Documents\Original Docs\Documents1\Docs\Teaching\PythonForDataScienceSummer2020\Week-8


Let's read the data from csv file.

In [73]:
import pandas as pd
df = pd.read_csv('play.csv')
df

Unnamed: 0,Day,Weather,Temperature,Humidity,Wind,Play
0,1,Sunny,Hot,90,10,No
1,2,Cloudy,Hot,95,5,Yes
2,3,Sunny,Mild,70,30,Yes
3,4,Cloudy,Mild,89,25,Yes
4,5,Rainy,Mild,85,25,No
5,6,Rainy,Cool,60,30,No
6,7,Rainy,Mild,92,20,Yes
7,8,Sunny,Hot,95,20,No
8,9,Cloudy,Hot,65,12,Yes
9,10,Rainy,Mild,100,25,No


#### Clean the Data:

Let's drop unnecessary columns from the dataframe. In this simple data set, 'Day' column does not help train any models. So, we will drop it.

In [74]:
df.drop(axis=1, columns='Day', inplace=True)

In [75]:
df

Unnamed: 0,Weather,Temperature,Humidity,Wind,Play
0,Sunny,Hot,90,10,No
1,Cloudy,Hot,95,5,Yes
2,Sunny,Mild,70,30,Yes
3,Cloudy,Mild,89,25,Yes
4,Rainy,Mild,85,25,No
5,Rainy,Cool,60,30,No
6,Rainy,Mild,92,20,Yes
7,Sunny,Hot,95,20,No
8,Cloudy,Hot,65,12,Yes
9,Rainy,Mild,100,25,No


The dataframe stores both string and integer columns.

In [9]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Weather      10 non-null     object
 1   Temperature  10 non-null     object
 2   Humidity     10 non-null     int64 
 3   Wind         10 non-null     int64 
 4   Play         10 non-null     object
dtypes: int64(2), object(3)
memory usage: 528.0+ bytes


String values should be encoded as categorical variables.

In [76]:
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
df_clean = df.copy() 
labelencoder = LabelEncoder()
df_clean['Weather'] = labelencoder.fit_transform(df['Weather'])
df_clean['Temperature'] = labelencoder.fit_transform(df['Temperature'])
df_clean['Play'] = labelencoder.fit_transform(df['Play'])
df_clean

Unnamed: 0,Weather,Temperature,Humidity,Wind,Play
0,2,1,90,10,0
1,0,1,95,5,1
2,2,2,70,30,1
3,0,2,89,25,1
4,1,2,85,25,0
5,1,0,60,30,0
6,1,2,92,20,1
7,2,1,95,20,0
8,0,1,65,12,1
9,1,2,100,25,0


Let's scale 'Humidity' and 'Wind'

In [88]:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 2)) #set max range to 2 because weather can go upto 2.
df_clean['Humidity'] = scaler.fit_transform(df[['Humidity']])
df_clean['Wind'] = scaler.fit_transform(df[['Wind']])

In [89]:
df_clean

Unnamed: 0,Weather,Temperature,Humidity,Wind,Play
0,2,1,1.5,0.4,0
1,0,1,1.75,0.0,1
2,2,2,0.5,2.0,1
3,0,2,1.45,1.6,1
4,1,2,1.25,1.6,0
5,1,0,0.0,2.0,0
6,1,2,1.6,1.2,1
7,2,1,1.75,1.2,0
8,0,1,0.25,0.56,1
9,1,2,2.0,1.6,0


In [79]:
df_clean

Unnamed: 0,Weather,Temperature,Humidity,Wind,Play
0,2,1,0.75,0.2,0
1,0,1,0.875,0.0,1
2,2,2,0.25,1.0,1
3,0,2,0.725,0.8,1
4,1,2,0.625,0.8,0
5,1,0,0.0,1.0,0
6,1,2,0.8,0.6,1
7,2,1,0.875,0.6,0
8,0,1,0.125,0.28,1
9,1,2,1.0,0.8,0


#### Prepare Features and Response Variables:

Now, let's prepare features and response dataframes.

In [90]:
features = df_clean.loc[:,'Weather':'Wind']
response = df_clean[['Play']]

#### Split the Data:

Let's prepare train and test parts.

In [106]:
from sklearn.model_selection import train_test_split
my_result_list = train_test_split(features, response, test_size=0.30, random_state=0)
features_train, features_test, response_train, response_test = my_result_list

In [107]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state = 0)
classifier.fit(features_train, response_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

#### Predictions:

In [114]:
response_pred = classifier.predict(features_test)
print('Predictions ',response_pred)
print('Actuals ',list(response_test['Play']))

Predictions  [1 1 1]
Actuals  [1, 1, 0]


#### Evaluate Classifier Model - Accuracy

![image.png](attachment:image.png)

In [109]:
from sklearn.metrics import accuracy_score
print('Accuracy Score on test data: ', accuracy_score(y_true=response_test, y_pred=response_pred))

Accuracy Score on test data:  0.6666666666666666


#### Evaluate Classifier Model - Confusion Matrix:

![image.png](attachment:image.png)

To check the accuracy we need to import confusion_matrix method of metrics class. The confusion matrix is a way of tabulating the number of mis-classifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.

In [110]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(response_test, response_pred)
print(cm)

[[0 1]
 [0 2]]


In [117]:
print('Number of negative predictions while it is actually negative: ',cm[0,0])
print('Number of positive predictions while it is actually negative: ',cm[0,1])
print('Number of negative predictions while it is actually positive: ',cm[1,0])
print('Number of positive predictions while it is actually positive: ',cm[1,1])

Number of negative predictions while it is actually negative:  0
Number of positive predictions while it is actually negative:  1
Number of negative predictions while it is actually positive:  0
Number of positive predictions while it is actually positive:  2
