## Machine Learning Tutorial 10: Decision Tree

Decision tree algorithm is used to solve classification problem in machine learning domain. In this tutorial we will solve employee salary prediction problem using decision tree. First we will cover some theoretical concepts and then proceed with the implementation. 


#### Topics covered:
* How to solve classification problem using decision tree algorithm?
* Theory Explain rationale behind decision tree using a use case of predicting salary based on department, degree and company that a person is working for 
* How do you select ordering of features? High v slow information gain and entropy
* Gini impurity
* Create sklearn model using Decision TreeClassifier
* **Exercise** - Find out survivial rate of titanic ship passengers using decision tree

### Classification Types

- Binary Classification
    - **Will customer buy life insurance?** -
        - **Yes/No**
- Multiclass Classification
    - **Which party a person is going to vote for?** - 
        - **Democratic/Republican/Independent**
    

In [13]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import tree

In [3]:
df = pd.read_csv("salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [5]:
inputs = df.drop('salary_more_then_100k', axis='columns')
target = df['salary_more_then_100k']

In [6]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [9]:
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

In [10]:
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_company.fit_transform(inputs['job'])
inputs['degree_n'] = le_company.fit_transform(inputs['degree'])
inputs.head()

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0


In [11]:
inputs_n = inputs.drop(['company','job','degree'],axis='columns')
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


In [14]:
model = tree.DecisionTreeClassifier()

In [15]:
model.fit(inputs_n,target)

In [16]:
model.score(inputs_n, target)

1.0

In [17]:
model.predict([[2,2,1]])



array([0], dtype=int64)

## Exercise

Given the data in the file **titanic.csv**, carry out the following tasks:

**Build decision tree model to predict survival based on certain parameters**

In this file using following columns build a model to predict if person would survive or not,

* **Pclass**
* **Sex**
* **Age**
* **Fare**

Calculate the score of your model

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [20]:
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
# Select relevant columns
df = df[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']]

In [23]:
# Handle missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)

In [25]:
# Convert categorical data to numerical data
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)

In [26]:
# Split the data into features and target
X = df[['Pclass', 'Sex_male', 'Age', 'Fare']]
y = df['Survived']

In [27]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# Build and train the decision tree model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [29]:
# Make predictions
y_pred = model.predict(X_test)

In [30]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy: .2f}")

Model Accuracy:  0.74
