# Introduction to decision tree model
Link to the Youtube video tutorial: https://www.youtube.com/watch?v=PHxYNGo8NcI&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=11  <br />


In this example:  <br />
    1) A decision tree model (machine learning model) is utilized for classification tasks.  <br /> 
    2) The independent variables (features) of this dataset, called company, job, and degree, are categorical variables  <br />
    3) The dependent variable (ground truth/target) of this dataset is salary_more_then_100k (conveys the message of if a person with the given attributes has salary more than 100k?)  <br />



When you have a dataset like this to solve classification problem, it is easier to draw a decision boundary using logistic regression.  <br />
<img src="hidden\easier-logistic.png" alt="This image describes the dataset easier for logistic regression model" style="width: 300px;"/>  <br />

However, if your dataset is more complex like this, you cannot just draw a single line as decision boundary. You might have to split your dataset again and again to come up with the decision boundaries and this is what decision tree algorithm does for you.  <br />
<img src="hidden\easier-decisiontree.png" alt="This image describes the dataset easier for decision tree model" style="width: 300px;"/>  <br />

Example of how to build a decision tree:  <br />
1) First, split the dataset into a decision tree using the company (so the decision tree has 3 branches: Google, Facebook, ABC Pharma).  <br />
    1) If the company is Facebook, regardless the job title and degree, your answer is always YES (salary is over 100k).  <br />
    2) But for Google and ABC Pharma, they have mixed samples, so you need to ask further question (further split the decision tree) to get your answer.  <br />
<img src="hidden\tree1.png" alt="Split 1" style="width: 400px;"/>  <br />


2) For Google, I will ask what is the job title.  <br />
    1) If it is a business manager, then the answer is always YES (regardless the degree).  <br />
    2) If it is a sales executive, then the answer is always NO (regardless the degree).  <br />
    3) But for computer programmer, it has mixed samples, so you need to ask further question (further split the decision tree) to get your answer.  <br />
<img src="hidden\tree2.png" alt="Split 2" style="width: 400px;"/>  <br />

3) When the process keeps going to further split the decision tree iteratively, you will get a decision tree as below eventually.  <br />
    1) The decision tree shows that we split the dataset using company first, followed by job title, and degree.  <br />
<img src="hidden\tree3.png" alt="Split 3" style="width: 400px;"/>  <br />

The order of splitting the decision tree is important (because it will impact the performance of your algorithm). You should use an approach (ordering of features) which gives you high information gain at every split/branching (EG: Facecbook is a branch here).  <br />
<img src="hidden\information-gain.png" alt="This image describes the dataset easier for decision tree model" style="width: 750px;"/>  <br />

### Load the dataset

In [146]:
import pandas as pd

# Load the dataset from CSV file into pandas dataframe called df
df = pd.read_csv("salaries.csv")

# show the first five columns of the df dataframe
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


### Data preprocessing

In [147]:
# save the independent variables to a variable called inputs, with dataframe format
inputs = df.drop('salary_more_then_100k',axis='columns')

# show the independent variables in inputs dataframe
inputs

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors
5,google,computer programmer,masters
6,abc pharma,sales executive,masters
7,abc pharma,computer programmer,bachelors
8,abc pharma,business manager,bachelors
9,abc pharma,business manager,masters


In [148]:
# save the dependent variable to a variable called target, with dataframe format
target = df['salary_more_then_100k']

# show the dependent variable in target dataframe
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [149]:
# encode the categorical variables of the independent variables from categorical names (text labels) into integer labels using label encoder
# since all the 3 independent variables (company, job, and degree) are categorical variables, you need to create 3 label encoder objects (3 label encoders with different names)

from sklearn.preprocessing import LabelEncoder

le_company = LabelEncoder() # create a label encoder called le_company
le_job = LabelEncoder()  # create a label encoder called le_job
le_degree = LabelEncoder()  # create a label encoder called le_degree

After encoding the categorical variables of the independent variables using label encoder, the mapping between categorical names (text labels) and integer labels of each categorical variable is as below:

Categorical variable: company
1) google       -> 2
2) abc pharma   -> 0
3) facebook     -> 1

Categorical variable: job
1) sales executive      -> 2
2) business manager     -> 0
3) computer programmer  -> 1

Categorical variable: degree
1) bachelors    -> 0
2) masters      -> 1

In [150]:
# create a new column called company_encoded to inputs data frame. Use the label encoder called le_company to transform the categorical names (text labels) of the company column into integer labels, then save the results to the company_encoded column
inputs['company_encoded'] = le_company.fit_transform(inputs['company'])
# create a new column called job_encoded to inputs data frame. Use the label encoder called le_job to transform the categorical names (text labels) of the job column into integer labels, then save the results to the job_encoded column
inputs['job_encoded'] = le_company.fit_transform(inputs['job'])
# create a new column called degree_encoded to inputs data frame. Use the label encoder called le_degree to transform the categorical names (text labels) of the degree column into integer labels, then save the results to the degree_encoded column
inputs['degree_encoded'] = le_company.fit_transform(inputs['degree'])
# show the inputs data frame
inputs

Unnamed: 0,company,job,degree,company_encoded,job_encoded,degree_encoded
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


In [151]:
# create a data frame only consists of independent variables with integer labels (by dropping the independent variables with text labels)
inputs_encoded = inputs.drop(['company','job','degree'], axis='columns')

# show the inputs_encoded dataframe
inputs_encoded

Unnamed: 0,company_encoded,job_encoded,degree_encoded
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


### Develop the machine learning model (decision tree model)

Information retrieved from: https://stackoverflow.com/questions/69326639/sklearn-warning-valid-feature-names-in-version-1-0  <br />
1) The warning:   <br />
    <img src="hidden\SKLearn-warning.png" alt="This image describes the SKLearn valid feature name warning" style="width: 400px;"/>  <br />
2) The solution:  <br />
    <img src="hidden\SKLearni-warning-solution.png" alt="This image describes the solution to SKLearn valid feature name warning" style="width: 400px;"/>  <br />


Information retrieved from: https://www.geeksforgeeks.org/python-pandas-dataframe-values/
1) Explanation of how .values on dataframe solve the "valid feature name" warning from SKLearn. .values only provide the values (entries) of the  dataframe (without the column names of the dataframe)  <br />
    1) <img src="hidden\values-function1.png" alt="This image describes the .value function, part 1" style="width: 400px;"/>  <br />
    2) <img src="hidden\values-function2.png" alt="This image describes the .value function, part 2" style="width: 400px;"/>  <br />

In [152]:
from sklearn import tree

# create the machine learning model (decision tree model/classifier)
model = tree.DecisionTreeClassifier() 

# train the decision tree model
model.fit(inputs_encoded.values, target) # use .values to only provide the values (entries) of the inputs_encoded dataframe (without the column names of the dataframe), to prevent the "valid feature name" warning from SKLearn

# show the accuracy of the trained decision tree model
print(model.score(inputs_encoded.values, target)) # use .values to only provide the values (entries) of the inputs_encoded dataframe (without the column names of the dataframe), to prevent the "valid feature name" warning from SKLearn

1.0


### Apply the trained machine learning model (decision tree model)

Predict a given sample

In [153]:
# predict the person who is working in the company of google, job of sales executive, and with degree of masters
# according to the independent variables format used in training, [[company encoded value, job encoded value, degree encoded value]]
print(model.predict([[2,2,1]]))

[0]


In [154]:
# predict the person who is working in the company of google, job of business manager, and with degree of masters
# according to the independent variables format used in training, [[company encoded value, job encoded value, degree encoded value]]
print(model.predict([[2,0,1]]))

[1]
