This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [2]:
df = pd.read_csv('originalfile.csv')
df= df.drop(['Overall_rank','Country_or_region', 'Region_Number'], axis = 1)
#Convert float score to category for classification
bins = [2.0, 3.0, 5.0, 6.0, 7.0, 10.0]    
labels = ["Unhappiest", "Unhappy", "Neutral", "Happy", "Happiest"]    
df['Score'] = pd.cut(df.Score, bins=bins, labels=labels)
df['Score'] = df.Score.cat.codes

2. Display columns and describe the data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Score                         156 non-null    int8   
 1   GDP_per_capita                156 non-null    float64
 2   Social_support                156 non-null    float64
 3   Healthy_life_expectancy       156 non-null    float64
 4   Freedom_to_make_life_choices  156 non-null    float64
 5   Generosity                    156 non-null    float64
 6   Perceptions_of_corruption     156 non-null    float64
dtypes: float64(6), int8(1)
memory usage: 7.6 KB


In [4]:
df.describe()

Unnamed: 0,Score,GDP_per_capita,Social_support,Healthy_life_expectancy,Freedom_to_make_life_choices,Generosity,Perceptions_of_corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,2.051282,0.905147,1.208814,0.725244,0.392571,0.184846,0.110603
std,1.021036,0.398389,0.299191,0.242124,0.143289,0.095254,0.094538
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.60275,1.05575,0.54775,0.308,0.10875,0.047
50%,2.0,0.96,1.2715,0.789,0.417,0.1775,0.0855
75%,3.0,1.2325,1.4525,0.88175,0.50725,0.24825,0.14125
max,4.0,1.684,1.624,1.141,0.631,0.566,0.453


3. Prepare Data

In [5]:
# Run this section to inspect X
X = df.drop(columns = ['Score'])
X

Unnamed: 0,GDP_per_capita,Social_support,Healthy_life_expectancy,Freedom_to_make_life_choices,Generosity,Perceptions_of_corruption
0,0.350,0.517,0.361,0.000,0.158,0.025
1,0.947,0.848,0.874,0.383,0.178,0.027
2,1.002,1.160,0.785,0.086,0.073,0.114
3,1.092,1.432,0.881,0.471,0.066,0.050
4,0.850,1.055,0.815,0.283,0.095,0.064
...,...,...,...,...,...,...
151,0.960,1.427,0.805,0.154,0.064,0.047
152,0.741,1.346,0.851,0.543,0.147,0.073
153,0.287,1.163,0.463,0.143,0.108,0.077
154,0.578,1.058,0.426,0.431,0.247,0.087


In [6]:
# Uncomment this section to inpect y
y = df['Score']
y

0      1
1      1
2      2
3      3
4      1
      ..
151    1
152    2
153    1
154    1
155    1
Name: Score, Length: 156, dtype: int8

4. Calculate accuracy

In [13]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.6875

5. Persisting Models

In [14]:
# Save the model to file
joblib.dump(model, '2019HapinessIndex.joblib')


['2019HapinessIndex.joblib']

5.b. Import the model and make predictions

In [15]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('2019HapinessIndex.joblib')
predictions = model.predict(X_test)
predictions

array([2, 3, 1, 1, 3, 4, 3, 1, 3, 3, 1, 1, 2, 2, 2, 2, 1, 0, 4, 3, 1, 1,
       1, 3, 3, 1, 1, 1, 2, 4, 3, 3], dtype=int8)

6. (Optional) Visualize decision trees

In [16]:
tree.export_graphviz(model, out_file = '2019HapinessIndex.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
