# Decion trees and random forests

## Data

Use the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) datafiles and corresponding questionaire and codebook files to understand what is in the data.

## Overarching research question

Explain what variables effect happiness (`V10`) using decison-tree learning.

## Method

There are many tools used for this, we apply [SciKit learn](https://scikit-learn.org/stable/modules/tree.html).

In [None]:
## create new data matrix for decision tree analysis

import csv

dependent = 'V10'
independent = [ 'V4', 'V5', 'V6', 'V7', 'V8', 'V9']

dependent_data = []
independent_data = []

for line in csv.DictReader( open('data/wvs.csv') ):
    
    dependent_data.append( line[ dependent ] )
    
    dd = []
    for variable in independent:
        dd.append( line[variable] )
    independent_data.append( dd )
    

print( len( dependent_data ) )

In [None]:
from sklearn import tree

decision_tree = tree.DecisionTreeClassifier( max_depth = 3) # tree.DecisionTreeRegressor() # for regression model
model = decision_tree.fit( independent_data, dependent_data )

In [None]:
import graphviz
from IPython.display import display
image = tree.export_graphviz( model, class_names = True )
display( graphviz.Source( image ) ) ## or you could write the graphviz format to a file and visualise it using some other format

## Model analysis

As we spoke, there are [many different approaches to evaluate the quality of a model](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation). Beyond single metrics (like accuracy score, kappa score), confusion matrix may be beneficial to understand model performance.

In [None]:
from sklearn import metrics

predicted_values = model.predict( independent_data )
metrics.confusion_matrix( predicted_values, dependent_data )

## Tasks

* Choose better or more interesting values to be modelled.
* Improve the data preprocessing (remove missing values etc.)
* Apply training data - test data split in the data analysis stage. Does that improve the analysis at all?
* Increase the maximum depth of the decision tree. Does it improve the analysis at all?

## Random forests -  a decision tree on steroids.

The challenge with decision trees is that they run a single model on the dataset.
However, like many other machine learning algorihms (both supervised and unsupervised), it relies on a single random state.
It can overfit or does not lead to best accuracy.
[Random forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) fix these issues through creating more trees, examining their differences and creating a classifier through combining different decision trees.

In [None]:
from sklearn import ensemble

random_forest = ensemble.RandomForestClassifier()
model = random_forest.fit( independent_data, dependent_data )

In [None]:
predicted_values = model.predict( independent_data )
metrics.confusion_matrix( predicted_values, dependent_data )