# Decision Tree Regression on the World Population

In this test we'll train a simple decision tree model using the world population data from the Analyse Supplementary Exam. 

### Imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [3]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')

In [3]:
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


### Question 1

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the future world population in a given year might be. To do this, we're going to ignore the 2017 column from our data, and use this as a metric for testing the accuracy of our prediction.

Since the given dataframe (`population_df`) only has population by country per year, we need to find the **total** world population for each year. To achieve this, we'll write a function that computes the sum of the populations of the different countries in `population_df` for each year. This function must return a 
return a 2-d numpy array that contains the year and the total world population.

_**Function Specifications:**_
* Should have no input and return a numpy `array` type as output.
* The array should only have two columns containing the year and the population, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The values within the array should be of type `np.int64`.

_**Further Reading:**_

Data types are associated with memory allocation. As such, your choice of data type affects the precision of computations in your program. For example, the `np.int` data type in numpy can only store values between -2147483648 to 2147483647 and assigning values outside this range for variables of this data type may cause run-time errors. To avoid this, we can use data types with larger memory capacity e.g. `np.int64`.

https://docs.scipy.org/doc/numpy/user/basics.types.html

In [4]:
def get_total_population_by_country_year():
    
    # Your code here
    df = population_df
    df = df.melt(var_name='Year', value_name='Population')
    #df.columns = ['Year', 'Population']
    df = df.groupby('Year').sum()
    arr = df.reset_index().values.astype(np.int64)
    
    return arr

get_total_population_by_country_year()

array([[      1960, 3014940395],
       [      1961, 3055691989],
       [      1962, 3108379009],
       [      1963, 3173207428],
       [      1964, 3238441149],
       [      1965, 3305101319],
       [      1966, 3374903353],
       [      1967, 3444384585],
       [      1968, 3514639116],
       [      1969, 3589069293],
       [      1970, 3664271341],
       [      1971, 3741545439],
       [      1972, 3818075376],
       [      1973, 3893726301],
       [      1974, 3970035481],
       [      1975, 4044577268],
       [      1976, 4117105339],
       [      1977, 4189387395],
       [      1978, 4262884975],
       [      1979, 4338225244],
       [      1980, 4414334568],
       [      1981, 4492427948],
       [      1982, 4573445316],
       [      1983, 4655199096],
       [      1984, 4736682102],
       [      1985, 4819699772],
       [      1986, 4905221325],
       [      1987, 4992879504],
       [      1988, 5081453078],
       [      1989, 5170171686],
       [  

In [6]:
get_total_population_by_country_year()

_**Expected Outputs:**_
```python
total_population_by_country_year()
```
> ```
array([[      1960, 3014940395],
       [      1961, 3055691989],
       [      1962, 3108379009],
       [      1963, 3173207428],
        ...
       [      2015, 7329250474],
       [      2016, 7415694711],
       [      2017, 7501739318]], dtype=int64)`
```




### Question 2

Now that we have have our data, we need to split this into a set of variables we will be training on, and the set of variables that we will make our predictions on. In this case, we're splitting the values such that we train on all but the last year in our dataset. We also need to split our data into the predictive features (denoted `X`) and the response (denoted `y`). 

Write a function that will take as input a 2-d numpy array and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features / response of the training set, and `(X-test, y_test)` are the feautes / response of the testing set.

_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* `(X_test, y_test)` should just be the last entry of the given input. They should also be the form of an `array`, and not as a single value.


In [5]:
def feature_response_split(arr):
    # your code here
    X_train = arr[0:-1, 0]
    y_train = arr[0:-1, 1]
    X_test = arr[-1:,0]
    y_test = arr[-1:,1]
        
    return (X_train, y_train), (X_test, y_test)

In [6]:
data = get_total_population_by_country_year()
feature_response_split(data)

((array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
         1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
         1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
         1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
         2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
         2015, 2016], dtype=int64),
  array([3014940395, 3055691989, 3108379009, 3173207428, 3238441149,
         3305101319, 3374903353, 3444384585, 3514639116, 3589069293,
         3664271341, 3741545439, 3818075376, 3893726301, 3970035481,
         4044577268, 4117105339, 4189387395, 4262884975, 4338225244,
         4414334568, 4492427948, 4573445316, 4655199096, 4736682102,
         4819699772, 4905221325, 4992879504, 5081453078, 5170171686,
         5267861414, 5355034619, 5439046865, 5523974088, 5607765176,
         5692526372, 5775191117, 5857799900, 5939330037, 6019808586,
         6099498206, 6178999138, 6258

_**Expected Outputs:**_
```python
data = get_total_population_by_country_year()
feature_response_split(data)
```
> ```
((array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
         1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
         1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
         1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
         2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
         2015, 2016], dtype=int64),
  array([3014940395, 3055691989, 3108379009, 3173207428, 3238441149,
         3305101319, 3374903353, 3444384585, 3514639116, 3589069293,
         3664271341, 3741545439, 3818075376, 3893726301, 3970035481,
         4044577268, 4117105339, 4189387395, 4262884975, 4338225244,
         4414334568, 4492427948, 4573445316, 4655199096, 4736682102,
         4819699772, 4905221325, 4992879504, 5081453078, 5170171686,
         5267861414, 5355034619, 5439046865, 5523974088, 5607765176,
         5692526372, 5775191117, 5857799900, 5939330037, 6019808586,
         6099498206, 6178999138, 6258066893, 6337336633, 6417178545,
         6497569010, 6578653086, 6660306328, 6743298983, 6826490839,
         6909731743, 6991803968, 7071734672, 7157142528, 7243184776,
         7329250474, 7415694711], dtype=int64)),
 (array([2017], dtype=int64), array([7501739318], dtype=int64)))
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `DecisionTreeRegressor` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `DecisionTreeRegressor` model.
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.


In [7]:
def train_model(X_train, y_train):
    # your code here
    X_train = X_train.reshape(-1,1)
    y_train = y_train.reshape(-1,1)

    model = DecisionTreeRegressor()
   
    return model.fit(X_train, y_train)

In [8]:
data = get_total_population_by_country_year()
(X_train, y_train), _ = feature_response_split(data)
train_model(X_train, y_train).predict([[2017]])

array([7.41569471e+09])

_**Expected Outputs:**_
```python
train_model(X_train, y_train).predict([[2017]]) == array([[7.41569471e+09]])
```

### Question 4

We would now like to test on our testing data that we produced from Question 2. This test will give the Root Mean Squared Logarithmic Error (RMSLE), which is given by:

$$
RMSLE = \sqrt{\frac{1}{N}\sum_{i=1}^N [log(1+p_i) - log(1+y_i)]^2}
$$

where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 3 decimal places.

_**Hint:**_
The Root Mean Squared Logarithmic Error is used to calculate the score in the kaggle House Prices Competition.

In [9]:
def test_model(model, X_test, y_test):
    # your code here
    X_test = X_test.reshape(-1,1)
    y_test = y_test.reshape(-1,1)
   
    RMSLE = np.round(np.sqrt((np.log(1 + model.predict(X_test)) - np.log(1 + y_test))**2),3)
            
    return RMSLE

In [10]:
data = get_total_population_by_country_year()
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
lm = train_model(X_train, y_train)
test_model(lm, X_test, y_test)

array([[0.012]])

_**Expected Outputs:**_
```python
test_model(lm, X_test, y_test) == 0.012
```