# Úkol č. 4 - regrese
**Deadline úkolu je uveden na [course pages](https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html).**

  * Cílem tohoto úkolu je vyzkoušet si řešit regresní problém na reálných datech.
  
> **Nejdůležitější na úkolu je to, abyste udělali vše procesně správně: korektní rozdělení datasetu, ladění hyperparametrů, vyhodnocení výsledků atp.**

## Dataset

  * Zdrojem dat je soubor `LifeExpectancyData.csv` na course pages (originál zde: https://www.kaggle.com/kumarajarshi/life-expectancy-who).
  * Popis datasetu najdete na uvedené stránce s originálem datasetu.
  * Cílová (vysvětlovaná) proměnná se jmenuje `Life expectancy `.
  

## Pokyny k vypracování
Body zadání, za jejichž (poctivé) vypracování získáte 12 bodů:

  1. Odeberte z dat body u kterých neznáte vysvětlovanou proměnnou.
  1. Rozdělte data na trénovací a testovací množinu.
  1. Proveďte základní průzkum dat. Na jeho základě adekvátně reagujte na problematické věci v datech (chybějící hodnoty, atd.).
  1. Aplikujte lineární a hřebenovou regresi a výsledky řádně vyhodnoťte:
    * K měření chyby použijte `mean_absolute_error`.
    * Experimentujte s tvorbou nových příznaků (na základě těch dostupných).
    * Experimentujte se standardizací/normalizací dat.
    * Vyberte si hyperparametry modelů k ladění a najděte jejich nejlepší hodnoty.
  1. Použijte i jiný model než jen lineární a hřebenovou regresi.


## Poznámky k odevzdání

  * Řiďte se pokyny ze stránky https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html.
  * Odevzdejte tento Jupyter Notebook.
  * Opravující Vám může umožnit úkol dodělat či opravit a získat tak další body. První verze je ale důležitá a bude-li odbytá, budete za to penalizováni.

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler
from sklearn.impute import KNNImputer

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neighbors import KNeighborsRegressor

import matplotlib
import matplotlib.pyplot as plt
from plotly import graph_objects as go
import plotly.express as px

%matplotlib inline

np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation (so 0.000 is printed as 0.

random_seed = 727

In [66]:
def encode_categories(data, dummies=False):
    label_encoder = LabelEncoder()
    for col in data.select_dtypes('object').columns:
        data[col] = data[col].fillna('NaN')
        data[col] = label_encoder.fit_transform(df[col])
        if dummies:
            data = pd.concat([
                data.drop(columns=[col]), pd.get_dummies(data[col], prefix=('d_' + col))
            ], axis=1)
    return data

def simplePreprocessing(data):
    data.rename(columns={'Life expectancy ':'Life expectancy', 
    'Measles ' : 'Measles',
    ' BMI ' : 'BMI',
    'under-five deaths ' : 'under-five deaths ',
    'Diphtheria ' : 'Diphtheria',
    ' HIV/AIDS' : 'HIV/AIDS',
    ' thinness  1-19 years' : 'thinness  1-19 years',
    ' thinness 5-9 years' : 'thinness 5-9 years'
    }, inplace=True)
    
    #removes datapoints with missing target
    data.drop(data[data['Life expectancy'].isna()].index, inplace=True)
    
    data = data.drop(columns=['Country', 'Year'])
    
    data.rename(columns={'Status' : 'Developed'}, inplace=True)
    
    data['Developed'] = data['Developed'].apply(lambda x: 1 if x == 'Developed' else 0)
    
    return encode_categories(data, True)

def removeFloat(data):
    data['Life expectancy'] = data['Life expectancy'].apply(lambda x: x*10).astype('int64')
    data['Adult Mortality'] = data['Adult Mortality'].apply(lambda x: x*10).astype('int64')
    data['Alcohol'] = data['Alcohol'].apply(lambda x: x*100).astype('int64')
    data['Hepatitis B'] = data['Alcohol'].astype('int64')
    data['BMI'] = data['BMI'].astype('int64')
    data['Polio'] = data['Polio'].astype('int64')
    data['Diphtheria'] = data['Diphtheria'].astype('int64')
    return data

def transformDataFunctionCreation(train):

    scaler = MinMaxScaler()
    train = pd.DataFrame(scaler.fit_transform(train),index=train.index, columns=train.columns)
    def scale(x):
        return pd.DataFrame(scaler.transform(x),index=x.index, columns=x.columns)

    imputer = KNNImputer(n_neighbors=5, weights='distance')
    train = pd.DataFrame(imputer.fit_transform(train),index=train.index, columns=train.columns)
        
    def transformFunction(x):
        x = scale(x)
        x = pd.DataFrame( imputer.transform(x),index=x.index, columns=x.columns)
        return x
        
    
    return train, transformFunction

In [63]:
df = pd.read_csv('LifeExpectancyData.csv')
df = simplePreprocessing(df)
display(df)
X_rest, X_test, y_rest, y_test = train_test_split(
    df.drop(columns=['Life expectancy']), df['Life expectancy'], test_size=0.2, random_state=random_seed
)

Unnamed: 0,Developed,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,0,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,83,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,0,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,0,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,89,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,0,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,93,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,0,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,97,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,0,44.3,723.0,27,4.36,0.000000,68.0,31,27.1,42,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,0,44.5,715.0,26,4.06,0.000000,7.0,998,26.7,41,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,0,44.8,73.0,25,4.43,0.000000,73.0,304,26.3,40,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,0,45.3,686.0,25,1.72,0.000000,76.0,529,25.9,39,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [64]:
X, f = transformDataFunctionCreation(X_rest)
X

Unnamed: 0,Developed,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
664,0.0,0.015235,0.000556,0.245665,0.037102,0.959184,0.000000,0.719160,0.000417,1.000000,0.082414,0.958763,0.000000,0.045101,0.023667,0.120438,0.109155,0.815873,0.826087
1114,0.0,0.319945,0.000556,0.415607,0.018999,0.948980,0.000000,0.038058,0.000417,0.947917,0.352873,0.948454,0.035644,0.019517,0.000578,0.204380,0.190141,0.656085,0.541063
2915,0.0,0.727147,0.018333,0.129480,0.000097,0.816327,0.002730,0.228346,0.021667,0.833333,0.333140,0.814433,0.312871,0.000096,0.009571,0.251825,0.239437,0.506878,0.526570
325,0.0,0.128809,0.000000,0.261850,0.033006,0.897959,0.000268,0.678478,0.000000,0.062500,0.534533,0.896907,0.000000,0.038682,0.000288,0.091241,0.088028,0.758730,0.642512
161,0.0,0.020776,0.000000,0.545665,0.000000,0.969388,0.000000,0.818898,0.000000,0.968750,0.427742,0.969072,0.000000,0.093885,0.017063,0.087591,0.084507,0.834921,0.608696
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2710,0.0,0.315789,0.003889,0.134104,0.006826,0.969388,0.000065,0.489501,0.003333,1.000000,0.172374,0.989691,0.000000,0.008108,0.000004,0.124088,0.119718,0.000000,0.492754
930,0.0,0.106648,0.001111,0.343174,0.000000,0.867347,0.000934,0.801837,0.001250,0.989583,0.430273,0.989691,0.000000,0.306493,0.005149,0.021898,0.017606,0.946032,0.787440
1487,0.0,0.860111,0.002778,0.169942,0.000185,0.584126,0.000000,0.321522,0.002917,0.843750,0.379571,0.845361,0.641584,0.000388,0.000149,0.054745,0.052817,0.471958,0.502415
2581,0.0,0.260388,0.008333,0.354335,0.014809,0.969388,0.026882,0.019685,0.007500,0.989583,0.186303,0.989691,0.009901,0.019780,0.049893,0.328467,0.323944,0.704762,0.570048


In [17]:
classifiers = [(LinearRegression, {}),
               (Ridge, {}),
               (DecisionTreeRegressor, {'max_depth': range(1,101), 'criterion': ['entropy', 'gini']}),
               (RandomForestRegressor, {'n_estimators': range(1, 100, 5), 'max_depth': range(1, 5)}),
               (AdaBoostRegressor,  {'n_estimators': range(1,100,5), 'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 1]}),
               (KNeighborsRegressor, {'n_neighbors' : range(2, 100)})]
n_splits = 5

In [67]:
lr_model = LinearRegression()

X, preprocessFunc = transformDataFunctionCreation(X_rest)
lr_model.fit(X, y_rest)

lr_pred = lr_model.predict(preprocessFunc(X_test))

lr_coef = lr_model.coef_
lr_error = np.sqrt(mean_absolute_error(y_test, lr_pred))

print(lr_error)

1911.1909362615686


In [68]:
for c, p in classifiers:
    model = c()
    print('using', c.__name__)
    X_train, preprocessFunc = transformDataFunctionCreation(X_rest)
    model.fit(X_train, y_rest)
    pred = model.predict(preprocessFunc(X_test))

    error = np.sqrt(mean_absolute_error(y_test, pred))

    print(lr_error)

using LinearRegression
1911.1909362615686
using Ridge
1911.1909362615686
using DecisionTreeRegressor
1911.1909362615686
using RandomForestRegressor
1911.1909362615686
using AdaBoostRegressor
1911.1909362615686
using KNeighborsRegressor
1911.1909362615686
