# Linear Regression



# Objective

The purpouse of this notebook is to practice basic concepts of linear regression such as preprocessing, training a model or evaluating the model.

# The data

The data was collected by insurance companies in US. The dataset is loaded in a csv file with the following features:

- age
- sex
- bmi
- children
- smoker
- region
- charges

In [28]:
import pandas as pd
import sklearn.compose
import warnings

warnings.filterwarnings('ignore')

url_test = 'https://github.com/robertomancebom/LinearRegressionPractice/blob/master/test.csv?raw=true'
url_train = 'https://github.com/robertomancebom/LinearRegressionPractice/blob/master/train.csv?raw=true'

train_data = pd.read_csv(url_train)
test_data = pd.read_csv(url_test)

train_data.describe(include='all')

Unnamed: 0,id,age,sex,bmi,children,smoker,region,charges
count,936.0,936.0,936,936.0,936.0,936,936,936.0
unique,,,2,,,2,4,
top,,,male,,,no,southeast,
freq,,,481,,,733,251,
mean,670.162393,39.201923,,30.483323,1.092949,,,13543.401938
std,385.864903,13.978319,,5.998443,1.19487,,,12285.440739
min,0.0,18.0,,16.815,0.0,,,1121.8739
25%,333.75,27.0,,26.125,0.0,,,4835.844225
50%,673.5,39.0,,30.25,1.0,,,9521.1343
75%,1007.5,51.0,,34.21,2.0,,,17388.57055


# Preprocessing

Once we've seen the data, we can differenciate two types of variables: numeric and categorical variables.

Moreover, the range of these data is different as seen among age, children or bmi. To work with the same values we need to use transformers.

On one hand, we need to change categorical variables to numerical variables in order to work correctly. To achieve this we are using **OneHotEncoder** and **OrdinalEncoder**.

On the other hand we need to preprocess the numerical variables. After studying different possibilities, we decided to use **StandardScaler**.

Furthermore, with the children variable, we used **KBinsDiscretizer** to divide the data in 3 groups.

Once we have preprocessed the dataset, we split the training data into two matrixes: 
* X with the features.
* Y with the target value.

In [29]:
X = train_data.iloc[:,:-1]
y = train_data.iloc[:,-1:]

column_transformer = sklearn.compose.ColumnTransformer(transformers=[
    ("id", "drop", [0]),
    ("age", sklearn.preprocessing.StandardScaler(),[1]),
    ("sex", sklearn.preprocessing.OneHotEncoder(drop='first'),[2]),
    ("bmi", sklearn.preprocessing.StandardScaler(),[3]),
    ("children",sklearn.preprocessing.KBinsDiscretizer(n_bins=3),[4]),
    ("smoker", sklearn.preprocessing.OneHotEncoder(drop='first'), [5]),
    ("region", sklearn.preprocessing.OrdinalEncoder(), [6]),
])

X = column_transformer.fit_transform(X)
X

array([[ 1.34552133,  1.        , -0.02974405, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.05921091,  0.        , -0.2674327 , ...,  1.        ,
         0.        ,  0.        ],
       [ 1.77498696,  0.        ,  1.53733305, ...,  0.        ,
         0.        ,  3.        ],
       ...,
       [ 1.13078852,  0.        , -0.1089736 , ...,  0.        ,
         0.        ,  0.        ],
       [-1.37442767,  0.        ,  0.46981911, ...,  0.        ,
         0.        ,  3.        ],
       [ 1.27394373,  1.        ,  1.94265475, ...,  0.        ,
         1.        ,  2.        ]])

# Training

Once we have the data preprocessed, we need to choose an algorithm to train a model.

In this case we are going to try three different algorithms **Ridge Regression**, **Gradient Boosting** and **Random Forest**.

In [30]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import Ridge

rd = Ridge(random_state=30)
gbr = GradientBoostingRegressor(random_state=30, min_samples_leaf=85,
                                  learning_rate=0.11, max_depth=4,
                                  n_estimators=150, max_features=4,
                                  min_weight_fraction_leaf=0.1)
rf = RandomForestRegressor(random_state=30)

model_rd = rd.fit(X,y.values.ravel())
model_gbr = gbr.fit(X, y.values.ravel())
model_rf = rf.fit(X,y.values.ravel())

# Model evaluation

In order to evaluate the model and getting the best hyperparameters, we use cross-validation. With this method we are trying to avoid overfitting and getting a better model overall.

Moreover, we used 'R2' as the evaluation metric.


In [31]:
import numpy as np
from sklearn.model_selection import RepeatedKFold, cross_val_score

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

score_rd = cross_val_score(model_rd, X, y, scoring='r2', cv=cv, n_jobs=-1)
score_gbr = cross_val_score(model_gbr, X, y, scoring='r2', cv=cv, n_jobs=-1)
score_rf = cross_val_score(model_rf, X, y, scoring='r2', cv=cv, n_jobs=-1)

cv_rd, rd_score = np.mean(score_rd), model_rd.score(X, y.values.ravel())
cv_gbr, gbr_score = np.mean(score_gbr), model_gbr.score(X, y.values.ravel())
cv_rf, rf_score = np.mean(score_rf), model_rf.score(X, y.values.ravel())

print('Ridge ---> r2:', rd_score, ' cv:', cv_rd)
print('Gradient Boosting ---> r2', gbr_score, ' cv:', cv_gbr)
print('Random Forest ---> r2', rf_score, ' cv:', cv_rf)

Ridge ---> r2: 0.7498543944130304  cv: 0.7388192001917021
Gradient Boosting ---> r2 0.8851182702290967  cv: 0.8474431856909658
Random Forest ---> r2 0.9756587703909264  cv: 0.8246365453055507


# The results

Once we have trained and evaluated the model, we have to predict the value of charges for the test dataset.

In [32]:
ids = test_data.iloc[:,:1]

test = column_transformer.fit_transform(test_data.iloc[:,:])

# We use Gradient Boosting to predict the values
pred = model_gbr.predict(test)

df = pd.DataFrame(pred, columns=['charges'])
df = ids.join(df)

df.to_csv('out.csv', index=False)
df

Unnamed: 0,id,charges
0,1319,8269.711563
1,12,4223.228297
2,487,1875.530903
3,1118,39406.886902
4,460,10495.700214
...,...,...
397,332,13805.873068
398,226,4324.570638
399,1285,9333.570908
400,631,3329.465711
