# Applying linear regression model to Healthcare for all case study

## Get our tookit - import modules / libraries 

In [None]:
!ls

In [None]:
# pandas, numpy, matplotlib, %matplotlib inline, seaborn 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')


In [None]:
#import the model from sklearn
from sklearn import linear_model
#import evaluation metrics from sklearn
from sklearn.metrics import mean_squared_error, r2_score
#import TTsplit from sklearn
from sklearn.model_selection import train_test_split

In [None]:
data=pd.read_csv('regression_data.csv')

This is an ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Linear Regression Pros and Cons 

+ Pros: Easy to interpret results, computationally inexpensive
+ Cons: Poorly models nonlinear data
+ Works with: Numeric values, nominal values, normally distributed data 

## Get the data, review the shape and clarify headings. 

If any basic data clean steps are needed (eg nulls), do that now

- describe() 
- dtypes
- shape
- info()
- unique()
- head()
- tail()
- query()
- value_counts()
- groupby().agg()


In [None]:
data.head()

In [None]:
data.info()

## clarify the objective including determining the label (column) we wish to predict

once the label has been identified, set that as y and remove that field from the other data (X)

Our objective:

Given a predictor variable X and a response variable y, we fit a straight line to this data that minimizes the distance—most commonly the average squared distance—between the sample points and the fitted line. We can now use the intercept and slope learned from this data to predict the outcome variable of new data.


In [None]:
y=data['TARGET_D'] # DEPENDENT VAR / LABEL 

In [None]:
data.info()

In [None]:
X=data.drop(['TARGET_D'], axis=1) # INDEPENDENT VAR / FEATURES 

# Pre processing

In Pre Processing the data analyst makes best efforts to give the ML model a 'fighting chance':
* cleaning the data, dealing with nulls, outliers 
* removing similar columns which present a multicollinearity risk
* eliminating heavily skewed data points through re-scaling 
* transforming all non numeric variables into numbers 

We will also need an important step to ensure relevance :
* To determine whether our machine learning algorithm not only performs well on the training set but also generalizes well to new data, we will  randomly divide the dataset into a separate training and test set. 
* We use the training set to train and optimize our machine learning model, while we keep the test set until the very end to evaluate the final model.

### Split the data into numeric and categorical features (columns) for pre processing - not including the label we will predict

In [None]:
X_num = X.select_dtypes(include = np.number)

In [None]:
X_num.head()

In [None]:
X_cat = X.select_dtypes(include = object)

In [None]:
X_cat.head()

### Initial pre-processing steps to consider 

- Check for multicollinearity
- if any columns are highly correlated, we should drop them now

- Transformation methods on a chosen feature- for one or more skewed columns or a column with distant, legitimate outliers
 **Box cox**  + 
 **Log transform**

- are there any other numerical columns we want to drop now because they are not correlated at all to the target variable?

- if we identify outliers, they can be removed by calculating the IQR (inter quartile range)


##### After making any changes to the data - replot to see the impact 

- if satisfied with a proposed change, replace the column 

### sklearn rescaling methods - for all numerical columns

+ the idea is to transform numerical features to make them present a gaussian/normal distribution- Linear Regression works best with normally distributed data


#### numerical rescaling - common options 

+ Normalizer 

+ StandardScaler

+ MinMaxScaler

[compare the effects of scalers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html)

[when to use which](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02)

- Use MinMaxScaler as your default
- Use RobustScaler if you have outliers and can handle a larger range
- Use StandardScaler if you need normalized features
- Use Normalizer sparingly - it normalizes rows, not columns


##### check the shape of each normalised numerical X before going further. 

In [None]:
# option 1 standardising all numeric features / rescaling using Normalizer 

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(X_num, alpha=0.2, figsize=(6,6),diagonal='kde');

In [None]:
from sklearn.preprocessing import Normalizer
transformer=Normalizer().fit(X_num)
X_normalized=transformer.transform(X_num)


In [None]:
X_normalized

In [None]:
X_norm=pd.DataFrame(X_normalized,columns=
                    ['AVGGIFT','HV1_log','IC1_transformed',
                     'IC5_transformed'])

In [None]:
scatter_matrix(X_norm, alpha=0.2, figsize=(6,6),diagonal='kde');

In [None]:
#option 2 standardising all numeric features / rescaling using Standard Scaler

In [None]:
from sklearn.preprocessing import StandardScaler
transformer=StandardScaler().fit(X_num)
X_standardised=transformer.transform(X_num)

In [None]:
X_standardised

In [None]:
X_std=pd.DataFrame(X_standardised,columns=
                    ['AVGGIFT','HV1_log','IC1_transformed',
                     'IC5_transformed'])

In [None]:
scatter_matrix(X_std, alpha=0.2, figsize=(6,6),diagonal='kde');

In [None]:
#option 3 standardising all numeric features / rescaling using MinMax scaler


### select which numerical transformation process you will use - this replaces X_num

In [None]:
X_num=X_normalized

## pre processing categorical columns 

In [None]:
#review categorical data 
X_cat

### Turning categories into numbers 

#### One hot encoder 

[Explanation of OHE](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [None]:
#option 1 using OHE 
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(X_cat)
encoded = encoder.transform(X_cat).toarray()
encoded


#### Label encoding 

[explanation of LE](https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/)

In [None]:
#option 2 using LE- orders by value counts
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(X_cat)
le.transform(X_cat) 


#### Get Dummies 

[explanation of getting dummies](https://www.geeksforgeeks.org/python-pandas-get_dummies-method/)

In [None]:
#option 3 using dummies- replaces values with integers
pd.get_dummies(X_cat, drop_first=True)

### select which categorical transformation process you will use - this replaces X_cat

In [None]:
X_cat=encoded

## Bring categorical and numerical X back together 


**first, it is best to check the shape of both arrays to ensure no rows have been lost**

In [None]:
X = np.concatenate((X_num, X_cat), axis=1)

In [None]:
X_num.shape

In [None]:
X_cat.shape

#### check the shape of y matches x !

In [None]:
y.shape

In [None]:
X.shape

## Split the data into train and test, randomly, as a %, using 

> from sklearn.model_selection import train_test_split


In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.4, random_state=100)

In [None]:
#import TTsplit from sklearn
#from sklearn.model_selection import train_test_split

## Apply the model to pre processed data

In [None]:
lm=linear_model.LinearRegression() #configure model
model=lm.fit(X_train,y_train) #train model
predictions=lm.predict(X_test) #set up prediction method 

### Measure the accuracy of linear regression 

One of the primary measures of accuracy we can use in linear regression is r2

r-squared tells us goodness of fit, ie how well the regression model fits the observed data. For example, an r-squared of 60% reveals that 60% of the data fit the regression model. Generally, a higher r-squared indicates a better fit for the model. We want a single score to tell us how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data. 

In [None]:
r2_score(y_test,predictions)

The intercept (often labeled the constant) is the expected mean value of Y when all X=0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value. 

If X never equals 0, then the intercept has no intrinsic meaning.

In [None]:
lm.intercept_

In linear regression, coefficients are the values that multiply the predictor values. 

The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.

A positive sign indicates that as the predictor variable increases, the response variable also increases.
A negative sign indicates that as the predictor variable increases, the response variable decreases.

The coefficient value represents the mean change in the response given a one unit change in the predictor. For example, if a coefficient is +3, the mean response value increases by 3 for every one unit change in the predictor.

In [None]:
lm.coef_

## review against our objective, summarise the accuracy of your model :
+ what are we seeking to predict?
+ how accurately can we do that?
+ what might be some reasons for inaccuracy?
+ are there any improvements we can make?

## Additional Evaluation Metrics for LR

https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e

MSE (mean squared error) represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.


A residual is the vertical distance between a data point and the regression line. Each data point has one residual. They are positive if they are above the regression line and negative if they are below the regression line. ... In other words, the residual is the error that isn't explained by the regression line

In [None]:
metrics.mean_absolute_error
metrics.mean_squared_error

RMSE (Root Mean Squared Error) is the square root of Mean Squared error. It measures the standard deviation of residuals.

In [None]:
result = sqrt(mean_squared_error(actual,pred))
metrics.mean_squared_error(squared=False)

The adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model. It is calculated as:

Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-p-1)]

where:

* R2: The R2 of the model
* n: The number of observations
* p: The number of predictor variables