# Laboratory work #7 Linear Regression

In this work your goal is to get yourself familiar with Linear and multiple linear regression, backward elimination.

You can collaborate with your teammate to complete this laboratory work - in a team of 2 students.

## Task 1 Introduction
1. Create a team of two students and choose a dataset suitable for regression task. You can use www.kaggle.com or any similar resources of datasets. 
2. Describe you dataset - what kind of information does it contain, what are the features, what is the target value?

In [24]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd  

df = pd.read_csv('Breast_cancer_data.csv')

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mean_radius      569 non-null    float64
 1   mean_texture     569 non-null    float64
 2   mean_perimeter   569 non-null    float64
 3   mean_area        569 non-null    float64
 4   mean_smoothness  569 non-null    float64
 5   diagnosis        569 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 26.8 KB


In [26]:
df.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
count,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.0
50%,13.37,18.84,86.24,551.1,0.09587,1.0
75%,15.78,21.8,104.1,782.7,0.1053,1.0
max,28.11,39.28,188.5,2501.0,0.1634,1.0


In [27]:
x = df.drop('diagnosis', axis=1)
y = df.diagnosis

In [28]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [29]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(455, 5) (114, 5) (455,) (114,)


## Linear and Multiple Linear Regression.

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. Train LinearRegression() model on your data.

In [30]:
from sklearn.linear_model import LinearRegression

In [31]:
l_r = LinearRegression()
l_r.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:
y_pred_lr = np.round(l_r.predict(x_test))
y_pred_lr_list = []
for i in y_pred_lr:
    if i > 1:
        y_pred_lr_list.append(1)
    elif i < 0:
        y_pred_lr_list.append(0)
    else:
        y_pred_lr_list.append(int(i))
y_pred_lr = y_pred_lr_list

m = pd.Series(y_test, name='actual data').reset_index().drop('index', axis = 1)
n = pd.Series(y_pred_lr, name='predicted data').reset_index().drop('index', axis = 1)
visualization = pd.concat([m, n], axis = 1)
print(visualization['actual data'].value_counts())
print(visualization['predicted data'].value_counts())
print(visualization)

1    67
0    47
Name: actual data, dtype: int64
1    69
0    45
Name: predicted data, dtype: int64
     actual data  predicted data
0              0               0
1              1               1
2              1               1
3              1               1
4              1               1
..           ...             ...
109            0               0
110            1               1
111            0               0
112            0               0
113            1               1

[114 rows x 2 columns]


## Backward Elimination

Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output. Perform backward elimination.

Below are some main steps which are used to apply backward elimination process:

Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)

Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

If P-value >SL, go to step 4.
Else Finish, and Our model is ready.

Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

In [33]:
import statsmodels.api as sm

In [34]:
x_opt = x[['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area', 'mean_smoothness']]
ols_r = sm.OLS(endog = y, exog = x_opt).fit()
ols_r.summary()

0,1,2,3
Dep. Variable:,diagnosis,R-squared (uncentered):,0.808
Model:,OLS,Adj. R-squared (uncentered):,0.806
Method:,Least Squares,F-statistic:,473.3
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,3.93e-199
Time:,23:07:04,Log-Likelihood:,-205.93
No. Observations:,569,AIC:,421.9
Df Residuals:,564,BIC:,443.6
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_radius,0.7246,0.060,12.141,0.000,0.607,0.842
mean_texture,-0.0085,0.003,-2.436,0.015,-0.015,-0.002
mean_perimeter,-0.0942,0.010,-9.101,0.000,-0.115,-0.074
mean_area,-0.0017,0.000,-8.880,0.000,-0.002,-0.001
mean_smoothness,3.0106,1.060,2.840,0.005,0.929,5.093

0,1,2,3
Omnibus:,25.729,Durbin-Watson:,1.762
Prob(Omnibus):,0.0,Jarque-Bera (JB):,40.664
Skew:,-0.348,Prob(JB):,1.48e-09
Kurtosis:,4.11,Cond. No.,54300.0


In [35]:
x_opt = x[['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area']]
ols_r = sm.OLS(endog = y, exog = x_opt).fit()
ols_r.summary()

0,1,2,3
Dep. Variable:,diagnosis,R-squared (uncentered):,0.805
Model:,OLS,Adj. R-squared (uncentered):,0.803
Method:,Least Squares,F-statistic:,582.4
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,8.34e-199
Time:,23:07:04,Log-Likelihood:,-209.97
No. Observations:,569,AIC:,427.9
Df Residuals:,565,BIC:,445.3
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_radius,0.6773,0.058,11.744,0.000,0.564,0.791
mean_texture,-0.0085,0.004,-2.406,0.016,-0.015,-0.002
mean_perimeter,-0.0809,0.009,-8.710,0.000,-0.099,-0.063
mean_area,-0.0021,0.000,-17.291,0.000,-0.002,-0.002

0,1,2,3
Omnibus:,30.606,Durbin-Watson:,1.774
Prob(Omnibus):,0.0,Jarque-Bera (JB):,75.338
Skew:,-0.242,Prob(JB):,4.37e-17
Kurtosis:,4.715,Cond. No.,2970.0


In [36]:
x_opt = x[['mean_radius', 'mean_texture', 'mean_perimeter']]
ols_r = sm.OLS(endog = y, exog = x_opt).fit()
ols_r.summary()

0,1,2,3
Dep. Variable:,diagnosis,R-squared (uncentered):,0.701
Model:,OLS,Adj. R-squared (uncentered):,0.7
Method:,Least Squares,F-statistic:,443.4
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,4.1e-148
Time:,23:07:06,Log-Likelihood:,-330.8
No. Observations:,569,AIC:,667.6
Df Residuals:,566,BIC:,680.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_radius,1.1231,0.064,17.622,0.000,0.998,1.248
mean_texture,0.0241,0.004,6.562,0.000,0.017,0.031
mean_perimeter,-0.1713,0.009,-18.053,0.000,-0.190,-0.153

0,1,2,3
Omnibus:,31.453,Durbin-Watson:,1.815
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.0
Skew:,-0.593,Prob(JB):,2.51e-08
Kurtosis:,2.734,Cond. No.,348.0


In [37]:
x_opt = x[['mean_radius', 'mean_texture']]
ols_r = sm.OLS(endog = y, exog = x_opt).fit()
ols_r.summary()

0,1,2,3
Dep. Variable:,diagnosis,R-squared (uncentered):,0.53
Model:,OLS,Adj. R-squared (uncentered):,0.528
Method:,Least Squares,F-statistic:,319.2
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,1.38e-93
Time:,23:07:06,Log-Likelihood:,-460.18
No. Observations:,569,AIC:,924.4
Df Residuals:,567,BIC:,933.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_radius,-0.0243,0.006,-4.114,0.000,-0.036,-0.013
mean_texture,0.0460,0.004,10.584,0.000,0.037,0.055

0,1,2,3
Omnibus:,2789.542,Durbin-Watson:,1.6
Prob(Omnibus):,0.0,Jarque-Bera (JB):,63.2
Skew:,-0.452,Prob(JB):,1.89e-14
Kurtosis:,1.641,Cond. No.,7.76


In [38]:
x_new = x_train[['mean_radius', 'mean_texture']]
ols_r = sm.OLS(endog = y, exog = x_opt).fit()
ols_r.summary()

0,1,2,3
Dep. Variable:,diagnosis,R-squared (uncentered):,0.53
Model:,OLS,Adj. R-squared (uncentered):,0.528
Method:,Least Squares,F-statistic:,319.2
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,1.38e-93
Time:,23:07:26,Log-Likelihood:,-460.18
No. Observations:,569,AIC:,924.4
Df Residuals:,567,BIC:,933.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_radius,-0.0243,0.006,-4.114,0.000,-0.036,-0.013
mean_texture,0.0460,0.004,10.584,0.000,0.037,0.055

0,1,2,3
Omnibus:,2789.542,Durbin-Watson:,1.6
Prob(Omnibus):,0.0,Jarque-Bera (JB):,63.2
Skew:,-0.452,Prob(JB):,1.89e-14
Kurtosis:,1.641,Cond. No.,7.76


In [39]:
l_r.fit(x_new, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [40]:
y_preds_lr_new = l_r.predict(x_test[['mean_radius', 'mean_texture']])

## Metrics

Create a method print_metrics(), that prints the following metrics MAE (mean absolute error), MSE (mean squared error), MAPE (mean absolute percentage error), r2_score.

In [41]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
print('Metrics for Linear Regression:')
print(mean_squared_error(y_test, y_pred_lr))
print(r2_score(y_test, y_pred_lr))
print(mean_absolute_error(y_test, y_pred_lr))

Metrics for Linear Regression:
0.07017543859649122
0.7103842489679263
0.07017543859649122


In [42]:
print('Metrics for Backward Elimination Linear Regression:')
print(mean_squared_error(y_test, y_preds_lr_new))
print(r2_score(y_test, y_preds_lr_new))
print(mean_absolute_error(y_test, y_preds_lr_new))

Metrics for Backward Elimination Linear Regression:
0.09835428065858413
0.5940894787427884
0.2574123420365969


In [21]:
# def mean_absolute_percentage_error(y_true, y_pred): 
#     y_true, y_pred = np.array(y_true), np.array(y_pred)
#     return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def mean_absolute_percentage_error(y_true, y_pred):
    mape = np.abs((y_true - y_pred) / y_true).mean(axis=0) * 100
    return mape

In [22]:
mean_absolute_percentage_error(y_test, y_pred_lr)

inf

In [23]:
mean_absolute_percentage_error(y_test, y_preds_lr_new)

inf

## Conclusion

Analyze the work that you have done and make a conclusion. Make a short report on your work. 

We can see that error is greater for Linear Regression with Backward Elimination, as well as r2_score is less than these values for Linear Regression for the whole columns inputted. It can be explained by the fact that we had columns with low correlation with our target column. It could be more precise in case we found a column with high correlation to the target column.