# Fundamentals of machine learning using Python 
## Linear regression models

***
<br>

## Regression vs. Classification

* Regression is performed on continuous data, while classification is performed on discrete data.
* Regression can be anything from predicting someone's age, the house of a price, or value of any variable. Classification includes predicting what class something belongs to (such as whether a tumor is benign or malignant).
* For both regression and classification - we'll use data to predict labels (umbrella-term for the target variables). Labels can be anything from "B" (class) for classification tasks to 123 (number) for regression tasks.
* Because we're also supplying the labels - these are supervised learning algorithms.

## What is linear regression?

* Linear regression analysis is used to predict the value of a variable based on the value of another variable.
* The variable you want to predict is called the dependent variable.
* The variable you are using to predict the other variable's value is called the independent variable.
* Linear regression estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable.
* Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values.
* Linear regression models are relatively simple and provide an easy-to-interpret mathematical formula that can generate predictions.

<img src="img/linear-regression.png" style="width:400px">

## Example 1: Simple linear regression with single independent variable

In [1]:
# load dataset

import pandas as pd

students_dataset = pd.read_csv("data/student_scores.csv")
students_dataset.head()

Unnamed: 0,Hours,Scores
0,2.5,21
1,5.1,47
2,3.2,27
3,8.5,75
4,3.5,30


In [2]:
# data preprocessing
# Scikit-Learn's linear regression model expects a 2D input

X = students_dataset["Hours"].values.reshape(-1, 1)
y = students_dataset["Scores"].values.reshape(-1, 1)
X.shape, y.shape

((25, 1), (25, 1))

In [3]:
# use a part of the data to train our model and another part of it, to test it

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((20, 1), (5, 1))

In [4]:
# training a linear regression model

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [5]:
# reading the form of the model

print(regressor.coef_, regressor.intercept_)

[[9.68207815]] [2.82689235]


This means that between the variables `Score` and `Hours` we obtained the following relationship:<br>
$Score = 9.68207815∗Hours + 2.82689235$

In [6]:
# making predictions

score = regressor.predict([[9.5]])
print(score)

[[94.80663482]]


In [7]:
# model evaluation - comparison of actual and predicted score values for the test set

y_pred = regressor.predict(X_test)
df_preds = pd.DataFrame({'Actual': y_test.squeeze(), 'Predicted': y_pred.squeeze()})
df_preds

Unnamed: 0,Actual,Predicted
0,81,83.188141
1,30,27.032088
2,21,27.032088
3,76,69.633232
4,62,59.951153


## Example 2: Multiple linear regression

* We can predict using many variables instead of one, and this is also a much more common scenario in real life, where many things can affect some result.

In [8]:
# load dataset
# petrol consumption data on 48 US States

import pandas as pd

petrol_dataset = pd.read_csv("data/petrol_consumption.csv")
petrol_dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [9]:
# preparing the data

# independent variables
X = petrol_dataset[['Average_income', 'Paved_Highways', 'Population_Driver_licence(%)', 'Petrol_tax']]

# dependent varible
y = petrol_dataset['Petrol_Consumption']

X.shape, y.shape

((48, 4), (48,))

In [10]:
# train and test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((38, 4), (10, 4))

In [11]:
# training the multivariate model

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [12]:
# reading the form of the model

print(regressor.coef_, regressor.intercept_)

feature_names = X.columns
model_coefficients = regressor.coef_

coefficients_df = pd.DataFrame(data = model_coefficients, index = feature_names, columns = ['Coefficient value'])
print(coefficients_df)

[-5.65355145e-02 -4.38217137e-03  1.34686930e+03 -3.69937459e+01] 361.4508790666834
                              Coefficient value
Average_income                        -0.056536
Paved_Highways                        -0.004382
Population_Driver_licence(%)        1346.869298
Petrol_tax                           -36.993746


By looking at the coefficients, we can see that, according to our model, the Average_income and Paved_Highways features are the ones that are closer to 0, which means they have have the least impact on the petrol consumption. While the pulation_Driver_license(%) and Petrol_tax, with the coefficients of 1346.86 and -36.99, respectively, have the biggest impact on our target prediction.

In [13]:
# making predictions and model evaluation

y_pred = regressor.predict(X_test)
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(results)

    Actual   Predicted
27     631  606.692665
40     587  673.779442
26     577  584.991490
43     591  563.536910
24     460  519.058672
37     704  643.461003
12     525  572.897614
19     640  687.077036
4      410  547.609366
25     566  530.037630


## --- Exercise ---

Scikit-learn standard dataset `diabetes` describes 442 patients with diabetes. Each patient is described by 10 independent variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The dependent variable is a quantitative measure of disease progression one year after baseline. Build a linear regression model and identify the variables with the greatest impact on output.

In [14]:
from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data
y = diabetes.target

X.shape, y.shape

((442, 10), (442,))

In [None]:
# Write your code here