
# Linear Regression: Univariate

---
This script contains examples on how to train and test a linear regression model in Python. For executing the script, you will need to download the dataset "Admission_Predict.csv".

**Note on importing libraries:**

General syntax to import specific functions in a library:
*from (library) import (specific library function)*
*from pandas import DataFrame*

General syntax to import a library but no functions:
*import (library) as (give the library a nickname/alias)*
*import matplotlib.pyplot as plt*
*import pandas as pd *

**Libraries:**

**Pandas** -- is a software library written for the Python programming language for data manipulation and analysis (dataframes, reading and writing, data alignment, reshaping, slicing, indexing, data structure insertion and deletion, merging, time series functionality etc.

**NumPy** -- is a library for Python adding support for large, multi-dimnesional arrays and matrices, along with a large collection of high-level matematical functions to operate on these arrays.


**Matplotlib** -- is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

**Seaborn** -- is a Python data visualization library based on matplotlib (It is used to create more attractive and informative statistical graphics. While seaborn is a different package, it can also be used to develop the attractiveness of matplotlib graphics).

**os** - is a module that provides easy functions allowing us to interact and get Operating System information and even control processes up to a limit.

**io** - is a module that provides the Python interfaces to stream handling.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import os
import io
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import metrics


In [None]:
# To make this notebook's output stable across runs (we make the output reproducable)
np.random.rand(42)

In [None]:
print?

In [None]:
# Let's generate some linear looking data:
# Note: numpy.random.randn generates samples from the normal distribution, while numpy.random.rand from unifrom
X = 2 * np.random.rand(100, 1)

In [None]:
y = 4 + 3 * X + np.random.randn(100, 1) # notice a difference between the function to generate X and y? The former draws from a uniform distribution and the latter from a normal distribution.

In [None]:
 print(np.c_[X, y])  # Translates slice objects to concatenation along the second axis.

In [None]:
# Let's plot (info on the marker and the color --> https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html)
plt.plot(X, y, "b.")
plt.xlabel("$x$", fontsize=18)
plt.ylabel("$y$", fontsize=18)
plt.axis([0, 2, 0, 12])

In [None]:
# Training a linear model
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression() # create an object for the linear regression
lin_reg.fit(X, y) # fit the data
Y_predict = lin_reg.predict(X)


In [None]:
plt.plot(X, y, "b.")
plt.plot(X, Y_predict, color='red')
plt.show()

In [None]:
X_new = np.array([[0.5], [1.75]])
y_predict = lin_reg.predict(X_new)
y_predict

**Let's try with real data**

---

Read CSV file = banking.csv

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
dataset = pd.read_csv(io.BytesIO(uploaded['banking.csv']))

In [None]:
dataset.head()

In [None]:
dataset.tail()

In [None]:
dataset.shape # Returns the dimensions of the array.

In [None]:
dataset.dtypes # Returns the dtypes in the DataFrame.

In [None]:
# Check for NAS
dataset.isna().any() # Generate a boolean mask indicating missing values

In [None]:
dataset.isna().sum()

In [None]:
# Describe the data
dataset.describe()

print(dataset.columns)

In [None]:
# Select the target and independent variables
# Convert 'loan' column to numerical using one-hot encoding
X = pd.get_dummies(dataset['loan'], prefix='loan').values  # Use get_dummies for one-hot encoding
y = dataset['age'].values.reshape(-1,1)

In [None]:
# Scatter plot
plt.scatter(X.ravel(), y.ravel(), marker='.') # Changed here to flatten the arrays
plt.xlabel('loan')
plt.ylabel('age')
plt.show()

In [None]:
# Remove non-numeric vars and check the correlations
corrmat = dataset.drop(['marital', 'education', 'housing', 'loan', 'contact', 'poutcome'], axis=1).corr()
corrmat

In [None]:
# Plot correlation heatmap
sns.heatmap(corrmat, cmap ="YlGnBu", linewidths=0.1)
# sns.heatmap(corrmat, cmap="Blues")
# sns.heatmap(corrmat, cmap="BuPu")
# sns.heatmap(corrmat, cmap="Greens")
plt.show()

In [None]:
# Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # 20% in testing; we set random_state, as everytime you run it without specifying random_state, you will get a different result
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# Fit the model on training set
model = LinearRegression()
model.fit(X_train, y_train) # training the algorithm

In [None]:
# Get coefficients
print('Intercept:', model.intercept_)
print('Slope:', model.coef_)
print('\nThe fitted model is y=', round(model.coef_[0][0], 2), '* x +', round(model.intercept_[0], 2))

In [None]:
# Get fitted value on test set
y_test_predicted = model.predict(X_test)

# Compare predictions
pd.DataFrame({'True': y_test.flatten(), 'Predicted': y_test_predicted.flatten()}) # .flatten --> collapses an array into one dimension

In [None]:
# Plot model
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_test_predicted, color='red', linewidth=2)
plt.show()

In [None]:
# Plot some predicted vs true values
points_to_plot=30
plt.scatter(X_test[:points_to_plot], y_test[:points_to_plot],  color='blue', marker='o', facecolors='none', label='true value')
plt.scatter(X_test[:points_to_plot], y_test_predicted[:points_to_plot],  color='red', marker='x', label='predicted')
plt.legend()
plt.show()

In [None]:
# Evaluate Root Mean Square Error (RMSE)
RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_test_predicted))
print('Root Mean Squared Error on test set:', RMSE_test)
print('Mean of y_test:', y_test.mean())

In [None]:
# Evaluate R-squared
R2 = metrics.r2_score(y_test, y_test_predicted)
print('R-squared:', R2)


**Linear regression for forecasting**

---

In this next section, we aim to train a linear model that will predict the Close price of the Bitcoin cryptocurrency.

In [None]:
# Let's import the dataset including Bitcoin prices
from google.colab import files
uploaded = files.upload()

In [None]:
data = pd.read_csv(io.BytesIO(uploaded['data_BTC.csv']))

In [None]:
# Let's check if we imported correctly
data.head()

In [None]:
# Let's get some summary stats on the prices
data.describe()

In [None]:
# Let's plot the movement of the BTC Close Price
plt.figure(figsize=(12, 6))
plt.plot(data['Date'], data['BTC-USD.Close'], label='Bitcoin Price', color='blue')
plt.xlabel('Date')
plt.ylabel('Price (USD)')
plt.title('Bitcoin Price Over Time')
plt.legend()


In [None]:
# We can use a linear model to forecast (predict) the close price of the Bitcoin at t+1 by using certain amount of lagged prices.
# So let's create lag features. Specifically, we are defining a function to create the lag values.
def create_lagged_features(data, lag):
    for i in range(1, lag+1):
        data[f'lag_{i}'] = data['BTC-USD.Close'].shift(i)
    data.dropna(inplace=True)
    return data

In [None]:
# Create lag features with a lag of 20 days
data = create_lagged_features(data, lag=20)

In [None]:
data.head()

In [None]:
# Split the data into training and testing sets. Remember: We never use the full dataset for training. We always split the data and use portion for training and portion for testing.
X = data.drop(['Date','BTC-USD.Close'], axis=1)
y = data['BTC-USD.Close']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Train a linear regression model
model = sm.OLS(y_train, X_train).fit()

# Print the summary, which includes coefficients and p-values
print(model.summary())

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
from sklearn.metrics import mean_squared_error, r2_score

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate the r2
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error on Test Set: {rmse:.2f}")
print(f"R-squared on Test Set: {r2:.2f}")

In [None]:
for true, predicted in zip(y_test, y_pred):
    print(f"True: {true:.2f}, Predicted: {predicted:.2f}")

In [None]:
# Let's plot
plt.scatter(X_test.iloc[:, 0], y_test,  color='blue', marker='o', facecolors='none', label='true value')
plt.scatter(X_test.iloc[:, 0], y_pred,  color='red', marker='x', label='predicted')
plt.legend()
plt.show()

**Linear regression assumes the following:**

---



1. **linear relationship** between regressor(s) and target
2. little or **no multicollinearity** between regressors
3. **homoscedasticity**, i.e. the variance of the error terms (i.e. residuals) doesn't vary too much for all observations
4. **normal distribution of error terms** (i.e. residuals)
5. no correlation between regressors and residuals or little or **no autocorrelation in residuals** for time series, i.e. correlation between  𝑒𝑡  and  𝑒𝑡−1

In [None]:
# Checking Assumption 1 - linear relationship between regressors and target
# Scatter plot of Y vs x

plt.scatter(X, y, marker='o')
plt.title('GRE score and the Chance of Being Admitted')
plt.xlabel('GRE score')
plt.ylabel('Chance of Being Admitted')
plt.show()

*Checking Assumption 2 - little to no multicolinearity between regressors *

There is only 1 regressor in a univeriate regression!

In [None]:
# Checking Assumption 3 - Homoscedasticity
# Plot the residual and check their "shape"
residuals_test = y_test - y_test_predicted
plt.figure(figsize=(20,10))
plt.scatter(np.arange(0,len(residuals_test)), residuals_test, marker='o', facecolors='none', color='black', alpha=0.5)
plt.show()

In [None]:
# Checking Assumption 4 - Normal distribution of residuals
# Check if residual distribution looks like a normal distribution with same mean and variance

resid_mean = residuals_test.mean()
resid_std = residuals_test.std()
normal_distr = np.random.normal(resid_mean, resid_std, len(residuals_test))

fig, ax = plt.subplots(1, 3, sharex='col', sharey='row', figsize=(20,10))
sns.distplot(residuals_test, ax=ax[0])
ax[0].set_title('Residual distribution', fontsize=20)
sns.distplot(normal_distr, ax=ax[1])
ax[1].set_title('Normal distribution', fontsize=20)
sns.distplot(residuals_test, label='residuals', ax=ax[2])
sns.distplot(normal_distr, label='normal\ndistribution', ax=ax[2])
ax[2].legend(loc='center left', bbox_to_anchor=(1.0, 0.5), fontsize=20)
plt.show()

In [None]:
# Check QQ-plot, i.e. plotting the quantiles of residual against quantiles of normal distribution

percentile_set = np.linspace(0,100,10000) # set percentile 1%, 2%, 3%, etc
residual_percentile = np.percentile(residuals_test, percentile_set)
normal_percentile = np.percentile(normal_distr, percentile_set)

plt.scatter(normal_percentile, residual_percentile, marker='o', facecolor='none', color='blue')
plt.ylabel('residual percentiles', fontsize=20)
plt.xlabel('normal percentiles', fontsize=20)
# plot bisector
line = np.linspace(normal_percentile.min(), normal_percentile.max())
plt.plot(line, line, color="black", ls="dashed")
plt.show()

In [None]:
# Checking Assumption 5 - Independence of residuals
# Check correlation between residuals and regressors
np.corrcoef(residuals_test.T, X_test.T)