# Regression-PCA-Regularisation

**Linear Regression: Train/Test Split and Metric Analysis**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset from "data.csv"
pd.set_option("display.notebook_repr_html", False)  # disable "rich" output
data = pd.read_excel("Real estate valuation data set.xlsx")

In [2]:
#******************************************************************************************************************

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#******************************************************************************************************************
#****** Define X and Y variables in the dataset
X = data.drop(columns=["Y house price of unit area"])
y = data["Y house price of unit area"]

#****** split the dataset into training and test datasets (80-20 split, test size = 20%, training size 80%)
X_train_data_1, X_test_data_1, y_train_data_1, y_test_data_1 = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train_data_1, y_train_data_1)

y_pred = model.predict(X_test_data_1)

#******************************************************************************************************************
#****** Check model
mse_1 = mean_squared_error(y_test_data_1, y_pred)
mae_1 = mean_absolute_error(y_test_data_1, y_pred)
r2_1 = r2_score(y_test_data_1, y_pred)

print("Mean Squared Error:", mse_1)
print("Mean Absolute Error:", mae_1)
print("R^2 Score:", r2_1)

#******************************************************************************************************************

Mean Squared Error: 54.59884830498824
Mean Absolute Error: 5.418032735899282
R^2 Score: 0.6745414195692352


Mean squared error (MSE) is the squared average difference between the predicted data and the actual data.
Mean absolute error (MAE) is the absolute average difference between the predicted data and the actual data.
R squared describes how well the model is in regard to predict the data. R square is ranged from 0 to 1 with 1 being a good model.
In this question 1, the results for MSE, MAE are 54.6 and 5.42 respectively. These results do not imply a good model. However, the R squared is 0.67 which suggests that 67% of the variance are in the target variable.

**Dimensionality reduction with PCA and Linear regression**

In [11]:
#******************************************************************************************************************

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#******************************************************************************************************************
#****** Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#****** Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

#****** Split the PCA transformed dataset into 80-20 split
X_train_pca, X_test_pca, y_train_data_2, y_test_data_2 = train_test_split(X_pca, y, test_size=0.2, random_state=42)

#****** Linear Regression
model_pca = LinearRegression()
model_pca.fit(X_train_pca, y_train_data_2)

y_pred_pca = model_pca.predict(X_test_pca)

#******************************************************************************************************************

mse_pca = mean_squared_error(y_test_data_2, y_pred_pca)
mae_pca = mean_absolute_error(y_test_data_2, y_pred_pca)
r2_pca = r2_score(y_test_data_2, y_pred_pca)

print("Mean Squared Error for model with and without PCA:", mse_pca, "&", mse_1)
print("Mean Absolute Error for model with and without PCA:", mae_pca, "&", mae_1)
print("R^2 Score for model with and without PCA:", r2_pca, "&", r2_1)

#******************************************************************************************************************

Mean Squared Error for model with and without PCA: 58.774641855017535 & 54.59884830498824
Mean Absolute Error for model with and without PCA: 5.82883266736958 & 5.418032735899282
R^2 Score for model with and without PCA: 0.6496499084264935 & 0.6745414195692352


Compared to the model built in question 1, the PCA model in question 2 has lower MSE, MAE and R squared. This explains that the original model in question 1 provides a better performance in predicting data.
It is suggested that the less effective outcome of the model in question 2 could be due to using PCA and selecting only the first three principal components. These first three components may not include enough information for the entirety of the dataset, thus yielding loss of information.


**Logistic Regression with PCA on the Iris Dataset**

In [4]:
#Import IRIS dataset from Sklearn
from sklearn.datasets import load_iris

iris_data = load_iris()
X_iris = iris_data.data
y_iris = iris_data.target

In [8]:
#******************************************************************************************************************

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#******************************************************************************************************************
#****** Apply StandardScaler to standardise the features
scaler_3 = StandardScaler()
X_scaled_3 = scaler.fit_transform(X_iris)

#****** Apply PCA to select the first three principal components
pca_3 = PCA(n_components=3)
X_iris_pca = pca.fit_transform(X_scaled_3)

#******************************************************************************************************************

In [9]:
#******************************************************************************************************************

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#******************************************************************************************************************
# Split the dataset into training and testing sets (80-20 split)
X_train_data_3, X_test_data_3, y_train_data_3, y_test_data_3 = train_test_split(X_iris_pca, y_iris, test_size=0.2, random_state=42)

# Train a logistic regression model
log_reg_model = LogisticRegression()
log_reg_model.fit(X_train_data_3, y_train_data_3)

# Make predictions
y_pred_3 = log_reg_model.predict(X_test_data_3)

#******************************************************************************************************************
# Performance evaluation
accuracy = accuracy_score(y_test_data_3, y_pred_3)
precision = precision_score(y_test_data_3, y_pred_3, average='weighted')
recall = recall_score(y_test_data_3, y_pred_3, average='weighted')
f1 = f1_score(y_test_data_3, y_pred_3, average='weighted')

print("Performance metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

#******************************************************************************************************************

Performance metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


Theoretically, accuracy score indicates the proportion of true positive over the total predictions. This means an accuracy score of 1 indicates that all classifying predictions are correct.

Precision is the proportion of true positive predictions over all positive predictions. This means a precision score of 1 indicates that there is no false positive classification.

Recall is the proportion of true positive predictions over actual positive instances. This means a recall score of 1 indicates all positive instances are correctly classified.

F1 is the harmonic mean of precision and recall.

In this model, all scores are 1 which imply that the logistic model is working perfectly.


**Regularisation in Logistic Regression: L1 and L2**

In [10]:
#******************************************************************************************************************

reg_log_reg_model = LogisticRegression(penalty='l2')
reg_log_reg_model.fit(X_train_data_3, y_train_data_3)

y_pred_reg = reg_log_reg_model.predict(X_test_data_3)

#******************************************************************************************************************
#****** Comparing with the previous model
accuracy_reg = accuracy_score(y_test_data_3, y_pred_reg)
precision_reg = precision_score(y_test_data_3, y_pred_reg, average='weighted')
recall_reg = recall_score(y_test_data_3, y_pred_reg, average='weighted')
f1_reg = f1_score(y_test_data_3, y_pred_reg, average='weighted')

print("Accuracy in the previous model and this model:", accuracy, "&", accuracy_reg)
print("Precision in the previous model and this model:", precision, "&", precision_reg)
print("Recall in the previous model and this model:", recall, "&", recall_reg)
print("F1 Score in the previous model and this model:", f1, "&", f1_reg)

#******************************************************************************************************************

Accuracy in the previous model and this model: 1.0 & 1.0
Precision in the previous model and this model: 1.0 & 1.0
Recall in the previous model and this model: 1.0 & 1.0
F1 Score in the previous model and this model: 1.0 & 1.0


L2 regularisation model has all scores being 1 which implies that the model using L2 regularisation works perfectly.
Compared to the scores achieved in Question 3, L2 regularisation produces a logistic regression model that can work just as good.