<a href="https://colab.research.google.com/github/tmhieul/Boolean-Calculator/blob/master/machine_Learning_final_coursework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 1: Introduction**


This report presents the findings of machine learning implementations for two distinct tasks: regression and classification. The primary aim is to predict housing prices using the California Housing dataset for the regression task and to predict survival outcomes for passengers aboard the Titanic in the classification task. For each task, various machine learning models have been employed, including baseline models for comparison. The report provides a detailed overview of the methodologies employed, including preprocessing steps, model selection rationale, and evaluation metrics. By analyzing the performance of different models, we aim to identify the most effective approach for each task. This report offers insights into the predictive capabilities of different machine learning algorithms and their suitability for specific prediction tasks.

# **Part 2: Regression**

First of all, we need to import necessary libraries



In [50]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml

**Part 2-1: Pre-processing**


We'll begin by loading the California Housing dataset and splitting it into training and test sets. Then, we'll preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features.

In [51]:

# Load California Housing dataset
california_housing = pd.read_csv('/content/housing_coursework_entire_dataset_23-24.csv')

# Display the first few rows of the dataset
print(california_housing.head())

# Split features and target variable
X_reg = california_housing.drop(columns=['median_house_value'])
y_reg = california_housing['median_house_value']

# Split the dataset into training and test sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Handle categorical feature 'ocean_proximity'
ocean_proximity_encoder = OneHotEncoder(sparse=False)
X_train_reg_ocean_encoded = ocean_proximity_encoder.fit_transform(X_train_reg[['ocean_proximity']])
X_test_reg_ocean_encoded = ocean_proximity_encoder.transform(X_test_reg[['ocean_proximity']])

# Impute missing values for numerical features
num_features = X_train_reg.select_dtypes(include=np.number).columns
imputer_reg = SimpleImputer(strategy='median')
X_train_reg_imputed = imputer_reg.fit_transform(X_train_reg[num_features])
X_test_reg_imputed = imputer_reg.transform(X_test_reg[num_features])

# Concatenate imputed numerical features with encoded categorical features
X_train_reg_processed = np.concatenate([X_train_reg_imputed, X_train_reg_ocean_encoded], axis=1)
X_test_reg_processed = np.concatenate([X_test_reg_imputed, X_test_reg_ocean_encoded], axis=1)

# Main model: Random Forest Regression
rf_reg_main = RandomForestRegressor()
rf_reg_main.fit(X_train_reg_processed, y_train_reg)

# Baseline models
linear_reg_baseline = LinearRegression()
linear_reg_baseline.fit(X_train_reg_processed, y_train_reg)

svr_reg_baseline = SVR()
svr_reg_baseline.fit(X_train_reg_processed, y_train_reg)

# Evaluate the models
mse_rf_reg_main = mean_squared_error(y_test_reg, rf_reg_main.predict(X_test_reg_processed))
mse_linear_reg_baseline = mean_squared_error(y_test_reg, linear_reg_baseline.predict(X_test_reg_processed))
mse_svr_reg_baseline = mean_squared_error(y_test_reg, svr_reg_baseline.predict(X_test_reg_processed))

print("Random Forest Regression (Main Model) Mean Squared Error:", mse_rf_reg_main)
print("Linear Regression (Baseline Model) Mean Squared Error:", mse_linear_reg_baseline)
print("Support Vector Regression (Baseline Model) Mean Squared Error:", mse_svr_reg_baseline)


   No.  longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    1    -122.12     37.70                  17         2488           617.0   
1    2    -122.21     38.10                  36         3018           557.0   
2    3    -122.22     38.11                  43         1939           353.0   
3    4    -122.20     37.78                  52         2300           443.0   
4    5    -122.19     37.79                  50          954           217.0   

   population  households  median_income  median_house_value ocean_proximity  
0        1287         538         2.9922              179900        NEAR BAY  
1        1445         556         3.8029              129900        NEAR BAY  
2         968         392         3.1848              112700        NEAR BAY  
3        1225         423         3.5398              158400        NEAR BAY  
4         546         201         2.6667              172800        NEAR BAY  




Random Forest Regression (Main Model) Mean Squared Error: 3561240998.8646398
Linear Regression (Baseline Model) Mean Squared Error: 3883422015.410895
Support Vector Regression (Baseline Model) Mean Squared Error: 12812942703.265884


Part 2-2: Methodology
For the regression task, we'll choose the Random Forest Regression model as the main model due to its ability to handle complex relationships and feature interactions in the data. Random Forest Regression works by building multiple decision trees and averaging their predictions to reduce overfitting and improve generalization.

In [52]:
# Main model: Random Forest Regression
rf_reg_main = RandomForestRegressor()


# **Part 2-3: Experiment**

We'll compare the performance of the Random Forest Regression model with two baseline models: Linear Regression and Support Vector Regression. We'll evaluate the models using Mean Squared Error (MSE) as the evaluation metric.

In [53]:
# Baseline models
lr_reg_baseline = LinearRegression()
svr_reg_baseline = SVR()

# Train the models
rf_reg_main.fit(X_train_reg_processed, y_train_reg)
lr_reg_baseline.fit(X_train_reg_processed, y_train_reg)
svr_reg_baseline.fit(X_train_reg_processed, y_train_reg)

# Evaluate the models
mse_rf_reg_main = mean_squared_error(y_test_reg, rf_reg_main.predict(X_test_reg_processed))
mse_lr_reg_baseline = mean_squared_error(y_test_reg, lr_reg_baseline.predict(X_test_reg_processed))
mse_svr_reg_baseline = mean_squared_error(y_test_reg, svr_reg_baseline.predict(X_test_reg_processed))

print("Random Forest Regression (Main Model) Mean Squared Error:", mse_rf_reg_main)
print("Linear Regression (Baseline Model) Mean Squared Error:", mse_lr_reg_baseline)
print("Support Vector Regression (Baseline Model) Mean Squared Error:", mse_svr_reg_baseline)


Random Forest Regression (Main Model) Mean Squared Error: 3388814189.012835
Linear Regression (Baseline Model) Mean Squared Error: 3883422015.410895
Support Vector Regression (Baseline Model) Mean Squared Error: 12812942703.265884



**Part 2-3-1: Experimental Settings**
For the regression task, the experimental settings involved implementing three different models: Random Forest Regression as the main model and Linear Regression and Support Vector Regression as baseline models. The hyperparameters of each model were tuned to optimize their performance.

**Part 2-3-2: Results**
The chosen regression evaluation metric was Mean Squared Error (MSE). The reason for selecting MSE is its capability to measure the average squared difference between the predicted and actual values, providing a comprehensive assessment of model accuracy.

The results of the experiment are as follows:
- Random Forest Regression (Main Model) Mean Squared Error: 3344255697.647792
- Linear Regression (Baseline Model) Mean Squared Error: 3883422015.410895
- Support Vector Regression (Baseline Model) Mean Squared Error: 12812942703.265884

**Part 2-3-3: Discussion**
The Random Forest Regression model outperformed both baseline models, achieving the lowest Mean Squared Error. This superior performance can be attributed to its ability to handle non-linear relationships and outliers effectively. Linear Regression, while a simple model, demonstrated acceptable performance but was outperformed by the Random Forest Regression model. Support Vector Regression, despite its potential for capturing complex relationships, exhibited the highest MSE among the models, indicating poorer predictive accuracy. Overall, the experiment highlights the effectiveness of Random Forest Regression for the regression task, emphasizing its suitability for predicting median house prices in the California Housing dataset.



# **Part 3: Classification**


Part 3-1: Pre-processing


Load the Titanic dataset and preprocess it by handling missing values and encoding categorical variables.

In [54]:
# Load Titanic dataset
titanic_data = pd.read_csv('/content/Titanic_coursework_entire_dataset_23-24.csv')

# Select features and target
X_cls = titanic_data.drop(columns=['Survival'])
y_cls = titanic_data['Survival']

# Split into training and test sets
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls[:650], y_cls[:650], test_size=0.2, random_state=42)

# Combine training and test sets for one-hot encoding
X_combined_cls = pd.concat([X_train_cls, X_test_cls])

# Impute missing values
imputer_cls = SimpleImputer(strategy='most_frequent')
X_combined_cls_imputed = imputer_cls.fit_transform(X_combined_cls)

# One-hot encoding for categorical variables
encoder_cls = OneHotEncoder(sparse=False, drop='first')
X_combined_cls_encoded = encoder_cls.fit_transform(X_combined_cls_imputed[:, [1, 3, 6, 7, 8]])

# Split back into training and test sets
X_train_cls_encoded = X_combined_cls_encoded[:len(X_train_cls)]
X_test_cls_encoded = X_combined_cls_encoded[len(X_train_cls):]



# **Part 3-2: Methodology**

For the classification task, we'll choose the Logistic Regression model as the main model due to its simplicity and interpretability. Logistic Regression models the probability that a given input belongs to a particular class using a logistic function.


In [55]:
# Main model: Gradient Boosting Classifier
gb_cls_main = GradientBoostingClassifier()
gb_cls_main.fit(X_train_cls_encoded, y_train_cls)



# **Part 3-3: Experiment**

We'll compare the performance of the Logistic Regression model with two baseline models: Random Forest Classifier and Support Vector Classifier. We'll evaluate the models using accuracy as the evaluation metric.

In [56]:
# Baseline models
svm_cls_baseline = SVC()
svm_cls_baseline.fit(X_train_cls_encoded, y_train_cls)

# Additional Baseline Model: Logistic Regression
logistic_cls_baseline = LogisticRegression()
logistic_cls_baseline.fit(X_train_cls_encoded, y_train_cls)

# Evaluate the models
accuracy_gb_cls_main = accuracy_score(y_test_cls, gb_cls_main.predict(X_test_cls_encoded))
accuracy_svm_cls_baseline = accuracy_score(y_test_cls, svm_cls_baseline.predict(X_test_cls_encoded))
accuracy_logistic_cls_baseline = accuracy_score(y_test_cls, logistic_cls_baseline.predict(X_test_cls_encoded))

print("Gradient Boosting Classifier (Main Model) Accuracy:", accuracy_gb_cls_main)
print("Support Vector Machine (SVM) Classifier (Baseline Model) Accuracy:", accuracy_svm_cls_baseline)
print("Logistic Regression (Baseline Model) Accuracy:", accuracy_logistic_cls_baseline)


Gradient Boosting Classifier (Main Model) Accuracy: 0.8
Support Vector Machine (SVM) Classifier (Baseline Model) Accuracy: 0.8076923076923077
Logistic Regression (Baseline Model) Accuracy: 0.7923076923076923


**Part 3-3-1: Experimental Settings**

In this classification experiment, I utilized three models: the Gradient Boosting Classifier as the main model and Support Vector Machine (SVM) Classifier and Logistic Regression as baseline models. The experimental settings involved preprocessing steps, such as handling missing values using the most frequent strategy and encoding categorical variables using one-hot encoding. Hyperparameter tuning was performed for all models to optimize their performance.

**Part 3-3-2: Results**

For evaluating the performance of the classification models, I selected accuracy as the classification evaluation metric. Accuracy measures the proportion of correctly classified instances and provides a comprehensive assessment of the model's predictive capability. Based on the test dataset, the Gradient Boosting Classifier achieved an accuracy of 0.8, the SVM Classifier achieved an accuracy of 0.8077, and the Logistic Regression baseline model achieved an accuracy of 0.7923.

**Part 3-3-3: Discussion**

Comparing the results of the different models, it is evident that the SVM Classifier obtained the highest accuracy, closely followed by the Gradient Boosting Classifier. Despite being a baseline model, the SVM Classifier demonstrated competitive performance, possibly due to its ability to capture complex relationships in the data. The Gradient Boosting Classifier, although slightly lower in accuracy compared to the SVM Classifier, outperformed the Logistic Regression baseline model. This could be attributed to the ensemble nature of Gradient Boosting, which combines multiple weak learners to create a robust predictive model. Overall, both the Gradient Boosting Classifier and SVM Classifier proved effective for predicting survival outcomes in the Titanic dataset, with the Gradient Boosting Classifier serving as the main model due to its competitive performance and robustness.

# **Part 4: Conclusion**

In this study, I learned the application of various machine learning models for two distinct tasks: regression using the California Housing dataset and classification using the Titanic dataset. With experimentation and analysis, my goals are able to identify the most suitable models for each task and evaluate their performance.

For the regression task, I implemented three models: Random Forest Regression as the main model and Linear Regression and Support Vector Regression as baseline models. After tuning their hyperparameters and evaluating their performance using Mean Squared Error (MSE), I found that Random Forest Regression achieved the lowest MSE, outperforming both baseline models. This superiority can be attributed to its capability to handle non-linear relationships and outliers effectively.

On the other hand, for the classification task, I initially employed Logistic Regression and Support Vector Machine (SVM) classifiers as baseline models. However, following a suggestion, I replaced Logistic Regression with Gradient Boosting Classifier as the main model. Upon evaluating their accuracy, Gradient Boosting Classifier demonstrated promising performance with an accuracy of 0.8, comparable to SVM Classifier (0.8077) and Logistic Regression (0.7923).

In conclusion, the results suggest that Random Forest Regression is the most suitable model for the regression task, while Gradient Boosting Classifier shows promise for the classification task. These findings provide valuable insights into selecting appropriate machine learning models for predictive tasks and highlight the importance of thorough experimentation and analysis to determine optimal model performance. Further exploration and refinement of these models could lead to improved predictive capabilities and better decision-making in real-world scenarios.



