<a href="https://colab.research.google.com/github/sgathai/Various-note-books-_Practising/blob/master/%5BSolution_Notebook%5D_AfterWork_ML_Essentials_with_scikit_learn_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Solution Notebook] AfterWork: ML Essentials with scikit-learn Course

## Prerequisites

In [None]:
# First import the libraries that we need.
# ----
import pandas as pd                   # library for performing data manipulation.
import numpy as np                    # library for performing scientific computations.
import matplotlib.pyplot as plt       # library for performing visualization.

## 1. Classification

The goal here is to categorize input data into predefined classes or labels.

### Example

In [None]:
# In this example, we will use the random forest classifier and gradient boosting
# classifier to predict discrete outcomes on a dataset.
# ---
# These models learn patterns in the training data and then use the information
# learned to classify the testing data.
# ---
# Dataset url = https://bit.ly/3Sn7blU
# ---
# We first import our classification models.
# ---
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

#### Data Importation

In [None]:
# We import and preview our dataset.
# ---
# Dataset url = https://bit.ly/3Sn7blU
# ---
# This dataset contains observations of iris flowers including sepal and petal
# length and width for each flower, as well as the species of the flower.
# The labels are the species of the iris flower.
# ---
iris_df = pd.read_csv('https://bit.ly/3Sn7blU')

# Check the first records.
iris_df.head()

In [None]:
# Check the last records.
iris_df.tail()

#### Data Exploration/ Cleaning/ Preparation/ Statistical Analysis



We will not perform extrensive exploration/ cleaning/ preparation/ statistical analysis steps here since the main focus of this part of the session is to make predictions on the dataset using a classification algorithm.

In [None]:
# We separate the features from the labels.
# features
iris_X = iris_df.drop('Species', axis=1)

# labels
iris_y = iris_df['Species']

In [None]:
# We scale the values of our features in order to give them equal importance.
# Scaling allows the datapoints of our features to lie within the same upper and lower limits.
# ---
# We can perform feature scaling using StandardScaler.
# ---
# Perform normalization.
from sklearn.preprocessing import StandardScaler

# Implement the standard scaler
standard_scaler = StandardScaler().fit(iris_X)
iris_X = standard_scaler.transform(iris_X)

In [None]:
# Split the data into a training set and a testing set.
# To the train_test_split function we pass the feature matrix (iris_X), the target vector (iris_y), and test_size which
# determines the percentage of data used for testing (20% in our case).
from sklearn.model_selection import train_test_split

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X, iris_y, test_size=0.2)

#### Method 1: Random Forest Classifier

In [None]:
# We create a random forest classifier that we'll use to perform classification.
# ---
# The random forest classifier is an ensemble learning method that builds multiple decision
# trees and merges their predictions to obtain a more accurate and stable result.
# ---
random_forest_clf = RandomForestClassifier()

Hyperparameter tuning/ model selection/ model parameter optimization is the process of searching for the best set of hyperparameters for a model.

Hyperparameters are configuration settings for a machine learning model that are set before the training process begins while model parameters are those that are learned from the data during training.

Hyperparameters influence the overall behaviour of the model and need to be specified by the user. Examples of hyperparameters include:
- Learning rates.
- Depth of a decision tree.
- Number of layers in a neural network.



In [None]:
# We perform hyperparameter tuning to optimize the model.
# Hyperparameter tuning involves searching for the best set of hyperparameter
# values that the model will explore during the tuning process.
from sklearn.model_selection import GridSearchCV

# We define hyerparameter grids for the classifier.
# `n_estimators` represents the number of trees in the random forest.
# `max_depth` sets the maximum depth of each tree in the random forest.
# `min_sample_split` is the minimum number of samples required to split an internal node.
# `min_samples_leaf` sets the minimum number of samples required to be at a leaf node.
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# We perform a grid search.
# GridSearchCV systematically evaluates all possible combinations of hyperparameter values
# specified above. The model is trained and evaluated for each combination and the set of
# hyperparameters that yields the best performance is chosen.
grid_search_rfc = GridSearchCV(estimator=random_forest_clf, param_grid=param_grid, cv=5, scoring='accuracy')

# We pass our training data to GridSearch.
grid_search_rfc.fit(iris_X_train, iris_y_train)

# We get the best model from GridSearch
best_model_rfc = grid_search_rfc.best_estimator_

In [None]:
# We make predictions using our trained classifier.
y_pred_rfc = best_model_rfc.predict(iris_X_test)

In [None]:
# We the evaluate the performance of our classifier.
# ---
# Accuracy is a measure of the proportion of correctly classified samples out of the total number of samples,
# the closer the accuracy score is to 1, the more better the model is at making predictions.
# A confusion matrix is a table representing the performance of a model. It shows the true positive, true
# negative, false positive, and false negative counts for each class. It is used to understand where a model
# is making errors.
# A classification report is a report displaying precision, recall, F1 score and support for each class. It is
# useful for undertanding the performance of a model across different classes.
# ---
# Accuracy is a global measure while classification report and confusion matrix offer insights into class-specific performance.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy_rfc = accuracy_score(iris_y_test, y_pred_rfc)
conf_matrix_rfc = confusion_matrix(iris_y_test, y_pred_rfc)
classification_rep_rfc = classification_report(iris_y_test, y_pred_rfc)

print(f"Accuracy: ")
print(accuracy_rfc)
print("Confusion Matrix:")
print(conf_matrix_rfc)
print("Classification Report:")
print(classification_rep_rfc)

Interpretation:
- **Accuracy:** An accuracy of 1.0 means that our model correctly classified all instances in the test dataset hence perfect performance.
- **Confusion matrix:** The diagonal elements represent the number of correctly identified instances for each class. Off-diagonal elements are zeros, indicating no misclassifications
- **Classification report:**
  - **Precision:** All classes have precision score of 1.0 indicating no false positives.
  - **Recall:** All classes have recall score of 1.0 indicating no false negatives.
  - **F1-score:** All classes have F1-score of 1.0 indicating a balance between precision and recall.
  - **Support:** This is the number of actual instances for each class

Cross validation is a resampling technique used to assess the performance and the generalization capability of a model.

It involves dividing the dataset into multiple subsets, training the model on some of these subsets, and evaluating it on the remaining subsets.

Our goal is to obtain a more reliable estimate of a model's performance compared to a single train-test split and this is particularly important when dealing with limited datasets.

In [None]:
# We perform k-fold cross validation to obtain a more accurate and reliable estimate of the model's performance.
# The dataset is divided into k subsets (folds) and the model is trained and evaluated k times, each time
# using a different fold as the test set and the remaining folds as the training set.
# ---
# We use k=5.
# ---
from sklearn.model_selection import cross_val_score

cv_scores_rfc = cross_val_score(best_model_rfc, iris_X, iris_y, cv=5)
print("\nCross-Validation Scores: ", cv_scores_rfc)
print("Mean Cross-Validation: ", np.mean(cv_scores_rfc))

Interpretation:
- The cross-validation scores indicate how well the model generalizes to different subsets of the data.
- The model's accuracy ranges from 83.33% to 100% across different folds.
- The mean cross-validation accuracy (95.33%) provides a more robust estimate of the model's overall performance, considering variability across different data subsets.

The variation in scores across folds is due to differences in the composition of training and validation sets in each fold.

A high mean cross-validation score (close to 1.0) suggests that the model performs well on average across different subsets of the data.

#### Method 2: Gradient Boosting Classifier

In [None]:
# We create a gradient boosting classifier that we'll use to perform classification.
# ---
# The gradient boosting classifier is an ensemble learning method that builds decision
# trees sequentially, each tree correcting the errors of the previous ones.
# ---
gradient_boosting_clf = GradientBoostingClassifier()

Hyperparameter tuning/ model selection/ model parameter optimization is the process of searching for the best set of hyperparameters for a model.

Hyperparameters are configuration settings for a machine learning model that are set before the training process begins while model parameters are those that are learned from the data during training.

Hyperparameters influence the overall behaviour of the model and need to be specified by the user. Examples of hyperparameters include:
- Learning rates.
- Depth of a decision tree.
- Number of layers in a neural network.



In [None]:
# We perform hyperparameter tuning to optimize the model.
# Hyperparameter tuning involves searching for the best set of hyperparameter
# values that the model will explore during the tuning process.
from sklearn.model_selection import GridSearchCV

# We define hyperparameter grids for the classifier.
# `n_estimators` represents the number of trees trained during the boosting process.
# `learning_rate` scales the contribution of each tree when updating the model in the boosting process.
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
}

# We perform a grid search.
# GridSearchCV systematically evaluates all possible combinations of hyperparameter values
# specified above. The model is trained and evaluated for each combination and the set of
# hyperparameters that yields the best performance is chosen.
grid_search_gbc = GridSearchCV(estimator=gradient_boosting_clf, param_grid=param_grid, cv=5, scoring='accuracy')

# We pass our training data to GridSearch.
grid_search_gbc.fit(iris_X_train, iris_y_train)

# We get the best model from GridSearch.
best_model_gbc = grid_search_gbc.best_estimator_

#### Exercise

- Use `best_model_gbc` to make predictions on the training set.
- Evaluate the model and interprate the results.
- Perform k-fold cross-validation and evaluate the results.

In [None]:
# We then make predictions using our trained classifier.
y_pred_gbc = best_model_gbc.predict(iris_X_test)

In [None]:
# We evaluate the performance of our classifier.
# ---
# Accuracy is a measure of the proportion of correctly classified samples out of the total number of samples,
# the closer the accuracy score is to 1, the more better the model is at making predictions.
# A confusion matrix is a table representing the performance of a model. It shows the true positive, true
# negative, false positive, and false negative counts for each class. It is used to understand where a model
# is making errors.
# A classification report is a report displaying precision, recall, F1 score and support for each class. It is
# useful for undertanding the performance of a model across different classes.
# ---
# Accuracy is a global measure while classification report and confusion matrix offer insights into class-specific performance.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy, classification report and confusion matrix
accuracy_gbc = accuracy_score(iris_y_test, y_pred_gbc)
conf_matrix_gbc = confusion_matrix(iris_y_test, y_pred_gbc)
classification_rep_gbc = classification_report(iris_y_test, y_pred_gbc)

# Display the metrics calculated above and interprate them
print(f"Accuracy: ")
print(accuracy_gbc)
print("Confusion Matrix:")
print(conf_matrix_gbc)
print("Classification Report:")
print(classification_rep_gbc)

In [None]:
# We perform k-fold cross validation to obtain a more accurate and reliable estimate of the model's performance.
# The dataset is divided into k subsets (folds) and the model is trained and evaluated k times, each time
# using a different fold as the test set and the remaining folds as the training set.
# ---
# We use k=5.
# ---
from sklearn.model_selection import cross_val_score

# Calculate the cross-validation scores.
cv_scores_gbc = cross_val_score(best_model_gbc, iris_X, iris_y, cv=5)
# Print out the scores and find the average score
print("\nCross-Validation Scores: ", cv_scores_gbc)
print("Mean Cross-Validation: ", np.mean(cv_scores_gbc))

### Challenge 1

In [None]:
# Classification challenge 1.
# ---
# A cancer researcher has given data on breast tumors features such as size and density.
# Implement a  support vector machine classification model to predict whether the tumors are benign or malignant.
# Evaluate the performance of your classifier.
# Hint: Check for missing values in the data and handle them using an imputation technique.
# ---
# Dataset URL = https://bit.ly/data_breast_cancer
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 1

Pre-code for data loading and preprocessing has been provided below.

In [None]:
# Load the Breast Cancer dataset.
cancer_df = pd.read_csv('https://bit.ly/data_breast_cancer')

# Extract features and labels.
cancer_X = cancer_df.drop('diagnosis', axis=1)
cancer_y = cancer_df['diagnosis']

In [None]:
# Check for NaN values in data.
nan_values = cancer_X.isnull().any()
if nan_values.any():
    print("There are missing values in the data.")
    print(nan_values)

In [None]:
from sklearn.impute import SimpleImputer

# Impute missing values with the most frequent value.
imputer = SimpleImputer(strategy='most_frequent')
cancer_X_imputed = imputer.fit_transform(cancer_X)

In [None]:
# Standardize the features using StandardScaler.
scaler = StandardScaler()
cancer_X_scaled = scaler.fit_transform(cancer_X_imputed)

In the cell below, write your code for the rest of your solution.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

# Split the dataset into a training and testing set using train_test_split.
cancer_X_train, cancer_X_test, cancer_y_train, cancer_y_test = train_test_split(cancer_X_scaled, cancer_y, test_size=0.2, random_state=42)

# Initialize a support vector machine classifier.
svm_clf = SVC()

# Train your classifier.
svm_clf.fit(cancer_X_train, cancer_y_train)

# Make predictions on the testing set using your classifier.
y_pred_svm = svm_clf.predict(cancer_X_test)

# Evaluate the performance of your classifier.
accuracy_svm = accuracy_score(cancer_y_test, y_pred_svm)
conf_matrix_svm = confusion_matrix(cancer_y_test, y_pred_svm)
classification_rep_svm = classification_report(cancer_y_test, y_pred_svm)

print("Accuracy:")
print(accuracy_svm)
print("Confusion Matrix:")
print(conf_matrix_svm)
print("Classification Report:")
print(classification_rep_svm)

# Perform 5-fold cross validation and interprate the result.
cv_scores_svm = cross_val_score(svm_clf, cancer_X_imputed, cancer_y, cv=5)
print("\nCross-Validation Scores: ", cv_scores_svm)
print("Mean Cross-Validation: ", np.mean(cv_scores_svm))

### Challenge 2

In [None]:
# Classification challenge 2.
# ---
# This challenge is an extension of challange 1. It requires you to perform hyperparameter tuning.
# Using the same cancer dataset, implement a logistic regression classification model to predict
# whether tumors are benign or malignant.
# Perform hyperparameter tuning on your model to optimize its performance.
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 2

In [None]:
# Make use of the split dataset from chalenge 1
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize a logistic regression classifier.
logistic_reg = LogisticRegression()

# Train your classifier.
logistic_reg.fit(cancer_X_train, cancer_y_train)

# Make predictions on the testing set using your classifier.
logistic_y_pred = logistic_reg.predict(cancer_X_test)

# Evaluate the performance of your classifier.
logistic_accuracy = accuracy_score(cancer_y_test, logistic_y_pred)
logistic_conf_matrix = confusion_matrix(cancer_y_test, logistic_y_pred)
logistic_class_report = classification_report(cancer_y_test, logistic_y_pred)

# Print the results
print("Accuracy:", logistic_accuracy)
print("Confusion Matrix:\n", logistic_conf_matrix)
print("Classification Report:\n", logistic_class_report)

In [None]:
# Perform hyperparameter tuning on your logistic regression classifier.
# Define hyperparameter grid for optimization.
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(-4, 4, 20),
    'solver': ['liblinear']
}

# Perform GridSearchCV for hyperparameter tuning.
grid_search = GridSearchCV(logistic_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(cancer_X_train, cancer_y_train)

# We get the best model from GridSearch.
logistic_best_model = grid_search_rfc.best_estimator_

# Train the best model.
logistic_best_model.fit(cancer_X_train, cancer_y_train)

# Make predictions on the test set
logistic_y_pred = logistic_best_model.predict(cancer_X_test)

# Evaluate the model
logistic_accuracy = accuracy_score(cancer_y_test, logistic_y_pred)
logistic_conf_matrix = confusion_matrix(cancer_y_test, logistic_y_pred)
logistic_class_report = classification_report(cancer_y_test, logistic_y_pred)

# Print the results
print("Accuracy:", logistic_accuracy)
print("Confusion Matrix:\n", logistic_conf_matrix)
print("Classification Report:\n", logistic_class_report)

# Perform 5-fold cross validation and interprate the result.
cv_scores_logistic = cross_val_score(logistic_best_model, cancer_X_scaled, cancer_y, cv=5)
print("\nCross-Validation Scores: ", cv_scores_logistic)
print("Mean Cross-Validation: ", np.mean(cv_scores_logistic))

## 2. Regression

The goal here is to predict a continuous value.

### Example

In [None]:
# In this example, we will use the linear regressor and ridge
# regressor to predict continuous outcomes on a dataset.
# ---
# These models learn patterns in the training data and then use
# the information learned to predict a value.
# ---
# Dataset url = https://bit.ly/data_boston_housing
# ---
# We first import our regression models.
# ---
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

#### Data Importation

In [None]:
# We import and preview our dataset.
# ---
# Dataset url = https://bit.ly/data_boston_housing
# ---
# This dataset contains observations of the characteristics of the area in which a house is located and the features of the house.
# Our goal is to predict median housing price given the characteristics of its location and the house's features.
# ---
housing_df = pd.read_csv('https://bit.ly/data_boston_housing')

# Check the first records.
housing_df.head()

In [None]:
# Check the last records.
housing_df.tail()

#### Data Exploration/ Cleaning/ Preparation/ Statistical Analysis



We will not perform extrensive exploration/ cleaning/ preparation/ statistical analysis steps here since the main focus of this part of the session is to make predictions on the dataset using a regression algorithm.

In [None]:
# Check for missing values.
print("\nMissing values in the dataset:")
print(housing_df.isnull().sum())

In [None]:
# Handle missing values using mean imputation.
# Imputation is the process of replacing missing or incopmlete data with substituted values.
# Here, we use SimpleImputer to perform mean imputation which replaces missing values
# with the mean of the observed values in the same column.
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(housing_df), columns=housing_df.columns)

In [None]:
# Separate the features from the labels.
# features
housing_X = df_imputed.drop('MEDV', axis=1)

# labels
housing_y = df_imputed['MEDV']

In [None]:
# Split the data into a training set and a testing set.
# To the train_test_split function we pass the feature matrix (housing_X), the target vector (housing_y), and test_size which
# determines the percentage of data used for testing (20% in our case).
from sklearn.model_selection import train_test_split

housing_X_train, housing_X_test, housing_y_train, housing_y_test = train_test_split(housing_X, housing_y, test_size=0.2)

#### Method 1: Linear Regression

In [None]:
# We create a linear regressor that we'll use to make predictions.
# ---
# The linear regressor is an method that finds the best-fit linear relationship that
# minimizes the difference between predicted values and actual values of the target variable.
# ---
linear_regressor = LinearRegression()

In [None]:
# We pass our training data to the model.
linear_regressor.fit(housing_X_train, housing_y_train)

In [None]:
# We make predictions using our trained regressor.
y_pred_linear = linear_regressor.predict(housing_X_test)

In [None]:
# We evaluate the performance of our regressor.
# ---
# Mean Square Error (MSE) is the average squared difference between the predicted and actual values.
# A lower MSE indicates better model performance.
# R2 Score/ Coefficient of Determination measures the proportion on variance in the dependent
# variable that is explained by the independent variable in the regression model. It ranges from 0 to 1
# with 0 indicating that the model explains none of the variance and 1 indicating a perfect fit.
# ---
from sklearn.metrics import mean_squared_error, r2_score
# Evaluate the model on imputed data.
mse_linear = mean_squared_error(housing_y_test, y_pred_linear)
r2_linear = r2_score(housing_y_test, y_pred_linear)

# Display results.
print("\nModel Evaluation:")
print("Mean Squared Error (MSE): ")
print(mse_linear)
print("R-squared (R2): ")
print(r2_linear)

Interpretation:
- The MSE of 30.85 suggests that, on average, the squared difference between predicted and actual values is 30.85. Smaller MSE values are desirable.
- The R-squared value of 0.68 indicates that the model captures about 68.18% of the variability in the target variable. While this is a reasonable fit, it also suggests that there is room for improvement.

Cross validation is a resampling technique used to assess the performance and the generalization capability of a model.

It involves dividing the dataset into multiple subsets, training the model on some of these subsets, and evaluating it on the remaining subsets.

Our goal is to obtain a more reliable estimate of a model's performance compared to a single train-test split and this is particularly important when dealing with limited datasets.

In [None]:
# We perform k-fold cross validation to obtain a more accurate and reliable estimate of the model's performance.
# The dataset is divided into k subsets (folds) and the model is trained and evaluated k times, each time
# using a different fold as the test set and the remaining folds as the training set.
# ---
# We use k=5.
# ---
from sklearn.model_selection import cross_val_score

cv_scores_linear = cross_val_score(linear_regressor, housing_X, housing_y, cv=5, scoring='neg_mean_squared_error')
cv_rmse_scores_linear = np.sqrt(-cv_scores_linear)
print("\nCross-Validation RMSE Scores:")
print(cv_rmse_scores_linear)
print("Mean RMSE Score:", np.mean(cv_rmse_scores_linear))

Interpretation:
- RMSE is a measure of the average magnitude of errors between predicted and actual values in a regression model.
- The model's RMSE varies across different folds, ranging from 3.43 to 9.01.
- The mean RMSE (5.88) provides an overall measure of the model's predictive performance, considering variability across different data subsets.
- A lower RMSE indicates better model performance, as it represents smaller errors in predictions.

#### Method 2: Ridge Regression

In [None]:
# We create a ridge regressor that we'll use to perform classification
# ---
# The ridge regressor is a method that finds the best-fit linear relationship that
# minimizes the difference between predicted values and actual values of the target variable.
# It prevents overfitting and handles multicollinearity (high correlation between predictors)
# by adding a regularization term to the cost function.
# ---
ridge_regressor = Ridge()

Hyperparameter tuning/ model selection/ model parameter optimization is the process of searching for the best set of hyperparameters for a model.

Hyperparameters are configuration settings for a machine learning model that are set before the training process begins while model parameters are those that are learned from the data during training.

Hyperparameters influence the overall behaviour of the model and need to be specified by the user. Examples of hyperparameters include:
- Learning rates.
- Depth of a decision tree.
- Number of layers in a neural network.



In [None]:
# We perform hyperparameter tuning to optimize the model.
# Hyperparameter tuning involves searching for the best set of hyperparameter
# values that the model will explore during the tuning process.
from sklearn.model_selection import GridSearchCV

# We define hyperparameter grids for the classifier.
# `alpha` controls the strength of the regularization term in the model. Smaller
# values allow for less regularization while larger values impose stronger regularization.
# Our model uses ridge regularization(L2) to prevent overfitting.
param_grid = {'alpha': [0.1, 1, 10]}

# We perform a grid search.
grid_search_ridge = GridSearchCV(estimator=ridge_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# We pass our training data to GridSearch.
grid_search_ridge.fit(housing_X_train, housing_y_train)

# We get the best model from GridSearch.
best_model_ridge = grid_search_ridge.best_estimator_

#### Exercise

- Use `best_model_ridge` to make predictions on the training set.
- Evaluate the model and interprate the results.
- Perform k-fold cross-validation and evaluate the results.

In [None]:
# We then make predictions using our trained regressor.
y_pred_ridge = best_model_ridge.predict(housing_X_test)

In [None]:
# We evaluate the performance of our regressor.
# ---
# Mean Square Error (MSE) is the average squared difference between the predicted and actual values.
# A lower MSE indicates better model performance.
# R2 Score/ Coefficient of Determination measures the proportion on variance in the dependent
# variable that is explained by the independent variable in the regression model. It ranges from 0 to 1
# with 0 indicating that the model explains none of the variance and 1 indicating a perfect fit.
# ---
from sklearn.metrics import mean_squared_error, r2_score

# Calculate MSE and R2 score.
mse_ridge = mean_squared_error(housing_y_test, y_pred_ridge)
r2_ridge = r2_score(housing_y_test, y_pred_ridge)

# Display results.
print("\nModel Evaluation:")
print("Mean Squared Error (MSE): ")
print(mse_ridge)
print("R-squared (R2): ")
print(r2_ridge)

In [None]:
# We perform k-fold cross validation to obtain a more accurate and reliable estimate of the model's performance.
# The dataset is divided into k subsets (folds) and the model is trained and evaluated k times, each time
# using a different fold as the test set and the remaining folds as the training set.
# ---
# We use k=5.
# ---
from sklearn.model_selection import cross_val_score

# Calculate the cross-validation scores.
cv_scores_ridge = cross_val_score(ridge_regressor, housing_X, housing_y, cv=5, scoring='neg_mean_squared_error')
cv_rmse_scores_ridge = np.sqrt(-cv_scores_ridge)

# Print out the scores and find the average score.
print("\nCross-Validation RMSE Scores:")
print(cv_rmse_scores_ridge)
print("Mean RMSE Score:", np.mean(cv_rmse_scores_ridge))

### Challenge 1

In [None]:
# Regression challenge 1.
# ---
# A 1990 cencus gave data on housing prices and some summary statistics about these houses.
# Implement a lasso regression model to predict housing prices.
# Evaluate the performance of your regressor.
# ---
# Dataset URL = https://bit.ly/3Sqd1nU
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 1

Pre-code for data loading and preprocessing has been provided below.

In [None]:
# Load the California housing dataset.
california_df = pd.read_csv('https://bit.ly/3Sqd1nU')

# Extract features and labels
california_X = california_df.drop('median_house_value', axis=1)
california_y = california_df['median_house_value']

In [None]:
# Encode categorical variable `ocean_proximity`
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
california_X['ocean_proximity'] = label_encoder.fit_transform(california_X['ocean_proximity'])

california_X.head()

In [None]:
# Check for NaN values in data
nan_values = california_X.isnull().any()
if nan_values.any():
    print("There are missing values in the data.")
    print(nan_values)

In [None]:
from sklearn.impute import SimpleImputer

# Impute missing values with the most frequent value
imputer = SimpleImputer(strategy='most_frequent')
california_X_imputed = imputer.fit_transform(california_X)

In the cell below, write your code for the rest of your solution.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

# Split the dataset into a training and testing set using train_test_split.
california_X_train, california_X_test, california_y_train, california_y_test = train_test_split(california_X_imputed, california_y, test_size=0.2, random_state=42)

# Initialize a lasso regressor.
lasso_reg = Lasso()

# Train your regressor.
lasso_reg.fit(california_X_train, california_y_train)

# Make predictions on the testing set using your regressor.
y_pred_lasso = lasso_reg.predict(california_X_test)

# Evaluate the performance of your regressor.
mse_lasso = mean_squared_error(california_y_test, y_pred_lasso)
r2_lasso = r2_score(california_y_test, y_pred_lasso)

# Display results
print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse_lasso}")
print(f"R-squared (R2): {r2_lasso}")

# Perform 5-fold cross validation and interprate the result.
cv_scores_lasso = cross_val_score(lasso_reg, housing_X, housing_y, cv=5, scoring='neg_mean_squared_error')
cv_rmse_scores_lasso = np.sqrt(-cv_scores_lasso)

# Print out the scores and find the average score.
print("\nCross-Validation RMSE Scores:")
print(cv_rmse_scores_lasso)
print("Mean RMSE Score:", np.mean(cv_rmse_scores_lasso))

### Challenge 2

In [None]:
# Regression challenge 2.
# ---
# This challenge is an extension of challange 1. It requires you to perform hyperparameter tuning.
# Using the same California housing dataset, implement a random forest regression model to predict
# median housing prices given features of a house.
# Evaluate the performance of your regressor.
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 2

In [None]:
# Make use of the split dataset from chalenge 1
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Initialize a random forest regressor.
rfr_regressor = RandomForestRegressor()

# Train your regressor.
rfr_regressor.fit(california_X_train, california_y_train)

# Make predictions on the testing set using your regressor.
y_pred_rfr = rfr_regressor.predict(california_X_test)

# Evaluate the performance of your regressor.
mse_rfr = mean_squared_error(california_y_test, y_pred_rfr)
r2_rfr = r2_score(california_y_test, y_pred_rfr)

# Display results
print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse_rfr}")
print(f"R-squared (R2): {r2_rfr}")

In [None]:
# Perform hyperparameter tuning on your random forest regressor.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# Define the hyperparameter grid for tuning
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Get the best model and evaluate its performance.
best_rfr_regressor = grid_search.best_estimator_
best_rfr_regressor.fit(california_X_train, california_y_train)
y_pred_best_rfr = best_rfr_regressor.predict(california_X_test)

mse_best_rfr = mean_squared_error(california_y_test, y_pred_best_rfr)
r2_best_rfr = r2_score(california_y_test, y_pred_best_rfr)

# Display results
print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse_best_rfr}")
print(f"R-squared (R2): {r2_best_rfr}")

## 3. Clustering

The goal here is to group similar data points together based on certain characteristics or features.

### Example

In [None]:
# In this example, we will use the k-means clustering and hierarchical
# clustering to group data.
# ---
# These models learn patterns in the data and then uses the information
# learned to group the data into clusters.
# ---
# Dataset url = https://bit.ly/3OdeHP6
# ---
# We first import our clustering models.
# ---
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

#### Data Importation

In [None]:
# We import and preview our dataset.
# ---
# Dataset url = https://bit.ly/3OdeHP6
# ---
# This dataset contains observations of wine features with the labels removed
# for clustering task.
# ---
wine_df = pd.read_csv('https://bit.ly/3OdeHP6')

# Checking the first records.
wine_df.head()

In [None]:
# Checking the last records.
wine_df.tail()

#### Data Exploration/ Cleaning/ Preparation/ Statistical Analysis

We will not perform extrensive exploration/ cleaning/ preparation/ statistical analysis steps here since the main focus of this part of the session is to perform clustering analysis on the dataset.

In [None]:
# We scale the values of our features in order to give them equal importance.
# Scaling allows the datapoints of our features to lie within the same upper and lower limits.
# ---
# We can perform feature scaling using MinMaxScaler.
# ---
# Standardize features
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
wine_df_scaled = scaler.fit_transform(wine_df)

In [None]:
# Visualize the standardized dataset
plt.scatter(wine_df_scaled[:, 0], wine_df_scaled[:, 1], edgecolor='k')
plt.title('Standardized Wine Dataset')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()

#### Method 1: K-Means Clustering

In [None]:
# The code utilizes the Elbow Method to determine the optimal number of clusters (K) for K-Means
# It calculates the inertia (within-cluster sum of squares) for different values of K and
# plots the results.

# Choose the number of clusters using the Elbow Method.
inertia = []
# The loop below iterates from k=1 to k=10, creating and fitting a model to the dataset for
# each value of k.
# It then calculates `kmeans.inertia` which is the the sum of squared distances of samples
# to their closest cluster center and adds this value to the list inertia.
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(wine_df_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method.
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()

In [None]:
# Choose the optimal number of clusters and train the KMeans model.
# From the plot above, we look for the "elbow" point where the inertia starts
# decreasing at a slower rate. This point is considered a good estimate for
# the optimal number of clusters.
# In our case, our "elbow" point is k=3 and we assign this value to the variable
# `optimal_k`.
optimal_k = 3
kmeans_model = KMeans(n_clusters=optimal_k)
y_pred_kmeans = kmeans_model.fit_predict(wine_df_scaled)

In [None]:
# Visualize the clustering result.
# We create a scatter plot using Matplotlib.
# `wine_df_scaled[:, 0]` extracts values from all rows of the first column
# `wine_df_scaled[:, 1]` extracts values from all rows of the second column
# The `c` parameter specifies the colour of each point in the scatter plot. In our case,
# the colour is determined by the cluster labels assigned to datapoints by our model.
plt.scatter(wine_df_scaled[:, 0], wine_df_scaled[:, 1], c=y_pred_kmeans)
plt.title('Clustering Result (Wine Dataset)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()

In [None]:
# The Silhouette Score is calculated to assess the quality of the clustering result.
# It is used to measure how well-separated clusters are in our clustering result.
# It quantifies how similar an object is to its own cluster (cohesion) compared to
# other clusters (separation).
# A near +1 score indicates an object is well matched to its own cluster and poorly matched
# to neighbouring clusters, hence good clustering.
# A near 0 score indicates an object is on or very close to the decision boundary between two
# neighbouring clusters.
# A near -1 score indicates that an object might be assigned to the wrong cluster.
# To calculate the average silhouette score for the entire dataset we use silhouetter_score()
# function and to this function we pass our dataset (wine_df_scaled) and the array containing
# the cluster labels assigned by our model to each data point (y_pred_kmeans).
from sklearn.metrics import silhouette_score

# Evaluate the clustering using silhouette score
silhouette_avg = silhouette_score(wine_df_scaled, y_pred_kmeans)
print('Silhouette Score: ')
print(silhouette_avg)

Interpretation:
- The silhouette score ranges from -1 to 1, where a higher score indicates better separation between clusters.
- A score around 0.30 suggests that the clusters have a moderate level of separation. It indicates that data points within clusters are somewhat well-separated, but there is still room for improvement.
- Values close to 1.0 would indicate very well-defined clusters, while values close to -1.0 would suggest overlapping clusters.

#### Method 2: Hierarchical Clustering

In [None]:
# A dendrogram to help visually determine the optimal number of clusters for hierarchical clustering.
# A dendrogram is a tree-like diagram that illustrates the hierarchical relationship between data points.
from scipy.cluster.hierarchy import dendrogram, linkage

# Create a dendrogram to determine the optimal number of clusters
linked = linkage(wine_df_scaled, 'ward')  # 'ward' method minimizes the variance within each cluster
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Sample Index')
plt.ylabel('Cluster Distance')
plt.show()

#### Exercise

- Select optimal k, initialize the model and fit the model.
- Evaluate the model and interprate the results.

In [None]:
# AgglomerativeClustering algorithm is used for hierarchical clustering with a specified number of clusters
# Choose the optimal number of clusters and perform hierarchical clustering
# By observing the dendrogram, we can identify branches where clusters are formed and determine a suitable
# cut point to define the number of clusters.
# The height at which we cut the dendrogram corresponds to the desired number of clusters. In our case, k=3.

optimal_k = 3
hierarchical_model = AgglomerativeClustering(n_clusters=optimal_k)
y_pred_hierarchical = hierarchical_model.fit_predict(wine_df_scaled)

In [None]:
# Visualize the clustering result.
# We create a scatter plot using Matplotlib.
# `wine_df_scaled[:, 0]` extracts values from all rows of the first column
# `wine_df_scaled[:, 1]` extracts values from all rows of the second column
# The `c` parameter specifies the colour of each point in the scatter plot. In our case,
# the colour is determined by the cluster labels assigned to datapoints by our model.
plt.scatter(wine_df_scaled[:, 0], wine_df_scaled[:, 1], c=y_pred_hierarchical)
plt.title('Clustering Result (Wine Dataset)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()

In [None]:
# The Silhouette Score is calculated to assess the quality of the clustering result.
# It is used to measure how well-separated clusters are in our clustering result.
# It quantifies how similar an object is to its own cluster (cohesion) compared to
# other clusters (separation).
# A near +1 score indicates an object is well matched to its own cluster and poorly matched
# to neighbouring clusters, hence good clustering.
# A near 0 score indicates an object is on or very close to the decision boundary between two
# neighbouring clusters.
# A near -1 score indicates that an object might be assigned to the wrong cluster.
# To calculate the average silhouette score for the entire dataset we use silhouetter_score()
# function and to this function we pass our dataset (wine_df_scaled) and the array containing
# the cluster labels assigned by our model to each data point (y_pred_kmeans).
from sklearn.metrics import silhouette_score

# Evaluate the clustering using silhouette score
silhouette_avg = silhouette_score(wine_df_scaled, y_pred_hierarchical)
print(f'Silhouette Score: {silhouette_avg:.2f}')

### Challenge 1

In [None]:
# Clustering challenge 1.
# ---
# A biology researcher provide data on observations of iris flowers including
# sepal and petal length and width for each flower, as well as the species of the flower
# The labels are the species of the flower.
# Perfom k-means clustering analysis on the dataset.
# ---
# Dataset URL = https://bit.ly/data_iris_dataset
# Hint: We will drop the label column to perform clustering
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 1

Pre-code for data loading and preprocessing has been provided below.

In [None]:
from sklearn.preprocessing import StandardScaler

# import the dataset using pandas.
iris_df = pd.read_csv('https://bit.ly/data_iris_dataset')

# drop label column.
X = iris_df.drop('Species', axis=1)

# standardize features.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In the cell below, write your code for the rest of your solution.

In [None]:
# Use the Elbow Method to determine the optimal number of clusters.
# Plot the Elbow Method to find the "elbow" point.
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()

In [None]:
from sklearn.metrics import silhouette_score

# Apply KMeans clustering.
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Visualize the clustering result.
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters)
plt.title('Clustering Result (Iris Dataset)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()

# Evaluate the clustering using silhouette score.
silhouette_avg = silhouette_score(X_scaled, clusters)
print('Silhouette Score: ',silhouette_avg)

### Challenge 2

In [None]:
# Clustering challenge 2.
# ---
# This challenge is an extension of challange 1.
# Using the same iris dataset, implement a hierarchical clustering model to assign
# datapoints to clusters.
# Evaluate the performance of your clustering model.
# ---
# YOUR CODE GOES BELOW.

### Solution for challenge 2

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

# Create a dendrogram to determine the optimal number of clusters
linked = linkage(X_scaled, 'ward')  # 'ward' method minimizes the variance within each cluster
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Sample Index')
plt.ylabel('Cluster Distance')
plt.show()

In [None]:
# Apply hierarchical clustering.
optimal_k = 3
hierarchical_model = AgglomerativeClustering(n_clusters=optimal_k)
y_pred_hierarchical = hierarchical_model.fit_predict(X_scaled)

# Visualize the clustering result.
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred_hierarchical)
plt.title('Clustering Result (Wine Dataset)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()

# Evaluate the clustering using silhouette score.
silhouette_avg = silhouette_score(X_scaled, y_pred_hierarchical)
print('Silhouette Score: ',silhouette_avg)