<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024_sem1)</div>

# IFN619 :: C2 - Machine Learning - Tutorial

Ensure that you have worked through the studio notebook before doing this tutorial, as the exercises below will build on what you did in the studio notebook.

In [None]:
from sklearn.preprocessing import minmax_scale

from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import pandas as pd
import plotly.express as px


### K-means algorithm Exercises

1. Load the [Queensland Ambulance Service Locations and Coordinates Data](https://data.qld.gov.au/dataset/679424b4-ccf8-46cd-8e0b-f16c49572dbb) into a dataframe
2. Load the data into a dataframe
3. Perform a k-means clustering based on coordinates (start with 2 clusters)
4. Visualise on a map
5. Try increasing the number of clusters to identify potentially meaningful groupings



#### Load the data

In [None]:
# Load the data
qas_df = pd.read_csv("https://www.data.qld.gov.au/datastore/dump/83360397-4dcb-495c-a9c8-342a5ef6b5aa?bom=True", index_col="_id")
qas_df

#### Create the model and fit to relevant data

In [None]:
clst2 = KMeans(n_clusters=2, random_state=0).fit(qas_df[['X Coordinates','Y Coordinates']])
qas_df['cluster2'] = clst2.labels_
qas_df

#### Visualise the clusters

In [None]:
qas_map = px.scatter_mapbox(qas_df, 
    lat="Y Coordinates", 
    lon="X Coordinates",
    color="cluster2") 
    
qas_map.update_layout(mapbox_style="open-street-map",   # changed from stamen-terrain
    mapbox_center_lat = -22.5, 
    mapbox_center_lon = 144,  
    mapbox_zoom = 3.0, 
    margin={"r":0,"t":0,"l":0,"b":0})

qas_map.show()

#### Try more clusters

In [None]:
clst6 = KMeans(n_clusters=6, random_state=0).fit(qas_df[['X Coordinates','Y Coordinates']])
qas_df['cluster6'] = clst6.labels_
qas_df

#### Visualise

In [None]:


qas_map = px.scatter_mapbox(qas_df, 
    lat="Y Coordinates", 
    lon="X Coordinates",
    color="cluster6",
    hover_name = "Entity Name"
) 
    
qas_map.update_layout(mapbox_style="open-street-map",   # changed from stamen-terrain
    mapbox_center_lat = -22.5, 
    mapbox_center_lon = 144,  
    mapbox_zoom = 3.0, 
    margin={"r":0,"t":0,"l":0,"b":0})

qas_map.show()


### Linear Regression algorithm

1. Load the [Great Barrier Reef Carbon Dioxide Measurements](https://www.csiro.au/en/education/Resources/Educational-datasets/GBR-Carbon-Study) data located in the data folder (name gbr.csv)
2. Perform linear regression to predict CO2
3. Experiment with predictions (increase the temperature to see what happens)


#### Load the data

In [None]:
lr_df = pd.read_csv('./data/gbr.csv')
lr_df

In [None]:
lr_df.describe()

#### Check for correlations

In [None]:
lr_df.corr()

In [None]:
lr_corr_fig = px.imshow(lr_df.corr(), color_continuous_scale = 'RdBu',zmin=-1, zmax=1) # check the documentation!! 
lr_corr_fig.show()

#### Visualise correlations

In [None]:
lr_corr_mt = px.scatter_matrix(lr_df)
lr_corr_mt.show()

#### Select domain features - independent variables

In [None]:
X_data = lr_df[['pressure', 'sea_surface_temp']] #'salinity']]
X_data

#### Select range feature - dependent variable

In [None]:
y_data = lr_df['co2']
y_data

#### Create train/test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, shuffle=True, train_size=0.8, random_state=99)

#### Create the model and fit to training data

In [None]:
linear_model = LinearRegression() 
linear_model.fit(X_train, y_train) 

#### Use the model to predict based on the test features

In [None]:
linear_predictions = linear_model.predict(X_test) 
linear_predictions # predicted CO2

In [None]:
lr_R2 = r2_score(y_test, linear_predictions) 
print(f'The model R squared score is: {lr_R2}')

#The R-squared is a coeficient between 0 and 1 that determine the quality of the model prediction. 
# This number indicates the percentage of variance in the dependent variable that the independent 
# variables explain. 0 means that the model's prediction is not explained at all by the independent 
# variables, while 1 means that the model's prediction is 100% explained by the independent variables.

#### Visualise the predictions vs actual values 
(for test data)

In [None]:
# Create a chart to check the differences between what has been predicted and the real values

y_test_fig_df = pd.DataFrame(y_test)
linear_prediction_fig_df = pd.DataFrame(linear_predictions)
linear_prediction_fig_df.columns = ['Predicted CO2']
linear_prediction_fig_df['Test Index'] = y_test_fig_df.index
linear_prediction_fig_df.set_index('Test Index', inplace=True)
linear_fig_df = linear_prediction_fig_df.join(y_test_fig_df)
linear_fig = px.scatter(linear_fig_df)
linear_fig.show()

#### Try a prediction on unseen data

In [None]:
new_lr_prediction = linear_model.predict(pd.DataFrame({'pressure': [1009.49], 'sea_surface_temp': [32]})) 
new_lr_prediction

### Logistic Regression algorithm

1. Load the [Thyroid sickness determination dataset](https://www.kaggle.com/datasets/bidemiayinde/thyroid-sickness-determination) in the data folder
2. Perform logistic regression
3. Change features to improve classification


#### Load and clean data

In [None]:
log_df = pd.read_csv('./data/health.csv')
# we want to predict thyroid disease, i.e. if log_df['Class'] is sick or negative

# transform categorical variables into numeric
log_df['sex_n'] = LabelEncoder().fit(log_df['sex']).transform(log_df['sex'])
log_df['class_n'] = LabelEncoder().fit(log_df['Class']).transform(log_df['Class'])
log_df

In [None]:
log_df.info()

In [None]:
# age should be integer instead of object
log_df['age'].value_counts()

In [None]:
log_df[log_df['age'].str.isnumeric()==False]

In [None]:
log_df = log_df[log_df['age'].str.isnumeric()==True].copy()
log_df['age'] = log_df['age'].astype('int')
log_df.info()

In [None]:
log_df.describe()

#### Check correlations

In [None]:
log_df.corr(numeric_only=True)

In [None]:
log_corr_fig = px.imshow(log_df.corr(numeric_only=True), color_continuous_scale = 'RdBu', zmin=-1, zmax = 1)
log_corr_fig.show()

#### Select independent and dependent variables

In [None]:
X_data = log_df[['T3', 'TT4', 'T4U']] 
X_data

In [None]:
y_data = log_df['class_n'] 
y_data

#### Create train/test split, check class balance, and scale

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, shuffle=True, train_size=0.8, random_state=99)

In [None]:
# Check the class balance
y_train.value_counts(normalize=True)

Class weights very imbalanced, a lot more negative (healthy) than positive (sick).
There is a class inbalance in the variable that we are going to predict. Therefore, the model is likely to predict towards 'Negative' (healhty) just because the biased data rather than the independent variables. In any classification model such as logistic regression, decision trees, etc. The class balance need to be considered (class_weight input parameter in LogisticRegression function).

Additionally, it is a common practice to scale the date to have a better model. To scale the data wwe are going to use standardization that scale the data to have a mean of 0 and a standard deviation of 1.

In [None]:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)

#### Create the model and fit to training data

In [None]:

logistic_model = LogisticRegression(class_weight='balanced') # (class_weight={0: 0.92, 1: 0.07})

In [None]:
# Fit the model to the training dataset
logistic_model.fit(X_train, y_train)

#### Test model on test data and check with confusion matrix

In [None]:
# to evaluate model use confusion matrix
logistic_prediction = logistic_model.predict(X_test)  # Use the model to predict based on the testing dataset
cm = confusion_matrix(y_test, logistic_prediction) # Compare the model's prediction against the true value in the testing dataset
cm

#### Visualise the confusion matrix

In [None]:
cm_fig = px.imshow(cm, labels={'x': 'Predicted label', 'y': 'Actual label'})
cm_fig.show()

#### Create a test report

In [None]:
report = classification_report(y_test, logistic_prediction)
print(report)

Consider precision and recall of model. 

Precision: What proportion of positive identifications was actually correct?
That is: (true positives / (true positives + false positives))

Recall: What proportion of actual positives was identified correctly?
That is (true positives / (true positives + false negatives))

0 means not sick, 1 means sick. 