# Sheet 3 - Exercise 5

## Instructions To Run This Notebook

### On your local machine (Assuming MacOS or Linux based distribution)

1. Create a folder: `mkdir sheet3`
2. Go to the folder `cd sheet3`
3. Install all the necessary dependencies (we assume you already have python and pip) by running the following command:
    `pip install pandas numpy scikit-learn scipy`
4. Download `Dataset for Exercise Sheet` `Presence.zip` folder from Moodle.
5. Unarchive `Presence.zip` file (make sure that `.csv` files are now inside `Presence` folder)
6. Put the `Presence` folder with data in the folder `sheet3` you have just created
7. Put this notebook `sheet3_exercise5.ipynb` in the same folder `sheet3`
8. Now you can run this notebook on *Jupyter* or *VSCode* by clicking "Run All"

### On HPC 

To run this notebook on HPC you will first need to setup Jupyter on HPC by following these steps:

1. Log into your HPC instance by using the command: `ssh -p 8022 <your-login>@access-iris.uni.lu` (replace your-login with your HPC login)
2. Once logged in, initialize a session using `si`
3. Create a directory inside HPC: `mkdir sheet3`
4. Go to the directory: `cd sheet3`
5. Load python: `module load lang/Python`
6. Create a Python environment for this activity: `python -m venv jupyter_env`
7. Activate the environment: `source jupyter_env/bin/activate`
8. Install/Upgrade pip: `python -m pip install --upgrade pip`
9. Install Jupyter: `python -m pip install jupyter ipykernel`
10. Install all the required dependencies: `pip install pandas numpy scikit-learn scipy`
11. Register our environment to the Jupyter kernel: `python -m ipykernel install --sys-prefix --name jupyter_env`
12. Run the following command to launch a jupyter server on HPC: `jupyter notebook --no-browser --port=8888 --NotebookApp.token='' --NotebookApp.password='' -NotebookApp.disable_check_xsrf=True --ip=0.0.0.0`


Now go on your local machine, open a new terminal tab and do the following:
1. Download `Dataset for Exercise Sheet` `Presence.zip` folder from Moodle.
2. Move the archive where you store this `sheet3_exercise5.ipynb` notebook (e.g. under `sheet3` directory)
3. Copy files from local to HPC: on your local machine within the directory where you have the `sheet3_exercise5.ipynb` and `Presence.zip` files stored run the following commands:\
`scp -P 8022 sheet3_exercise5.ipynb <your-login>@access-iris.uni.lu:~/sheet3`
`scp -P 8022 Presence.zip <your-login>@access-iris.uni.lu:~/sheet3`
4. On HPC, go to `sheet3` directory and unarchive the data archive: `unzip Presence.zip` (make sure it was unarchived to `sheet3` directory and `Presence` folder contains all necessary .csv files).
5. Launch the jupyter notebook on localhost by running: `ssh -p 8022 -NL 8888:iris-XXX:8888 <your-login>@access-iris.uni.lu` and replace XXX by the number of the iris cluster provided to you on HPC for example I have on HPC *(jupyter_env) 0 [omahfoud@iris-055 sheet3](3487277 1N/T/1CN)$* so I would replace XXX by *055*
6. On your machine open your favorite browser and go to `localhost:8888` or `http://127.0.0.1:8888/` this should open your Jupyter interface
7. Click on **sheet3_exercise5.ipynb** then click on **Run All*

# Import necessary libraries

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from scipy.spatial.distance import cdist

# Read csv files containing the data and drop unimportant features

In [2]:
dataframes = {}

categorical_columns = ['area_layer_id', 'area_layer_name_it', 'period_layer_id', 'period_layer_name_it', 'customer_class_id', 
                       'customer_class_name_it', 'visitor_class_id', 'visitor_class_name_it', 'detailed_visitor_class_id',
                       'detailed_visitor_class_name_it', 'recurrency_class_id', 'recurrency_class_name_it', 'seasonality_class_id',
                       'seasonality_class_name_it', 'naval_tourist_id', 'naval_tourist_name_it']

continuous_columns = ['area_id', 'period_id', 'period_start', 'period_end', 'unique_users']

for i, filename in enumerate(os.listdir('Presence'), start=1):
    if filename.endswith('.csv'):
        file_path = os.path.join('Presence', filename)
        
        # We are dropping this feature since area_name_it has too many categories to be considered categorical
        dataframes[i] = pd.read_csv(file_path).drop(columns=['area_name_it'])

# Define Pearson Correlation Coefficient

In [3]:
def pearson_correlation(x, y):
    N = len(x)
    sum_x = np.sum(x)
    sum_y = np.sum(y)
    sum_xy = np.sum(x * y)
    sum_x_squared = np.sum(x ** 2)
    sum_y_squared = np.sum(y ** 2)

    numerator = N * sum_xy - sum_x * sum_y
    denominator = np.sqrt((N * sum_x_squared - sum_x ** 2) * (N * sum_y_squared - sum_y ** 2))

    if denominator == 0:
        return 0
    return numerator / denominator

# Define k-means Clustering Analysis function

In [8]:
def kmeans_clustering_analysis(feature, num_clusters, features='all'):
    for k in dataframes:

        # K-means clustering
        X = dataframes[k].copy()

        area_layer_name = X[feature]
        X.drop(columns=[feature], inplace=True)

        X['period_start'] = X['period_start'].astype(str).str[-2:].astype(float).astype(int)
        X['period_end'] = X['period_end'].astype(str).str[-2:].astype(float).astype(int)

        categorical_feature_columns = [col for col in categorical_columns if col != feature]
        encoder = OneHotEncoder(drop='first')
        encoded_categorical = encoder.fit_transform(X[categorical_feature_columns]).toarray()
        encoded_categorical_df = pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(categorical_feature_columns))
        encoded_categorical_feature_columns = encoded_categorical_df.columns

        X = pd.concat([encoded_categorical_df, X[continuous_columns].reset_index(drop=True)], axis=1)

        X_standardized = StandardScaler().fit_transform(X)
        X = pd.DataFrame(X_standardized, columns=X.columns)

        if features == 'categorical':
            X = X[encoded_categorical_feature_columns]
        elif features == 'continuous':
            X = X[continuous_columns]
            
        feature_columns = X.columns
            
        kmeans = KMeans(n_clusters=num_clusters, random_state=42)
        kmeans.fit(X)
        y_kmeans = kmeans.predict(X)

        X = pd.concat([X, area_layer_name.reset_index(drop=True)], axis=1)

        X['kmeans_cluster'] = y_kmeans

        label_encoders = {}
        le = LabelEncoder()
        X['human_cluster'] = le.fit_transform(X[feature])
        label_encoders[feature] = le

        # Evaluation
        kmeans_clusters = np.sort(X['kmeans_cluster'].unique())
        human_clusters = np.sort(X['human_cluster'].unique())

        similarity_matrix = np.zeros((len(kmeans_clusters), len(human_clusters)))

        for i, kc in enumerate(kmeans_clusters):
            for j, hc in enumerate(human_clusters):
                kmeans_set = set(X.index[X['kmeans_cluster'] == kc])
                human_set = set(X.index[X['human_cluster'] == hc])
                intersection = len(kmeans_set.intersection(human_set))
                union = len(kmeans_set.union(human_set))
                similarity_matrix[i, j] = intersection / union if union != 0 else 0

        overall_similarity = np.sum(similarity_matrix) / max(len(kmeans_clusters), len(human_clusters))

        similarity_df = pd.DataFrame(similarity_matrix, index=kmeans_clusters, columns=human_clusters)

        print(f'Similarity Matrix for file {k}:')
        print(similarity_df)
        print(f'\nOverall Similarity for file {k}:', overall_similarity)

        kmeans_centroids = []
        for kc in kmeans_clusters:
            centroid = X[X['kmeans_cluster'] == kc][feature_columns].mean().values
            kmeans_centroids.append(centroid)

        human_centroids = []
        for hc in human_clusters:
            centroid = X[X['human_cluster'] == hc][feature_columns].mean().values
            human_centroids.append(centroid)

        kmeans_centroids = np.array(kmeans_centroids)
        human_centroids = np.array(human_centroids)

        distance_matrix = cdist(kmeans_centroids, human_centroids, metric='euclidean')

        distance_matrix = pd.DataFrame(distance_matrix, index=kmeans_clusters, columns=human_clusters)

        print(f'\nCentroid Euclidean Distance Matrix for file {k}:')
        print(distance_matrix)

        correlation_matrix = np.zeros((len(kmeans_clusters), len(human_clusters)))

        for i, kc in enumerate(kmeans_centroids):
            for j, hc in enumerate(human_centroids):
                correlation_matrix[i, j] = pearson_correlation(kc, hc)

        pearson_matrix = pd.DataFrame(correlation_matrix, index=kmeans_clusters, columns=human_clusters)

        print(f'\nCentroid Pearson Correlation Matrix for file {k}:')
        print(pearson_matrix)

# Question 1

In [26]:
kmeans_clustering_analysis('area_layer_name_it', 3)

Similarity Matrix for file 1:
          0         1         2
0  0.119322  0.497260  0.005040
1  0.080525  0.345748  0.003456
2  0.017990  0.101726  0.000000

Overall Similarity for file 1: 0.3903555396297342

Centroid Euclidean Distance Matrix for file 1:
          0         1          2
0  4.092259  2.034382  16.810949
1  4.713511  2.701600  16.945167
2  9.515172  8.044827  19.020262

Centroid Pearson Correlation Matrix for file 1:
          0         1         2
0  0.115456 -0.113153  0.002941
1 -0.018086  0.019308 -0.010701
2 -0.143729  0.138798  0.009702
Similarity Matrix for file 2:
          0         1         2
0  0.106832  0.436710  0.005106
1  0.078528  0.311078  0.005116
2  0.104261  0.203029  0.007488

Overall Similarity for file 2: 0.4193830132528777

Centroid Euclidean Distance Matrix for file 2:
          0         1          2
0  4.585973  2.787239  14.838787
1  4.686359  2.753048  14.849875
2  4.847531  3.523054  14.860551

Centroid Pearson Correlation Matrix for file

# Question 2

In [27]:
kmeans_clustering_analysis('period_layer_name_it', 3)

Similarity Matrix for file 1:
          0         1         2
0  0.442803  0.029418  0.105946
1  0.397008  0.031437  0.103667
2  0.090733  0.021122  0.050555

Overall Similarity for file 1: 0.4242295301756515

Centroid Euclidean Distance Matrix for file 1:
          0          1         2
0  2.320996   8.477549  3.485179
1  2.384391   8.552009  3.682138
2  8.147308  11.717546  8.726207

Centroid Pearson Correlation Matrix for file 1:
          0         1         2
0 -0.037142  0.017535  0.031260
1 -0.012004  0.000385  0.014274
2  0.070214 -0.026063 -0.064688
Similarity Matrix for file 2:
          0         1         2
0  0.134879  0.037826  0.093636
1  0.304443  0.028992  0.106631
2  0.477199  0.031935  0.133156

Overall Similarity for file 2: 0.44956558862396206

Centroid Euclidean Distance Matrix for file 2:
          0         1         2
0  4.152798  8.714508  4.660378
1  2.987399  8.510508  3.986266
2  2.459095  8.275057  3.400579

Centroid Pearson Correlation Matrix for file 2:

# Question 3

In [28]:
kmeans_clustering_analysis('customer_class_id', 3)

Similarity Matrix for file 1:
          0         1         2
0  0.000000  0.299909  0.357646
1  0.001562  0.357003  0.231961
2  0.000000  0.146271  0.000000

Overall Similarity for file 1: 0.46478374824008833

Centroid Euclidean Distance Matrix for file 1:
           0         1         2
0  40.401847  2.633191  2.636970
1  39.907468  2.754545  2.710882
2  40.841059  7.646704  9.045144

Centroid Pearson Correlation Matrix for file 1:
          0         1         2
0 -0.119100 -0.166507  0.172025
1  0.129643 -0.150191  0.145773
2 -0.001613  0.446329 -0.448406
Similarity Matrix for file 2:
          0         1         2
0  0.000429  0.381603  0.189135
1  0.000000  0.324168  0.379503
2  0.000000  0.088029  0.000000

Overall Similarity for file 2: 0.4542891613907298

Centroid Euclidean Distance Matrix for file 2:
           0         1          2
0  76.774016  2.721902   3.781602
1  76.886896  3.033417   2.199362
2  77.473148  9.605085  10.940562

Centroid Pearson Correlation Matrix for

# Question 4

In [29]:
kmeans_clustering_analysis('visitor_class_id', 4)

Similarity Matrix for file 1:
          0        1         2         3
0  0.000000  0.00000  0.399991  0.113482
1  0.000000  0.20759  0.224206  0.268125
2  0.203736  0.00000  0.025245  0.141144
3  0.000000  0.00000  0.143038  0.183952

Overall Similarity for file 1: 0.4776272705947871

Centroid Euclidean Distance Matrix for file 1:
           0          1         2         3
0  11.543043   8.957766  2.062156  3.521861
1  11.834023   6.526294  2.935782  2.877453
2   6.071104  11.004695  8.910298  7.784368
3  12.290376   8.589536  3.350771  3.238021

Centroid Pearson Correlation Matrix for file 1:
          0         1         2         3
0 -0.108608 -0.499138  0.616741 -0.221172
1 -0.382767  0.541773 -0.245717 -0.114755
2  0.838595  0.020716 -0.419421  0.230438
3 -0.340818 -0.217405  0.088799  0.203200
Similarity Matrix for file 2:
          0         1         2         3
0  0.000000  0.273307  0.000000  0.316919
1  0.000000  0.000000  0.414642  0.000000
2  0.023853  0.000000  0.144888

# Question 5

In [30]:
kmeans_clustering_analysis('detailed_visitor_class_id', 5)

Similarity Matrix for file 1:
          0         1         2         3         4
0  0.077883  0.000000  0.156124  0.174720  0.164988
1  0.145195  0.000000  0.226360  0.184423  0.178182
2  0.000000  0.972222  0.000000  0.000000  0.000000
3  0.009889  0.021122  0.032205  0.030307  0.030758
4  0.024056  0.000000  0.175979  0.188955  0.190731

Overall Similarity for file 1: 0.5968198063209009

Centroid Euclidean Distance Matrix for file 1:
           0          1          2         3          4
0   8.282655   8.991159   3.385137  3.064419   3.447442
1   7.704414   8.373187   3.090078  2.971519   3.378215
2  11.086929   0.284400   8.551175  8.240315   8.541236
3  13.172779  12.537963  10.055864  9.992483  10.104040
4   9.462233   8.382796   2.894748  2.310742   2.644498

Centroid Pearson Correlation Matrix for file 1:
          0         1         2         3         4
0  0.114226 -0.396314  0.105632  0.157463  0.104568
1  0.265418 -0.238871  0.079108 -0.107424 -0.059830
2  0.003140  0.999

# Question 6

In [21]:
combined_df = pd.concat(dataframes.values(), ignore_index=True)

combined_df['period_start'] = combined_df['period_start'].astype(str).str[-2:].astype(float).astype(int)
combined_df['period_end'] = combined_df['period_end'].astype(str).str[-2:].astype(float).astype(int)

label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    combined_df[col] = LabelEncoder().fit_transform(combined_df[col])
    label_encoders[col] = le

In [14]:
def generate_decision_trees(X, y):
    precision, recall = [], []

    for _ in range(30):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

        dt_classifier = DecisionTreeClassifier().fit(X_train, y_train)

        predictions = dt_classifier.predict(X_test)

        precision.append(precision_score(y_test, predictions, average='weighted'))
        recall.append(recall_score(y_test, predictions, average='weighted'))
    
    return precision, recall

def print_results(precsion, recall):
    print('Precision')
    print('Min: ', min(precsion))
    print('Max: ', max(precsion))
    print('Avg: ', np.mean(precsion))
    print('Std: ', np.std(precsion), '\n')

    print('Recall')
    print('Min: ', min(recall))
    print('Max: ', max(recall))
    print('Avg: ', np.mean(recall))
    print('Std: ', np.std(recall), '\n')

## area_layer_name_it

In [15]:
# 1)
cols = combined_df.columns.drop(['area_layer_name_it'])
X, y = combined_df[cols].values, combined_df['area_layer_name_it'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

predictions = dt_classifier.predict(X_test)

precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall, '\n\n')

# 2)
precision, recall = generate_decision_trees(X, y)
    
print_results(precision, recall)

Precision: 1.0
Recall: 1.0 


Precision
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

Recall
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 



## period_layer_name_it

In [16]:
# 1)
cols = combined_df.columns.drop(['period_layer_name_it'])
X, y = combined_df[cols].values, combined_df['period_layer_name_it'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

predictions = dt_classifier.predict(X_test)

precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall, '\n\n')

# 2)
precision, recall = generate_decision_trees(X, y)
    
print_results(precision, recall)

Precision: 1.0
Recall: 1.0 


Precision
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

Recall
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 



## customer_class_id

In [17]:
# 1)
cols = combined_df.columns.drop(['customer_class_id'])
X, y = combined_df[cols].values, combined_df['customer_class_id'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

predictions = dt_classifier.predict(X_test)

precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall, '\n\n')

# 2)
precision, recall = generate_decision_trees(X, y)
    
print_results(precision, recall)

Precision: 1.0
Recall: 1.0 


Precision
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

Recall
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 



## visitor_class_id

In [18]:
# 1)
cols = combined_df.columns.drop(['visitor_class_id'])
X, y = combined_df[cols].values, combined_df['visitor_class_id'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

predictions = dt_classifier.predict(X_test)

precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall, '\n\n')

# 2)
precision, recall = generate_decision_trees(X, y)
    
print_results(precision, recall)

Precision: 1.0
Recall: 1.0 


Precision
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

Recall
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 



## detailed_visitor_class_id

In [19]:
# 1)
cols = combined_df.columns.drop(['detailed_visitor_class_id'])
X, y = combined_df[cols].values, combined_df['detailed_visitor_class_id'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

predictions = dt_classifier.predict(X_test)

precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall, '\n\n')

# 2)
precision, recall = generate_decision_trees(X, y)
    
print_results(precision, recall)

Precision: 1.0
Recall: 1.0 


Precision
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

Recall
Min:  1.0
Max:  1.0
Avg:  1.0
Std:  0.0 

