# Data Science
# Exercise 5 - Comparative Experimentation
<br/>Student:
<br/>se21m024
<br/>Thomas Stummer
<br/><br/>The interpretation of the data can be found in the document <b><i>se21m024_Stummer_ex5_comp_exp.pdf</i></b>.
<br/><br/>
The library <i>Surprise</i> (https://surprise.readthedocs.io/en/stable/index.html) was used to create the following results. The code is highly inspired by the example code provided by the libries official documentation.
<br/><br/>
Big data set: Covertype<br>
The data set was provided by Jock A. Blackard and Colorado State University and downloaded from https://archive.ics.uci.edu/ml/datasets/Covertype.
<br/><br/>
Small data set: Heart Failure Prediction<br>
The data set was provided by Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020) (https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5) and downloaded from https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data.

# Import necessary dependencies

In [1]:
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Dataset
from surprise import Reader
from surprise import KNNWithMeans
from surprise import CoClustering
import datetime

## Import data sets

In [2]:
# Load 100k data
reader = Reader(line_format='user item rating timestamp', sep='\t')
data_100k = Dataset.load_from_file('./Data/ml-100k/u.data', reader=reader)

# Load 1m data
reader = Reader(line_format='user item rating timestamp', sep='::')
data_1m = Dataset.load_from_file('./Data/ml-1m/ratings.dat', reader=reader)

## Set up algorithms

In [15]:
## Define algorithms
def generate_predictions(data, user_based, num_splits, use_coClustering=False):

    # Student ID: se21m024 -> random_state = 21024
    random_state = 21024

    # Create 5 random training and test sets
    train_test_sets = [train_test_split(data, test_size=.2, random_state=random_state)]

    for i in range(num_splits-1):
        train_test_sets.append(train_test_split(data, test_size=.2))

    # Configure the algorithm (Co-clustering or k-NN with means)
    if(use_coClustering):
        algorithm = CoClustering(n_cltr_u=3, n_cltr_i=3, n_epochs=20, random_state=random_state)
    else:
        algorithm_options = {'name': 'cosine', 'user_based': user_based}
        algorithm = KNNWithMeans(k=40, min_k=1, sim_options=algorithm_options, random_state=random_state)

    # Train the algorithm and test the accuracy and runtime
    results = []
    for train_set, test_set in train_test_sets:
        start_time = datetime.datetime.now()
        algorithm.fit(train_set)
        predictions = algorithm.test(test_set)
        end_time = datetime.datetime.now()
        mse = accuracy.mse(predictions)
        runtime_sec = (end_time - start_time).total_seconds()
        results.append([mse, runtime_sec])

    # Print results
    print('\nResults')
    for mse, runtime_sec in results:
        print('MSE: ' + str(mse), 'Runtime: ' + str(runtime_sec) + ' seconds')

# Generate and evaluate predictions

## 100k data set

### User Based Filtering (k-NN with means)

In [16]:
print('\n### 100k User based (k-NN with means) ###')
generate_predictions(data_100k, True, 5)


### 100k User based (k-NN with means) ###
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9048
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9184
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9023
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8967
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9123

Results
MSE: 0.9047969682388375 Runtime: 4.460698 seconds
MSE: 0.9183666176616443 Runtime: 4.445627 seconds
MSE: 0.9023409577181297 Runtime: 4.426486 seconds
MSE: 0.8966928864367588 Runtime: 4.857592 seconds
MSE: 0.91225430391259 Runtime: 4.36691 seconds


### Item Based Filtering (k-NN with means)

In [17]:
print('\n### 100k Item based (k-NN with means) ###')
generate_predictions(data_100k, False, 5)


### 100k Item based (k-NN with means) ###
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8800
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8920
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9014
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9040
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8928

Results
MSE: 0.8800046616403912 Runtime: 5.555936 seconds
MSE: 0.8919933604783681 Runtime: 5.49285 seconds
MSE: 0.9013814008747711 Runtime: 5.946492 seconds
MSE: 0.9039946022839733 Runtime: 5.665933 seconds
MSE: 0.8928368768979199 Runtime: 5.487841 seconds


### Co-clustering

In [18]:
print('\n### 100k Co-clustering ###')
generate_predictions(data_100k, False, 5, True)


### 100k Co-clustering ###
MSE: 0.9255
MSE: 0.9357
MSE: 0.9327
MSE: 0.9323
MSE: 0.9168

Results
MSE: 0.9255341256709398 Runtime: 2.064703 seconds
MSE: 0.935717379767046 Runtime: 1.927965 seconds
MSE: 0.9326897055015237 Runtime: 2.309413 seconds
MSE: 0.9322633368773251 Runtime: 1.913248 seconds
MSE: 0.9167665686432691 Runtime: 1.909768 seconds


## 1m data set

### User Based Filtering (k-NN with means)

In [19]:
print('\n### 1m User based (k-NN with means) ###')
generate_predictions(data_1m, True, 5)


### 1m User based (k-NN with means) ###
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8871
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8827
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8815
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8863
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8797

Results
MSE: 0.8871383941625158 Runtime: 188.827831 seconds
MSE: 0.8827446643560981 Runtime: 187.465062 seconds
MSE: 0.8814503117906278 Runtime: 190.810032 seconds
MSE: 0.8862540528986809 Runtime: 188.107589 seconds
MSE: 0.8796519293664954 Runtime: 189.922178 seconds


### Item Based Filtering (k-NN with means)

In [20]:
print('\n### 1m Item based (k-NN with means) ###')
generate_predictions(data_1m, False, 5)


### 1m Item based (k-NN with means) ###
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8011
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8016
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8013
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7968
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7962

Results
MSE: 0.8010809546318405 Runtime: 85.472311 seconds
MSE: 0.8015950374291422 Runtime: 86.592962 seconds
MSE: 0.8012576493558005 Runtime: 86.56306 seconds
MSE: 0.796758773341108 Runtime: 86.910761 seconds
MSE: 0.7961656679567479 Runtime: 87.845872 seconds


### Co-clustering

In [21]:
print('\n### 1m Co-clustering ###')
generate_predictions(data_1m, False, 5, True)


### 1m Co-clustering ###
MSE: 0.8401
MSE: 0.8302
MSE: 0.8389
MSE: 0.8378
MSE: 0.8456

Results
MSE: 0.8401407533473455 Runtime: 20.937122 seconds
MSE: 0.8302378049312799 Runtime: 20.427174 seconds
MSE: 0.838876977008309 Runtime: 21.269789 seconds
MSE: 0.8377893712206187 Runtime: 20.859609 seconds
MSE: 0.8456028526239072 Runtime: 22.456361 seconds
