<a href="https://colab.research.google.com/github/samkaj/appliedml-1/blob/main/assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

By Group 5: Samuel Kajava and Torbjörn Livén

[Instructions](https://www.cse.chalmers.se/~richajo/dit866/assignments/a1/assignment1.html) here.

In [3]:
# Fetch the CTG data set.
!wget https://www.cse.chalmers.se/~richajo/dit866/data/CTG.csv

--2024-01-15 20:42:13--  https://www.cse.chalmers.se/~richajo/dit866/data/CTG.csv
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.221.33
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.221.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 307385 (300K) [text/plain]
Saving to: ‘CTG.csv.1’


2024-01-15 20:42:14 (474 KB/s) - ‘CTG.csv.1’ saved [307385/307385]



In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the CSV file.
data = pd.read_csv('CTG.csv', skiprows=1)

# Select the relevant numerical columns.
selected_cols = ['LB', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'ASTV', 'MSTV', 'ALTV',
                 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
                 'Median', 'Variance', 'Tendency', 'NSP']
data = data[selected_cols].dropna()

# Shuffle the dataset.
data_shuffled = data.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
X = data_shuffled.drop('NSP', axis=1)

# Map the diagnosis code to a human-readable label.
def to_label(y):
    return [None, 'normal', 'suspect', 'pathologic'][(int(y))]

Y = data_shuffled['NSP'].apply(to_label)

# Partition the data into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

In [5]:
# Take a peak at the data.
X.head()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,...,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency
658,130.0,1.0,0.0,3.0,0.0,0.0,0.0,24.0,1.2,12.0,...,35.0,120.0,155.0,1.0,0.0,134.0,133.0,135.0,1.0,0.0
1734,134.0,9.0,1.0,8.0,5.0,0.0,0.0,59.0,1.2,0.0,...,109.0,80.0,189.0,6.0,0.0,150.0,146.0,150.0,33.0,0.0
1226,125.0,1.0,0.0,4.0,0.0,0.0,0.0,43.0,0.7,31.0,...,21.0,120.0,141.0,0.0,0.0,131.0,130.0,132.0,1.0,0.0
1808,143.0,0.0,0.0,1.0,0.0,0.0,0.0,69.0,0.3,6.0,...,27.0,132.0,159.0,1.0,0.0,145.0,144.0,146.0,1.0,0.0
825,152.0,0.0,0.0,4.0,0.0,0.0,0.0,62.0,0.4,59.0,...,25.0,136.0,161.0,0.0,0.0,159.0,156.0,158.0,1.0,1.0


## Step 2. Training the baseline classifier

We begin by using a dummy classifier as a baseline for our upcomping implementation. A higher aggregated result means that the accuracy is higher, so we want to find a classifier with a good aggregated result.

In [6]:
from sklearn.dummy import DummyClassifier

# Create dummy classifier.
clf = DummyClassifier(strategy='most_frequent')

In [7]:
from sklearn.model_selection import cross_val_score

# Perform cross validation.
dummy_cross_val = cross_val_score(clf, Xtrain, Ytrain)

Once we have cross validated the scores, we now aggregate the results to make it comparable to other classifiers. We use the mean to find a good representation of the results.

In [8]:
import numpy as np

# Returns the aggregation of an array of numbers.
def aggregate(arr: np.ndarray):
  # return the mean of the array
  return np.mean(arr)

dummy_aggr = aggregate(dummy_cross_val)
dummy_aggr

0.7805882352941176

## Step 3. Trying out different classifiers

In this step, we use a number of classifiers and compare the aggregated results using the `aggregate()` function defined above. We choose to scale the data to help the linear classifiers converge [[1]](https://scikit-learn.org/stable/modules/preprocessing.html).

In [9]:
from sklearn.preprocessing import StandardScaler

# Scale the data.
scaler = StandardScaler()
scaler.fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

In [10]:
import sklearn.tree as tree
import sklearn.ensemble as ensemble
import sklearn.linear_model as linear
import sklearn.svm as svm 
import sklearn.neural_network as nn

# Run cross validation on decision tree, random forest, gradient boosting, perceptron, logistic regression, linear SVC and MLP.
tree_clf = tree.DecisionTreeClassifier()
tree_cross_val = cross_val_score(tree_clf, Xtrain, Ytrain)

forest_clf = ensemble.RandomForestClassifier()
forest_cross_val = cross_val_score(forest_clf, Xtrain, Ytrain)

gb_clf = ensemble.GradientBoostingClassifier(max_depth=10)
gb_cross_val = cross_val_score(gb_clf, Xtrain, Ytrain)

perceptron_clf = linear.Perceptron()
perceptron_cross_val = cross_val_score(perceptron_clf, Xtrain, Ytrain)

logreg_clf = linear.LogisticRegression()
logreg_cross_val = cross_val_score(logreg_clf, Xtrain, Ytrain)

linsvc_clf = svm.LinearSVC(dual=False)
linsvc_cross_val = cross_val_score(linsvc_clf, Xtrain, Ytrain)

mlp_clf = nn.MLPClassifier(max_iter=3000, hidden_layer_sizes=(100,100), solver='adam')
mlp_cross_val = cross_val_score(mlp_clf, Xtrain, Ytrain)

# Aggregate the results.
tree_aggr = aggregate(tree_cross_val)
forest_aggr = aggregate(forest_cross_val)
gb_aggr = aggregate(gb_cross_val)
perceptron_aggr = aggregate(perceptron_cross_val)
logreg_aggr = aggregate(logreg_cross_val)
linsvc_aggr = aggregate(linsvc_cross_val)
mlp_aggr = aggregate(mlp_cross_val)

# Print the results.
print('Baseline:\n---------')
print('Dummy:', dummy_aggr)

print('\nTree-based:\n-----------')
print('Decision tree:', tree_aggr)
print('Random forest:', forest_aggr)
print('Gradient boost:', gb_aggr)

print('\nLinear:\n-------')
print('Perceptron:', perceptron_aggr)
print('Logistic regression:', logreg_aggr)
print('Linear SVC:', linsvc_aggr)

print('\nNeural net:\n-----------')
print('MLP:', mlp_aggr)

Baseline:
---------
Dummy: 0.7805882352941176

Tree-based:
-----------
Decision tree: 0.9229411764705882
Random forest: 0.9364705882352942
Gradient boost: 0.9476470588235294

Linear:
-------
Perceptron: 0.8729411764705883
Logistic regression: 0.891764705882353
Linear SVC: 0.8905882352941177

Neural net:
-----------


### Choosing the best classifier

With the cross validation results finished and aggregated, we choose the best performing one based on their scores.

In [12]:
from sklearn.metrics import accuracy_score

# Candidate models and their aggregation results.
candidates = {
    'Decision tree': (tree_clf, tree_aggr),
    'Random forest': (forest_clf, forest_aggr),
    'Gradient boost': (gb_clf, gb_aggr),
    'Perceptron': (perceptron_clf, perceptron_aggr),
    'Logistic regression': (logreg_clf, logreg_aggr),
    'Linear SVC': (linsvc_clf, linsvc_aggr),
    'MLP': (mlp_clf, mlp_aggr)
}

# Find the best model.
best = None
for candidate in candidates:
  if best is None or candidates[candidate][1] > candidates[best][1]:
    best = candidate

print('Best model:', best)
best_clf = candidates[best][0]
best_clf.fit(Xtrain, Ytrain)
Yguess = best_clf.predict(Xtest)
print('Accuracy:', accuracy_score(Ytest, Yguess))

Best model: Gradient boost
Accuracy: 0.9272300469483568
