# DAA ML Intro - Activity 1

Here we are going to look at some ML code for a toy dataset to see how easy it is to implement in Python.

To start, we need to install and import some packages: 
* ucimlrepo is a helper library for a number of ML datasets.
* pandas is for dataframes
* scikit-learn is the engine of ML for Python, containing interfaces for most machine learning models.

In [10]:
!pip3 install ucimlrepo pandas scikit-learn

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [11]:
import pandas as pd

from ucimlrepo import fetch_ucirepo
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

To start, we need to fetch the dataset. This is a landmark dataset for health machine learning containing measurements of a number of different parameters for fine needle asperates for breast cancer patients. More information about the dataset can be found here: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Note: In Python, text after an octothorpe (#) is ignored. We'll use this for comments.

In [12]:
# Fetch the dataset
df = fetch_ucirepo(id=17)

# Split the dataset into features and targets.
# This step is from the UCI Repo instructions for obtaining the dataset.
X = df.data.features
y = df.data.targets

# Encode the target variable to 1 for M and 0 for B
# This is needed to encode the labels into numbers for the machine learning model to use.
# A common theme with ML/AI is turning data into numbers!
y = pd.Categorical(y['Diagnosis'], categories=['B', 'M']).codes

Now that we have the data in the correct format, we can see what features we are working with.

In [13]:
# Look at the feature column names:
print(X.columns)

Index(['radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3'],
      dtype='object')


Helpfully, our data has been processed in a format that makes it ideal for deep learning! This is almost never the case, and will need some manual and some automatic data cleaning before it can be used for training due to missing data, errors in measuring etc.

We start by using `train_test_split` - this is a common function that splits our data into a training set and a testing set. `test_size` is set to 0.2 which aims for 20% of the data to be in the testing dataset. Note that this is a fairly "empirical" way to measure model performance compared to statistics - this is common throughout ML however there's a lot of work into improving these methods!

In [14]:
# Train test split - try changing the test_size parameter to see what happens!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Next, we build the KNN model class - this doesn't actually do anything, it initialises an instance of a class before training. `scikit-learn` has a lot of options we can pass into this constructor to tinker with how the model works (e.g. changing how the distance between points is measured) - but for now the default parameters should be fine. The only value we need to pass is the number of neighbours to work with.

In [15]:
# Create a KNN classifier with 5 neighbors - try changing this to see what happens!
knn = KNeighborsClassifier(n_neighbors=5)

We can then train the model using the `.fit()` method. We'll see that this is a common theme for all of `scikit-learn`'s models - it's designed in a way that should make it easy to plug and play different model types!

In [16]:
# Fit the classifier to the training data
knn.fit(X_train, y_train)

We can then use the model to predict the inputs for our test set. We now have two labels for the test set; the true labels from the dataset and our predicted labels from our model.

In [17]:
# Predict the test data
y_pred = knn.predict(X_test)

We can then use the accuracy_score, f1_score and confusion_matrix functions to print out information about how well the model is doing. Note that these all measure something slightly different!

In [18]:
# Print some helpful metrics.
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'F1 Score: {f1_score(y_test, y_pred)}')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}') 

Accuracy: 0.9122807017543859
F1 Score: 0.875
Confusion Matrix:
[[69  5]
 [ 5 35]]
