# Introduction to Machine Learning

Today is the day! Finally, you will enter the mystical magical world of Machine Learning. Let us begin!!!

In today's workshop, we will introduce you to two famous Machine Learning models. The first one is **K-Nearest-Neighbour** and the second one is a __Decision Tree Model__. In this workshop, everything you have learned so far in the previous weeks will come together. This means that you not only have to use NumPy, but also Pandas functionalities. But don't worry, as responsible tutors, we will stand on your side. :)

Comments for slides:

- What is Machine Learning

- Training vs Testing
- Models
    - KNN 
    - Decision Tree
    
- Evaluation
    - Accuracy 
    
- Optimisation
    - Hyper Parameter

- Overfitting vs. Underfitting

## Installing the packages

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

## Breast Cancer Detection

Now, we work with the Breast Cancer Wisconsin (Diagnostic) Database. The goal is to create a classifier that can with the diagnosis of patients. 

### Download the data

In [None]:
from sklearn.datasets import load_breast_cancer

# Loads the breast cancer dataset
cancer = load_breast_cancer() 

# Dataset description
print(cancer.DESCR)

### Great! We have downloaded the cancer dataset. Let's deep dive into it!

In [None]:
print("Let's print the dataset")
print(cancer)

In [None]:
# As you can see this is quite confusing. Maybe, we know more if we know the data type
print("The data type is: ")
print(type(cancer))

# For further explication: https://scikit-learn.org/stable/datasets/index.html

In [None]:
# First look at the features
print("The feature data looks like this: ")
print(cancer.data, "\n")
print("The data type is: ")
print(type(cancer.data))

print()
# Now, let's look at the labels / targets
print("The label data looks like this: ")
print(cancer.target, "\n")
print("The data type is: ")
print(type(cancer.target))

### Before we start now, there 2 quick things to check/do. 

First of all, NumPy is cool, but let us work with Pandas. It is just way more comfy.. :)

And secondly, what does 0 and what does 1 mean...we should really understand our labelling before we do some predicting.



In [None]:
# Converts the data into Pandas dataframe

# This will create a data frame with the cancer data and features as columns
df_cancer = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)

print("Dataframe without label column")
print(df_cancer.shape, "\n")

# In order to get target as additional column we have to create a pandas series
df_cancer['target'] = pd.Series(cancer.target)

print("Dataframe with label column")
print(df_cancer.shape, "\n")

df_cancer.head() # Gives first rows of newly created DataFrame

In [None]:
cancer_ser = df_cancer['target'].value_counts() # Creates a series which counts the number of 0 and 1s


print(type(cancer_ser))
cancer_ser

# Okay, now we actually know the share of our classes
# cancer_ser.rename(index = {1: 'benign', 0: 'malignant'}, inplace=True) # Changes the naming of the index
# print(cancer_ser)

### Wicked! We will focus on the train and test split of our data

In [None]:
# Download the train_test_split functionality
from sklearn.model_selection import train_test_split

# Preparing X and Y values
X = df_cancer.iloc[:, :-1]
y = df_cancer.iloc[:, -1]

print("Initial data dimensions")
print(X.shape)
print(y.shape, "\n")

# Creates four data sets with training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("After train / test split")
print(X_train.shape)
print(y_train.shape)


## Drum roll....It is time for K-Nearest-Neighbour!

### Model and Training

In [None]:
# Install KNN model
from sklearn.neighbors import KNeighborsClassifier

# Defining neighbours
k = 1

# Creating a KNN model
knn = KNeighborsClassifier(n_neighbors = k) 

# Training the model with training data
knn.fit(X_train, y_train) 

### Prediction

Now, using then KNN classifier, we predict the class labels using the mean value for each feature.

In order to implement this, we will use `cancerdf.mean()[:-1].values.reshape(1, -1)` which gets the mean value for each feature, ignores the label column, and finally reshapes the data from 1 dimension to 2.

In [None]:
# Creates the predictions based on the test set
y_pred = knn.predict(X_test)

print("Predictions are: ")
print(y_pred)

### Evaluate The Prediction With Accuracy

In order to evalute the performance of the classifier, the score (mean accuracy) of the KNN classifier will be computed for  `X_test` and `y_test`.

In [None]:
knn_acc = knn.score(X_test, y_test)

print(f"The accuracy of the KNN classifier is: {knn_acc}")

### Let us visualize this result (comparing training vs. testing)

In [None]:
mal_train_X = X_train[y_train==0]
mal_train_y = y_train[y_train==0]
ben_train_X = X_train[y_train==1]
ben_train_y = y_train[y_train==1]

mal_test_X = X_test[y_test==0]
mal_test_y = y_test[y_test==0]
ben_test_X = X_test[y_test==1]
ben_test_y = y_test[y_test==1]

scores = [knn.score(mal_train_X, mal_train_y), knn.score(ben_train_X, ben_train_y), 
          knn.score(mal_test_X, mal_test_y), knn.score(ben_test_X, ben_test_y)]


plt.figure()

# Plot the scores as a bar chart
bars = plt.bar(np.arange(4), scores, color=['#4c72b0','#4c72b0','#55a868','#55a868'])

# directly label the score onto the bars
for bar in bars:
    height = bar.get_height()
    plt.gca().text(bar.get_x() + bar.get_width()/2, height*.90, '{0:.{1}f}'.format(height, 2), 
                 ha='center', color='w', fontsize=11)

plt.tick_params(top='off', bottom='off', left='off', right='off', labelleft='off', labelbottom='on')

# remove the frame of the chart
for spine in plt.gca().spines.values():
    spine.set_visible(False)

plt.xticks([0,1,2,3], ['Malignant\nTraining', 'Benign\nTraining', 'Malignant\nTest', 'Benign\nTest'], alpha=0.8);
plt.title('Training and Test Accuracies for Malignant and Benign Cells', alpha=0.8)

# Challenge I

As you know now, the KNN classifier has a hyper-parameter to tune. It's **K** value. In this first challenge, compute and plot the accuracy of the KNN performance for the values k: {1,2,3,4,5,6,7,8,9,10}

**What is the best K-value for your model ?**

In [None]:
# Your code here

In [None]:
# Write the k value with the best score
k = 
# Hint np.argmax
best_score = 

print(f"The best score is: {best_score}" )
print(f"The best score, you get with k = {k}")

# We made it! Our first Machine Learning use case!


Now, it your turn to work on real life use case, tweark the data in the right manner and create a model. This time, you will use the Decision Tree model. Of course, you can also try the KNN model on this use case. 

# Challenge II 

Alright, this is quite an oldie but really important and cool. We will predict hand-written digits!


In [None]:
from sklearn import datasets
digits = datasets.load_digits()

In [None]:
# Dataset description
print(digits.DESCR)

First, we want to know the dimensionality of the data. 

**What are the dimensions of digits.data?**

and **what are the dimensions of digits.targets?**

In [None]:
# Here is your code

**Let us quickly plot the data with the command below!**

In [None]:
# Just run the code
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

**Now create an np.array Y_digit which holds your labels, and further an np.array X_digit which holds the bit matrices. When you have the data seperately stored in X and Y, use the scikit function to split the data in a test and training set!**


In [None]:
from sklearn.model_selection import train_test_split

**In order to check the dimensions split of your training and test set, print the X_train and y_train dimension.**

**How big is your training set now?**

## Decision Tree Classifier

This time, we will use a DT classifier to work with X_train and y_train

**Hint:** Here you will find the info how to set-up your Decision Tree in Sklearn: 

In [None]:
# Laod the Decision Tree model from Sklearn
from sklearn.tree import DecisionTreeClassifier

### Create a Decision Tree model and train it with your training data (X_train, y_train)

In [None]:
# Create a Decision Tree model instance 
# Your code here

### Prediction of your Decision Tree model

In [None]:
# Predict the classes for X_test and save it to y_pred
# Your code here

**Let's have a closer look on our predictions. Plot the first element of X_test (this should be an image) and plot the first element of y_pred (this is an integer)** 

**Is the predictions accurate?**

**HINT 1:** The function in order to plot the image is: plt.imshow(first_X_value, cmap='binary', interpolation='nearest')

**HINT 2:** Maybe, you have to reshape the first element of your X_test

In [None]:
# Your code here

### Evaluate your model 

In [None]:
# Evaluate your predictions with y_test. What is your accuracy? --> Save your accuracy in variable dt_acc
from sklearn.metrics import accuracy_score

# Your code here

### We can see that our model has its limitations...

In [None]:
print(f"This is our target test set: {y_test}", "\n")

print(f"This is our target prediction set: {y_pred}", "\n")

# We can see that the second element was not correctly predicted, the prediction says it is a 3, but it is actually a 3...damn it!

In [None]:
X_second = X_test[1].reshape(8,8)
print(X_second.shape, "\n")

print(f"The second prediction is a: {y_pred[1]}", "\n")
print(f"The corresponding image is a: ")
plt.imshow(X_second, cmap='binary', interpolation='nearest')

### If you can create a KNN classifier and compute its performance for this task!

### Plot the second prediction, to see if  the classifier got it right!

In [None]:
# This challenge is optional!

from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Lets build the best classifier, train it and predict the values
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, y_train)
 
y_pred2 = knn.predict(X_test)


In [None]:
X_second = X_test[1].reshape(8,8)
print(X_second.shape, "\n")

print(f"The second prediction is a: {y_pred2[1]}", "\n")
print(f"The corresponding image is a: ")
plt.imshow(X_second, cmap='binary', interpolation='nearest')

## WOOOOW, the prediction is right! KNN, with k=3 did a great job!

## Congrats! You did it! You officially completed our Python workshop series. We hope that enjoyed it as much as we did. 

![great](img/great.jpg)


### Learning to program can be very challenging in the beginning (like everything, lol). But with the concepts and tools, we introduced, you should be able to tackle your own challenges.

### In the upcoming months, we will sent you a feedback form. We would love to hear your feedback in order to improve this workshop series as much as possible. 

#### Until then...Don't forget: https://www.youtube.com/watch?v=SJUhlRoBL8M

# Additional info about ML (Regression)

In order to give you also a notion about regression tasks, we quickly introduce Linear Regression. This model is very popular and quite efficient in its prediction. Of course, it has its limitations. Run the example by your own and see how you can use Machine Learning for a real value prediction task

## Linear Regression

In this first use case, we will predict the y value given, some x coordinates. First, we download the linear model from sklearn. We also download the train_test_split and the r2_score functions.

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import  r2_score

In [None]:
np.random.seed(0) # Just generates a random set of variables

n = 100 # Number of data points
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10

# Let's have quick look at the data dimensions
print("Orignal data")
print(x.shape)
print(y.shape, "\n")

X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0) # Train_test_plit function

# Okay, how did the dimensions change
print("After train / test split")
print(X_train.shape)
print(y_train.shape, "\n")

# Necessary reshaping for later data manipulation
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

print("Properly reshaped")
print(X_train.shape)
print(y_train.shape)

In [None]:
# Let us plot the data
def plot_data():
    plt.figure()
    plt.scatter(X_train, y_train, label='training data')
    plt.scatter(X_test, y_test, label='test data')
    plt.legend(loc=4);
    
plot_data()

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

In [None]:
# The coefficients
print('Coefficients: ', regr.coef_,  "\n")

# The mean squared error
print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}",  "\n")

# Explained variance score: 1 is perfect prediction
print(f"R2 score: {r2_score(y_test, y_pred)}",   "\n")

# Plot outputs
plt.scatter(X_test, y_test,  color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=2)

plt.show()


### Quiz: Is this underfitting or overfitting?

In [None]:
# Underfitting