In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("11-exercise-pids2024.ipynb")

# Exercise sheet 11

**Hello everyone!**

**Points: 15**

Topics of this exercise sheet are:
* Classification
* Cross validation
* Grid search
* Data cleaning

Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.datasets import fetch_openml

# Question 1: Classification with neural networks (5 Points)


In this first task, you will use a neural network for classification. For this, we load the dataset `cancer.csv`. When you run `df.head()`, you will get

```text
   radius_mean  texture_mean  ...  fractal_dimension_worst  diagnosis
0        13.74         17.91  ...                  0.07014          0
1        13.37         16.39  ...                  0.07628          0
2        14.69         13.98  ...                  0.09208          0
3        12.91         16.33  ...                  0.06949          0
4        13.62         23.23  ...                  0.06953          0
```

Each line represents a cell. The last column, `diagnosis`, is 1 if the cell is cancerous, and 0 if it is benign. All other columns describe geometric features of the cell. 

In [None]:
df = pd.read_csv("daten/cancer.csv")
df.head()

#### Question 1a: Loading and preprocessing the data (1 Point)

Create two DataFrames again, `X` and `y`, where `y` contains only the diagnoses and `X` contains all the other columns. 
*Hint: Don't forget to scale your data using [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*

In [None]:
class Question1a:
    ...
    X_scaled = ...

In [None]:
grader.check("Question 1a")

### Question 1b: Train the model (2 Points)

Train an [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) 
using two hidden layers with `10` neurons each and set `max_iter` to 30000. 
Test how well your model performs on the data you trained on, by computing the *accuracy*. 
The *Accuracy*, is is the number of correct predictions divided by the number of predictions.


In [None]:
class Question1b:
    ...
    accuracy = ...
    print(f"accuracy is {accuracy}")

In [None]:
grader.check("Question 1b")

### Question 1c: Test on a test dataset (2 Points)

Load the test set `cancer_test.csv` on a new set of patients. Compute the accuracy again. What do you observe.
Use the scaler from `Question1a` to scale the data. 

In [None]:
class Question1c:
    ...
    accuracy = ...
    print(f"accuracy is {accuracy}")

In [None]:
grader.check("Question 1c")

# Question 2: Cross-validation and Grid search (5 Points)

In this exercise we try to compare the quality of different models using cross-validation on the training data. 
We use the cancer dataset from the previous exercise

#### Question 2a: Setting up a parameter grid (2 Points)

Read in the documentation [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) (and using a search engine/AI) how to do crossvalidation in scikit learn. Perform a 5 fold cross-validation on the scaled data from Question1a using the classifier
defined in Question1b.  Compute the mean and standard deviation of the `test_score`. 

In [None]:
class Question2a:
    cv = ...
    print("cv result: ", cv)
    mean_test_score = ...
    std_test_score = ...
    print(f"mean {mean_test_score} standard deviaton {std_test_score}")

In [None]:
grader.check("Question 2a")

Is the average test score closer to what you got in training or what you got when applying the classifier for new patients?

### Question 2b: Setting up a param grid (1 Point)

Next we want to do grid search. For this, we set up a dictionary with all parameters we want to search. 
The keys are the parameter of the classifier we want to vary and the values
are all the possible values the parameter can take on. (Check the documentation of [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) to find the parameter names.  Define a dictionary `param_grid` for a grid search over
six different models, with layer sizes `(10, ), (10, 10), (10, 10, 10)` and with activation functions `relu` and `tanh`. 

In [None]:
class Question2b:
    param_grid = ...

In [None]:
grader.check("Question 2b")

### Question 2c: Do the grid search (2 Points)

Perform a grid search using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) where you use the given MLPClassifier as the estimator and the parameter grid you set up above. 
Use `accuracy` as scoring and choose a 5 fold cross-validation. 

In [None]:

class Question2c:
    ...
    grid_search = ...

    print(grid_search)

In [None]:
grader.check("Question 2c")


# Question 3: A more complex example (5 Points)

In this exercise we walk through a data cleaning process and working with non-numerical data, before we 
do a classification. 

In [None]:

# Load the dataset
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
y = y.apply(lambda x : int(x))


### Question 3a) Data cleaning (I) (2 Points)

Not all columns contain meaningful information for prediction. 
Extact the column `pclass, sex, cabin, fare, boat`. 
You can read about the meaning of the coluns [here](https://www.openml.org/search?type=data&sort=runs&id=40945&status=active).
Fill the `nan`s in column `boat` with a value of `0` and replace the non-numerical values with an other unique value. 
Change `sex` to 0 for male and 1 for female. Assigned the resulting dataframe to the variable `X_cleaned`. 


*Hint: To find out what values you have in the column 'boat' use the method `unique`*. 

In [None]:

class Question3a:
    ...
    X_cleaned = ...
    
    display(X_cleaned)


In [None]:
grader.check("Question 3a")

### Question 3b) Data cleaning (II) (2 Points)

Drop now all rows which contain `nan`s. Make sure you also drop them from the label `y`. 

In [None]:
class Question3b:
    X = Question3a.X_cleaned 
    
    X_nonan = ...
    display(X_nonan)

In [None]:
grader.check("Question 3b")

### Question3c: Split in training and test set (1 point)

Use scikit's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
to split the data into a training, test and validation set. The training set should be around 60% of the data and the other two sets 20%. Assign the data to the variables `X_train, y_train, X_val, y_val, X_test, y_test`

In [None]:

class Question3c:
    X_train = ...
    y_train = ...
    X_val = ...
    y_val = ...
    X_test = ...
    y_test = ...
    

In [None]:
grader.check("Question 3c")

### Question 3d) Training the classifier. 

Now you can train the classifier. You won't get points for it, but wouldn't it be unsatisfying not to do it?


In [None]:
class Question3d:

    ...
    accuracy = ...
    print("accuracy: ", accuracy)