# Preprocessor Tuning

## (0) The `tumors` Dataset

* 👩🏻‍⚕️ The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
* 🎯 The task is to detect as many malignant tumors as possible.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tumors_dataset.csv"
data = pd.read_csv(url)

data.head()

In [None]:
round(data.malignant.value_counts(normalize = True),2)

## (1) Building a Pipeline

❓ **Question: Building a Pipeline** ❓

Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [None]:
# YOUR CODE HERE

## (2) Optimizing a pipelined model

❓ **Question (GridSearching a Pipeline)** ❓

* What is the optimal number of neighbors for the KNN imputer: 2, 5, or 10 ? 
    * Perform a GridSearch on your pipeline and save your answer under a variable called `n_best`.
    * _Be careful: Use a scoring metric that is relevant for the task in your Grid Search, just saying... :)_
* Feel free to GridSearch on the whole dataset instead of using a train/test split in this challenge. Here, the goal is just to become familiar with Pipelines :)



In [None]:
n_best = None

In [None]:
# YOUR CODE HERE

In [None]:
# YOUR CODE HERE

## (3) Evaluating a pipeline

❓ **Question: what is the performance of the optimal pipeline**  ❓

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [None]:
cv_score = None

In [None]:
# YOUR CODE HERE

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solution', 
    n_best = n_best,
    cv_score=cv_score
)

result.write()
print(result.check())

## (4) Predicting using a fitted and pipelined model

👇 Here is a new tumor.

In [None]:
new_url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/new_tumor.csv"

new_data = pd.read_csv(new_url)
new_data

❓ **Question: Using your optimal pipeline, predict whether the new tumor is malignant or not** ❓

In [None]:
# YOUR CODE HERE

🏁 Congratulations! You are now an expert at pipelining !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!