# Preprocessor Tuning

## The ```breast tumors``` Dataset

- The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
- The task is to detect as many malignant tumors as possible.

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
pd.set_option('display.max_columns', None)

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df = df.rename(columns={'target': 'malignant'})

In [None]:
round(df['malignant'].value_counts(normalize = True), 2)

## Building a Pipeline

❓ **>>>** Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [None]:
from sklearn.pipeline import Pipeline

from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

# Code here!


## Optimizing a pipelined model

❓ **>>>**  What is the optimal number of neighbors for the KNN imputer: 2, 5, or 10 ?

- Perform a GridSearch on your pipeline and save your answer under a variable called `n_best`.
- Be careful: Use a scoring metric that is relevant for the task in your Grid Search.

- Feel free to GridSearch on the whole dataset instead of using a train/test split in this challenge. Here, the goal is just to become familiar with Pipelines.

In [None]:
n_best = None
# Code here!


✅ **Expected results** : ```n_best``` = 2

## Evaluating a pipeline

❓ **>>>** **Question: what is the performance of the optimal pipeline**

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [None]:
cv_score = None
# Code here!

✅ **Expected results** : ```cv_score``` = 0.9915884194053209

## Predicting using a fitted and pipelined model

Here is a new tumor.

In [None]:
new_obs = {'mean radius': [20.57],
           'mean texture': [17.77],
           'mean perimeter': [132.9],
           'mean area': [1326.0],
           'mean smoothness': [0.08474],
           'mean compactness': [0.07864],
           'mean concavity': [0.0869],
           'mean concave points': [0.07017],
           'mean symmetry': [0.1812],
           'mean fractal dimension': [0.05667],
           'radius error': [0.5435],
           'texture error': [0.7339],
           'perimeter error': [3.398],
           'area error': [74.08],
           'smoothness error': [0.005225],
           'compactness error': [0.01308],
           'concavity error': [0.0186],
           'concave points error': [0.0134],
           'symmetry error': [0.01389],
           'fractal dimension error': [0.003532],
           'worst radius': [24.99],
           'worst texture': [23.41],
           'worst perimeter': [158.8],
           'worst area': [1956.0],
           'worst smoothness': [0.1238],
           'worst compactness': [0.1866],
           'worst concavity': [0.2416],
           'worst concave points': [0.186],
           'worst symmetry': [0.275],
           'worst fractal dimension': [0.08902]}

❓ **>>>** Using your optimal pipeline, predict whether the new tumor is malignant or not. Display also the probabilities.

In [None]:
# Code here!
