<a href="https://colab.research.google.com/github/w4bo/AA2425-unibo-mldm/blob/master/slides/lab-09-breastcancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The `BreastCancer` challenge

**Goal**: it is your job to predict the `diagnosis` for each data item.

**Metric**: submissions are evaluated using the accuracy score.

- When splitting train and test datasets, the test dataset should contain 30% of the data.

**Requirements**: you are allowed to use `numpy`, `pandas`, `matplotlib`, `sns`, and `sklearn` Python libraries.

1. You can import any model from `sk-learn`.
2. Try `sk-learn` [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
3. Explore AutoML with `FLAML`

# Setup

In [None]:
#| echo: false
#| output: false

!pip install flaml[automl]

In [None]:
# Import the libraries used for machine learning
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
import matplotlib.pyplot as plt  # used for plotting 
import seaborn as sns   # used for plotting
from sklearn.model_selection import train_test_split  #  split the data into training and test
from sklearn.neighbors import KNeighborsClassifier  # import a machine learning model
from sklearn import metrics  # check the error and accuracy of the model

# SEED all random generators
import random
import os
seed = 42
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)

# read the data
df = pd.read_csv("https://raw.githubusercontent.com/w4bo/teaching-handsondatapipelines/main/materials/datasets/breastcancer.csv")

# Data understanding

Hints

- There are 569 observations with 30 features each
- Each observation is labelled with a `diagnosis`

In [None]:
df

# Data profiling

In [None]:
df.info()

# Feature semantics

Hint:

- id of the observation
- diagnosis (M = malignant, B = benign)
- Ten real-valued features are computed for each cell nucleus:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

`*_mean`: the means of all cells

`*_se`: standard error of all cells

`*_worst`: the worst cell


# ... and now?

Take a first glance to the dataset

- Do we consider all features?
- Are there null values?
- Which are the attribute types?
- Which are the attribute ranges?
- How many labels?
- Are classes unbalanced?