# Hacakthon: Medical Diagnoses

You are working for a hospital and you want to be able to detect a certain disease using data taken from medical images.

You have data describing an image of a medical sample and knowledge of whether the patient the sample as taken from was diagnosed with the disease or not.

You want to be able to capture every patient who need to be assessed further so as to rule out any complications related to the disease.


<img src="images/doctor.png" style="display: block;margin-left: auto;margin-right: auto;height: 300px"/>

## About the data

This dataset comes from Scikit-Learn as one of the many datasets to explore and on which perform machine learning. [The Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).

The features in the dataset were computed from a digitized image of a breast tissue sample; they describe characteristics of the cell nuclei present in the image.


|Column|Description|Type|
|:---|:---|:---|
|id|ID number|float|
|diagnosis|The diagnosis of breast tissues (M = malignant, B = benign)|float|
|radius_mean|mean of distances from center to points on the perimeter|float|
|texture_mean|standard deviation of gray-scale values|float|
|perimeter_mean|mean size of the perimeter of the tumor|float|
|area_mean|mean size of the tumor|float|
|smoothness_mean|mean of local variation in radius lengths|float|
|compactness_mean|mean of perimeter^2 / area - 1.0|float|
|concavity_mean|mean of severity of concave portions of the contour|float|
|concave points_mean|mean for number of concave portions of the contour|float|
|fractal_dimension_mean|mean for "coastline approximation" - 1|float|
|radius_se|standard error for the mean of distances from center to points on the perimeter|float|
|texture_se|standard error for standard deviation of gray-scale values|float|
|perimeter_se|standard error for the perimeter of the tumor|float|
|area_se|standard error for the size of the tumor|float|
|smoothness_se|standard error for local variation in radius lengths|float|
|compactness_se|standard error for perimeter^2 / area - 1.0|float|
|concavity_se|standard error for severity of concave portions of the contour|float|
|concave points_se|standard error for number of concave portions of the contour|float|
|fractal_dimension_se|standard error for "coastline approximation" - 1|float|
|radius_worst|"worst" or largest mean value for mean of distances from center to points on the perimeter|float|
|texture_worst|"worst" or largest mean value for standard deviation of gray-scale values|float|
|perimeter_worst|"worst" or largest mean value for mean of perimeter|float|
|area_worst|"worst" or largest mean value for mean of area|float|
|smoothness_worst|"worst" or largest mean value for local variation in radius lengths|float|
|compactness_worst|"worst" or largest mean value for perimeter^2 / area - 1.0|float|
|concavity_worst|"worst" or largest mean value for severity of concave portions of the contour|float|
|concave points_worst|"worst" or largest mean value for number of concave portions of the contour|float|
|fractal_dimension_worst|"worst" or largest mean value for "coastline approximation" - 1|float|

In [None]:
import pandas as pd
medical = pd.read_csv('data/medical.csv')
medical.head()

### Exercise: 

Perform some preliminary analysis on the dataset.

1. How many patients is there data for?

2. How could you show the datatype of each column? Are there any missing values?

3. How many predictive features are there?

4. Run a pair plot on the features you think are the most correlated, using hue as the diagnosis.

**Answers** Data Exploration

In [None]:
# %load answers/data-exploration-medical.py

# Prepare `X` and `y`

Split the data into `X` and `y` where `X` is the feature matrix and `y` is the target (`diagnosis`)

Exclude `id` from the feature matrix due to it being a unique identifier.

Check the shape of `X` and `y`. 

**Answers** Prepare X and y

In [None]:
# %load answers/prepare-x-y-medical.py

## Train Test Split

Perform the train test split on the data to create `X_train`, `X_test`, `y_train`, `y_test`

Use a `random_state` to ensure the split is the same each time it is run.

Check the shape of `X_train`, `X_test`, `y_train` and `y_test`

**Answers** Train test split

In [None]:
# %load answers/train-test-medical.py

## Preprocessing

There are no categorical values or missing values to deal with. However since we are building a `Logistic Regression` we will want to `scale` the data so that the coefficients can be compared.

Choose from the below and import it in from `sklearn.preprocessing`

- `StandardScaler`
- `RobustScaler`
- `MinMaxScaler`

Instantiate your scaler (eg. `scaler = RobustScaler()`) and try it out by performing:

```python
pd.DataFrame(scaler.fit_transform(X_train), columns=features)
```

**Answers** Preprocessing

In [None]:
# %load answers/preprocessing-medical.py

## Building the Model

Now that we have a scaler chosen, we're ready to build a pipeline.

- Import `Pipeline` from `sklrean.pipeline` and `LogisticRegression` from `sklearn.linear_model`.
- Instantiate the model with no parameters
- Instantiate the pipeline with the scaler and model as the 2 steps.

Fit the pipeline to `X_train` and `y_train`

**Answers** Building the Model

In [None]:
# %load answers/build-model-medical.py

## Performance metrics

Find the accuracy score on both the train and the test. Is your model generalising well? Is your model overfitting?

The high accuracy score on both train and test demonstrates a model that generalises well. Since the drop from train to test is minor, this also suggests that the model is not overfitting.

**Answers** Performance metrics

In [None]:
# %load answers/performance-metrics-medical.py

## Feature Selection

Look at the coefficients of each feature from the model. Which are the most important features to determine the diagnosis?

- Create a dataframe using
    - `pd.DataFrame(pipeline_lr['model'].coef_[0], index=features, columns=['Importances'])`
    - Use assign to update the column `Importances` to be absolute values
    - Sort the data by `Importances`
    - Save the dataframe to a variable called `coefs_plotter`
- Plot the data using `coefs_plotter.plot(y='Importances', kind='barh')`
    


**Bonus:** Add a column `color_negative` that is `darkorange` when the original `Importances` column was negative and `royalblue` when not. Add the parameter `color=coefs_plotter['color_negative']` to color the bars accordingly.

**Answers**: Feature Importances

In [None]:
# %load answers/feature-importances-medical.py

## Hyperparameter Tuning

Import `GridSearchCV()` from `sklearn.model_selection` and use it to search across the different scalers that you can use.  

- Check out the actual names of the model parameters using `pipeline.get_params()`
- Create a parameter dictionary for your grid
- Make sure to instanitate the grid (eg. `grid = GridSearch(pipeline, parameters)`)

Fit your grid to `(X_train, y_train)`.

What was the best scaler according to the `GridSearchCV` and what was the average accuracy score?

Use `grid.score(X, y)` to find the accuracy score on your train and test data.

Make a dataframe from the `cv_results_` from your `grid` and `sort_values()` by `rank_test_score` to see the best models and their parameters.

**Answers** Hyperparameter Tuning

In [None]:
# %load answers/hyperparameter-tuning-medical.py

## <mark>Bonus: Custom Scikit-Learn</mark>

This dataset has quite a lot of features/columns. Accordingly, we may want to perform some feature selection, i.e. drop some of the features.

This may be for visualisation purposes or to aid model performance. For example, we saw in the feature importances chart that some features were more important to the model than others. Additionally, we saw previously that some features were (unsuprisngly) correlated with one another, e.g. `perimeter_mean` and `area_mean`, which can impede model performance.

We could choose to manually drop some. However, it could be more efficient to automate this.

### Part A

Create a customer Transformer that is able to drop all columns containing a particular string, e.g. all feature names containing the str `perimeter_`.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DropColumnsContaining(BaseEstimator, TransformerMixin):
    def __init__(self):
        #TODO
        pass
    
    def fit(self, X, y=None):
        #TODO
        return self
    
    def transform(self, X):
        return #TODO

In [None]:
# %load answers/custom-transformer-medical-A.py

### Part B

Add your custom transformer to your Pipeline and experiment to see if dropping certain columns can aid performance

In [None]:
# %load answers/custom-transformer-medical-B.py

<img src='images/gdd-logo.png' align=right width=300px>

# Conclusion

You have now gone through a model build from start to finish following these steps:

- Data Exploration
- Split into X and y
- Using train_test_split
- Preprocessing using scalers
- Building the Model
- Performance Metrics
- Hyperparameter Tuning

This is a great place to be in to start your first Machine Learning problems!

## Next Steps

There are more advanced techniques that you might also want to implement. Your next topic of learning may be one of the following

- Further metrics and how to interpret them
- Engineering and selecting features
- Building your own sci-kit learn estimator
- Model interpretation

All of these are visited in our Advanced Data Science with Python course!