# I. INTRODUCTION <a></a>

Welcome everyone ! This is my first kernel. My main objective in this kernel is to put in practice the things I learn in data science and machine learning. The upcoming study is about finding the best machine learning to detect the presence or absence of malignant breast cancer tumors. It is therefore a classification problem. A second objective of this kernel is to learn more about breast cancer, and how data science can help.

Disclaimer 1: this kernel is going to evolve through time. At first, the kernel will propose a basic method of machine learning to deal with the problem, and I will leave a lot of open questions (things that I truly don't know). As things go, I should be able to propose a complete study with an apropriate comparaison of the different ML algorithms, taking into account the relevant hyperparameters of the problem.

Disclaimer 2 : like John Snow, I know nothing... about breast cancer, or biology / medicine in general. If I say stupid things in these areas, please do correct me.

Disclaimer 3 : most of this work is based on the kernel of another Kaggler user, DATAI. Thanks to you.

Here is something I read on Breast Cancer, it dates back from 2005 but quick readings suggest things didn't evolve much in the meantime :

"Breast cancer is a major cause of concern in the United States today. At a rate of nearly one in three cancers diagnosed, breast cancer is the most frequently diagnosed cancer in women in the United States. The American Cancer Society projected that 211,300 invasive and 55,700 in situ cases would be diagnosed in 2003 [1]. Furthermore, breast cancer is the second leading cause of death for women in the United States, and is the leading cause of cancer deaths among women ages 40—59 [1,2]. According to The American Cancer Society 39,800 breast cancer related deaths are expected in 2003 [2]. Though predominantly in women, breast cancer can also occur in men. In the United States, of the 40,600 deaths from breast cancer in 2001, 400 were men [3]. Even though in the last couple of decades, with increased emphasis towards cancer related research, new and innovative methods for early detection and treatment have been developed, which helped decrease the cancer related death rates [4—6], cancer in general and breast cancer in specific is still a major cause of concern in the United States."
Source : Delen, D., Walker, G., & Kadam, A. (2005). Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine, 34(2), 113-127.


From this small extract, one can note that artificial intelligence has been around for a long time in the field of medicine. Early detection seems to be of paramount importance in order to decrease the death rate of cancer. Artificial intelligence could go further and help curing cancer, by tailoring specific treatments with the genome analysis of the patient (Precision Medicine, [7]).


In this data analysis report, the data comes from Breast Cancer Wisconsin (Diagnostic) Data Set. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

This study is decomposed in six steps :
#### Data analysis
#### Data visualization
#### Basic ML analysis
In this section I will insist on the proper way to estimate the generalization error
#### Hyperparameter tuning
Coming soon
#### "I wanna be the very best", Ash
Coming soon
#### Conclusion


# **Enjoy your data analysis!!!**


[1] http://www.cancer.org/

[2] http://www.komen.org/bci/bhealth/QA/q_and_a.asp

[3] José M. Jerez-Aragonés, José A. Gómez-Ruiz, Gonzalo Ramos-Jiménez, José Muñoz-Pérez, Emilio Alba-Conejo,
A combined neural network and decision trees model for prognosis of breast cancer relapse, Artificial Intelligence in Medicine, Volume 27, Issue 1, 2003, Pages 45-63, ISSN 0933-3657

[4] Edwards, B. K., Howe, H. L., Ries, L. A., Thun, M. J., Rosenberg, H. M., Yancik, R., ... & Feigal, E. G. (2002). Annual report to the nation on the status of cancer, 1973–1999, featuring implications of age and aging on US cancer burden. Cancer, 94(10), 2766-2792.

[5] unfound

[6] Warren, J. (2003). Cancer death rates falling, but slowly. WebMD medical news, 2(003).

[7] Bertalan Mesko (2017) The role of artificial intelligence in precision medicine, Expert Review of Precision Medicine and Drug Development, 2:5, 239-241, DOI: 10.1080/23808993.2017.1380516

# II. Data Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt
import time
from subprocess import check_output
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
data = pd.read_csv('../input/data.csv')

Before making anything like feature selection, feature extraction and classification, we start with basic data analysis. 
Lets look at features of data.

In [None]:
data.head()  # head method shows only first 5 rows

**Some remarks concerning the data :**

1) The **id** column is useless for the classification

2) **Diagnosis** is the class label, M for Malignant, B for Benign

3) **Unnamed: 32** feature includes NaN and is useless for the classification

4) The range of values between features seems quite high (we will use the describe function later on)


From these preliminary remarks, one can rearrange the data by first eliminating the columns **id**, **Unnamed : 32** and **Diagnosis**. The information about **Diagnosis** must be kept.

In [None]:
# feature names as an Index (panda object including a list of column names and dtype)
col = data.columns       # .columns gives columns names in data 
print(col)

In [None]:
# y includes our labels and x includes our features
y = data['diagnosis']                   # M or B 
list = ['Unnamed: 32','id','diagnosis']
x = data.drop(list,axis = 1 )
x.head()

In [None]:
sns.set(style="darkgrid")
ax = sns.countplot(x = y)       # M = 212, B = 357
B, M = y.value_counts(sort=True) #sort=True (default) in order to be sure B, M are in the right order (alphabetical)
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

This is good news. It's always simpler to work on a classification problem when the different classes are equally represented. Here, the count ratio between Malignant tumorts and Benign tumors is close to 0.6, which is acceptable.

Now that we have features, **what do they mean** ? Ten real-valued features are computed for each cell nucleus of the image:

1) radius (mean of distances from center to points on the perimeter) 2) texture (standard deviation of gray-scale values) 3) perimeter 4) area 5) smoothness (local variation in radius lengths) 6) compactness (perimeter^2 / area - 1.0) 7) concavity (severity of concave portions of the contour) 8) concave points (number of concave portions of the contour) 9) symmetry 10) fractal dimension ("coastline approximation" )

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

According to the provided information, all values are presents, but let's not trust what we read on the internet :

In [None]:
#check if there is a NaN value in our data frame x, a False indicates there are no missing values
x.isnull().values.any()

Let's now use the describe function in order to look at our features :

In [None]:
x.describe()

These type of information on our features (mean, std,...) helps us understanding our data. For example, we clearly see that **area_mean** and **smoothness_means** are not on the same scale. It is well known that machine learning algorithms do not converge well when facing unscaled features. Moreover, it is much more difficult to visualize our data when the features are not at the same scale. For these two reasons, we need a **normalization** of our features, prior to anything else (vizualisation, feature selection, classification algorithm).

**Standardization** will be used to ensure a good **normalization** of our features. Other techniques exist, but ML experts focus on **Standardization**. The idea is to take a feature, compute its mean and standard deviation, then substract the feature by its mean, and divide the result by the standard deviation. An equation is clearer than an explanation :

In [None]:
data = x
data_normal = (data - data.mean()) / (data.std())              # process of normalization by standardization

# III. Visualization
The Visualization will help us understand our data. With human eye, we should expect being able to find intuition about what features could be good candidates for the classification.

In order to do so, we'll use Seaborn. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Let's dive in.

We start with violin plots. Since we have 30 features to study, I created 3 groups with 10 features each. I represent the various quartiles of the solution alongside the violins, i.e. the 25 percentile, the median and the 75 percentile.

In [None]:
def violin_plot(beginning,end):
    data = pd.concat([y,data_normal.iloc[:,beginning:end]],axis=1)
    data = pd.melt(data, id_vars="diagnosis", var_name="features", value_name='value')
    plt.figure(figsize=(10,10))
    sns.violinplot(x="features", y="value", hue="diagnosis", data=data, split=True, inner="quart")
    plt.xticks(rotation=45)

In [None]:
# first ten features
violin_plot(0,10)

Let's interpret the plot above. **Radius_mean**, **Perimeter_mean**, **Area_mean**, **Compactness_mean**, **Concavity_mean**, **Concave_points_mean** are well separated between Malignant and Benign tumors, as the 75 percentile of Benign tumors is below the 25 percentile of Malignant tumors. These 6 features would be good candidates for the classifier.

On the contrary, **fracta_dimension_mean** has the same median for both tumor types, so it wouldn't be a good candidate for the classifier.

The same sort of separation can be observed for the following features :

In [None]:
# Second ten features
violin_plot(10,20)

In [None]:
# Third ten features
violin_plot(20,31)

From the graph above, we notice some similarities between **radius_worst** and **perimeter_worst** on one hand, and **concavity_worst** and **concave points_worst** on the other hand. Other similarities exist of course. If two violins look similar, it might indicate a correlation between the features, and if two features are correlated, one can ask if it's possible (or not) to drop one.

In order to compare two features, let's first use joint plot.

In [None]:
def joint_plot(feature1, feature2):
    """ I have a FutureWarning on this function, anyone knows how to get rid of it ?"""
    sns.jointplot(x.loc[:,feature1], x.loc[:,feature2], kind="regg", color="#ce1414")

In [None]:
joint_plot("radius_worst","perimeter_worst")

In [None]:
joint_plot("concavity_worst","concave points_worst")

As expected, **radius_worst** and **perimeter_worst** are strongly correlated, there is a linear relationship between the two features, linked to the 2pi ratio between perimeter and radius. **Concave points_worst** and **Concavity_worst** also seem correlated.

**What if we want to observe all correlation between features?** To do so, Seaborn proposes a useful/beautiful heatmap function.

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

# IV. Basic Machine Learning Analysis

Ok. We are going to do some machine learning. The previous analysis gave us good insights on what should be relevant for our classifier, and it also comforted us in the possibility of having good results with a classifier. In this section I will give the general methodology of machine learning for a given classifier.

# But let's start with the beginning, what do we want ?

"In a general classification problem, the goal is to learn a classifier that performs well on unseen data drawn from the same distribution as the available data; in other words, to learn classifiers with good generalization. One common way to estimate generalization capabilities is to measure the performance of the learned classifier on test data that has not been used to train the classifier. When a large test data set cannot be held out or easily acquired, resampling methods, such as cross validation, are commonly used to estimate the generalization error. The resulting **estimates of generalization** can also be used for model selection by choosing from various possible classification algorithms (models) the one that has the lowest cross validation error (and hence the lowest expected generalization error)." 

Text from : Rao, R. B., Fung, G., & Rosales, R. (2008, April). On the dangers of cross-validation. An experimental evaluation. In Proceedings of the 2008 SIAM International Conference on Data Mining (pp. 588-596). Society for Industrial and Applied Mathematics. https://people.csail.mit.edu/romer/papers/CrossVal_SDM08.pdf

This previous text / explanation can be found in many books as it is the very starting point of Machine Learning, yet I chose this reference because of the title : "On the dangers of cross-validation". The article states that cross-validation can no longer be trusted to prove the generalization of a model when we compare a high number of models together. In the present data analysis, we are not going to compare so many models together, so this specific danger doesn't affect us. However, keep in mind that cross-validation is not a perfect way to ensure a good generalization of your model. It can only give you an **estimate of the generalization**.

## The first thing we want is a single performance criterion
I am still new to Kaggle, but in the Kernels I browsed, I rarely saw any discussion on the choice of the performance criterion. All I see is "accuracy" everywhere. It seems to me that, when detecting cancer, other performance criteria could be introduced. Since we want to detect cancer everytime there is one, we might focus on having a good recall rather than a good accuracy or precision(https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9). **I'm going to use the recall as my performance criterion for the rest of this analysis. It might actually not be the best criterion, and if someone tells me why I'll update it.**

## The second thing we want is a good generalization of our algorithm
To ensure that, we have access to the k-fold cross validation for instance. But I have some questions regarding this k-fold cross validation. First of all, how should we choose k to ensure the best estimate of the generalization ? Secondly, training the same algorithm with different random_states will lead to different results and impact the generalization error and everything that comes after (feature selection !). This proves that doing a single k-fold cross validation is erroneous (in our application) as it leads to stochastic results. I'll propose a methodology to solve this issue, but I'm not sure about it.

## The less features the better
For a given performance criterion, would you rather have a machine algorithm with a performance of 97.01% and 5 features, or a machine algorithm with a performance of 97.02% and 500 features ? I'm pushing it to the extreme, but I never saw the number of features taken into account in the final decision of which algorithm is better. At some degree, being able to humanly understand your machine learning algorithm should be taken into account, shouldn't it ? Reducing the number of features, in our application, could simplify the work of doctors when detecting malignant tumors... **Right now, I don't know. Therefore, I won't consider the number of features as part of the performance criterion. However, features will still be selected in order to maximize the classical performance criterion (recall).**

In this section, we will select feature with different methods that are feature selection with correlation, recursive feature elimination (RFE) and recursive feature elimination with cross validation (RFECV). We will use the Random Forest classification to train our model. Our observations will lead to the conclusion that the cross validation is not a good estimate of the generalization error, and a methodology will be proposed to improve the estimate of the generalization error. **Did you know about this methodology ? It is wrong ? I'm afraid of either being wrong, or having invented the wheel again...**

## 1) Feature Selection with correlation and Random Forest classification

We start by the simplest possible method. At this stage, we still don't care about cross-validation.

As it can be seen in the heatmap figure **radius_mean, perimeter_mean and area_mean** are correlated with each other so we will only use **area_mean**. This first criterion is not backed up by data, but if I had to give a wild guess, I would say that experimental measures on area might have lower uncertainties than measures of radius or perimeter. I might be wrong on this one. Another criterion to choose **area_mean**, backed up by data this time, is that the feature seems to express more differences between malignant and benign tumors on my violin plots. The use of swarmplots could help me in my decision making, and I invite you to try it.

**Compactness_mean, concavity_mean and concave points_mean** are correlated -> I choose **concavity_mean**.

**radius_se, perimeter_se and area_se** are correlated -> I choose  **area_se**.

**radius_worst, perimeter_worst and area_worst** are correlated -> I choose  **area_worst**.

**Compactness_worst, concavity_worst and concave points_worst** are correlated -> I choose  **concavity_worst**.

**Compactness_se, concavity_se and concave points_se** are correlated -> I choose  **concavity_se**.

**texture_mean and texture_worst are correlated** are correlated -> I choose  **texture_mean**.

**area_worst and area_mean** are correlated -> I choose  **area_mean**.




In [None]:
drop_list = ['perimeter_mean','radius_mean','compactness_mean','concave points_mean','radius_se','perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst','compactness_se','concave points_se','texture_worst','area_worst']
x_1 = x.drop(drop_list,axis = 1 )        # do not modify x, we will use it later 
x_1.head()

After dropping correlated features, we end up with a heatmap matrix that is almost uncorrelated :

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(14, 14))
sns.heatmap(x_1.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

We reduced the number of features from 30 to 16. To check if our feature selection is correct, let's use the Random Forest algorithm and find the recall for the chosen features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.metrics import accuracy_score, recall_score

#It appears that the function recall_score needs to have y in a binary format, unlike accuracy_score can
#take into account the alphabetical format too.
#We write y in the binary format, with B=0 and M=1
y = y.replace("B", 0)
y = y.replace("M", 1)

# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x_1, y, test_size=0.3, random_state=42)

#random forest classifier with n_estimators=10 (default)
clf_rf = RandomForestClassifier(random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train)

recall = recall_score(y_test,clf_rf.predict(x_test))
print('Recall is: ', recall)
accuracy = accuracy_score(y_test,clf_rf.predict(x_test))
print('Accuracy is: ', accuracy)
f1 = f1_score(y_test,clf_rf.predict(x_test))
print('F1 score is: ', f1)
cm = confusion_matrix(y_test,clf_rf.predict(x_test))
sns.heatmap(cm,annot=True,fmt="d")

Recall is almost 92.1%, and accuracy and F1 scores are good. This is already quite nice to have such results with a naive feature selection, and without optimization. Right now, the recall is computed on the test set, and we will do the same with the next method. Let's not bother with cross-validation yet.

## 2) Recursive feature elimination (RFE) with Random Forest

RFE uses one of the classification methods (random forest in our example), assign weights to each feature. Whose absolute weights are the smallest are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features is reached.

Like the previous method, we will use 16 features, however this time the 16 features will be computed through the RFE.

In [None]:
from sklearn.feature_selection import RFE

# split data train 70 % and test 30, this time with x and not x_1 in order to have all the features %
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Create the RFE object and rank each pixel
clf_rf_2 = RandomForestClassifier(random_state=43)      
rfe = RFE(estimator=clf_rf_2, n_features_to_select=16, step=1)
rfe = rfe.fit(x_train, y_train)

In [None]:
print('Chosen best 16 feature by rfe:',x_train.columns[rfe.support_])

Chosen 16 best features by rfe are different than the one obtained with the previous naive method. Therefore we do need to calculate recall again.

In [None]:
recall = recall_score(y_test,rfe.predict(x_test))
print('Recall is: ', recall)
accuracy = accuracy_score(y_test,rfe.predict(x_test))
print('Accuracy is: ', accuracy)
f1 = f1_score(y_test,rfe.predict(x_test))
print('F1 score is: ', f1)
cm = confusion_matrix(y_test,rfe.predict(x_test))
sns.heatmap(cm,annot=True,fmt="d")

Recall is 93.7%, slighlty better than the previous naive approach for the same classifier with the same Random_State and the same test set. Let's now find the optimal number of features !

## 3) Recursive feature elimination with cross validation and Random Forest classification
Scikit proposes an algorithm that automatically finds the optimal number (and choice) of features required for best scoring, the RFECV.

The idea is to apply the previous RFE with an additional hyperparameter that is the appropriate number of features. The algorithm will evaluate the generalisation error obtained by keeping N features, and choose N in order to minimize the generalisation error (or maximize the recall). The computation of the generalisation error is based on a k-fold cross validation, with k being another (problematic ?) hyperparameter. Let's use the technique with a 5-fold cross validation and with a Random_state of 43.

In [None]:
from sklearn.feature_selection import RFECV

clf_rf_3 = RandomForestClassifier(random_state=43) 
rfecv = RFECV(estimator=clf_rf_3, step=1, cv=5, scoring='recall')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])

We found the best 20 features for "best classification". **Or did we ?** First of all I must say that I don't like this solution. Having **radius_mean**, **perimeter_mean** and **area_mean** together doesn't seem good to me.

Let's look at the evolution of the recall with the increase in the number of features :

In [None]:
# Plot number of features VS. cross-validation scores
import matplotlib.pyplot as plt
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

Looking at this plot, it appears that keeping 8, 14, 17, 20, 21 or 26 features roughly gives the same recall output. Personnally I would have chosen 8 features rather than 20. This is the tradeoff between having a good score and having a good understanding in your ML. As I said, right now I choose to keep the best score without any consideration for the number of features.

## 4) Improving the estimate of the generalization error
In our search for the best algorithm, we should consider fine tuning the hyperparameters of the machine learning algorithm (here, the Random Forest). However, it is useless to fine tune hyperparameters if, in the end (it doesn't even matter), you can't estimate the generalization error properly. Therefore, before trying to optimize our algorithm, let's first check the quality of the estimate of the generalization error, and improve it if necessary.

I can't stress this enough, but this section is not about improving any algorithm (i.e. it's not about improving the generalization error), **it's about improving the estimate of the generalization error**. If we are not relevant on the estimate of the generalization error, then we can't be relevant in the choice of the best algorithm.

Two parameters got my attention : k (from the k-fold cross validation) and random_state (the thing you put in the ML algorithm to initialize it.

First of all, let's look at the evolution of the recall found with the RFECV with a fixed k-fold cross validation, but a varying Random_state :

In [None]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")

for rs in range(10):
    clf_rf_4 = RandomForestClassifier(random_state = rs)
    rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5, scoring='recall')   #5-fold cross-validation
    rfecv = rfecv.fit(x_train, y_train)
    plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)

plt.show()

I stopped at ten random states, just enough to see the variance between the results, for a same k-fold cross validation. Let's now look at the evolution of the optimal number of features :

In [None]:
opt_features = []
for rs in range(10):
    clf_rf_4 = RandomForestClassifier(random_state = rs) 
    rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5, scoring='recall')   #5-fold cross-validation
    rfecv = rfecv.fit(x_train, y_train)
    opt_features.append(rfecv.n_features_)
print("Number of optimal features :", opt_features)

The number of optimal features fluctuates a lot for a same classifier. It is therefore quite erroneous to base our understanding of the optimal feature selection on a single occurence of the cross validation.

# What can we do about it ?
I don't have any expertise in the fields of algorithm validation or feature selection, so the method I propose might either be erroneous or already in practice. Either way, please tell me, and help me learn more about the subjects. I'll update the kernel accordingly.

The first step would be, for a given k-fold cross validation, to average the curves obtained via the RFECV over a large range of random states. This would give the best number of features selected (statistically speaking).

Doing so, we end up with a problem leading to our second step. The number N obtained in the first step is an averaged that does not refer to any particular set of features. A priori, different set of N features could express high recall. Therefore, once the optimal number N of features is fixed, one has to evaluate a RFE algorithm with a fixed value of N features, for a large number of random states. Each result should give a different set of N features (a priori). Once the computations are done, pick the N most occuring features.

## First step : averaging the solutions of the RFECV to obtain an averaged-RFECV
We saw that doing one cross validation is erroneous for our feature selection. Our feature selection will be based on the averaged optimal number of features :

In [None]:
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")

number_of_random_states = 100
average_optimal = np.zeros(30)

for rs in range(number_of_random_states):
    clf_rf_4 = RandomForestClassifier(random_state = rs)
    rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5, scoring='recall')   #5-fold cross-validation
    rfecv = rfecv.fit(x_train, y_train)
    average_optimal += np.asarray(rfecv.grid_scores_)
average_optimal /= number_of_random_states    
plt.plot(range(1, len(rfecv.grid_scores_) + 1), average_optimal)
print("Number of features selected :", np.argmax(average_optimal)+1)
print("Evaluation of the optimal recall :", np.max(average_optimal))
plt.show()

The smoothness of the curve is such that I gain confidence in my result. Here, I find that 11 features are required to optimize my algorithm, for a given k-fold cross validation, and for a given set of hyperparameters of the algorithm. We also find that 91.2% is a closer estimate of the generalization error.

## Second step : but what are these 11 features ?

The previous algorithme gave you the best number of features, statistically speaking, but we lost the information provided by the determinist RFECV, that is the set of features itself. To retrieve this information, let's use a RFE algorithm with a fixed number of 11 features. We choose the features that appear most often in the solution. Let's do both.

Everytime a feature appears in the solution, I increment its weight by 1.

In [None]:
from collections import Counter

most_appearing_features = []

for rs in range(10):
    # Create the RFE object and rank each pixel
    clf_rf_2 = RandomForestClassifier(random_state=rs)      
    rfe = RFE(estimator=clf_rf_2, n_features_to_select=11, step=1)
    rfe = rfe.fit(x_train, y_train)
    most_appearing_features.append(x_train.columns[rfe.support_].tolist())
most_appearing_features = [item for sublist in most_appearing_features for item in sublist] #flatten the list

print('Most appearing features :')
Counter(most_appearing_features).most_common(11) #find the 11 most common features

Now that we have our set of 11 features, let's compute our final recall. To do so, we use a training with this 11 features, for a large number of random_states.

In [None]:
drop_list = ["radius_se","texture_se","perimeter_mean","perimeter_se","area_mean","smoothness_mean","smoothness_se","smoothness_worst","compactness_mean","compactness_se","compactness_worst","concavity_se","concave points_se","symmetry_mean","symmetry_se","symmetry_worst","fractal_dimension_mean","fractal_dimension_se","fractal_dimension_worst"]
print("Number of eliminated features :", len(drop_list))
x_2 = x.drop(drop_list,axis = 1)        # do not modify x, we will use it later 
x_2.head()

In [None]:
# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x_2, y, test_size=0.3, random_state=42)

number_of_random_states = 10
recall = 0

for rs in range(number_of_random_states):
    #random forest classifier with n_estimators=10 (default)
    clf_rf = RandomForestClassifier(random_state=rs)      
    clr_rf = clf_rf.fit(x_train,y_train)
    recall += recall_score(y_test,clf_rf.predict(x_test))

recall /= number_of_random_states
print('Recall is: ', recall)

The recall on the test set is 94.0%

I could have evaluate the test error with a single random_state, but I chose to average the results to stay consistent with the procedure.

# V. Conclusion
Let's wrap up what we did.

First, we did a data analysis and a data vizualisation. Both of them were very basic and gave us good insights on our data.

Secondly, we used a Machine Learning algorithm, namely the Random Forest, in order to classify the data between malignant and benign tumors. Using the k-cross fold validation proved deficient to properly estimate the generalization error, so I "developed" a methodology to improve this estimation. The steps of the method are as follow :

1) Run the RFECV many times to get the average generalization error and its evolution with the number of features.

2) Pick N, the number of features that maximizes the averaged recall

3) Run the RFE many times with a fixed number of features N

4) Choose the N features that get picked up most often in the RFE

5) Train a large number of Random Forest with this set of features and average the recall on the test set

Let's wrap up what's left to do :

1) We still have to work on the k-fold cross validation, as I still don't understand the influence of k on the generalization error

2) Once we have a methodology to properly estimate the generalization error, we can tune the hyperparameters of the Random Forest to optimize it

3) We can do the same with other algorithms, and choose the best

## I hope you enjoyed this kernel

## If you have any question or advice, don't hesitate ...