[Help taken from](https://www.kaggle.com/kanncaa1/feature-selection-and-data-visualization)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
df.head()

In [None]:
df.info()

In [None]:
cols_to_drop = ['id', 'Unnamed: 32']
df = df.drop(cols_to_drop, axis=1)
df.head()

In [None]:
df.describe().T

### First hand observations:

1. The data needs to be standardized or normalised based on our visualizations.
2. Some of features have very high values as compared to other columns.

In [None]:
sns.countplot(df['diagnosis'])

It's clearly visible from the heatmap that many of the features are highly correlated. We would have to remove these features from our model. We also need to conduct further data visualizations to check for anomalies in our dataset.

In [None]:
correlation_coeffs = df.corr()
correlation_stack = correlation_coeffs.unstack()
correlation_stack_sorted = correlation_stack.sort_values(kind="quicksort", ascending=True)

In [None]:
correlation_stack_sorted[-50:]

## 1. Boxplots

In [None]:
y = df['diagnosis']
df.drop('diagnosis', axis=1, inplace=True)
df_std = (df - df.mean())/df.std()

In [None]:
df_std.head()

In [None]:
def plot_data(data, y, plot_type):
    data = pd.concat([y, data], axis=1)
    data = pd.melt(data,id_vars="diagnosis",
                   var_name="features",
                   value_name='value')
    plt.figure(figsize=(10,10))
    if plot_type=='violin':
        sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
    elif plot_type=='box':
        sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
    elif plot_type=='swarm':
        sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
    plt.xticks(rotation=90)

In [None]:
plot_data(df_std.iloc[:, 0:10], y, "violin")

In [None]:
plot_data(df_std.iloc[:, 11:20], y, "violin")

In [None]:
plot_data(df_std.iloc[:, 21:30], y, "violin")

## Observations:

1. There are many features which have different distributions(median) for different diagnosis type ('malignant' or 'benign')
2. Features like 'radius_mean', 'concavity_mean' have considerably different medians for the different types, thus they will be good for classification.
3. Features like 'symmetry_mean', 'fractal_dimension_mean', 'texture_se' have the same median thus they won't add much meaning to the classification task.

In [None]:
plot_data(df_std.iloc[:, 0:10], y, 'box')

In [None]:
plot_data(df_std.iloc[:, 11:20], y, 'box')

In [None]:
plot_data(df_std.iloc[:, 0:10], y, 'swarm')

In [None]:
plot_data(df_std.iloc[:, 10:20], y, 'swarm')

In [None]:
plot_data(df_std.iloc[:, 21:30], y, 'swarm')

#### From the above swarm plots we can have the following conclusions:
1. 'perimeter_mean', 'area_mean', 'concavity_mean' does the perfect job in separating the data.
2. In the 2nd plot none of the features do a good job in separating the different classes.
3. In the 3rd plot, 'area_worst', 'perimeter_worst', 'concavity_worst' are also able to separate the data.

#### We now have to run algorithms to select features best for classification.

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), cmap='Blues', annot=True,linewidths=.5, fmt= '.1f')
plt.show()

In [None]:
df.columns

### Manually selecting features from observations made above:

1. The features: radius_mean, perimeter_mean, area_mean are highly correlated, so we will choose one feature from among them -> 'perimeter_mean'.
2. The features: radius_worst, perimeter_worst, area_worst are highly correlated, so we will choose one feature from among them -> 'area_worst'.
3. 'compactness_mean', 'concavity_mean', 'concave points_mean' are also correlated, we choose -> 'concave points_mean'
4. 'radius_se', 'perimeter_se', 'area_se' are also correlated, we choose -> 'radius_se'
5. 'compactness_worst', 'concavity_worst', 'concave points_worst' are also correlated, we choose -> 'concavity_worst'
6. 'texture_mean' and 'texture_worst' are also correlated, we choose -> 'texture_mean'

In [None]:
cols_to_drop = ['area_mean','radius_mean','compactness_mean','concavity_mean',
                'area_se','perimeter_se','perimeter_worst', 
                'compactness_worst','concave points_worst','compactness_se',
                'concave points_se','texture_worst','radius_worst']

data = df.drop(cols_to_drop, axis=1)
data.head()

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(), cmap='Blues', annot=True,linewidths=.5, fmt= '.1f')
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score, classification_report

# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=42)

#random forest classifier with n_estimators=10 (default)
clf_rf = RandomForestClassifier(random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train)

ac = accuracy_score(y_test,clf_rf.predict(x_test))
print('Accuracy is: ',ac)
cm = confusion_matrix(y_test,clf_rf.predict(x_test))
sns.heatmap(cm,annot=True,fmt="d")

### A simple model with the manually chosen features gave us 97% accuracy

In [None]:
print(classification_report(y_test,clf_rf.predict(x_test)))

In [None]:
features = pd.DataFrame()
features['Feature'] = x_train.columns
features['Importance'] = clf_rf.feature_importances_
features.sort_values(by=['Importance'], ascending=False, inplace=True)
features.set_index('Feature', inplace=True)
features.plot(kind='bar', figsize=(20, 10))

### Now we shall do the same process but with using all the dataframe features and using a automatic feature selection algorithm

## 1. Recursive Feature Elimination with CV

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. The model uses CV for finding the  optimal number of features and the important features.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

In [None]:
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
rf_clf2 = RandomForestClassifier() 

rfecv = RFECV(estimator=rf_clf2, step=1, cv=3, scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(data, y)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])

On running a Feature Selection algorithm on the entire dataset we see that we have narrowed down to 14 important features for classification.

## XG-Boost

### Feature Importance in Gradient Boosting

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

In [None]:
from numpy import sort
from xgboost import XGBClassifier
from sklearn.feature_selection import SelectFromModel

### First fit on the entire dataset

In [None]:
y = y.map({'B':0, 'M':1}).astype('int')

In [None]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.25, random_state=7, stratify=y)

In [None]:
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
	# select features using threshold
	selection = SelectFromModel(model, threshold=thresh, prefit=True)
	select_X_train = selection.transform(X_train)
	# train model
	selection_model = XGBClassifier()
	selection_model.fit(select_X_train, y_train)
	# eval model
	select_X_test = selection.transform(X_test)
	y_pred = selection_model.predict(select_X_test)
	predictions = [np.round(value) for value in y_pred]
	accuracy = accuracy_score(y_test, predictions)
	print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

We can clearly see that using a naive xgb classifier and with just 7 features we can achieve around 93% accuracy on the test set. This accuracy can also be increased further using hyperarameter tuning.

In [None]:
# How to get back feature_importances_ (gain based) from plot_importance fscore
# Calculate two types of feature importance:
# Weight = number of times a feature appears in tree
# Gain = average gain of splits which use the feature = average all the gain values of the feature if it appears multiple times
# Normalized gain = Proportion of average gain out of total average gain

k = model.get_booster().trees_to_dataframe()
group = k[k['Feature']!='Leaf'].groupby('Feature').agg(fscore = ('Gain', 'count'),
feature_importance_gain = ('Gain', 'mean'))

# Feature importance same as plot_importance(importance_type = ‘weight’), default value
group['fscore'].sort_values(ascending=False)
# Feature importance same as clf.feature_importance_ default = ‘gain’
group['feature_importance_gain_norm'] = group['feature_importance_gain']/group['feature_importance_gain'].sum()
group.sort_values(by='feature_importance_gain_norm', ascending=False)
print('3')
# Feature importance same as plot_importance(importance_type = ‘gain’)
group[['feature_importance_gain']].sort_values(by='feature_importance_gain', ascending=False)

Link to above comment - [CODE LINK](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/#comment-540697)

Please upvote and leave a comment if the notebook was helpful. Cheers