<h1 align=center>Data Visalization with Seaborn : Feature Selection and Classification</h1>


### About the Dataset:

**Attribute Information**:

- ID number
- Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

1. radius (mean of distances from center to points on the perimeter) 
2. texture (standard deviation of gray-scale values) 
3. perimeter 
4. area 
5. smoothness (local variation in radius lengths) 
6. compactness (perimeter^2 / area - 1.0) 
7. concavity (severity of concave portions of the contour) 
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

### Loading Libraries and Data

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import time

In [None]:
data = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

<h2 align=center> Exploratory Data Analysis </h2>

### Separate Target from Features

In [None]:
data.head()

In [None]:
col = data.columns       
print(col)

In [None]:
y = data.diagnosis                           
drop_cols = ['Unnamed: 32','id','diagnosis']
x = data.drop(drop_cols,axis = 1 )
x.head()

### Plot Diagnosis Distributions

In [None]:
ax = sns.countplot(y,label="Count")
B, M = y.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

In [None]:
x.describe()

<h2 align=center> Data Visualization </h2>

### Visualizing Standardized Data with Seaborn

In [None]:
data_dia = y
data = x
data_n_2 = (data - data.mean()) / (data.std())              
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=45);

### Violin Plots and Box Plots

In [None]:
data = pd.concat([y,data_n_2.iloc[:,10:20]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=45);

In [None]:
data = pd.concat([y,data_n_2.iloc[:,20:31]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=45);

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=45);

### Using Joint Plots for Feature Comparison

In [None]:
sns.jointplot(x.loc[:,'concavity_worst'],
              x.loc[:,'concave points_worst'],
              kind="regg",
              color="#ce1414");

###  Observing the Distribution of Values and their Variance with Swarm Plots

In [None]:
#sns.set(style="whitegrid", palette="muted")
data_dia = y
data = x
data_n_2 = (data - data.mean()) / (data.std())  
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=45);

In [None]:
data = pd.concat([y,data_n_2.iloc[:,10:20]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=45);

In [None]:
data = pd.concat([y,data_n_2.iloc[:,20:31]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=45);

### Observing all Pair-wise Correlations

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax);

###  Dropping Correlated Columns from Feature Matrix


In [None]:
drop_cols=['perimeter_mean','radius_mean','compactness_mean',
              'concave points_mean','radius_se','perimeter_se',
              'radius_worst','perimeter_worst','compactness_worst',
              'concave points_worst','compactness_se','concave points_se',
              'texture_worst','area_worst']
df=x.drop(drop_cols,axis=1)
df.head()

In [None]:
fig,ax=plt.subplots(figsize=(18,18))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

### Classification using XGBoost (minimal feature selection)

In [None]:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
x_train,x_test,y_train,y_test=train_test_split(df,y,test_size=0.3,random_state=42)

clf_1=xgb.XGBClassifier(random_state=42)
clf_1=clf_1.fit(x_train,y_train)

In [None]:
preds=clf_1.predict(x_test)

print('Accuracy Score :',accuracy_score(preds,y_test))


In [None]:
cm=confusion_matrix(y_test,preds)
sns.heatmap(cm,annot=True,fmt='d')

### Task 4: Univariate Feature Selection and XGBoost

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
select_feature=SelectKBest(chi2,k=10)
select_feature=select_feature.fit(x_train,y_train)

In [None]:
print('Score list:',select_feature.scores_)
print('Features list :',x_train.columns)

In [None]:
x_train_2=select_feature.transform(x_train)
x_test_2=select_feature.transform(x_test)

clf_2=xgb.XGBClassifier()
clf_2=clf_2.fit(x_train_2,y_train)

preds_2=clf_2.predict(x_test_2)

print('Accuracy score =',accuracy_score(preds_2,y_test))

cm_2=confusion_matrix(preds_2,y_test)
sns.heatmap(cm_2,annot=True,fmt='d')


### Task 5: Recursive Feature Elimination with Cross-Validation

In [None]:
from sklearn.feature_selection import RFECV

clf_3=xgb.XGBClassifier()
rfecv=RFECV(estimator=clf_3,step=1,cv=5,scoring='accuracy',n_jobs=-1).fit(x_train,y_train)

print('Optimal features =',rfecv.n_features_)
print(' Best features =',x_train.columns[rfecv.support_])

In [None]:
accuracy_score(y_test,rfecv.predict(x_test))

In [None]:
num_features=[i for i in range(1,len(rfecv.grid_scores_)+1)]
cv_scores=rfecv.grid_scores_
ax=sns.lineplot(x=num_features,y=cv_scores)
ax.set(xlabel='No.of selected features',ylabel='CV_Scores')

## Please Do Upvote the kernel you liked it.