# Intro

Hello everyone, in this notebook I will try to show how we can work on a multidimensional data and make it less dimensional with **Feature Selection**. First we will try to get information about the feature we have with **data visualization**, then we will try to establish the optimum model with less features with **feature selection**. I try to learn new things every day and improve myself. I may have mistakes, if you come across, please mention it in the comments. Your feedback is very important to me.

Notebook content is as follows:

 - [Base Model](#1)
 - [EDA](#2)
 - [Visualization](#3)
 - [Feature Selections and Random Forest Classification](#4)
 - [Conclusion](#5)
 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import warnings
warnings.filterwarnings("ignore")

In [None]:
breast = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

df = breast.copy()

I will drop unnecessary columns

In [None]:
df.drop(["Unnamed: 32","id"],axis = 1, inplace = True)

<a id="1"></a> <br>
# **Base Model**

In [None]:
X = df.drop("diagnosis", axis = 1)
y = df["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, stratify = y, random_state = 42)

In [None]:
logreg = LogisticRegression().fit(X_train,y_train)
y_pred = logreg.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
plt.figure(figsize=(3,3))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Base Model', color='navy', fontsize=15)
plt.show()

The model we built using all the features,base model, gives an **accuracy score of 0.94**. Now let's try to make Feature Elimination and see if we can get better results. But first, let's visualize the data and gain an understanding of the features.

<a id="2"></a> <br>
# **EDA**

In [None]:
y = df.diagnosis
x = df.drop("diagnosis", axis = 1)

x.head()

In [None]:
ax = sns.countplot(x = "diagnosis", data = df)
plt.show()

b,m = df.diagnosis.value_counts()
print("Number of Benign: ", b)
print("Number of Malignant: ", m)


In [None]:
x.describe().T

<a id="3"></a> <br>
# **Visualization**

- Before visualization, we need to do **normalization** or **standardization**. Because differences between values of features are very high to observe on plot.
- I plot features in 3 group and each group includes 10 features to observe better.

## **Violin Plots**

In [None]:
data_dia = y
data = x

#standardization
data_n2 = (data-data.mean()) / data.std()

data = pd.concat([y,data_n2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')

# plotting the violin plot
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

In **texture_mean** feature, 
  - median of the **Malignant** and **Benign** looks like separated, so it can be good for classification.<br>
  
However, in **fractal_dimension_mean** feature, 
  - median of the **Malignant** and **Benign** does not looks like separated so it does not gives good information for classification.

In [None]:
# second ten part
data = pd.concat([y,data_n2.iloc[:,10:20]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')

# plotting the violin plot
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

In [None]:
# last ten part
data = pd.concat([y,data_n2.iloc[:,20:31]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')

# plotting the violin plot
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

- when we looked up all graphs, some features look very similar, some features look very different. The different ones will help during the classification..

- In order to compare two features deeper, lets use **joint plot** for some similar and some different features.

## **Joint-Plot**
### **Concavity Worst - Concave Points Worst**

In [None]:
sns.jointplot(x ='concavity_worst', y = 'concave points_worst', 
              data = x, kind="reg", color="#D81B60");

### **Symmetry Worst - Fractal Dimension Worst**

In [None]:
sns.jointplot(x = "symmetry_worst", y = "fractal_dimension_worst",
              data = x, kind = "reg",color="#D81B60" );

In [None]:
sns.jointplot(x ='concavity_se', y = 'concave points_se', 
              data = x, kind="reg", color="#D81B60");

* When we looked at Violin plots, the distribution of some features was very close to each other, we looked closer to the relationship between them to better observe this. And we have seen that there is a linear relationship between them.

## **PairGrid**
**Instead of looking at them individually, let's look at three or four properties in one graph.**

In [None]:
df = x.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
plt.show()

In [None]:
df = x.loc[:,['perimeter_mean','area_mean','area_worst',"concavity_mean"]]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
plt.show()

### **Swarm Plots**

In [None]:
data = pd.concat([y,data_n2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(8,8))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)

- For example, we can easily see that **area mean** feature well separated from each other in terms of classification
- Hovewer, **symmetry_mean** looks like malignant and benign are mixed so it is hard to classfy while using this feature.

In [None]:
data = pd.concat([y,data_n2.iloc[:,10:20]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(8,8))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)

- in here we can make same comment as before, **area_se** almost well separated from each other
- but for example **fractal_dimension_se** not like that.

In [None]:
data = pd.concat([y,data_n2.iloc[:,20:31]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(8,8))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)

<a id="4"></a> <br>
# **Feature Selection and Random Forest Classification**

In this part we will select feature with different methods that are feature selection with,
-  **Correlation**, 
-  **Univariate Feature Selection**, 

And we will use **Random Forest Classification** in order to train our model and predict.

## **1.Correlation Matrix and Random Forest C.**

In [None]:
corr = x.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(20, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, annot=True,fmt='.2f',mask=mask, cmap=cmap, ax=ax);

I will drop features that coefficients are high from 0.9

**Highly Correlated Features**

-  compactness_mean, concavity_mean and concave points_mean then I choose **concavity_mean**
-  radius_se, perimeter_se and area_se then I choose **area_se**
-  radius_worst, perimeter_worst and area_worst then I choose **area_worst**
-  compactness_worst, concavity_worst and concave points_worst then I choose **concavity_worst** 
-  compactness_se, concavity_se and concave points_se then I choose **concavity_se**
-  texture_mean and texture_worst are correlated then I choose **texture_mean** 
-  area_worst and area_mean I choose **area_mean**

In [None]:
drop_list1 = ['perimeter_mean','radius_mean','compactness_mean','concave points_mean','radius_se','perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst','compactness_se','concave points_se','texture_worst','area_worst']
x1 = x.drop(drop_list1, axis = 1 )       
x1.head()

Now lets create again correlation matrix

In [None]:
corr = x1.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(12, 6))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, annot=True,fmt='.2f',mask=mask, cmap=cmap, ax=ax);

### **Random Forest Model**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score


x_train, x_test, y_train, y_test = train_test_split(x1, y, test_size=0.3, random_state=42)

#n_estimators=10 (default)
clf_rf = RandomForestClassifier(random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train)

ac_score = accuracy_score(y_test,clf_rf.predict(x_test))
print('Accuracy is: ',ac_score)

cnf_m = confusion_matrix(y_test,clf_rf.predict(x_test))

plt.figure(figsize=(3,3))
sns.heatmap(cnf_m, annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Base Model', color='navy', fontsize=15)
plt.show()

## **2.Univariate Feature Selection and Random Forest C.**
- In univariate feature selection, we will use **SelectKBest** that removes all but the **k highest scoring** features.
- In this method we need to choose how many features we will use.(k) I will try model one by one with 
  - k = 4
  - k = 5 
  - k = 6

#### **k=4**

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# find best scored 4 features
select_feature = SelectKBest(chi2, k=4).fit(x_train, y_train)

print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)

According to score list we will choose best 4 as follows: 
  - **texture_mean**,
  - **area_mean**,
  - **area_se**,
  - **concavity_worst**

and we build new model

In [None]:
x_train_2 = select_feature.transform(x_train)
x_test_2 = select_feature.transform(x_test)

#random forest classifier with n_estimators=10 (default)
clf_rf_2 = RandomForestClassifier()      
clr_rf_2 = clf_rf_2.fit(x_train_2,y_train)

ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2))
print('Accuracy is: ',ac_2)

cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2))


plt.figure(figsize=(3,3))
sns.heatmap(cm_2, annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Base Model', color='navy', fontsize=15)
plt.show()

**It gave a better result than before, predicting 1s much better.**

**k = 5**

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)

print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)

In [None]:
x_train_3 = select_feature.transform(x_train)
x_test_3 = select_feature.transform(x_test)

#random forest classifier with n_estimators=10 (default)
clf_rf_2 = RandomForestClassifier()      
clr_rf_2 = clf_rf_2.fit(x_train_3,y_train)

ac_3 = accuracy_score(y_test,clf_rf_2.predict(x_test_3))
print('Accuracy is: ',ac_2)

cm_3 = confusion_matrix(y_test,clf_rf_2.predict(x_test_3))

plt.figure(figsize=(3,3))
sns.heatmap(cm_3, annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Base Model', color='navy', fontsize=15)
plt.show()

**k = 6**

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# find best scored 6 features
select_feature = SelectKBest(chi2, k=6).fit(x_train, y_train)

print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)

In [None]:
x_train_4 = select_feature.transform(x_train)
x_test_4 = select_feature.transform(x_test)

#random forest classifier with n_estimators=10 (default)
clf_rf_3 = RandomForestClassifier()      
clr_rf_3 = clf_rf_3.fit(x_train_4,y_train)

ac_4 = accuracy_score(y_test,clf_rf_3.predict(x_test_4))
print('Accuracy is: ',ac_2)

cm_4 = confusion_matrix(y_test,clf_rf_3.predict(x_test_4))

plt.figure(figsize=(3,3))
sns.heatmap(cm_4, annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Base Model', color='navy', fontsize=15)
plt.show()

- With **k = 4**, it gave a better result (0.96) than other k values, predicting 1's much better.

<a id="5"></a> <br>
# **Conclusion**
In short, in this notebook, I tried to show **Data Visualization** and **Feature Selection** techniques. While we had 33 features in the beginning, we reduced it to 4 using statistics and some algorithms.
