<h1 style="font-size:350%;"><center><b style="color:navy;">BREAST CANCER - EXPLORATORY DATA ANALYSIS</b></center></h1>

<h1 style="font-size:200%;"><b>OBJECTIVE</b></h1>
<ul>
    <li style="font-size:150%;">The goal of this kernel is to perfom <b>EXPLORATORY DATA ANALYSIS</b> on Breast Cancer Dataset and build a Machine Learning Model with good Accuracy. This will help in understand the importance of attributes thereby helping in predicting breast cancer depending these attributes.</li>
</ul>

<h1 style="font-size:200%;"><b>STEPS PERFORMED</b></h1>
<ul>
    <li style="font-size:150%;">Data Cleaning</li>
    <li style="font-size:150%;">Data Visualization</li>
    <li style="font-size:150%;">PCA</li>
    <li style="font-size:150%;">Model Building</li>
    <li style="font-size:150%;">Conclusion</li>
</ul>

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Importing the dataset
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

# DATA CLEANING

In [None]:
# Printing the 1st 5 columns
data.head()

In [None]:
# get the dimenions of data
data.shape

In [None]:
# get the columns list:
data.columns

In [None]:
# Target Variable:
data.diagnosis.value_counts()

In [None]:
#get the datatype of columns:
data.dtypes

## MISSING VALUES

In [None]:
# Check for null values:
data.isnull().sum()

In [None]:
#drop the unnamed column:
data.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
# statistics of our data:
data.describe().T

# DATA VISUALIZATION

In [None]:
# Finding out the correlation between the features
corr = data.corr()
corr.shape

In [None]:
# Plotting the heatmap of correlation between features
plt.figure(figsize=(20,20))
sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15}, cmap='Greens')
plt.show()

In [None]:
# Analyzing the target variable

plt.title('Count of cancer type')
sns.countplot(data['diagnosis'])
plt.xlabel('Cancer lethality')
plt.ylabel('Count')
plt.show()

In [None]:
#plot the histograms for each feature:
data.hist(figsize = (30,30), color = 'orange')
plt.show()

In [None]:
melted_data = pd.melt(data,id_vars = "diagnosis",value_vars = ['radius_worst', 'texture_worst', 'perimeter_worst'])
plt.figure(figsize = (15,10))
sns.boxplot(x = "variable", y = "value", hue="diagnosis",data= melted_data)
plt.show()

In [None]:
data.columns

In [None]:
#generate a scatter plot with the following columns:

columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

sns.pairplot(data=data[columns], hue="diagnosis", palette='rocket')

In [None]:
# Distribution density plot KDE (kernel density estimate)
sns.FacetGrid(data, hue="diagnosis", height=6).map(sns.kdeplot, "radius_mean").add_legend()
plt.show()

In [None]:
# Plotting the distribution of the mean radius
sns.stripplot(x="diagnosis", y="radius_mean", data=data, jitter=True, edgecolor="gray")
plt.show()

<h1>Drop the Columns with high correlation</h1>

<ul>
    <li style="font-size:130%;">Multicollinearity is a problem as it undermines the significance of independent varibales and we fix it by removing the highly correlated predictors.</li>
    <li style="font-size:130%;">we can verify the presence of multicollinearity between some of the variables. For instance, the radius_mean column has a correlation of 1 and 0.99 with perimeter_mean and area_mean columns, respectively. This is because the three columns essentially contain the same information, which is the physical size of the observation (the cell). Therefore we should only pick ONE of the three columns when we go into further analysis.</li>
    <li style="font-size:130%;">Another place where multicollienartiy is apparent is between the "mean" columns and the "worst" column. For instance, the radius_mean column has a correlation of 0.97 with the radius_worst column.
also there is multicollinearity between the attributes compactness, concavity, and concave points. So we can choose just ONE out of these, I am going for Compactness.</li>
</ul>

In [None]:
#From the correlation matrix we got to knwo that these columns are highly correlated with radius_mean, perimeter, area columns.
# So we are dropping these columns:

# first, drop all "worst" columns
cols = ['radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst',
        'concave points_worst', 
        'symmetry_worst', 
        'fractal_dimension_worst']
data = data.drop(cols, axis=1)

# then, drop all columns related to the "perimeter" and "area" attributes
cols = ['perimeter_mean',
        'perimeter_se', 
        'area_mean', 
        'area_se']
data = data.drop(cols, axis=1)

# lastly, drop all columns related to the "concavity" and "concave points" attributes
cols = ['concavity_mean',
        'concavity_se', 
        'concave points_mean', 
        'concave points_se']
data = data.drop(cols, axis=1)

# verify remaining columns
data.columns

In [None]:
# Draw the heatmap again, with the new correlation matrix
corr = data.corr().round(2)

# Define custom colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.tight_layout()

# TRAIN TEST SPLIT

In [None]:
# Spliting target variable and independent variables
X = data.drop(['diagnosis'], axis = 1)
y = data['diagnosis']

In [None]:
# Splitting the data into training set and testset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
print("Size of training set:", X_train.shape)
print("Size of training set:", X_test.shape)

# Random Forest Model

In [None]:
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Random Forest Classifier

# Import library of RandomForestClassifier model
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier
rf = RandomForestClassifier()

# Hyperparameter Optimization
parameters = {'n_estimators': [4, 6, 9, 10, 15], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
             }

# Run the grid search
grid_obj = GridSearchCV(rf, parameters)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the rf to the best combination of parameters
rf = grid_obj.best_estimator_

# Train the model using the training sets 
rf.fit(X_train,y_train)

In [None]:
# Prediction on test data
y_pred = rf.predict(X_test)

In [None]:
from sklearn import metrics
# Calculating the accuracy
acc_rf = round( metrics.accuracy_score(y_test, y_pred) * 100 , 2 )
print( 'Accuracy of Random Forest model : ', acc_rf )

# Support Vector Machine

In [None]:
# SVM Classifier

# Creating scaled set to be used in model to improve the results
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Import Library of Support Vector Machine model
from sklearn import svm

# Create a Support Vector Classifier
svc = svm.SVC()

# Hyperparameter Optimization
parameters = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

# Run the grid search
grid_obj = GridSearchCV(svc, parameters)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the svc to the best combination of parameters
svc = grid_obj.best_estimator_

# Train the model using the training sets 
svc.fit(X_train,y_train)

In [None]:
# Prediction on test data
y_pred = svc.predict(X_test)

In [None]:
# Calculating the accuracy
acc_svm = round( metrics.accuracy_score(y_test, y_pred) * 100, 2 )
print( 'Accuracy of SVM model : ', acc_svm )

# CONCLUSION

<p style="font-size:180%;">In this Kernel i have performed EDA, and built the machine learning models(RF, SVM)</p>

<p style="font-size:180%;">If you like the kernel, please give an upvote.</p>