In this notebook I would conduct some well known exploratory data analysis and Statistical analysis over the Breast cancer Wisconsin Dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv").drop(['Unnamed: 32'],axis=1)
df.head(n=6).T

**This dataset has a very interesting description let's have a look at it.**

In [None]:
df.describe().T

### **Correlation = Statistical relationship between two variables/features/dimensions**<br>
It is possible that dimensions in our dataset are closely dependant or associated with each other. Modifying one can drastically affect other ones. Dataset with such various dimensions contains good correlation among dimensions. A possible suspect of highly correlated features. Let's dig into that.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.

Positive Correlation: both variables change in the same direction.
Neutral Correlation: No relationship in the change of the variables.
Negative Correlation: variables change in opposite directions.

In [None]:
print(df.dtypes)

In [None]:
plt.figure(figsize=(21,21))
plt.title("Pearson Correlation Heatmap")
corr = df.corr(method='pearson')
mask = np.tril(df.corr())
sns.heatmap(corr, 
           xticklabels=corr.columns.values,
           yticklabels=corr.columns.values,
           annot = True, # to show the correlation degree on cell
           vmin=-1,
           vmax=1,
           center= 0,
           fmt='0.2g', #
           cmap= 'coolwarm',
           linewidths=3, # cells partioning line width
           linecolor='white', # for spacing line color between cells
           square=False,#to make cells square 
           cbar_kws= {'orientation': 'vertical'}
           )

b, t = plt.ylim() 
b += 0.5  
t -= 0.5  
plt.ylim(b,t) 
plt.show()
################################################################################
plt.figure(figsize=(21,21))
plt.title("Spearman Correlation Heatmap")
corr = df.corr(method='spearman')
mask = np.tril(df.corr())
sns.heatmap(corr, 
           xticklabels=corr.columns.values,
           yticklabels=corr.columns.values,
           annot = True, # to show the correlation degree on cell
           vmin=-1,
           vmax=1,
           center= 0,
           fmt='0.2g', #
           cmap= 'coolwarm',
           linewidths=3, # cells partioning line width
           linecolor='white', # for spacing line color between cells
           square=False,#to make cells square 
           cbar_kws= {'orientation': 'vertical'}
           )

b, t = plt.ylim() 
b += 0.5  
t -= 0.5  
plt.ylim(b,t) 
plt.show()

> **Such a beauty, Cors matrics heatmap revelas intersting facts, Can you notice those darker cells in the heatmap..... That reveals a level of deeper relation between dimensions. The more darker it is the more value it contains. Genrally darker is more deeper correlation.**

In [None]:
#Sort dataframe columns by their mean value 
sorted_mean_df = pd.DataFrame(np.random.randn(5,32), columns=list(df.columns)).drop(['id'],axis=1)

**Let's just have a look at the mean distribution of the dimensions.**

In [None]:
plt.figure(figsize=(20,20))
sns.boxplot(data= sorted_mean_df.drop(['diagnosis','texture_se','perimeter_se','radius_se'],axis=1) ,width=0.3 , saturation=0.9,orient="h")

It's clearly visible that the dimensions have varying mean setup. Most of them if you look closely are not even close to normal distribution.
**Let's have a look at those dimensions which bifircating from normal distribution pattern.**

In [None]:
df.columns

In [None]:
plt.figure(figsize=(21,15))
sns.boxplot(data= sorted_mean_df[['radius_mean','symmetry_mean','compactness_se','concavity_se','concave points_se','texture_worst','smoothness_worst','compactness_worst','symmetry_worst', 'fractal_dimension_worst']],
                                   orient='h',
                                   width=0.2,
                                   saturation=0.9)

If you observe closely onto the above boxenplot of selected dimensions, they are having tilted mean which is more biased either positive skew or negative skew. Ideally the mean of a dimension must stay closer to mid of the Q1 and Q3.

Effect of this 

In [None]:
from scipy import stats
dd = pd.DataFrame()
dd,lambdaChanges = stats.boxcox(df['radius_mean'])
plt.figure(figsize=(21,5))
sns.boxplot(data = df['radius_mean'],orient='h',width=0.3)
plt.figure(figsize=(21,5))
sns.boxplot(data=dd,orient='h',width=0.3)

In [None]:
plt.figure(figsize=(15,5))
sns.distplot(df['radius_mean'])

In [None]:
from sklearn.preprocessing import StandardScaler
dataframe  = pd.DataFrame()
scaler = StandardScaler()
dataframe = scaler.fit_transform(df[['radius_mean','symmetry_mean','compactness_se','concavity_se','concave points_se','texture_worst','smoothness_worst','compactness_worst','symmetry_worst', 'fractal_dimension_worst']])
dataframe = pd.DataFrame(dataframe)
dataframe.columns = ['radius_mean','symmetry_mean','compactness_se','concavity_se','concave points_se','texture_worst','smoothness_worst','compactness_worst','symmetry_worst', 'fractal_dimension_worst']
dataframe.head()

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(dataframe['radius_mean'])

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(df['radius_mean'])