![](https://storage.googleapis.com/kaggle-datasets-images/180/384/3da2510581f9d3b902307ff8d06fe327/dataset-cover.jpg)
<br><br>This notebook contains basic statistical analysis steps where Wisconsin Breast Cancer Data is used as an example. However, if you want to do a classification on them, have a look at my other notebook [Logistic Regression with Breast Cancer Data](https://www.kaggle.com/redwankarimsony/logistic-regression-with-breast-cancer-data) where I implemented gradient descent from the scratch to classify benign and malignant breast cancer. 

If this notebook helps you, please do <font color='red'> **UPVOTE...** </font>

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("ggplot")

import plotly.graph_objects as go

from IPython.display import clear_output
import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
data.drop(['Unnamed: 32','id'], axis = 1, inplace=True)
data.head()

## <font color='blue'>Histogram:</font> 
How many times each value appears in dataset. This description is called the distribution of variable
Most common way to represent distribution of varible is histogram that is graph which shows frequency of each value. Frequency = number of times each value appears
Example: `[1,1,1,1,2,2,2].` Frequency of 1 is `four` and frequency of 2 is `three.`

In [None]:
radius_mean_m = data[data['diagnosis'] =='M']['radius_mean']
radius_mean_b = data[data['diagnosis'] =='B']['radius_mean']

fig = go.Figure()
fig.add_trace(go.Histogram(x=radius_mean_b, name = 'Benign'))
fig.add_trace(go.Histogram(x=radius_mean_m, name = 'Malignant'))

# Overlay both histograms
fig.update_layout(title = 'Histogram Comparison of radius_mean', 
                  title_x = 0.5,
                  xaxis_title ='Radius Mean Value ->',
                  yaxis_title = 'Value Counts ->',
                  barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()
print(f'Mean of radius_mean values (Benign): {radius_mean_b.mean()}')
print(f'Mean of radius_mean values (Malignant): {radius_mean_m.mean()}')

### &#9673;<font color = 'blue'> Understanding Histogram Plot </font>
1. From this graph you can see that `radius_mean` of malignant tumors are bigger than `radius_mean` of benign tumors mostly.
2. The benign distribution (blue in graph) is approcimately bell-shaped that is shape of normal distribution (gaussian distribution). However the same is not true for the malignant class data. It concludes that, malignant class `radius_mean` is a bit erratic
3. Also you can find out that the mean value of malignant is higher than that of benign class. 




##  <font color = 'blue'> Finding Outliers </font>
1. When you are looking at histogram as you can see there are rare values in malignant distribution (red in graph)
2. There values can be errors or rare events. These errors and rare events can be called outliers
3. **Calculating outliers:**
* first we need to calculate first quartile $(Q1)(25\%)$ 
* then find `IQR(inter quartile range) = Q3-Q1`
* finally compute `Q1 - 1.5IQR` and `Q3 + 1.5IQR`
* Anything outside this range is an outlier lets write the code for bening tumor distribution for feature `radius_mean`

For visually inspecting the outliers, [Box Plot](https://plotly.com/python/box-plots/) is an excellent choice. 

In [None]:
radius_mean_m = data[data['diagnosis'] =='M']['radius_mean']
radius_mean_b = data[data['diagnosis'] =='B']['radius_mean']

# Calculating Qurtile Points
desc = radius_mean_b.describe()
print(desc)
Q1 = desc[4]
Q3 = desc[6]
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

# Finding Outliers
a = radius_mean_b[radius_mean_b < lower_bound].values 
b = radius_mean_b[radius_mean_b > upper_bound].values
outliers = np.concatenate([a,b], axis = 0)

print("Anything outside this range is an outlier: (", lower_bound ,",", upper_bound,")")
print(f'Outliers: {outliers}')

## <font color='blue'>Box Plots:</font> 
A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

* **Minimum:** the lowest data point excluding any outliers.
* **Maximum:** the largest data point excluding any outliers.
* **Median (Q2 / 50th percentile):** the middle value of the dataset.
* **First quartile (Q1 / 25th percentile):** also known as the lower quartile $q_n(0.25)$, is the median of the lower half of the dataset.
* **Third quartile (Q3 / 75th percentile):** also known as the upper quartile $q_n(0.75)$, is the median of the upper half of the dataset.

An important element used to construct the box plot by determining the minimum and maximum data values feasible, but is not part of the aforementioned five-number summary, is the `interquartile range` or `IQR` denoted below:

**Interquartile range (IQR):** is the distance between the upper and lower quartiles.

$${\displaystyle {\text{IQR}}=Q_{3}-Q_{1}=q_{n}(0.75)-q_{n}(0.25)}$$
A boxplot is constructed of two parts, a box and a set of whiskers shown in the following figure. The lowest point is the minimum of the data set and the highest point is the maximum of the data set. The box is drawn from `Q1` to `Q3` with a horizontal line drawn in the middle to denote the median.

In [None]:
radius_mean_m = data[data['diagnosis'] =='M']['radius_mean']
radius_mean_b = data[data['diagnosis'] =='B']['radius_mean']

fig = go.Figure()
fig.add_trace(go.Box(y=radius_mean_m, name='Malignant', marker_color = 'indianred'))
fig.add_trace(go.Box(y=radius_mean_b, name = 'Benign', marker_color = 'lightseagreen'))

fig.update_layout(title='Distribution of radius_mean for Benign and Malignant Class',
                  title_x = 0.5,
                  xaxis_title = 'Feature',
                  yaxis_title = 'Value',
                  height = 400,
                  width = 800)
fig.show()

## <font color='blue'> Summary Statistics</font>
* Mean
* Variance: spread of distribution
* Standart deviation square root of variance
* Lets look at summary statistics all the feature column in the given dataset. 

In [None]:
data.describe()

## <font color='blue'>CDF (Cumulative Distribution Function) </font>
1. Cumulative distribution function is the probability that the variable takes a value less than or equal to x. $P(X <= x)$
2. Let's explain in cdf graph of bening radiues mean
in graph, what is $P(12 < X)$? The answer is $0.5$ obviously. 
3. The probability that the variable takes a values less than or equal to 12(radius mean) is 0.5.
You can plot cdf with two different method

In [None]:
def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result

fig = go.Figure()
fig.add_scatter(x=np.unique(radius_mean_b), 
                y=ecdf(radius_mean_b)(np.unique(radius_mean_b)), 
                line_shape='hv')

fig.update_layout(title='CDF Curve for the feature radius_mean', title_x = 0.5,   
                  xaxis_title = 'Radius Mean Value',
                  yaxis_title = 'CDF',
                  height = 400, width = 600)
fig.show()

## <font color='blue'>Relationship Between Variables</font>
1. We can say that two variables are related with each other, if one of them gives information about others
For example, price and distance. If you go long distance with taxi you will pay more. Therefore, we can say that price and distance are positively related with each other.

2. Scatter Plot is the simplest way to check relationship between two variables. Let's look at the relationship between `radius_mean` and `area_mean`

3. In the following scatter plot you can see that when radius mean increases, area mean also increases Therefore, they are positively correlated with each other.

In [None]:
plt.figure(figsize = (15,8))
sns.jointplot(x = data['radius_mean'], y= data['area_mean'] ,kind="reg", color='green')
plt.title('Relation between radius_mean and area_mean')
plt.grid()
plt.show()

plt.figure(figsize = (15,8))
sns.jointplot(x = data['radius_mean'], y= data['texture_mean'] ,kind="reg", color='crimson')
plt.title('Relation between radius_mean and texture_mean')
plt.grid()
plt.show()

### &#9673; <font color='blue'>Observation:</font> 
1. Here in the first figure (green one), we see that both of the features `radius_mean` and `area_mean` are highly dependent on each other. As a result, they are positively correlated. 
2. In the second figure (pink one), both of the features are kind of independent from each other and there is little dependency among them.

It is also easier to observe the relation between more than two features.





## <font color='blue'>Multiple Feature Correlation:</font> 

In [None]:
# Also we can look relationship between more than 2 distribution
sns.set(style = "darkgrid")
# df = data.loc[:,["radius_mean","area_mean","fractal_dimension_se" ]]
df = data[['diagnosis', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'radius_mean']]
g = sns.PairGrid(df, hue = 'diagnosis',diag_sharey = False)
g.map_upper(sns.regplot )
g.map_lower(sns.kdeplot, color = 'blue')
g.map_diag(sns.kdeplot, color = 'green')
plt.show()

Similar things can also be simulated by `sns.pairplot()` function. There is something interesting observation here. For the Benign data there is one correlation and for Malignant data there is another relation.

In [None]:
df = data[['diagnosis', 'smoothness_mean','compactness_mean',	'concavity_mean']]
g = sns.pairplot(df, hue = 'diagnosis', )
g.map_lower(sns.kdeplot, color = 'blue')
g.map_upper(sns.regplot)

## <font color='blue'> Correlation (Pearson) </font>
A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables or features. The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution. 

Pearson correlation coefficient `(r)`: 

<font color='blue'> $$r =\frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}}$$ </font>
where <br>
$r$	=	correlation coefficient <br>
$x_{i}$	=	values of the x-variable in a sample<br>
$\bar{x}$	=	mean of the values of the x-variable<br>
$y_{i}$	=	values of the y-variable in a sample<br>
$\bar{y}$	=	mean of the values of the y-variable

Let's have a look at the correlation of all the features. 



In [None]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(data.corr(method='pearson'),annot= True,linewidths=0.6,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Pearson Correlation Map')
plt.savefig('graph.png')
plt.show()

### &#9673; <font color='blue'> Understanding the Pearson Correlation Matrix: </font>
Its a Huge matrix that includes a lot of numbers. But to understand it, you need to know several things at first. 
1. The range of this numbers are -1 to 1.
2. Meaning of 1 is two variable are positively correlated with each other like `radius_mean` and `area_mean`
3. Meaning of `zero` is there is no correlation between variables like `radius_mean` and `fractal_dimension_se`.
4. Meaning of -1 is two variables are negatively correlated with each other like `radius_mean` and `fractal_dimension_mean`. Actually correlation between of them is not -1, it is -0.3 but the idea is that if sign of correlation is negative that means that there is negative correlation.

## <font color='blue'> Spearman's Rank Correlation </font>
In statistics, Spearman's rank correlation coefficient or Spearman's $rho\ (\rho)$ named after Charles Spearman and often denoted by the Greek letter or as, is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. 

$$\rho=1-\frac{6 \sum d_{i}^{2}}{n (n^{2}-1)}$$
where, <br>
$\rho	$ =	Spearman's rank correlation coefficient <br>
$d_{i}	$ =	difference between the two ranks of each observation <br>
$n	$ =	number of observations


|Pearson Moment Correlation | Spearman Rank-order Correlation |
|:--|:--|
|The Pearson correlation evaluates the linear relationship between two continuous variables.  A relationship is linear when a change in one variable is associated with a proportional change in the other variable. | The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tendto change together, but not necessarily at a constant rate. The Spearmancorrelation coefficient is based on the ranked values for each variable rather than the raw data. | 
|For example, you might use a Pearson correlation to evaluate whether increases in temperature  at your production facility are associated with decreasing thickness of your chocolate coating.| Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in  which employees complete a test exercise is related to the number of months they  have been employed.|

In [None]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(data.corr(method='spearman'),annot= True,linewidths=0.6,fmt = ".1f",cmap="YlGnBu_r", ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Pearson Correlation Map')
plt.savefig('graph.png')
plt.show()

## <font color='blue'> Mean VS Median </font>
* Sometimes instead of mean we need to use median. I am going to explain why we need to use median with an example.

* Let's assume that there are 10 people who work in a company. Boss of the company will make raise in their salary if their mean of salary is smaller than 5



In [None]:
salary = [1,4,3,2,5,4,2,3,1,500]
print("Mean of salary: ",np.mean(salary))

* Mean of salary is 52.5 so the boss thinks that oooo I gave a lot of salary to my employees. And do not makes raise in their salaries.
* However as you know this is not fair and 500(salary) is outlier for this salary list.
Median avoids outliers

In [None]:
print("Median of salary: ",np.median(salary))

<a id="14"></a> <br>
## Normal(Gaussian) Distribution and z-score
* Also called bell shaped distribution
* Instead of making formal definition of gaussian distribution, I want to explain it with an example.
* The classic example is gaussian is IQ score.
    * In the world lets say average IQ is 110.
    * There are few people that are super intelligent and their IQs are higher than 110. It can be 140 or 150 but it is rare.
    * Also there are few people that have low intelligent and their IQ is lower than 110. It can be 40 or 50 but it is rare.
    * From these information we can say that mean of IQ is 110. And lets say standart deviation is 20.
    * Mean and standart deviation is parameters of normal distribution.
    * Lets create 100000 sample and visualize it with histogram.

In [None]:
# parameters of normal distribution
mu, sigma = 110, 20  # mean and standard deviation
s = np.random.normal(mu, sigma, 100000)
print("mean: ", np.mean(s))
print("standart deviation: ", np.std(s))
# visualize with histogram
plt.figure(figsize = (10,7))
plt.hist(s, 100, )
plt.ylabel("frequency")
plt.xlabel("IQ")
plt.title("Histogram of IQ")
plt.show()

* As it can be seen from histogram most of the people are cumulated near to 110 that is mean of our normal distribution
* However what is the "most" I mentioned at previous sentence? What if I want to know what percentage of people should have an IQ score between 80 and 140?
* We will use z-score the answer this question. 
      * z = (x - mean)/std 
      * z1 = (80-110)/20 = -1.5
      * z2 = (140-110)/20 = 1.5
      * Distance between mean and 80 is 1.5std and distance between mean and 140 is 1.5std.
      * If you look at z table, you will see that 1.5std correspond to 0.4332
 <a href="https://ibb.co/hys6OT"><img src="https://preview.ibb.co/fYzWq8/123.png" alt="123" border="0"></a>
      * Lets calculate it with 2 because 1 from 80 to mean and other from mean to 140
      * 0.4332 * 2 = 0.8664
      * 86.64 % of people has an IQ between 80 and 140.
  <a href="https://ibb.co/fhc6OT"><img src="https://preview.ibb.co/bUi2xo/hist.png" alt="hist" border="0"></a>

  * What percentage of people should have an IQ score less than 80?
* z = (110-80)/20 = 1.5
* Lets look at table of z score 0.4332. 43.32% of people has an IQ between 80 and mean(110).
* If we subtract from 50% to 43.32%, we ca n find percentage of people have an IQ score less than 80.
* 50-43.32 = 6.68. As a result, 6.68% of people have an IQ score less than 80.