# Overview

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from pathlib import Path
from scipy.stats import skew, kurtosis, pearsonr, spearmanr, ttest_rel

%matplotlib inline

# Importing Data

In [None]:
path = Path('../input/breast-cancer-wisconsin-data/data.csv')
data = pd.read_csv(path)

In [None]:
data.head()

In [None]:
data.head()

In [None]:
# Let's have a quick look at the shape of the data
data.shape

In [None]:
# Drop the unrequired column from the data
data = data.drop(columns=['Unnamed: 32'], axis=1)

In [None]:
data.columns

## Histogram

Histograms will show how frequently data of each type appears in the dataset i.e the frequency of malign and benign cancer

In [None]:
plt.figure(figsize=(10, 6))
M = plt.hist(data[data.diagnosis == 'M'].area_mean, bins=30, label='Malignant', alpha=0.5, color='#b967ff')
B = plt.hist(data[data.diagnosis == 'B'].area_mean, bins=30, label='Benign', alpha=0.6, color='#ff6f69')
plt.legend()
plt.xlabel('Mean Area Values', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
plt.title('Histogram of Frequency of Mean Area of Tumors', fontsize=16)
plt.show()

# Calculating the maximum mean area of malignant and benign tumors

print('The Maximum Mean Area for Malignant Tumor is',M[0].max())
print('The Maximum Mean Area for Benign Tumor is',B[0].max())

In [None]:
plt.figure(figsize=(10, 6))
M = plt.hist(data[data.diagnosis == 'M'].area_worst, bins=30, label='Malignant', alpha=0.4, color='#4682b4')
B = plt.hist(data[data.diagnosis == 'B'].area_worst, bins=30, label='Benign', alpha=0.4, color='#cc0000')
plt.legend()
plt.xlabel('Worst Area Values', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
plt.title('Histogram of Frequency of Worst Area of Tumors', fontsize=16)
plt.show()

# Calculating the maximum mean area of malignant and benign tumors

print('The Maximum Worst Area for Malignant Tumor is',M[0].max())
print('The Maximum Worst Area for Benign Tumor is',B[0].max())

#### From the above graphs we can observe that the Area Mean and Worst Area of Benign tumors are roughly Normally(Gaussian) Distributed

# Exploring the Outliers in the Data 

In [None]:
sns.set_style(style='whitegrid')

melted_data = pd.melt(data, id_vars='diagnosis', value_vars=['texture_mean', 'radius_mean'])
plt.figure(figsize=(15, 8))
sns.boxplot(x='variable', y='value', hue='diagnosis', data=melted_data,palette='plasma')
plt.show()

 ### This Plot shows the Outliers in texture_mean and radius_mean of Malignant and Benign Tumors. These Outliers can be Rare events or Errors.   

# Desciptive Statistics

In [None]:
# Using data.describe() method we can look at all the summary statistics of the data.
data.describe()

#### Skweness and Kurtosis(Fisher's)

##### Skewness is a measure of the asymmetry of a distribution. This value can be positive or negative.

-A negative skew indicates that the tail is on the left side of the distribution, which extends towards more negative values.

-A positive skew indicates that the tail is on the right side of the distribution, which extends towards more positive values.

-A value of zero indicates that there is no skewness in the distribution at all, meaning the distribution is perfectly symmetrical.

##### Kurtosis is a measure of whether or not a distribution is heavy-tailed or light-tailed relative to a normal distribution.

The kurtosis of a normal distribution is 3.
If a given distribution has a kurtosis less than 3, it is said to be playkurtic, which means it tends to produce fewer and less extreme outliers than the normal distribution.
If a given distribution has a kurtosis greater than 3, it is said to be leptokurtic, which means it tends to produce more outliers than the normal distribution.

In [None]:
radius_mean_skew_malign = skew(data[data.diagnosis == "M"].radius_mean) 
print('Skewness of Malignant Tumor Radius Mean is', round(radius_mean_skew_malign, 2))

radius_mean_skew_benign = skew(data[data.diagnosis == "B"].radius_mean) 
print('Skewness of Benign Tumor Radius Mean is', round(radius_mean_skew_benign, 2))

In [None]:
radius_mean_kurt_malign = kurtosis(data[data.diagnosis == "M"].radius_mean) 
print('Kurtosis of Malignant Radius Mean is', round(radius_mean_kurt_malign, 2))

radius_mean_kurt_malign = kurtosis(data[data.diagnosis == "B"].radius_mean) 
print('Kurtosis of Benign Radius Mean is', round(radius_mean_kurt_malign, 2))

#Note - The Kurtosis Calculated here is fisher's kurtosis.
#Normal Distribution has kurtosis of 3 so 3 is subtracted from kurtosis value in fisher's kurtosis to make it comparable to normal distribution.

# Cumulative Distribution Function

Every cumulative distribution function F(X) is non-decreasing

If maximum value of the cdf function is at x, F(x) = 1.

The CDF ranges from 0 to 1.

In [None]:
  
# getting data of the histogram
count, bins_count = np.histogram(data[data.diagnosis == 'M'].radius_mean, bins=10)
  
# finding the PDF of the histogram using count values
pdf = count / sum(count)
  
# using numpy np.cumsum to calculate the CDF
cdf = np.cumsum(pdf)
  
# plotting PDF and CDF
plt.figure(figsize=(10, 6))
plt.plot(bins_count[1:], pdf, color="red", label="PDF")
plt.plot(bins_count[1:], cdf, label="CDF")
plt.legend()
plt.title("PDF & CDF of Malignant Tumor Radius Mean", fontsize=16)
plt.xlabel(xlabel='Radius Mean(Malignant Tumor)', fontsize=13)
plt.ylabel(ylabel='Probability', fontsize=13)

plt.show()

In [None]:
  
# getting data of the histogram
count, bins_count = np.histogram(data[data.diagnosis == 'B'].radius_mean, bins=10)
  
# finding the PDF of the histogram using count values
pdf = count / sum(count)
  
# using numpy np.cumsum to calculate the CDF
cdf = np.cumsum(pdf)
  
# plotting PDF and CDF
plt.figure(figsize=(10, 6))
plt.plot(bins_count[1:], pdf, color="red", label="PDF")
plt.plot(bins_count[1:], cdf, label="CDF")
plt.legend()
plt.title("PDF & CDF of Benign Tumor Radius Mean", fontsize=16)
plt.xlabel(xlabel='Radius Mean(Benign Tumor)', fontsize=13)
plt.ylabel(ylabel='Probability', fontsize=13)

plt.show()

# Effect Size

### Difference Effect size (Cohen d)

- Effect size quantifies the size of an effect i.e the difference between the two groups.
- Cohen suggest that if d(effect size)= 0.2, it is small effect size, d = 0.5 medium effect size, d = 0.8 large effect size.

In [None]:
def cohen_d(x,y):
    nx = len(x)
    ny = len(y)
    dof = nx + ny - 2
    return (np.mean(x) - np.mean(y)) / np.sqrt(((nx-1)*np.std(x, ddof=1) ** 2 + (ny-1)*np.std(y, ddof=1) ** 2) / dof)


In [None]:
print('Effect Size of Radius Mean between Malignant and Benign Tumor is', cohen_d(x=data[data.diagnosis == 'M'].radius_mean, y=data[data.diagnosis == "B"].radius_mean))

#### The Effect size is 2.2 which suggests that the two groups are significantly different from each other.

In [None]:
print('Effect Size of Perimeter Mean between Malignant and Benign Tumor is', cohen_d(x=data[data.diagnosis == 'M'].perimeter_mean, y=data[data.diagnosis == "B"].perimeter_mean))

### Association effect Size (r - Strength of Association)

- The correlation value ranges between -1 and 1. -1 indicating perfect negative correlation, 0 indicating no correlation and 1 indicating perfectly positive correlation.

In [None]:
plt.figure(figsize=(18,18))
dataplot = sns.heatmap(data.corr(), cmap="YlGnBu", annot=True)
plt.title('Correlation Matrix', fontsize=16)
plt.show()

### Pearson's Correlation Coefficient

- An absolute value of r around 0.1 is considered a low effect size.
- An absolute value of r around 0.3 is considered a medium effect size.
- An absolute value of r greater than .5 is considered to be a large effect size.


In [None]:
corr, p = pearsonr(data.radius_mean,data.area_mean)
print('Pearson\'s Correlation coefficient is: ',round(corr,2))
print('Pearson\'s Correlation p value is: ',p)


# Effect Size of Pearson Correlation is > 0.5 indicating large effect size (radius mean and area mean have high correlation).
# p value < 0.05 indicates correlation is statistically significant.
#Note- pearsonr method rounds off very small p value to 0

### Spearman's Rank Correlation

In [None]:
corr, p = spearmanr(data.radius_mean, data.area_mean)
print('Spearman\'s Correlation coefficient is: ',round(corr,2))
print('Spearman\'s Correlation p value is: ',p)

# Effect Size of Spearman's Correlation is > 0.5 indicating large effect size (radius mean and area mean have high correlation).
# p value < 0.05 indicates correlation is statistically significant.
#Note- spearmanr method rounds off very small p value to 0

## Pairplots

In [None]:
data_pairplot = data[['diagnosis','radius_mean', 'area_mean', 'texture_mean', 'perimeter_mean', 'compactness_mean', 'symmetry_mean']]

In [None]:
data_pairplot.head()

In [None]:
sns.pairplot(data_pairplot, hue='diagnosis', markers=['o', 's'], corner=True, palette='plasma')
plt.show()

-  This pairplot shows the positive correlation between area perimeter and radius.
- It also shows that area, perimeter , symmetry, compactness , radius of benign tumor is normally distributed and that of malignant is close to normal distribution.

# Covariance
- Covariance is measure of tendency of two variables to vary together.
- It is maximized if they vary together.
- Zero if they are orthogonal.
- Negative if they move in opposite direction.

In [None]:
cov_matrix = pd.DataFrame.cov(data_pairplot)
plt.figure(figsize=(10, 8))
sns.heatmap(cov_matrix, annot=True, fmt='g')
plt.show()

# Hypothesis Testing

##### Hypothesis testing is a way to form Statistical Conclusions about the population from data collected from a smaller sample size compared to the population size. Hypothesis is a statement about a parameter that we would want to prove or disprove hence the names:

- Null Hypothesis=Ho= Status quo [For example: Treating humans to a particular sunsceen does no change the rate of getting burnt]
- Alternate Hypothesis=Ha=Reason why data is being collected[For example: Treating humans to a particular sunscreen does change the rate of getting burnt]

The Hypothesis I want to test is whether average radius and average mean have a relationship with each other or not with an alpha level of 0.05.

H0: There is no difference between mean radius and mean area of benign tumors.


H1: There is significant difference between mean radius and mean area of benign tumors.

In [None]:
alpha = 0.05
t_value , p_value = ttest_rel(data[data.diagnosis == "B"].radius_mean, data[data.diagnosis == "B"].area_mean)
print('p value', p_value)

In [None]:
if p_value <= alpha:
    print("We reject the null hypothesis.")
else:
    print("There is not enough evidence to reject the null hypothesis.")
    

In [None]:
alpha = 0.05
t_value , p_value = ttest_rel(data[data.diagnosis == "M"].radius_mean, data[data.diagnosis == "M"].area_mean)
print('p value', p_value)

In [None]:
if p_value <= alpha:
    print("We reject the null hypothesis.")
else:
    print("There is not enough evidence to reject the null hypothesis.")
    