# Disclaimer

With the range of opportunities that the dataset bequeaths, it is important to take full advantage of it, in this regard, I present a very short but picturesque research using 2 variables/columns; "radius_mean" and the "texture_mean", verbatim

From the description of the data, I was acknowledged that the 'radius_mean' is the "radius (mean of distances from center to points on the perimeter)", whereas the 'texture_mean' is "texture (standard deviation of gray-scale values)" (gray scale values have pixel values within the range of 0 and 1).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)
plt.rcParams['figure.figsize'] = (15, 8)

from scipy import stats

In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum().sum()

From the above analysis, it's clear that most of the dataset is free from null values except for the "Unnamed:32" variable, and that all of the values are floating point numbers (except of course, the diagnosis variable). However, we won't need the 'Unnamed: 32' variable, so I remove it

In [None]:
df.drop('Unnamed: 32', 1,  inplace = True)

# 1. Analyzing the Radius Mean variable

As we begin univariate analysis; the best way according to me is to plot its (variable's) histogram which helps us visualizing the distribution of the variable

In [None]:
df['radius_mean'].hist(figsize = (15, 8))

The distribution, though resembling the normal distribution is skewed to the right... We'll foster it again anon. But how does the distribution vary for the 'Malignant' and the 'Benign' categories? Let's have a look

In [None]:
m = df[df['diagnosis'] == 'M']
b = df[df['diagnosis'] == 'B']

In [None]:
m['radius_mean'].hist(figsize = (15, 8), alpha = 0.4, label = 'Malignant')
b['radius_mean'].hist(figsize = (15, 8), alpha = 0.4, label = 'Benign')
plt.legend()

It is quite clear and quite obvious that the distribution for the 'Malignant' cases is skewed far more right than that for the 'Benign' cases. The mean of the radius mean of the 'Malignant' cases are thus, understandably so, outdoes its 'Benign' counterpart.

## Maximum Likelihood Estimator 

The data that we have is sample taken from a larger population. Therefore, it is not wise to assume the sample mean as the population mean. This is in part, also because our distribution demonstrates outliers which might affect the sample mean. In that case, the median would be the true representation of the population mean. But how do we conclude whether the mean or the median is the best suited? For that I use the method of ***estimation***.

In [None]:
def Estimate(df, column, n = 7, m = 1000):
    mu = df[column].mean()
    
    means = []
    medians = []
    
    for _ in range(m):
        sample = df[column].sample(n, replace = True)
        mean = sample.mean()
        median = sample.median()
        
        means.append(mean)
        medians.append(median)
        
    print('RMSE of sample means: ', RMSE(means, mu))
    print('RMSE of sample medians: ', RMSE(medians, mu))
    
def RMSE(estimates, actual):
    error2 = [(estimate - actual) ** 2 for estimate in estimates]
    mse = np.mean(error2)
    return np.sqrt(mse)

Briefly speaking, estimation is the process of running the experiment again and again, each time taking a sample from the data and estimating both its sample mean and median and remembering those values, thereafter taking the RMSE (root mean squared error) of both.

In [None]:
Estimate(df, 'radius_mean')

For radius mean, the RMSE for the sample mean (xbar) is lesser than that of the median. Therefore, we can say that the sample mean is the true representation of the population mean (mu). Hence, xbar is the ***'maximum likelihood estimator'*** (MLE) for the population mean.

In [None]:
Estimate(m, 'radius_mean')
print('\n')
Estimate(df[df['diagnosis'] == 'B'], 'radius_mean')

For both 'Malignant' and 'Benign' categories, sample mean is the MLE

## Sampling Distributions

Generally, whenever we estimate some statistic, we report 2 other statistics with it, which are;

1. ***Standard Error:*** If we run the experiment again again, or in this case, if we compute the mean of the radius mean, everytime surveying different patients (sample), how much do we expect the mean to deviate? This deviation is called the standard error

2. ***Confidence Interval:*** If we run the experiment again and again, what is the expected range of the estimator? This range is known as the confidence interval

In [None]:
def SimulatedSample(df, column, n = 11, m = 1000):
    mu = df[column].mean()
    
    means = []
    
    for _ in range(m):
        sample = df[column].sample(n)
        xbar = sample.mean()
        
        means.append(xbar)
    
    ci = [np.percentile(means, 5), np.percentile(means, 95)]
    cdf = [Cdf(means, x) for x in sorted(means)]
    stderr = RMSE(means, mu)
    
    return stderr, ci, cdf, means

def Cdf(sample, x):
    count = 0
    for i in sample:
        if i <= x:
            count += 1
    return count / len(sample)

In [None]:
def PlotSimulated(df, column):
    stderr, ci, cdf, means = SimulatedSample(df, column)

    print("Std Error: ", stderr)
    print("90% Confidence Interval: ", ci)

    plt.plot(sorted(means), cdf, ds = 'steps', label = 'CDF of sample means after simulation')
    plt.axvline(ci[0], ls = ':', color = 'black', label = 'Confidence Interval')
    plt.axvline(ci[1], ls = ':', color = 'black')
    plt.axvline(df[column].mean(), ls = '-', color = 'maroon', label = 'population mean')
    plt.legend()

In [None]:
PlotSimulated(m, 'radius_mean')

In [None]:
PlotSimulated(df[df['diagnosis'] == 'B'], 'radius_mean')

There is a difference of almost 2 between the low and high ends of the CI for the 'Benign' cases, however, the difference between it's 'Malignant' counterpart is almost 4. Therefore, I have no hesitation in concluding that the sample mean of the radius mean for the 'Malignant' cases is a rough estimate as compared to 'Benign'. 

## Hypothesis Testing

In the beginning of the analysis, I concluded that the mean of the radius mean for the 'Malignant' cases is greater than that of its 'Benign' counterparts. However, there's a possibility of this effect occuring by chance. To investigate whether the effect is true for the population or has occured by chance, we use ***'Hypothesis Testing'.***

I use a method called 'T-test independent'. I employ the scipy's stats package for that.

It's documentation reads, **"This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values".**

In [None]:
rmm = df[df['diagnosis'] == 'M']['radius_mean'].values
rmb = df[df['diagnosis'] == 'B']['radius_mean'].values

test_stat, pval = stats.ttest_ind(rmm, rmb, equal_var = False)
print(f'{pval : .4f}')
print(test_stat)

The pvalue turns out to be 0, which means that we can reject the null hypothesis, thus concluding that the means of the radius mean of both the distributions are not equal. The positive test statistic demonstrates that the mean of the 'Malignant' distribution is greater than that of its 'Benign' counterpart.

# 2. Analyzing the Texture Mean variable

I'll not write the conclusions as, they must be understood by the reader as of now.

In [None]:
df['texture_mean'].hist(figsize = (15, 8))

In [None]:
m['texture_mean'].hist(figsize = (15, 8), alpha = 0.4, label = 'Malignant')
df[df['diagnosis'] == 'B']['texture_mean'].hist(figsize = (15, 8), alpha = 0.4, label = 'Benign')
plt.legend()

## MLE

In [None]:
Estimate(df, 'texture_mean')

In [None]:
Estimate(m, 'texture_mean')
print('\n')
Estimate(df[df['diagnosis'] == 'B'], 'texture_mean')

## Sampling Distributions

In [None]:
PlotSimulated(m, 'texture_mean')

In [None]:
PlotSimulated(df[df['diagnosis'] == 'B'], 'texture_mean')

# Hypothesis Testing

In [None]:
tmm = df[df['diagnosis'] == 'M']['texture_mean'].values
tmb = df[df['diagnosis'] == 'B']['texture_mean'].values

test_stat, pval = stats.ttest_ind(tmm, tmb, equal_var = False)
print(f'{pval : .4f}')
print(test_stat)

# 3. Bivariate Analysis

Now that we have successfully analyzed both the variables individually, it's time for bivariate analysis, that is comparing the variables and finding relationships.

To visualize relationships, scatter plots are a good way begin with

In [None]:
a, b = df['radius_mean'].values, df['texture_mean'].values 

sns.scatterplot(x = a, y = b)
plt.xlabel('Radius Mean')
plt.ylabel('Texture Mean')

From the scatter plot, it occurs that there's a little or less correlation between both the variables

## Correlation

Another method to evaluate the relationship between the 2 variables is correlation. Since, there seems to a linear relationship, however small, between both the variables; I employ the 'Pearson's correlation'. Because it is duly affected by the scale of the data, it is wise to standardize both the variables

In [None]:
def correlation(x, y):
    return covariance(x, y) / (np.std(x) * np.std(y))

def covariance(x, y):
    xbar = np.mean(x)
    ybar = np.mean(y)
    n = len(x)
    
    xs = [x_i - xbar for x_i in x]
    ys = [y_i - ybar for y_i in y]
    
    return np.dot(xs, ys) / n

In [None]:
rm = df['radius_mean'].values
tm = df['texture_mean'].values

rm_scaled = [(rm_i - np.mean(rm)) / np.std(rm) for rm_i in rm]
tm_scaled = [(tm_i - np.mean(tm)) / np.std(tm) for tm_i in tm]

print(correlation(rm_scaled, tm_scaled))
print(correlation(rm, tm))

As is evident, there's not much difference between the correlation of the scaled and the unscaled data. It is because both have the same unit. But if we use different units for both, say, cm for radius mean and m fot texture mean, we must observe a drastic change.

Another assumption for the 'Pearson's correlation' is that the variables are normally distributed. But this is not entirely true for both the variables. Let's see the magnitude of skewness in both the variables. For that I use 'Pearson's median skewness'.

In [None]:
def PearsonMedianSkewness(x):
    median = np.median(x)
    mean = np.mean(x)
    std = np.std(x)
    
    gp = 3 * (mean - median) / std
    
    return gp

In [None]:
print(PearsonMedianSkewness(rm))
print(PearsonMedianSkewness(tm))

Seems like both the variables are somewhat skewed to the right. In that case, we use 'percentiles' instead of scaled data for finding correlation.

In [None]:
def Percentile(sample, x):
    count = 0
    
    for i in sample:
        if i <= x:
            count += 1
    
    return 100 * count / len(sample)

In [None]:
rm_percentile = [Percentile(rm, x) for x in rm]
tm_percentile = [Percentile(tm, x) for x in tm]

print(correlation(rm_percentile, tm_percentile))

The correlation is still almost the same... which is pretty less, which means that the 2 variables are scarcely correlated.

But is this effect representative of the population or is it merely by chance? I use scipy's 'pearsonr' function to test the hypothesis

In [None]:
test_stat, pval = stats.pearsonr(rm_percentile, tm_percentile)

print(test_stat)
print(f'{pval: .4f}')

The p value is 0, demonstrating that the correlation is neither greater than nor is it equal to the test statistic... Therefore, we can conclude that the correlation between both the variables is scarce.

## Tests for normality

Even though it is now established that neither the 'radius mean' and nor the 'texture means' are perfectly normally distributed, we can however perform some tests to verify this claim as well. I'll use the following tests to find out;

1. KS test
2. Anderson Darling test
3. Normal probability plot

In [None]:
## KS test, Null Hypothesis: the two distributions are identical

ts_rm, pval_rm = stats.kstest(rm, 'norm')
ts_tm, pval_tm = stats.kstest(tm, 'norm')

print('KS for radius mean (p value): ' f'{pval_rm: .4f}')
print('KS for texture mean (p value): ' f'{pval_tm: .4f}')

In [None]:
## Anderson Darling test, Null Hypothesis: a sample is drawn from a population that follows a particular distribution.

print('KS for radius mean: ', stats.anderson(rm, 'norm'), '\n')
print('KS for texture mean: ', stats.anderson(tm, 'norm'))

In [None]:
stats.probplot(rm, plot = plt)

In [None]:
stats.probplot(tm, plot = plt)

# 4. Linear Regression

Now I'll employ linear regression for model explaination.

In [None]:
from sklearn.linear_model import LinearRegression

X = df[df['diagnosis'] == 'B']['radius_mean'].values
y = df[df['diagnosis'] == 'B']['texture_mean'].values

lr = LinearRegression().fit(X.reshape(-1, 1), 
                            y.reshape(-1, 1))
print(lr.coef_, lr.intercept_)

The coefficient and intercept correspond to the slope and the intercept of the linear regression model. A coefficient of "-0.0833" means that with a unit change in 'radius mean' (feature) results in a decrease of 0.0833... units in the texture mean (target). In ML, a feature variable is generally denoted with 'X', whereas the target variable is denoted by 'y'.

The same can be interpreted for the 'Malignant' cases.

In [None]:
X = df[df['diagnosis'] == 'M']['radius_mean'].values
y = df[df['diagnosis'] == 'M']['texture_mean'].values

lr = LinearRegression().fit(X.reshape(-1, 1), 
                            y.reshape(-1, 1))
print(lr.coef_, lr.intercept_)

# 5. Reporting Final Results

Lastly, I articulate some of the most relevant facts (for a wider audience) using a text graph

In [None]:
plt.text(0.1, 0.8, 'RADIUS MEAN', color = 'maroon')
plt.text(0.7, 0.8, 'TEXTURE MEAN', color = 'maroon')
plt.text(0.015, 0.7, 'MALIGNANT', color = 'indigo')
plt.text(0.3, 0.7, 'BENIGN', color = 'indigo')
plt.text(0.6, 0.7, 'MALIGNANT', color = 'indigo')
plt.text(0.9, 0.7, 'BENIGN', color = 'indigo')

plt.text(0.0, 0.6, 'POPULATION MEAN', color = 'green', fontsize = 'x-small')
plt.text(0.25, 0.6, 'POPULATION MEAN', color = 'green', fontsize = 'x-small')
plt.text(0.58, 0.6, 'POPULATION MEAN', color = 'green', fontsize = 'x-small')
plt.text(0.85, 0.6, 'POPULATION MEAN', color = 'green', fontsize = 'x-small')

plt.text(0.1, 0.55, 17.46, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.35, 0.55, 12.14, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.68, 0.55, 21.60, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.95, 0.55, 17.91, color = 'grey', fontsize = 'x-small', ha = 'center')

plt.text(0.0, 0.47, 'STANDARD ERROR', color = 'green', fontsize = 'x-small')
plt.text(0.25, 0.47, 'STANDARD ERROR', color = 'green', fontsize = 'x-small')
plt.text(0.58, 0.47, 'STANDARD ERROR', color = 'green', fontsize = 'x-small')
plt.text(0.85, 0.47, 'STANDARD ERROR', color = 'green', fontsize = 'x-small')

plt.text(0.1, 0.42, 0.97, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.35, 0.42, 0.54, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.68, 0.42, 0.14, color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.95, 0.42, 1.19, color = 'grey', fontsize = 'x-small', ha = 'center')


plt.text(0.0, 0.34, 'CONFIDENCE INTERVAL', color = 'green', fontsize = 'x-small')
plt.text(0.25, 0.34, 'CONFIDENCE INTERVAL', color = 'green', fontsize = 'x-small')
plt.text(0.58, 0.34, 'CONFIDENCE INTERVAL', color = 'green', fontsize = 'x-small')
plt.text(0.85, 0.34, 'CONFIDENCE INTERVAL', color = 'green', fontsize = 'x-small')

plt.text(0.1, 0.29, '15.88 - 19.01', color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.35, 0.29, '11.18 - 13.03', color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.68, 0.29, '19.74 - 23.51', color = 'grey', fontsize = 'x-small', ha = 'center')
plt.text(0.95, 0.29, '16.05 - 20.06', color = 'grey', fontsize = 'x-small', ha = 'center')

#plt.vlines(x = (0.25, 0.85), ymin = 0.2, ymax = 0.7)

plt.title('BREAST CANCER FACT CHECK', family = 'monospace', fontsize = 'xx-large')
plt.axis('off')

# 6. Epilogue

Even though I have tried to write the description of most of what I have done, nevertheless it is does not engulf the basic knowhow of what these concepts pertain. For some concepts, I didn't even write a description (which I might do in the future). Therefore, it would be wise for the young and inexperienced reader to search about all these concepts on the internet. Moreover, whatever I have done is subject to improvement (as more experienced people might say), however, I have made a balance between very basic and scarcely advanced analysis, this is in part because when you have been expounding on these topics for quite a while, you at some point get bored. 

I am also open for active criticism of my notebooks.

I remain