### Dataset Description

___

1. ID number
2. Outcome (R = recur, N = nonrecur) 
3. Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 
4. (-33) Ten real-valued features are computed for each cell nucleus:

    * radius (mean of distances from center to points on the perimeter)
    * texture (standard deviation of gray-scale values)
    * perimeter
    * area
    * smoothness (local variation in radius lengths)
    * compactness (perimeter^2 / area - 1.0)
    * concavity (severity of concave portions of the contour)
    * concave points (number of concave portions of the contour)
    * symmetry
    * fractal dimension ("coastline approximation" - 1)

  Several of the papers listed above contain detailed descriptions of how these features are computed.

  The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 4 is Mean Radius, field 14 is Radius SE, field 24 is Worst Radius.

> Values for features 4-33 are recoded with four significant digits.

34. Tumor size - diameter of the excised tumor in centimeters
35. Lymph node status - number of positive axillary lymph nodes observed at time of surgery

In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [81]:
dataset = pd.read_csv('wisconsin_prognostic_breasts_cancer.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,id,outcome,time,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,...,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,tumor_size,lymph_node_status
0,33,855563,N,99,10.95,21.35,71.9,371.1,0.1227,0.1218,...,87.22,514.0,0.1909,0.2698,0.4023,0.1424,0.2964,0.09606,2.7,0
1,137,9013838,N,7,11.08,18.83,73.3,361.6,0.1216,0.2154,...,91.76,508.1,0.2184,0.9379,0.8402,0.2524,0.4154,0.1403,2.0,0
2,119,892189,N,1,11.76,18.14,75.0,431.1,0.09968,0.05914,...,85.1,553.6,0.1137,0.07974,0.0612,0.0716,0.1978,0.06915,10.0,18
3,3,843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,0
4,24,853612,N,116,11.84,18.7,77.93,440.6,0.1109,0.1516,...,119.4,888.7,0.1637,0.5775,0.6956,0.1546,0.4761,0.1402,3.0,2


In [82]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0,id,time,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,tumor_size
count,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,...,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0
mean,98.5,1990469.0,46.732323,17.412323,22.27601,114.856566,970.040909,0.102681,0.142648,0.156243,...,30.139091,140.347778,1404.958586,0.143921,0.365102,0.436685,0.178778,0.323404,0.090828,2.847475
std,57.301832,2889025.0,34.46287,3.161676,4.29829,21.383402,352.149215,0.012522,0.049898,0.070572,...,6.017777,28.892279,586.006972,0.022004,0.163965,0.173625,0.045181,0.075161,0.021172,1.937964
min,0.0,8423.0,1.0,10.95,10.38,71.9,361.6,0.07497,0.04605,0.02398,...,16.67,85.1,508.1,0.08191,0.05131,0.02398,0.02899,0.1565,0.05504,0.4
25%,49.25,855745.2,14.0,15.0525,19.4125,98.16,702.525,0.0939,0.1102,0.10685,...,26.21,118.075,947.275,0.129325,0.2487,0.32215,0.15265,0.27595,0.076578,1.5
50%,98.5,886339.0,39.5,17.29,21.75,113.7,929.1,0.1019,0.13175,0.15135,...,30.135,136.5,1295.0,0.14185,0.3513,0.40235,0.17925,0.3103,0.08689,2.5
75%,147.75,927995.8,72.75,19.58,24.655,129.65,1193.5,0.110975,0.1722,0.2005,...,33.555,159.875,1694.25,0.154875,0.423675,0.54105,0.207125,0.3588,0.101375,3.5
max,197.0,9411300.0,125.0,27.22,39.28,182.1,2250.0,0.1447,0.3114,0.4268,...,49.54,232.2,3903.0,0.2226,1.058,1.17,0.2903,0.6638,0.2075,10.0


Calculate expected value, variance and standard deviation for `mean_perimeter`

In [83]:
# expected value
np.mean(dataset.mean_perimeter)

114.85656565656565

In [84]:
# variance
np.var(dataset.mean_perimeter)

454.94051951841647

In [85]:
# standard deviation
mean_perimeter = dataset.mean_perimeter
np.std(mean_perimeter, axis=0)

21.329334718139158

 #### Calculate expected value, variance and standard deviation for `mean_area`
 ___

In [86]:
# expected_value
np.mean(dataset.mean_area)

970.0409090909092

In [87]:
# variance
np.var(dataset.mean_area)

123382.76130624428

In [88]:
# standard deviation
mean_area = dataset.mean_area
np.std(mean_area, axis=0)

351.25882381264717

#### Calculate confidence intervals for both `expected value` and `variance`
___

In [89]:
# calculation of CI of mean
# we will calculate the CI of mean perimeter of recured people
dataset.groupby('outcome').agg({'mean_perimeter': [np.mean, np.std, np.size]})

Unnamed: 0_level_0,mean_perimeter,mean_perimeter,mean_perimeter
Unnamed: 0_level_1,mean,std,size
outcome,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
N,112.756424,20.509148,151.0
R,121.60383,22.926511,47.0


![alt text](CI_mean.png "CI for the mean formula")

In [90]:
# extract neccessary parameters for recured people
mean_ = 121.60    # mean mean_perimeter
std_ = 22.93      # standard deviation
n = 47            # total number of people
z = 1.96          # a z-score for a 95% confidence interval for a large enough sample size
# calculate standard error using the formula of standard error of the mean
se = std_ / np.sqrt(n)
# now construct the CI
lcb = mean_ - z * se  # lower limit of the CI
ucb = mean_ + z * se  # upper limit of the CI
(lcb, ucb)

(115.04441886010842, 128.15558113989158)

![alt text](CI_variance.PNG "CI for the variance formula")

In [91]:
# calculate CI for variance
# get sample variance
var = np.var(dataset.mean_perimeter, ddof=1)
degrees_of_freedom = 197
# according to the table (https://www.medcalc.org/manual/chi-square-table.php)
left = 160.023
right = 237.763
ucb = degrees_of_freedom * var / left
lcb = degrees_of_freedom * var / right
(lcb, ucb)

(378.8571933591285, 562.907974882651)

In [92]:
mean, var, _ = stats.bayes_mvs(dataset.mean_perimeter)
(mean, var)

(Mean(statistic=114.85656565656565, minmax=(112.34515014455606, 117.36798116857524)),
 Variance(statistic=461.93960443408446, minmax=(390.3777526674414, 544.1921481230138)))

#### Hypothesis Tests
___

##### Let's check the equality of the means for `mean_perimeter` value for recured and nonrecurrent patients with known variances
____

In [93]:
dataset.groupby('outcome').agg({'mean_perimeter': [np.mean, np.var, np.std, np.size]})

Unnamed: 0_level_0,mean_perimeter,mean_perimeter,mean_perimeter,mean_perimeter
Unnamed: 0_level_1,mean,var,std,size
outcome,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
N,112.756424,420.625163,20.509148,151.0
R,121.60383,525.624924,22.926511,47.0


We have a hypothesis $H_{0}$ : $M(X)$ = $M(Y)$
with a competing hypothesis $H_{1}$ : $M(X)$ != $M(Y)$

In [94]:
# calculate observed criterion value
x = 112.76
y = 121.60
n = 151
m = 47
var_x = 420.63
var_y = 525.62
z = (x - y) / np.sqrt((var_x / n) + (var_y / m))

In [95]:
# according to the Laplace's function table https://100task.ru/sample/119.aspx
fun_z_tp = (1 - 0.05) / 2
fun_z_tp

0.475

In [96]:
z_tp = 2.02

In [97]:
abs(z) < z_tp 
# so wу reject the null hypothesis

False

##### Let's check the equality of the means for `mean_perimeter` value for recured and nonrecurrent patients with unknown variances
____

In [98]:
recurrent_df = dataset.loc[dataset['outcome'] == 'R'][:14]
recurrent_df.describe()

Unnamed: 0.1,Unnamed: 0,id,time,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,tumor_size
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,...,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,66.142857,1392104.0,39.928571,14.375,21.345,94.662857,650.478571,0.107609,0.138577,0.126641,...,29.591429,119.935714,1008.95,0.162671,0.396464,0.432729,0.166036,0.319036,0.105994,2.528571
std,52.536957,2202418.0,27.437942,1.24471,3.8117,8.40303,114.913661,0.009837,0.037982,0.044713,...,5.848944,14.053065,260.262722,0.013803,0.126489,0.142909,0.024529,0.044503,0.014913,1.408452
min,5.0,89539.0,5.0,12.34,15.29,81.15,477.4,0.08876,0.07081,0.05253,...,20.24,101.7,733.2,0.1389,0.2057,0.2678,0.1252,0.2589,0.07873,0.4
25%,21.0,848147.5,17.5,13.7325,19.885,90.2575,579.45,0.10265,0.116525,0.09737,...,27.7725,110.9,874.05,0.153525,0.338375,0.33945,0.1527,0.288375,0.0988,1.5
50%,44.5,859104.5,35.0,14.395,20.955,96.575,652.25,0.109,0.1336,0.11815,...,30.59,117.7,951.35,0.1647,0.3842,0.4049,0.1646,0.31875,0.10515,2.5
75%,115.75,878458.0,67.75,15.115,23.4175,98.585,716.6,0.11615,0.1589,0.160975,...,32.8375,123.375,1045.0,0.1699,0.422375,0.469425,0.176775,0.3407,0.114875,3.0
max,147.0,9010018.0,78.0,16.27,27.54,108.1,813.7,0.1189,0.2022,0.2135,...,39.34,157.1,1748.0,0.1851,0.6577,0.7026,0.2134,0.4218,0.1341,6.0


In [99]:
nonrecured_df = dataset.loc[dataset['outcome'] != 'R'][:20]
nonrecured_df.describe()

Unnamed: 0.1,Unnamed: 0,id,time,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,tumor_size
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,...,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,70.1,1201572.0,66.25,12.6505,21.1245,82.958,495.635,0.109349,0.137292,0.121006,...,30.2435,101.766,716.395,0.162345,0.420552,0.4616,0.167975,0.36312,0.11025,2.645
std,59.646238,1854835.0,42.884024,0.881192,3.336751,5.493129,70.048346,0.016092,0.064866,0.064232,...,4.821423,9.251143,129.198409,0.028181,0.274136,0.253667,0.0531,0.115016,0.036379,2.034304
min,3.0,85715.0,1.0,10.95,15.56,71.9,361.6,0.08162,0.05761,0.02685,...,22.4,85.1,508.1,0.1094,0.07974,0.0612,0.0716,0.1978,0.06057,1.0
25%,31.75,851461.5,23.75,11.83,18.7975,78.725,438.45,0.099045,0.078105,0.07291,...,27.5875,95.5725,638.125,0.1455,0.191875,0.2592,0.123425,0.287025,0.090713,1.65
50%,45.5,856245.5,76.0,12.915,21.085,85.34,513.05,0.10775,0.12835,0.1112,...,29.945,102.65,721.4,0.1585,0.3998,0.42635,0.1648,0.33495,0.10285,2.0
75%,101.75,884443.0,103.5,13.3175,22.365,87.2375,548.7,0.11935,0.17735,0.16905,...,32.175,106.625,766.525,0.180275,0.53575,0.63055,0.2067,0.4207,0.128275,2.775
max,188.0,9013838.0,123.0,13.86,30.98,89.65,595.8,0.1425,0.2839,0.2414,...,41.05,119.4,993.6,0.2184,1.058,1.105,0.2575,0.6638,0.2075,10.0


In [100]:
x = 94.66
y = 82.96
n = 14
m = 20
s_x = np.var(recurrent_df.mean_perimeter)
s_y = np.var(nonrecured_df.mean_perimeter)
criteria_observed = (x - y) / ((n-1) * s_x + (m - 1) * s_y) * np.sqrt(m * n * (n + m - 2) / (n + m))
criteria_observed

0.1359554459306587

**Число степеней свободы:** $k = n + m - 2 = 14 + 20 - 2 = 22$<br>  
[According to critical values of the Student's-t Distribution Table:](https://100task.ru/sample/120.aspx) $t_{(0.05;22)}=2.07$<br>
_____
$t_{obsv} < t_{(0.05;22)}$ - _we can accept the null hypothesis_