# The overall workflow in 'Breast_Cancer_Feature_Selection'
## 1. Feature Selection - Statistical Hypothesis testing
## 2. Next steps

# Packages

In [1]:
import pandas as pd
import numpy as np
from Breast_Cancer_Functions import preprocess_dataset, f_test_two_p_variance
from scipy.stats import ttest_ind

# 1. Feature Selection - Statistical Hypothesis testing

- **Before performing a two-sample t-test, I need to determine whether two samples have the same variance by conducting an F-test.**

- **F-test, null ($H_0$) and alternative ($H_1$) hypotheses:**

$$H_0: \sigma_1^2 = \sigma_2^2$$

$$H_1: \sigma_1^2 \neq \sigma_2^2$$

- **F-test, test statistic**: 

$$F = \frac {s_1^2} {s_2^2}$$

- **F-test, 95% confidence interval for the estimator (if a C.I. contains 1, two variances are equal):**

$$C.I. = (\frac {1}{F_\frac{\alpha}{2}} \frac {s_1^2}{s_2^2}, \frac {1}{F_{1-\frac{\alpha}{2}}} \frac {s_1^2}{s_2^2})$$

- **Reject \$H_0$**:

    - when 95% C.I. contains 1 
    - when p-value is less than $\alpha$

## 1.1 Hypothesis testing - two-sample F-test for comparing variances

In [2]:
data = pd.read_csv('data.csv')
data = preprocess_dataset(data=data)

In [3]:
# Initiate two lists collecting columns' names.
unequal_col = []
equal_col = []

# Loop through numerical columns.
for num_col in data.columns:
    if data[num_col].dtype != 'object':
        f_results = f_test_two_p_variance(var1=np.var(data[data.diagnosis == 'M'][num_col],ddof=1), 
                                          var2=np.var(data[data.diagnosis == 'B'][num_col],ddof=1), 
                                          n1=data[data.diagnosis == 'M'][num_col].count(), 
                                          n2=data[data.diagnosis == 'B'][num_col].count(), 
                                          alpha_level=0.05)
        print('Column name: {}.'.format(num_col))
        print('F ratio: {}.'.format(round(f_results[0],4)))
        print('p-value: {}.'.format(round(f_results[1],4)))
        print('95 C.I. {}.'.format(f_results[2]))
        if f_results[1] <= 0.05:
            print('Decision: reject H0, two variances are unequal.')
            unequal_col.append(num_col)
        else: 
            print('Decision: do not reject H0, two variances are equal.')
            equal_col.append(num_col)
        print('----------------------------------------------------')

Column name: radius_mean.
F ratio: 3.2381.
p-value: 0.0.
95 C.I. (2.5533, 4.1376).
Decision: reject H0, two variances are unequal.
----------------------------------------------------
Column name: texture_mean.
F ratio: 0.895.
p-value: 0.3748.
95 C.I. (0.7057, 1.1436).
Decision: do not reject H0, two variances are equal.
----------------------------------------------------
Column name: perimeter_mean.
F ratio: 3.4259.
p-value: 0.0.
95 C.I. (2.7014, 4.3776).
Decision: reject H0, two variances are unequal.
----------------------------------------------------
Column name: area_mean.
F ratio: 7.5072.
p-value: 0.0.
95 C.I. (5.9197, 9.5927).
Decision: reject H0, two variances are unequal.
----------------------------------------------------
Column name: smoothness_mean.
F ratio: 0.8793.
p-value: 0.3031.
95 C.I. (0.6933, 1.1235).
Decision: do not reject H0, two variances are equal.
----------------------------------------------------
Column name: compactness_mean.
F ratio: 2.5588.
p-value: 0.

In [4]:
# Check whether I have collectd all the results. 
# The total columns (predictors) must be 30.
len(unequal_col) + len(equal_col)

30

## 1.2 Hypothesis testing - two-sample t-test for comparing means

- **Two sample t-test, null (h0) and alternative (h1) hypotheses:**

$$H_0: \mu_1 = \mu_2$$

$$H_1: \mu_1 \neq \mu_2$$

- **Two sample t-test, test statistic:** 
    - If unequal variances are assumed, then the formula is:
$$ T = \frac {\bar {x_1} - \bar {x_2}} {\sqrt {\frac {s_1^2} {n_1} + \frac {s_2^2} {n_2}}} $$

    - If equal variances are assumed, then the formula reduces to:
$$ T = \frac {\bar {x_1} - \bar {x_2}} {s_p \sqrt {\frac {1} {n_1} + \frac {1} {n_2}}} $$
$$ s_p^2 = \frac {(n_1-1)·s_1^2 + (n_2-1)·s_2^2} {n_1+n_2-2} $$

- **Reject \$H_0$**:

    - when p-value is less than $\alpha$

In [5]:
# Perform a two-sample t-test to inspect whether two groups have 'different' means.
# If they are different to each other in the aspect of mean given a predictor, 
# then it might indicate this predictor has some prediction power 
# on the response variable (diagnosis).

# Collecting results.
t_test_stats = np.array([])
t_test_p_values = np.array([])
num_col_names = np.array([])

# Loop into numerical columns (unequal variance case).
for num_col in unequal_col:
    t_test_stat, t_test_p_value = ttest_ind(a=data[data.diagnosis == 'M'][num_col], 
                                            b=data[data.diagnosis == 'B'][num_col], 
                                            equal_var=False)
    t_test_stats = np.append(t_test_stats, t_test_stat)
    t_test_p_values = np.append(t_test_p_values, t_test_p_value)
    num_col_names = np.append(num_col_names, num_col)
    # Decision (reject or not reject H0)
    if t_test_p_value <= 0.05:
        print('Column name: {}.'.format(num_col))
        print('Decision: reject H0, two means are unequal.')
        print('p-value: {}.'.format(t_test_p_value))
    else:
        print('Column name: {}.'.format(num_col))
        print('Decision: do not reject H0, two means are equal.')
        print('p-value: {}.'.format(t_test_p_value))
    print('-------------------------------------------')

# Loop into numerical columns (equal variance case).
for num_col in equal_col:
    t_test_stat, t_test_p_value = ttest_ind(a=data[data.diagnosis == 'M'][num_col], 
                                            b=data[data.diagnosis == 'B'][num_col], 
                                            equal_var=True)
    t_test_stats = np.append(t_test_stats, t_test_stat)
    t_test_p_values = np.append(t_test_p_values, t_test_p_value)
    num_col_names = np.append(num_col_names, num_col)
    # Decision (reject or not reject H0)
    if t_test_p_value <= 0.05:
        print('Column name: {}.'.format(num_col))
        print('Decision: reject H0, two means are unequal.')
        print('p-value: {}.'.format(t_test_p_value))
    else:
        print('Column name: {}.'.format(num_col))
        print('Decision: do not reject H0, two means are equal.')
        print('p-value: {}.'.format(t_test_p_value))
    print('-------------------------------------------')

Column name: radius_mean.
Decision: reject H0, two means are unequal.
p-value: 1.6844591259582747e-64.
-------------------------------------------
Column name: perimeter_mean.
Decision: reject H0, two means are unequal.
p-value: 1.0231409970104587e-66.
-------------------------------------------
Column name: area_mean.
Decision: reject H0, two means are unequal.
p-value: 3.284366459573323e-52.
-------------------------------------------
Column name: compactness_mean.
Decision: reject H0, two means are unequal.
p-value: 9.607863145123788e-42.
-------------------------------------------
Column name: concavity_mean.
Decision: reject H0, two means are unequal.
p-value: 3.742120672313664e-58.
-------------------------------------------
Column name: concave points_mean.
Decision: reject H0, two means are unequal.
p-value: 3.1273162856782697e-71.
-------------------------------------------
Column name: radius_se.
Decision: reject H0, two means are unequal.
p-value: 1.4911328540231125e-30.
---

In [6]:
# Sort the t-test statistic result and return the corresponding indexes.
sort_index = abs(t_test_stats).argsort()

In [7]:
# Column names: From the least to the most important numerical feature.
num_col_names[sort_index]

array(['symmetry_se', 'texture_se', 'fractal_dimension_mean',
       'smoothness_se', 'fractal_dimension_se', 'concavity_se',
       'compactness_se', 'fractal_dimension_worst', 'symmetry_mean',
       'smoothness_mean', 'symmetry_worst', 'concave points_se',
       'texture_mean', 'smoothness_worst', 'area_se', 'texture_worst',
       'perimeter_se', 'radius_se', 'compactness_worst',
       'compactness_mean', 'concavity_worst', 'area_mean',
       'concavity_mean', 'area_worst', 'radius_mean', 'perimeter_mean',
       'radius_worst', 'concave points_mean', 'perimeter_worst',
       'concave points_worst'], dtype='<U32')

## 1.3 Overall feature selection summary

### Numerical predictors
- **Predictors didn't pass the test (p-value larger than 0.05):** 
    - Therefore, these four features are not good predictors to differentiate 'diagnosis' groups.
    - Compared this hypothesis testing result with my previous assumptions in 'Breast_Cancer_EDA', **'symmetry_se', 'texture_se', and 'smoothness_se'** were picked correctly!

| Column name | p-value |
| -- | -- |
| symmetry_se | 0.8871
| texture_se | 0.8354
| fractal_dimension_mean | 0.7599
| smoothness_se | 0.1103

- **Predictors passed the test (p-value less than 0.05):**
    - **fractal_dimension_se** passed the test, but notice that the p-value is very close to 0.05.
    - **concave points_worst** might be the most important predictor among all features observed from the test result.
    - **perimeter and concave groups** both have mean and worst predictors ranked in the top five list.
    - Compared this hypothesis testing result with my previous assumptions in 'Breast_Cancer_EDA', **'radius_worst', 'perimeter_mean', and 'perimeter_worst'** were picked correctly!
    
| Column name | Rank |
| -- | -- |
| perimeter_mean | 5
| radius_worst | 4
| concave points_mean | 3
| perimeter_worst | 2
| **concave points_worst** | 1 

# 2. Next steps
- From now on, I have better understandings on my dataset and attained solid conclusions regarding feature selection by verifying assumptions between visualizations and statistical hypothesis testing results.
- I will build some baseline models, see how much score such as AUC, recall, and precision I can achieve. Then, I will boost the model performance by tuning hyperparameters of each of the top three baseline models and compare the final model performance.