# **Breast Cancer Diagnosis Study**

## Objectives

* Answering the business requirement 1 : Pattern Identification of features
  - Identify the most critical features (e.g., radius, parameter of lobes, concavity) correlated with malignant tumors.
  - Use visual analysis to guide early diagnosis.

## Inputs

* Generate Dataset: outputs/datasets/collection/breast-cancer.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\stazr\\OneDrive\\Documents\\vscode-projects\\breast-cancer-diagnosis-PP5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\stazr\\OneDrive\\Documents\\vscode-projects\\breast-cancer-diagnosis-PP5'

## Load Data


In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/breast-cancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


---

## Data Exploration

* Since we have loaded the data, we are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context.

In [None]:
def df_describe(df):
    # Basic stats
    print("Shape:", df.shape)
    print("\nData Types:\n", df.dtypes)
    print("\nMissing Values:\n", df.isna().sum())
    print("\nDescriptive Stats:\n", df.describe(include='all'))
    print("\nSample:\n", df.head())

df_describe(df)

Shape: (569, 32)

Data Types:
 id                           int64
diagnosis                    int64
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst       

## Correlation Study

In [6]:
df['diagnosis'].unique()

array([1, 0])

We will drop the `id` column since it's irrelevant to our correlation study

In [None]:
df = df.drop(columns=['id'])
df.tail()

We use .corr() for spearman and pearson methods, and investigate the top 10 correlations

* We know this command returns a pandas series and the first item is the correlation between `diagnosis` and `diagnosis`, which happens to be 1, so we exclude that with [1:]
* We sort values considering the absolute value, by setting key=abs

In [None]:
corr_pearson = df.corr(method='pearson')['diagnosis'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

We do the same for spearman

In [None]:
corr_spearman = df.corr(method='spearman')['diagnosis'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

From the above methods:

* `perimeter_worst`, `concave points_worst` and `radius_worst` are highly correlated to our target variable `diagnosis`.
* It seems the correlations between the top variables and target is strong - moderately positive.

Ideally, we pursue strong correlation levels. However, this is not always possible.

We will consider the top six correlation levels at df and will study the associated variables

In [None]:
top_n = 6
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = ['area_worst', 'concave points_mean', 'concave points_worst', 'perimeter_mean', 'perimeter_worst', 'radius_worst']
vars_to_study

## EDA on selected variables

Filter the dataset on only the top 6 correlated variable list and include the diagnosis.

In [None]:
df_eda = df.filter(vars_to_study + ['diagnosis'])
df_eda.head()

### Visualize variable correlation to Diagnosis:

Plot the distribution:


In [None]:
%matplotlib inline
# This line is used to display plots inline in Jupyter notebooks

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()

target_var = 'diagnosis'
for col in vars_to_study:
    plot_numerical(df_eda, col, target_var)
    print("\n\n")

### Multivariate analysis

Multivariate analysis (MVA) is a set of statistical methods used to analyze data sets with multiple variables, examining relationships and patterns among them. We will visualize the MVA among the variables, all in one go, with a pairplot figure.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df_eda, hue='diagnosis', corner=True, diag_kind='kde')
plt.suptitle('Pairplot of Selected Variables', y=1.02, fontsize=20)
plt.show()

---

# Conclusions and Next steps

### The correlations and plots interpretation converge

* Higher worst area value might point to a Malignant diagnosis.
* Mean of the concave points if >0.05 might point to a Malignant diagnosis.
* Concave worst area value if >0.14 might point to a Malignant diagnosis.
* A mean tumor boundary(perimeter) value of >85 might point to a Malignant diagnosis.
* A >100 value of outer perimeter of lobes might point to a Malignant diagnosis.
* Higher worst radius value might point to a Malignant diagnosis.

Next we will work on the Data Cleaning process.

---