# **Breast Cancer Diagnosis Study**

## Objectives

* Answering the business requirement 1 : Pattern Identification of features
  - Identify the most critical features (e.g., radius, parameter of lobes, concavity) correlated with malignant tumors.
  - Use visual analysis to guide early diagnosis.

## Inputs

* Generate Dataset: outputs/datasets/collection/breast-cancer.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Load Data


In [None]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/breast-cancer.csv')
df.head()

---

## Data Exploration

* Since we have loaded the data, we are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation Study

In [None]:
df['diagnosis'].unique()

We will drop the `id` column since it's irrelevant to our correlation study

In [None]:
df = df.drop(columns=['id'])
df.tail()

We use .corr() for spearman and pearson methods, and investigate the top 10 correlations

* We know this command returns a pandas series and the first item is the correlation between `diagnosis` and `diagnosis`, which happens to be 1, so we exclude that with [1:]
* We sort values considering the absolute value, by setting key=abs

In [None]:
corr_pearson = df.corr(method='pearson')['diagnosis'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

We do the same for spearman

In [None]:
corr_spearman = df.corr(method='spearman')['diagnosis'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

From the above methods:

* `perimeter_worst`, `concave points_worst` and `radius_worst` are highly correlated to our target variable `diagnosis`.
* It seems the correlations between the top variables and target is strong - moderately positive.

Ideally, we pursue strong correlation levels. However, this is not always possible.

We will consider the top six correlation levels at df and will study the associated variables

In [None]:
top_n = 6
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = ['area_worst', 'concave points_mean', 'concave points_worst', 'perimeter_mean', 'perimeter_worst', 'radius_worst']
vars_to_study

## EDA on selected variables

Filter the dataset on only the top 6 correlated variable list and include the diagnosis.

In [None]:
df_eda = df.filter(vars_to_study + ['diagnosis'])
df_eda.head()

### Visualize variable correlation to Diagnosis:

Plot the distribution:


In [None]:
%matplotlib inline
# This line is used to display plots inline in Jupyter notebooks

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()

target_var = 'diagnosis'
for col in vars_to_study:
    plot_numerical(df_eda, col, target_var)
    print("\n\n")

### Multivariate analysis

Multivariate analysis (MVA) is a set of statistical methods used to analyze data sets with multiple variables, examining relationships and patterns among them. We will visualize the MVA among the variables, all in one go, with a pairplot figure.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df_eda, hue='diagnosis', corner=True, diag_kind='kde')
plt.suptitle('Pairplot of Selected Variables', y=1.02, fontsize=20)
plt.show()

---

# Conclusions and Next steps

### The correlations and plots interpretation converge

* Higher worst area value might point to a Malignant diagnosis.
* Mean of the concave points if >0.05 might point to a Malignant diagnosis.
* Concave worst area value if >0.14 might point to a Malignant diagnosis.
* A mean tumor boundary(perimeter) value of >85 might point to a Malignant diagnosis.
* A >100 value of outer perimeter of lobes might point to a Malignant diagnosis.
* Higher worst radius value might point to a Malignant diagnosis.

Next we will work on the Data Cleaning process.

---