### Prototyping the Analysis Pipeline for the Predicting 5-year Survivability of Colorectal Cancer Patients

_Write in this notebook all the stages required to prototype your data analysis pipeline according to the project instructions._

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import sklearn
import shap

ModuleNotFoundError: No module named 'plotly'

# Do your analysis from here...

# Take advantage of jupyter notebooks from previous courses! (DSHI?😉)

## Colorectal Cancer Dataset

### Dataset

The *Colorectal Cancer Global Dataset & Predictions* is a comprehensive dataset containing demographic, clinical, lifestyle, and treatment-related features relevant to colorectal cancer (CRC). It concists of 16749 rows and combines 6 numerical features with 22 categorical features, making it suitable for descriptive, diagnostic, and predictive analysis.

Colorectal cancer is one of the leading causes of cancer-related deaths worldwide. Since survival rates vary drastically depending on stage at diagnosis and other factors, clinicians face challenges in providing accurate, data-driven prognosis. The dataset addresses this gap by supporting the development of predictive models that estimate 5-year survival probabilities, thereby contributing to personalized medicine and improved risk stratification.

The intended target users are clinicians and healthcare providers, who can use the survival predictions to support clinical decision-making, tailor treatment strategies, and allocate resources effectively.

By integrating predictive analytics into an interactive dashboard, the project aims to bridge clinical needs with patient-centered care, offering an intuitive, transparent, and supportive tool.

### Exploratory Data Analysis

In [None]:
df = pd.read_csv("colorectal_cancer_dataset.csv", delimiter=";")
print("hello world")

#### Data Inspection

- Inspecting the first 5 rows.

In [None]:
df.head()

Unnamed: 0,Patient_ID,Country,Age,Gender,Cancer_Stage,Tumor_Size_mm,Family_History,Smoking_History,Alcohol_Consumption,Obesity_BMI,...,Survival_5_years,Mortality,Healthcare_Costs,Incidence_Rate_per_100K,Mortality_Rate_per_100K,Urban_or_Rural,Economic_Classification,Healthcare_Access,Insurance_Status,Survival_Prediction
0,1,UK,77,M,Localized,69,No,No,Yes,Overweight,...,Yes,No,54413,50,5,Urban,Developed,Moderate,Insured,Yes
1,2,UK,59,M,Localized,33,No,No,No,Overweight,...,Yes,No,76553,37,25,Urban,Developing,High,Uninsured,Yes
2,3,Japan,66,M,Regional,17,No,Yes,No,Normal,...,Yes,No,62805,54,27,Urban,Developed,Moderate,Uninsured,No
3,4,USA,83,M,Regional,14,No,No,No,Obese,...,Yes,No,89393,45,11,Urban,Developed,Moderate,Insured,Yes
4,5,France,66,M,Localized,34,No,Yes,No,Normal,...,Yes,No,66425,15,27,Urban,Developing,High,Insured,Yes


- Inspecting the number of rows and columns.

In [None]:
df.shape

(167497, 28)

- Inspecting the datatype of each feature.

In [None]:
df.dtypes

Patient_ID                     int64
Country                       object
Age                            int64
Gender                        object
Cancer_Stage                  object
Tumor_Size_mm                  int64
Family_History                object
Smoking_History               object
Alcohol_Consumption           object
Obesity_BMI                   object
Diet_Risk                     object
Physical_Activity             object
Diabetes                      object
Inflammatory_Bowel_Disease    object
Genetic_Mutation              object
Screening_History             object
Early_Detection               object
Treatment_Type                object
Survival_5_years              object
Mortality                     object
Healthcare_Costs               int64
Incidence_Rate_per_100K        int64
Mortality_Rate_per_100K        int64
Urban_or_Rural                object
Economic_Classification       object
Healthcare_Access             object
Insurance_Status              object
S

In [None]:
df.shape

(167497, 28)

#### Data Processing

- Removing features that are not relevant as they will not be used for further analysis and model training. For example, the project does not intend to build a model accounting for patients insurance status or healthcare costs as predictive measures for survival. Furthermore, the Healthcare_Access feature is removed as only patients with an ongoing contact with a healthcare provider will be relevant in the project. Country or area of residence will not be considered.

In [None]:
df.drop(columns=["Country", "Insurance_Status", "Healthcare_Costs", "Urban_or_Rural", "Economic_Classification", "Healthcare_Access"], inplace=True)

 - Converting the object datatypes to categorical and displaying the newly assigned datatypes.

In [None]:
df["Gender"] = df["Gender"].astype("category")
df["Cancer_Stage"] = df["Cancer_Stage"].astype("category")
df["Family_History"] = df["Family_History"].astype("category")
df["Smoking_History"] = df["Smoking_History"].astype("category")
df["Alcohol_Consumption"] = df["Alcohol_Consumption"].astype("category")
df["Obesity_BMI"] = df["Obesity_BMI"].astype("category")
df["Gender"] = df["Gender"].astype("category")
df["Cancer_Stage"] = df["Cancer_Stage"].astype("category")
df["Family_History"] = df["Family_History"].astype("category")
df["Smoking_History"] = df["Smoking_History"].astype("category")
df["Diet_Risk"] = df["Diet_Risk"].astype("category")
df["Physical_Activity"] = df["Physical_Activity"].astype("category")
df["Diabetes"] = df["Diabetes"].astype("category")
df["Inflammatory_Bowel_Disease"] = df["Inflammatory_Bowel_Disease"].astype("category")
df["Genetic_Mutation"] = df["Genetic_Mutation"].astype("category")
df["Screening_History"] = df["Screening_History"].astype("category")
df["Early_Detection"] = df["Early_Detection"].astype("category")
df["Treatment_Type"] = df["Treatment_Type"].astype("category")
df["Survival_5_years"] = df["Survival_5_years"].astype("category")
df["Mortality"] = df["Mortality"].astype("category")
df["Survival_Prediction"] = df["Survival_Prediction"].astype("category")
df.dtypes

Patient_ID                       int64
Age                              int64
Gender                        category
Cancer_Stage                  category
Tumor_Size_mm                    int64
Family_History                category
Smoking_History               category
Alcohol_Consumption           category
Obesity_BMI                   category
Diet_Risk                     category
Physical_Activity             category
Diabetes                      category
Inflammatory_Bowel_Disease    category
Genetic_Mutation              category
Screening_History             category
Early_Detection               category
Treatment_Type                category
Survival_5_years              category
Mortality                     category
Incidence_Rate_per_100K          int64
Mortality_Rate_per_100K          int64
Survival_Prediction           category
dtype: object

- Checking for null values. The dataset is complete.

In [None]:
df.isnull().sum()

Patient_ID                    0
Age                           0
Gender                        0
Cancer_Stage                  0
Tumor_Size_mm                 0
Family_History                0
Smoking_History               0
Alcohol_Consumption           0
Obesity_BMI                   0
Diet_Risk                     0
Physical_Activity             0
Diabetes                      0
Inflammatory_Bowel_Disease    0
Genetic_Mutation              0
Screening_History             0
Early_Detection               0
Treatment_Type                0
Survival_5_years              0
Mortality                     0
Incidence_Rate_per_100K       0
Mortality_Rate_per_100K       0
Survival_Prediction           0
dtype: int64

- Inspecting the first five rows of the processed dataset

In [None]:
df.head()

Unnamed: 0,Patient_ID,Age,Gender,Cancer_Stage,Tumor_Size_mm,Family_History,Smoking_History,Alcohol_Consumption,Obesity_BMI,Diet_Risk,...,Inflammatory_Bowel_Disease,Genetic_Mutation,Screening_History,Early_Detection,Treatment_Type,Survival_5_years,Mortality,Incidence_Rate_per_100K,Mortality_Rate_per_100K,Survival_Prediction
0,1,77,M,Localized,69,No,No,Yes,Overweight,Low,...,No,No,Regular,Yes,Combination,Yes,No,50,5,Yes
1,2,59,M,Localized,33,No,No,No,Overweight,Moderate,...,No,No,Regular,No,Chemotherapy,Yes,No,37,25,Yes
2,3,66,M,Regional,17,No,Yes,No,Normal,Low,...,Yes,No,Irregular,No,Chemotherapy,Yes,No,54,27,No
3,4,83,M,Regional,14,No,No,No,Obese,High,...,No,No,Regular,No,Surgery,Yes,No,45,11,Yes
4,5,66,M,Localized,34,No,Yes,No,Normal,Low,...,Yes,No,Never,Yes,Surgery,Yes,No,15,27,Yes


### Descriptive Analytics

Visualized in the dashboard:
- Patient
- Patient summary (Family history, smoking histor, BMI, Alchool )
- discuss in the group about the "tabs"

Diagnostic
smoking and survival rate
comorbidities
regression between 

---

### Example of a simple pre-trained model from Scikit-learn

Training a logistic regression here, that can be loaded to make predictions on user input on the web dashboard.

Source: https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html


In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

from sklearn import datasets
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.linear_model import LogisticRegression

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = LogisticRegression(C=1e5)
logreg.fit(X, Y)

_, ax = plt.subplots(figsize=(4, 3))
DecisionBoundaryDisplay.from_estimator(
    logreg,
    X,
    cmap=plt.cm.Paired,
    ax=ax,
    response_method="predict",
    plot_method="pcolormesh",
    shading="auto",
    xlabel="Sepal length",
    ylabel="Sepal width",
    eps=0.5,
)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors="k", cmap=plt.cm.Paired)
plt.show()

## Exporting a pretrained model

In the dashboard, you should load a pre-trained model that was designed and evaluated in the jupyter notebook. You can do it with `pickle` or any alternative for the same purpose.

In [None]:
import pickle

In [None]:
# Save in the `assets` folder so that it is accessible from the web dashboard
file_path = "../assets/trained_model.pickle"
data_to_save = logreg

# Creates a binary object and writes the indicated variables
with open(file_path, "wb") as writeFile:
    pickle.dump(data_to_save, writeFile)

In [None]:
# Load model
pre_trained_model_path = "../assets/trained_model.pickle"
loaded_model = None # This will be replaced by the trained model in the pickle 

with open(pre_trained_model_path, "rb") as readFile:
    loaded_model = pickle.load(readFile)

In [None]:
loaded_model

In [None]:
# Sepal [length, width]
user_data = [[5, 4]] # Must be 2D array
prediction = loaded_model.predict(user_data)

print(f"The predicted value for data {user_data} is {prediction}")