# **Differentiated Thyroid Cancer Recurrence**

## Group 2
### Abogado, Marxel S.
### Surban, Alyssa Nicole J.
### Tan, Jamilene Arianna L.
---

### I. **Pre-processing**

In [3]:
# pre-processing
import pandas as pd

def check_missing_values(df):
    """Check for missing values in the DataFrame."""
    missing = df.isnull().sum()
    print("Missing values per column:\n", missing)
    return missing

def check_duplicates(df):
    """Check for duplicate rows in the DataFrame."""
    duplicates = df.duplicated().sum()
    print(f"Number of duplicate rows: {duplicates}")
    return duplicates

def check_outliers(df):
    """Check for outliers in categorical data by showing value counts."""
    for col in df.columns:
        print(f"\nValue counts for {col}:")
        print(df[col].value_counts())

def create_dummy_variables(df):
    """Create dummy variables for all categorical columns."""
    df_dummies = pd.get_dummies(df, drop_first=True)
    print("Dummy variables created. Shape:", df_dummies.shape)
    return df_dummies


In [4]:
# Load the data
file_path = 'Thyroid_Diff.csv'
df = pd.read_csv(file_path)

In [7]:
# Check for missing_values, duplicates, outliers
df_missing = check_missing_values(df)
df_duplicates = check_duplicates(df)
check_outliers(df)

# Remove rows with missing values
df = df.dropna()

# Remove duplicate rows
df = df.drop_duplicates()

Missing values per column:
 Age                     0
Gender                  0
Smoking                 0
Hx Smoking              0
Hx Radiothreapy         0
Thyroid Function        0
Physical Examination    0
Adenopathy              0
Pathology               0
Focality                0
Risk                    0
T                       0
N                       0
M                       0
Stage                   0
Response                0
Recurred                0
dtype: int64
Number of duplicate rows: 19

Value counts for Age:
Age
31    22
27    13
40    12
26    12
28    12
      ..
79     1
18     1
69     1
76     1
78     1
Name: count, Length: 65, dtype: int64

Value counts for Gender:
Gender
F    312
M     71
Name: count, dtype: int64

Value counts for Smoking:
Smoking
No     334
Yes     49
Name: count, dtype: int64

Value counts for Hx Smoking:
Hx Smoking
No     355
Yes     28
Name: count, dtype: int64

Value counts for Hx Radiothreapy:
Hx Radiothreapy
No     376
Yes      7
Na

In [9]:

# Create dummy variables for categorical variables
df_dummies = create_dummy_variables(df)

df_dummies.head()

Dummy variables created. Shape: (364, 41)


Unnamed: 0,Age,Gender_M,Smoking_Yes,Hx Smoking_Yes,Hx Radiothreapy_Yes,Thyroid Function_Clinical Hypothyroidism,Thyroid Function_Euthyroid,Thyroid Function_Subclinical Hyperthyroidism,Thyroid Function_Subclinical Hypothyroidism,Physical Examination_Multinodular goiter,...,N_N1b,M_M1,Stage_II,Stage_III,Stage_IVA,Stage_IVB,Response_Excellent,Response_Indeterminate,Response_Structural Incomplete,Recurred_Yes
0,27,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,34,False,False,True,False,False,True,False,False,True,...,False,False,False,False,False,False,True,False,False,False
2,30,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,62,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,62,False,False,False,False,False,True,False,False,True,...,False,False,False,False,False,False,True,False,False,False


### II. **Feature selection**

In [None]:
# feature selection

### III. **Model fitting**

In [None]:
# model fitting

'''
potential machine learning models:
- random forest
- xgboost
- adaboost
- (other gradient boosting)
- naive bayes
- logistic regression
- more for classification
'''

### IV. **Model evaluation**

In [None]:
# model evaluation

'''
classification metrics:
- accuracy
- recall
- specificity
- precision
- f1 score 

'''

### V. **Explainable AI**

In [None]:
# use shap and lime

# feature importance plot
# prediction probabilities