# Wassim Mecheri Lab 6
# **Prediction with Neural Network and Logistic Regression**

Objective:  
In this lab, you will use and compare a neural network and logistic regression to predict the status of breast cancer patients (alive or dead) based on the dataset available at Kaggle: Breast Cancer Dataset. This exercise will help you understand model performance for binary classification tasks.

Outcome Variable:  
Status: Label "Alive" as 0 and "Dead" as 1.

Models to Compare:
- Neural Network (using MLPClassifier)
- Logistic Regression

# **0 Loading Libraries**

In [1]:
from IPython.display import display, Markdown

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score

# **1 Data Preparation** 

In [2]:
display(Markdown('## **1.1 Load Dataset**'))
df = pd.read_csv('./Breast_Cancer.csv')
display(Markdown('**Sample of df**'))
display(df.head(3))
display(Markdown(f'**Initial shape**: {df.shape}'))

display(Markdown('## **1.2 Handlind missing values**'))
clean_df = df.dropna()
display(Markdown(f'**Missing values removed**: {clean_df.shape}'))

display(Markdown('## **1.3 Encoding the Outcome Variable**'))
clean_df.loc[clean_df['Status']=='Alive','Status']=0
clean_df.loc[clean_df['Status']=='Dead','Status']=1
display(clean_df.head(3))

display(Markdown('## **1.4 Encoding Categorical Variable**'))
clean_df.loc[clean_df['A Stage']=='Distant','A Stage']=0
clean_df.loc[clean_df['A Stage']=='Regional','A Stage']=1
clean_df.loc[clean_df['Estrogen Status']=='Negative','Estrogen Status']=0
clean_df.loc[clean_df['Estrogen Status']=='Positive','Estrogen Status']=1
clean_df.loc[clean_df['Progesterone Status']=='Negative','Progesterone Status']=0
clean_df.loc[clean_df['Progesterone Status']=='Positive','Progesterone Status']=1
encoded_df = pd.get_dummies(clean_df, columns=['Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage', 'differentiate', 'Grade'])
display(encoded_df.head(3))

display(Markdown('## **1.5 Data Splitting**'))
y = np.array(encoded_df['Status']).astype(int)
X = np.array(encoded_df.drop(columns='Status'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
display(Markdown(f'**Shapes after splitting:**'))
display(Markdown(f'- X_train shape: {X_train.shape}'))
display(Markdown(f'- X_test shape: {X_test.shape}'))
display(Markdown(f'- y_train shape: {y_train.shape}'))
display(Markdown(f'- y_test shape: {y_test.shape}'))

## **1.1 Load Dataset**

**Sample of df**

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive


**Initial shape**: (4024, 16)

## **1.2 Handlind missing values**

**Missing values removed**: (4024, 16)

## **1.3 Encoding the Outcome Variable**

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,0
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,0
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,0


## **1.4 Encoding Categorical Variable**

Unnamed: 0,Age,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status,Race_Black,...,6th Stage_IIIB,6th Stage_IIIC,differentiate_Moderately differentiated,differentiate_Poorly differentiated,differentiate_Undifferentiated,differentiate_Well differentiated,Grade_ anaplastic; Grade IV,Grade_1,Grade_2,Grade_3
0,68,1,4,1,1,24,1,60,0,False,...,False,False,False,True,False,False,False,False,False,True
1,50,1,35,1,1,14,5,62,0,False,...,False,False,True,False,False,False,False,False,True,False
2,58,1,63,1,1,14,7,75,0,False,...,False,True,True,False,False,False,False,False,True,False


## **1.5 Data Splitting**

**Shapes after splitting:**

- X_train shape: (3219, 36)

- X_test shape: (805, 36)

- y_train shape: (3219,)

- y_test shape: (805,)

# **2 Fitting the Models**

In [3]:
logistic = LogisticRegression(max_iter=1102) # increase max iter to converge
logistic.fit(X_train, y_train)
y_pred_prob_logistic = logistic.predict_proba(X_test)[:, 1]
y_pred_logistic = logistic.predict(X_test)

mlp_clf = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=0)
mlp_clf.fit(X_train, y_train)
y_pred_prob_mlp = mlp_clf.predict_proba(X_test)[:, 1]
y_pred_mlp = mlp_clf.predict(X_test)

# **3 Evaluation Metrics**

In [4]:
display(Markdown('**Accuracy results**'))
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
display(Markdown(f'- **Logistic:** {accuracy_logistic:.2f}'))
display(Markdown(f'- **MLP:** {accuracy_mlp:.2f}'))

display(Markdown('**Recall results**'))
recall_logistic = recall_score(y_test, y_pred_logistic)
recall_mlp = recall_score(y_test, y_pred_mlp)
display(Markdown(f'- **Logistic:** {recall_logistic:.2f}'))
display(Markdown(f'- **MLP:** {recall_mlp:.2f}'))

display(Markdown('**Precision results**'))
precision_logistic = precision_score(y_test, y_pred_logistic)
precision_mlp = precision_score(y_test, y_pred_mlp)
display(Markdown(f'- **Logistic:** {precision_logistic:.2f}'))
display(Markdown(f'- **MLP:** {precision_mlp:.2f}'))

display(Markdown('**F1 Score results**'))
f1_score_logistic = f1_score(y_test, y_pred_logistic)
f1_score_mlp = f1_score(y_test, y_pred_mlp)
display(Markdown(f'- **Logistic:** {f1_score_logistic:.2f}'))
display(Markdown(f'- **MLP:** {f1_score_mlp:.2f}'))

**Accuracy results**

- **Logistic:** 0.89

- **MLP:** 0.90

**Recall results**

- **Logistic:** 0.40

- **MLP:** 0.44

**Precision results**

- **Logistic:** 0.80

- **MLP:** 0.85

**F1 Score results**

- **Logistic:** 0.53

- **MLP:** 0.58

**Comparison and Analysis**  
The MLP Neural Network did better than Logistic Regression on all metrics. This indicates that the MLP model better captured complex patterns in the data, likely due to its higher complexity. Given the objective of predicting patient status reliably, the MLP’s higher recall and F1 score make it the better choice for this classification task.  
To further improve results, we could obtain more data or optimize parameters for the MLP classifier.