<a href="https://colab.research.google.com/github/talw98/Breast-Cancer-Survival-Prediction-Model/blob/main/Copy_of_Breast_Cancer_Survival_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BREAST CANCER SURVIVAL PREDICTION 

We have a dataset of 400 patients who are suffering from breast cancer and underwent surgery. The dataset was obtained from [Kaggle](https://kaggle.com/amandam1/breastcancerdataset). The dataset is in csv format and consists of 13 columns as follows:


1.   Patient_ID
2.   Age
3.   Gender: Male, Female
4.   Protein1, Protein2, Protein3, Protein4 : expression levels
5.   Tumour_Stage
6.   Histology: Infiltrating Ductal Carcinoma, Infiltration Lobular Carcinoma, 
     Mucinous Carcinoma
7.   ER status: Positive, Negative
8.   PR status: Positive, Negative
9.   HER2 status:
10.  Surgery_type: Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy,
     Other
11.  DateofSurgery
12.  DateofLast_visit
13.  Patient_Status

Task is to predict whether a breast cancer patient will survive or not when the surgery is done


I will import the required libraries and the datset

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from google.colab import files
upload = files.upload()

data = pd.read_csv("BRCA.csv")
data.head()

Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,TCGA-D8-A1XD,36.0,FEMALE,0.080353,0.42638,0.54715,0.27368,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,15-Jan-17,19-Jun-17,Alive
1,TCGA-EW-A1OX,43.0,FEMALE,-0.42032,0.57807,0.61447,-0.031505,II,Mucinous Carcinoma,Positive,Positive,Negative,Lumpectomy,26-Apr-17,09-Nov-18,Dead
2,TCGA-A8-A079,69.0,FEMALE,0.21398,1.3114,-0.32747,-0.23426,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,08-Sep-17,09-Jun-18,Alive
3,TCGA-D8-A1XR,56.0,FEMALE,0.34509,-0.21147,-0.19304,0.12427,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,25-Jan-17,12-Jul-17,Alive
4,TCGA-BH-A0BF,56.0,FEMALE,0.22155,1.9068,0.52045,-0.31199,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,06-May-17,27-Jun-19,Dead


Now we I will check if the columns contain any null values or not. Null values hamper our results

In [None]:
data.isnull().sum()


Patient_ID             7
Age                    7
Gender                 7
Protein1               7
Protein2               7
Protein3               7
Protein4               7
Tumour_Stage           7
Histology              7
ER status              7
PR status              7
HER2 status            7
Surgery_type           7
Date_of_Surgery        7
Date_of_Last_Visit    24
Patient_Status        20
dtype: int64


There are null values in every column. They will be dropped

In [None]:
data = data.dropna()

Some insights about the columns are as follows:

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 317 entries, 0 to 333
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Patient_ID          317 non-null    object 
 1   Age                 317 non-null    float64
 2   Gender              317 non-null    object 
 3   Protein1            317 non-null    float64
 4   Protein2            317 non-null    float64
 5   Protein3            317 non-null    float64
 6   Protein4            317 non-null    float64
 7   Tumour_Stage        317 non-null    object 
 8   Histology           317 non-null    object 
 9   ER status           317 non-null    object 
 10  PR status           317 non-null    object 
 11  HER2 status         317 non-null    object 
 12  Surgery_type        317 non-null    object 
 13  Date_of_Surgery     317 non-null    object 
 14  Date_of_Last_Visit  317 non-null    object 
 15  Patient_Status      317 non-null    object 
dtypes: float

Now will check how many males and females are in our dataset

In [None]:
data.Gender.value_counts()

FEMALE    313
MALE        4
Name: Gender, dtype: int64


Let's have a look at the tumour stage of the patients

In [None]:
#Tumor stage
stage = data['Tumour_Stage'].value_counts()
transactions = stage.index
quantity = stage.values
figure = px.pie(data,
                values=quantity,
                names=transactions,hole=0.5,
                title="Tumour stages of patients")

figure.show()

Now let's have a look at the histology of the breast cancer patients

In [None]:
#Histology of Patients
histology = data["Histology"].value_counts()
transactions = histology.index
quantity = histology.values
figure = px.pie(data,
                values = quantity,
                names = transactions,hole = 0.5,
                title="Histology of patients")

figure.show()


Let's have a look at the ER, PR and HER2 status of the patients

In [None]:
#ER Staus
data["ER status"].value_counts()
#print PR status
data["PR status"].value_counts()

#HER2 Status
data["HER2 status"].value_counts()


Positive    317
Name: ER status, dtype: int64
Positive    317
Name: PR status, dtype: int64
Negative    288
Positive     29
Name: HER2 status, dtype: int64


Now let's have a look at the different Surgery types

In [None]:
#Surgery type
surgery = data["Surgery_type"].value_counts()
transactions = surgery.index
quantity = surgery.values
figure = px.pie(data,
                values = quantity,
                names = transactions,hole=0.5,
                title = "Type of Surgery Patients")

figure.show()

This dataset has a lot of categorical values and categorical columns. It needs to be converted into numerical so that we can run our machine learning model. I will assign numerical values to categirical data.

In [None]:
data["Tumour_Stage"] = data["Tumour_Stage"].map({"I" : 1, "II" : 2, "III"  : 3})
data["Histology"] = data["Histology"].map({"Infiltrating Ductal Carcinoma" : 1, "Infiltrating Lobular Carcinoma" : 2, "Mucinous Carcinoma" : 3})
data["ER status"] = data["ER status"].map({"Positive" : 1})
data["PR status"] = data["PR status"].map({"Positive" : 1})
data["HER2 status"] = data["HER2 status"].map({"Positive" : 1, "Negative" : 2})
data["Gender"] = data["Gender"].map({"MALE" : 0, "FEMALE" : 1})
data["Surgery_type"] = data["Surgery_type"].map({"Other" : 1, "Modified Radical Mastectomy" : 2,
                                                 "Lumpectomy" : 3, "Simple Mastectomy" : 4})

data.head()

     Patient_ID   Age  Gender  Protein1  Protein2  Protein3  Protein4  \
0  TCGA-D8-A1XD  36.0       1  0.080353   0.42638   0.54715  0.273680   
1  TCGA-EW-A1OX  43.0       1 -0.420320   0.57807   0.61447 -0.031505   
2  TCGA-A8-A079  69.0       1  0.213980   1.31140  -0.32747 -0.234260   
3  TCGA-D8-A1XR  56.0       1  0.345090  -0.21147  -0.19304  0.124270   
4  TCGA-BH-A0BF  56.0       1  0.221550   1.90680   0.52045 -0.311990   

   Tumour_Stage  Histology  ER status  PR status  HER2 status  Surgery_type  \
0             3          1          1          1            2             2   
1             2          3          1          1            2             3   
2             3          1          1          1            2             1   
3             2          1          1          1            2             2   
4             2          1          1          1            2             1   

  Date_of_Surgery Date_of_Last_Visit Patient_Status  
0       15-Jan-17          19-Ju

Before moving to the machine learning model, we need to split data into training and test set

In [None]:
# now I will split the data and then run train test splits
x = np.array(data[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4',
                   'Tumour_Stage', 'Histology', 'ER status', 'PR status',
                   'HER2 status', 'Surgery_type']])

y = np.array(data[['Patient_Status']])

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.10, random_state = 42)

Now I will train the machine learning moodel

In [None]:
model = SVC()
model.fit(xtrain, ytrain)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



SVC()

I will now input all the features and predict whether a patient will survive from breast cancer surgery or not. Patient with index '0' (or the first in the list) data will be used here.

In [None]:
#Prediction of patient whether is alive or not
#Patient '0' data will be used here to predict the outcome
#features = [['Age, 'Gender', 'Protein1', 'Protein2', 'Protein3', 'Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']]

features = np.array([[36.0, 1, 0.080353, 0.42638, 0.54715, 0.273680, 3, 1, 1, 1, 2, 2,]])
model.predict(features)


['Alive']
