# Breast Cancer Data Analysis Project

This Juptyer Notebook contains data science exericses pertaining to the Breast Cancer Data set in order to find trends and more information about the data. 

In [1]:
# Packages used in this exercise.
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

## Exploratory Data Analysis

The csv file is read into the notebook using the `pandas` package. Then, the first 5 rows are shown below using the `.head()` function.

In [5]:
df = pd.read_csv('BRCA.csv')
print("The total number of patients in the dataset is {}.".format(len(df)))
df.head()


The total number of patients in the dataset is 341.


Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,TCGA-D8-A1XD,36.0,FEMALE,0.080353,0.42638,0.54715,0.27368,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,15-Jan-17,19-Jun-17,Alive
1,TCGA-EW-A1OX,43.0,FEMALE,-0.42032,0.57807,0.61447,-0.031505,II,Mucinous Carcinoma,Positive,Positive,Negative,Lumpectomy,26-Apr-17,09-Nov-18,Dead
2,TCGA-A8-A079,69.0,FEMALE,0.21398,1.3114,-0.32747,-0.23426,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,08-Sep-17,09-Jun-18,Alive
3,TCGA-D8-A1XR,56.0,FEMALE,0.34509,-0.21147,-0.19304,0.12427,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,25-Jan-17,12-Jul-17,Alive
4,TCGA-BH-A0BF,56.0,FEMALE,0.22155,1.9068,0.52045,-0.31199,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,06-May-17,27-Jun-19,Dead


In [9]:
df.tail()

Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
336,,,,,,,,,,,,,,,,
337,,,,,,,,,,,,,,,,
338,,,,,,,,,,,,,,,,
339,,,,,,,,,,,,,,,,
340,,,,,,,,,,,,,,,,


### Data Cleaning

As shown above, there are some rows and columns with `NaN` values. Therefore, we will clean the data by removing rows and columns that have `NaN` values.

In [10]:
df_cleaned = df.dropna()
print("Actually, there are {} patients in the dataset after data cleaning".format(len(df_cleaned)))

Actually, there are 317 patients in the dataset after data cleaning


## Data Analysis

**Question: What is the minimum age, maximum age, median age, mean age, and mode age of the dataset?**

In [12]:
print("Patient Age Statistics")
print("Min: {}".format(np.min(df_cleaned['Age'])))
print("Max: {}".format(np.max(df_cleaned['Age'])))
print("Median: {}".format(np.median(df_cleaned['Age'])))
print("Mean: {:.2f}".format(np.average(df_cleaned['Age']))) #:.2f formats the number such that it has up to 2 decimal points
print("Mode: {}".format(stats.mode(df_cleaned['Age']).mode[0]))

Patient Age Statistics
Min: 29.0
Max: 90.0
Median: 58.0
Mean: 58.73
Mode: 59.0


**Question: What is the time difference between the surgery date and the last visit date? Create a new column that calculates the time difference**
**in days and find the longest time difference for a patient that is currently alive.**

In [None]:
## Code Here

**Discovery:**

**Question: How many males and females are in the dataset?**

In [14]:
num_males = len(df_cleaned[df_cleaned['Gender'] == "MALE"])
num_females = len(df_cleaned[df_cleaned['Gender'] == "FEMALE"])
print("There are {} males and {} females in the dataset.".format(num_males, num_females))

There are 4 males and 313 females in the dataset.


**Discovery: There are also males in the dataset that have breast cancer.**

**Question: What are the different histologies of the patients and how many of each are there in the dataset?**

In [18]:
histology_counts = df_cleaned['Histology'].value_counts()
print(histology_counts)

Infiltrating Ductal Carcinoma     224
Infiltrating Lobular Carcinoma     81
Mucinous Carcinoma                 12
Name: Histology, dtype: int64


**Discovery: The most common histology seen in this dataset for breast cancer is *Infiltrating Ductal Carcinoma*.**

**Question: What are the highest and lowest protein expression levels for each protein.**

In [None]:
## Code Here

**Discovery**

**Question: For each tumour_stage, what is the average protein expression level for each protein level?**
**What are some trends found in the data? Present this data visually.**

In [None]:
## Code Here

**Discovery**

**Question: For each tumour_stage, how many patients are alive or dead?**

In [None]:
## Code Here

**Discovery**