This dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer (SEER primary cites recode NOS histology codes 8522/3) diagnosed in 2006-2010. Patients with unknown tumour size, examined regional LNs, positive regional LNs, and patients whose survival months were less than 1 month were excluded; thus, 4024 patients were ultimately included.

# Importing dataset

In [2]:
import pandas as pd

In [4]:
df=pd.read_csv("Breast_Cancer.csv")
df

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4019,62,Other,Married,T1,N1,IIA,Moderately differentiated,2,Regional,9,Positive,Positive,1,1,49,Alive
4020,56,White,Divorced,T2,N2,IIIA,Moderately differentiated,2,Regional,46,Positive,Positive,14,8,69,Alive
4021,68,White,Married,T2,N1,IIB,Moderately differentiated,2,Regional,22,Positive,Negative,11,3,69,Alive
4022,58,Black,Divorced,T2,N1,IIB,Moderately differentiated,2,Regional,44,Positive,Positive,11,1,72,Alive


In [5]:
df.shape

(4024, 16)

In [6]:
df.columns

Index(['Age', 'Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage',
       'differentiate', 'Grade', 'A Stage', 'Tumor Size', 'Estrogen Status',
       'Progesterone Status', 'Regional Node Examined',
       'Reginol Node Positive', 'Survival Months', 'Status'],
      dtype='object')

In [None]:
## This dataset contains 16 columns and 4024 rows that is records and has columns as:
### Age:Age of female
### Race:Skin colour
### Marital Status: Status of marriage
### T Stage:Using the TNM system, the “T” plus a letter or number (0 to 4) is used to describe the size and location of the tumor. 
### N Stage:The “N” in the TNM staging system stands for lymph nodes. These small, bean-shaped organs help fight infection.
### 6th Stage:Doctors may refer to stage I to stage IIA cancer as "early stage" and stage IIB to stage III as "locally advanced."
### differentiate: Well-differentiated cancer cells look more like normal cells and tend to grow and spread more slowly than poorly differentiated or undifferentiated cancer cells.
### Grade:It is differentiation of natural cells than cancer cells
### A Stage: Spread of cancer in body
### Tumor Size:Size of tumor present in body
### Estrogen Status:Status of estrogen
### Progesterone Status:Status of Progesterone
### Regional Node Examined:Records the total number of regional lymph nodes that were removed and examined by the pathologist.
### Reginol Node Positive:Records the exact number of regional nodes examined by the pathologist and found to contain metastases
### Survival Months: Expected month to be survived
### Status: Alive or Dead
       

In [7]:
for col in df.columns:
    print(col,"---------->" ,df[col].dtypes)

Age ----------> int64
Race ----------> object
Marital Status ----------> object
T Stage  ----------> object
N Stage ----------> object
6th Stage ----------> object
differentiate ----------> object
Grade ----------> object
A Stage ----------> object
Tumor Size ----------> int64
Estrogen Status ----------> object
Progesterone Status ----------> object
Regional Node Examined ----------> int64
Reginol Node Positive ----------> int64
Survival Months ----------> int64
Status ----------> object


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Age                     4024 non-null   int64 
 1   Race                    4024 non-null   object
 2   Marital Status          4024 non-null   object
 3   T Stage                 4024 non-null   object
 4   N Stage                 4024 non-null   object
 5   6th Stage               4024 non-null   object
 6   differentiate           4024 non-null   object
 7   Grade                   4024 non-null   object
 8   A Stage                 4024 non-null   object
 9   Tumor Size              4024 non-null   int64 
 10  Estrogen Status         4024 non-null   object
 11  Progesterone Status     4024 non-null   object
 12  Regional Node Examined  4024 non-null   int64 
 13  Reginol Node Positive   4024 non-null   int64 
 14  Survival Months         4024 non-null   int64 
 15  Stat

In [9]:
for col in df.columns:
    print(col,"------->",df[col].isnull().sum())

Age -------> 0
Race -------> 0
Marital Status -------> 0
T Stage  -------> 0
N Stage -------> 0
6th Stage -------> 0
differentiate -------> 0
Grade -------> 0
A Stage -------> 0
Tumor Size -------> 0
Estrogen Status -------> 0
Progesterone Status -------> 0
Regional Node Examined -------> 0
Reginol Node Positive -------> 0
Survival Months -------> 0
Status -------> 0


In [13]:
df.duplicated().sum()

1

In [17]:
df.drop_duplicates(inplace=True)

In [26]:
col= ['Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage','differentiate', 'Grade', 'A Stage',  'Estrogen Status','Progesterone Status',  'Status']

In [27]:
for i in col:
    print(i,"------->",df[i].value_counts())
    print("______________________________")

Race -------> White    3412
Other     320
Black     291
Name: Race, dtype: int64
______________________________
Marital Status -------> Married      2642
Single        615
Divorced      486
Widowed       235
Separated      45
Name: Marital Status, dtype: int64
______________________________
T Stage  -------> T2    1786
T1    1602
T3     533
T4     102
Name: T Stage , dtype: int64
______________________________
N Stage -------> N1    2731
N2     820
N3     472
Name: N Stage, dtype: int64
______________________________
6th Stage -------> IIA     1304
IIB     1130
IIIA    1050
IIIC     472
IIIB      67
Name: 6th Stage, dtype: int64
______________________________
differentiate -------> Moderately differentiated    2350
Poorly differentiated        1111
Well differentiated           543
Undifferentiated               19
Name: differentiate, dtype: int64
______________________________
Grade -------> 2                        2350
3                        1111
1                         543
 an