# Title: Haberman's Survival Data

1. Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999

2. Past Usage:

    a. Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
    
    b. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
    
    c. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
    
3. Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.(12 years of data set)

4. Number of Instances: 306

5. Number of Attributes: 4 (including the class attribute)

6. Attribute Information:

    a. Age of patient at time of operation (numerical)
    
    b. Patient's year of operation (year - 1900, numerical)
    
    c. Number of positive axillary nodes detected (numerical)
    
    (Positive axillary lymph node. A positive axillary lymph node is a lymph node in the area of the armpit (axilla) to which cancer has spread. This spread is determined by surgically removing some of the lymph nodes and examining them under a microscope to see whether cancer cells are present.)
    
    d. Survival status (class attribute)
    
    1 = the patient survived 5 years or longer
                  
    2 = the patient died within 5 year
    
7. Missing Attribute Values: None

Columns Edit

30Age

64Op_Year

1axil_nodes_det

1Surv_status

haberman.csv (3.03 KB)

__The objective of performing this analysis is to classify whether a person will survive the surgery or not, given the patient's Age, Year of Operation and the number of axillary nodes detected__

In [None]:
# Reading the haberman csv data set

import pandas 
import seaborn
import matplotlib.pyplot as plt
import numpy 
#Loading Iris.csv into a pandas dataFrame.
cancer = pandas.read_csv("../input/haberman.csv")

In [None]:
# (Q) how many data-points and features?
# (or)size of the matrix of dataset
print(cancer.shape)

In [None]:
#(Q) What are the column names in our dataset?
print(cancer.columns)

In [None]:
#(Q) How many data points for each class are present? 
# or how many people survived for 5 or more than 5 years and how many died
cancer["Surv_status"].value_counts()


In [None]:
# Printing the top 5 rows for inital analysis of the dataset and for verfication

cancer.head()

In [None]:
# total number of observations
print(cancer.info())

In [None]:
# unique constraints
list(cancer['Surv_status'].unique())

##### Observation

1. 4 columns: 

    a. Age
    
    b. Op_Year
    
    c. axial_nodes_det
    
    d. Surv_status
    
2. In Surv_status: __Class Attribute__
    
    a. 1  (__224__) - lived for 5 years or more than 5 years after surgery
       
    b. 2  (__81__) - died
 
3. Size of table:
    
    305(rows) x 4 (columns)   

4. All columns in data set are non null

5. All values in data set are numerical

6. class attribute unique constraints are 

    1 and 2

## Modifications

Since the Surv_status is categorical data we can map the data 
    
    a. 1 - 'Survived'
    
    b. 2 - 'Died'

In [None]:
# replacing 1 - survived and 2 - died
#cancer["Surv_status"].replace(to_replace = {
#   1: "Survived",
#    2: "Died"
#}, inplace = True)

cancer['Surv_status'] = cancer['Surv_status'].apply({1: 'survived', 2: 'died'}.get)
cancer



In [None]:
# verification of the values whether assigned or not

cancer["Surv_status"].value_counts()

In [None]:
# finding the unique variables 

list(cancer['Surv_status'].unique())

In [None]:
# checking the overview of data set description

cancer.describe()

##### Observation

1. Data set is recorded for 306 patients (1958-1969)

2. __Analysis of age__

    a. **mean** of age = __52__
    
    b. __range__ varies between __(30 to 83)__
    
3. __Analysis of axillary nodes(axial_nodes__):

    a.__mean__ number of axillary nodes =   __4__
    
    b. __range__  varies between __(0 to 52)__
    
    c. 25% of patients = 0 AXILLARY NODES
    
    d. 75% of patients <  5 AXILLARY NODES
    
4. __Analysis of Op_Year__
    
    a. __mean__ operation year = __62__
    
    b. __range__ varies between __(58 to 69)__
    
    c. 25% of operations from __1958 to 1960__ i.e., in 2 years
    

In [None]:
cancer["Surv_status"].value_counts()

In [None]:
print("\n" + str(cancer["Surv_status"].value_counts(normalize = True)))

##### Observations
  Values are imbalanced as 
        **survived are 73%**
            &
        **died are 26%**

In [None]:
# checking if any null values
cancer.isnull().any()

In [None]:
cancer.count()

##### Observation
The data set does not have any null values with every observation having 305 observations each

# 3 types analysis:

1. Univariate : CDF, PDF, BOXPLOT, VIOLIN PLOT
2. Bivariate   : PAIR PLOT, SCATTER PLOT
3. Multivariate: 3D SCATTER PLOT

# UNIVARIATE ANALYSIS

## **Basic plots**

In [None]:
status1 = cancer.loc[cancer["Surv_status"] == "survived"]

plt.plot(status1["Age"], numpy.zeros_like(status1["Age"]), '2')#blue
plt.plot(status1["Op_Year"], numpy.zeros_like(status1["Op_Year"]), '*')#orange
plt.plot(status1["axial_nodes_det"], numpy.zeros_like(status1["axial_nodes_det"]), '.')#green
plt.show()






In [None]:
status2 = cancer.loc[cancer["Surv_status"] == "died"]

plt.plot(status2["Age"], numpy.zeros_like(status2["Age"]), '2')#blue
plt.plot(status2["Op_Year"], numpy.zeros_like(status2["Op_Year"]), '*')#orange
plt.plot(status2["axial_nodes_det"], numpy.zeros_like(status2["axial_nodes_det"]), '.')#green
plt.show()




In [None]:
seaborn.FacetGrid(cancer, hue = "Surv_status", height = 5)\
        .map(seaborn.distplot, "Age")\
        .add_legend()
plt.show()


In [None]:
seaborn.FacetGrid(cancer, hue = "Surv_status", height = 5)\
        .map(seaborn.distplot, "Op_Year")\
        .add_legend()
plt.show()

In [None]:
seaborn.FacetGrid(cancer, hue = "Surv_status", height = 5)\
        .map(seaborn.distplot, "axial_nodes_det")\
        .add_legend()
plt.show()

##### Observations

Every plot is giving a highly overlapped view

So constructing CDF

As the points are overlapping , we count the the number of points in between a particular graph plot and make the y axis as that count for everydifference


## **CDF & PDF**

In [None]:
counts, bin_edges = numpy.histogram(status1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)


counts, bin_edges = numpy.histogram(status1['Age'], bins=20, 
                                 density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);

plt.show();


In [None]:
counts, bin_edges = numpy.histogram(status1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



plt.show();

# blue - pdf
# orange - cdf


##### Observation

1) cdf - 10% of people are below age 35

2) cdf - max number of people are from age 53 - 58 by 16%

3) pdf - very less patients above age 77

In [None]:
counts, bin_edges = numpy.histogram(status1['Op_Year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



plt.show();



##### Observation 

1) 18% (max)of people are have done opertions in the year 58

2) highest operations are done in the year 58

3) about 30% of operations are done in between 58 - 60

4) 1961 - 62 sudden drop in the number of patients 

In [None]:
counts, bin_edges = numpy.histogram(status1['axial_nodes_det'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



plt.show();



##### Observation

1) 88% of people have less than 10 axial nodes

3) 98% of the people have less than 28 axial nodes

In [None]:
print("age")
counts, bin_edges = numpy.histogram(status1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

print("Op_Year")
counts, bin_edges = numpy.histogram(status1['Op_Year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

print("axial_nodes_det")

counts, bin_edges = numpy.histogram(status1['axial_nodes_det'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = numpy.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


plt.show();


In [None]:
plt.close()
cancer.hist()
plt.figure(figsize=(18,8))


## **BOX PLOTS**

In [None]:
seaborn.boxplot(x='Surv_status',y='Age', data= cancer)
plt.show()



##### Observation
1) people age between 30 - 34 have survived

2) patients with age>77 have died

3) patients with age from 71 to 77 have a  better chances of survival

In [None]:
seaborn.boxplot(x='Surv_status',y='Op_Year', data= cancer)
plt.show()



##### Observation 
1) people who undergone operations in 1960-66 have higher chances of survival


In [None]:
seaborn.boxplot(x='Surv_status',y='axial_nodes_det', data= cancer)
plt.show()


##### Observation
 1) patients with less than 3 axillary nodes had higher chances of survival
 
 2) patients with more than 10 axillary nodes has very less chances of survival

## **VIOLIN PLOT**

In [None]:
seaborn.violinplot(x="Surv_status", y="Age", data=cancer, height=8)
plt.show()

##### Observation

1) people with age greater than 40 died mostly


In [None]:
seaborn.violinplot(x="Surv_status", y="Op_Year", data=cancer, height=8)
plt.show()

##### Observation

1) people with operation in the year 58 to 62 have survived the max

2) people with opertions in the year 63 to 65 have died the max

3) the classification of year is not much helpful as it has equal distribution

In [None]:
seaborn.violinplot(x="Surv_status", y="axial_nodes_det", data=cancer, height=8)
plt.show()

##### Observation

1) people with 0 axial nodes have survived mostly

2) people with axial nodes less than 5 have higher chances of surviving

3) people with axial nodes > 50 died

# BIVARIATE PLOTS

## **SCATTER PLOTS**

In [None]:
print(cancer.columns)
cancer.plot(kind='scatter', x='Age', y='Op_Year') ;
plt.show()

In [None]:
seaborn.set_style("whitegrid");
seaborn.FacetGrid(cancer, hue="Surv_status", height=4) \
   .map(plt.scatter, "Age", "Op_Year") \
   .add_legend();
plt.show();

In [None]:
seaborn.set_style("whitegrid");
seaborn.FacetGrid(cancer, hue="Surv_status", height=4) \
   .map(plt.scatter, "Age", "axial_nodes_det") \
   .add_legend();
plt.show();

In [None]:
seaborn.set_style("whitegrid");
seaborn.FacetGrid(cancer, hue="Surv_status", height=4) \
   .map(plt.scatter, "Op_Year", "axial_nodes_det") \
   .add_legend();
plt.show();

In [None]:
seaborn.set_style("whitegrid");
seaborn.FacetGrid(cancer, hue="Surv_status", height=4) \
   .map(plt.scatter, "Age", "Op_Year", "axial_nodes_det") \
   .add_legend();
plt.show();

##### Observation

1) features are not linearly seperable

2) So, differentiating the features from this data would not possible

## **3D SCATTER PLOTS**

## **PAIR PLOT**

In [None]:
plt.close();
seaborn.set_style("whitegrid");
seaborn.pairplot(cancer, hue="Surv_status", height=3);
plt.show()

##### Observation

1) Year of operation is not having much effect on the classification

2) Patients age and number of axillary nodes have some effect on the classification

# MULTIVARIATE

## **3D SCATTER PLOT**

In [None]:
cancer['Surv_status'] = cancer['Surv_status'].apply({'survived': 0, 'died': 1}.get)
cancer

In [None]:
import mpl_toolkits.mplot3d
fig = plt.figure(figsize = (12, 10))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(cancer['Age'], cancer['Op_Year'], cancer['axial_nodes_det'], 
           c = cancer['Surv_status'].map(lambda x: {True:'green', False:'red'}[x]), marker = 'o')
ax.set_xlabel('Age')
ax.set_ylabel('Op_Year')
ax.set_zlabel('axial_nodes_det')
plt.show()

In [None]:
cancer['Surv_status'] = cancer['Surv_status'].apply({0: 'survived', 1: 'died'}.get)
cancer

## **PROBABILITY DENSITY OR COUNTOUR PLOT**

In [None]:
seaborn.jointplot(x="Age", y="axial_nodes_det", data=cancer, kind="kde");
plt.show();

# CONCLUSION

1) The data set is quite overlapping 

2) The number of survivals can't be determined by the year but age and number axillary nodes can be a helpful part

3) Though the axillary nodes count and age are helpful part the classfication can't be done clearly

4) Axillary nodes is the most important feature in case of analysis

* The data set can be divided into into 4 regions by above all observations in age
 
| Region |       Condition     | total     | survived  |survival%  |
|--------|---------------------|-----------|-----------|-----------|
|    1   |       age <= 40     |     42    |     38    |   90%     |
|    2   | age > 40 & age <= 70|     234   |     166   |   70%     |
|    3   | age > 70 & node<= 77|     11    |     9     |   82%     |
|    4   | age > 77 & node<= 83|     2     |     0     |    0%     |

* The data set can be divided into into 2 regions by above all observations in positive axillary nodes

| Region |       Condition     | total     | survived  |survival%  |
|--------|---------------------|-----------|-----------|-----------|
|    1   |       node <= 3     |     216   |     117   |   81.9%   |
|    2   |       node <= 4     |     229   |     187   |   81.6%   |

In [None]:
def CancerAnalysis(age, opyear, node):
    
    '''This function returns
        True: if a patient will survive for more than 5 years
        False: otherwise
    '''
    
    if age <= 40:
        return True;
    elif age > 77:
        return False;
    elif age < 77 & age > 70: 
        return True;
    elif node <= 4:
        return True;
    else:
        return False; # analysis is not accurate in this part of else
    
    

1) The above function is not 100% accurate. 

2) For doing accurate analysis we might require a higher level of algorithmic analysis as overlapping in the data set is high