**Haberman's Survival **

**Data Description ** The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information:**

**1.** Age of patient at time of operation (numerical)

**2.** Patient's year of operation (year - 1900, numerical)

**3.** Number of positive axillary nodes detected (numerical)

**4.** Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
from colorama import Fore, Back, Style
df = pd.read_csv("../input/haberman.csv",header = None)
df.columns = ['age','year','#auxNodes','status']
df['status'] = df['status'].apply(lambda y: 'survived' if y == 1 else 'died')
print("*"*15 +"Basic Information Of DataSet","*"*15)
print(df.info()) # #attributes, #entries
print("*"*15 +"Description Of DataSet","*"*15)
print(df.describe()) #descritption of the dataset such as count,mean,std,min,max etc.
print("No. of people survived",df["age"][df["age"].between(0,100)][df["status"] == 'survived'].count()) #no. of people who survived more than 5 years
print(df.iloc[:,-1].value_counts(normalize = True))
sns.set()
style.use("ggplot")
sns.set_style("whitegrid")


**Observations**

**1.** There are 4 columns and 306 records.

**2.** There are no missing values.

**3.** Age of the patient lies between 30 to 83, with mean be 52.457 and median is 52

**4.** Total 225 people survived which makes total of 73% of total patients.

**Objective**

To perdict whether a patient will survive after 5 years of treatment or not based on Age, Year of Treatment, #postive_aux_nodes.

In [None]:
#univariate analysis
#histograms
for idx,features in enumerate(df.columns[:-1]):
    sns.FacetGrid(df,hue = 'status',size = 5).map(sns.distplot,features).add_legend()
    plt.title("Histogram Plot using "+ features.upper())
    plt.show()





**Observation**:

No proper information can be obtain from the above **Histogam**, in all 3 plots there is huge overlapping. 

In [None]:
#Probability Density Function and Cummulative Density Functioon
plt.figure(figsize = (20,8))
for idx,features in enumerate(df.columns[:-1]):
    plt.subplot(1,3,idx+1)
    #Survived People Probability Distribution 
    counts_survived,bins_edges_survived = np.histogram(df[features][df['status'] == 'survived'],bins = 10, density = True)
    pdf_survived = counts_survived/sum(counts_survived)
    cdf_survived = np.cumsum(pdf_survived)
    #Died People Probability Distribution 
    counts_died, bins_edges_died = np.histogram(df[features][df['status'] == 'died'], bins = 10, density = True)
    pdf_died = counts_died/sum(counts_died)
    cdf_died = np.cumsum(pdf_died)
    
    print(Fore.GREEN + "*"*20 + Style.RESET_ALL + "  "+features.upper()+"  "+Fore.GREEN + "*"*20 + Style.RESET_ALL)
    print (Fore.RED +"Probability Density of People Survived"+Style.RESET_ALL, pdf_survived)
    print (Fore.RED + 'Probability Density of People Died '+ Style.RESET_ALL , pdf_died)
    print (Fore.RED + 'Cummulative Density of People Survived  '+ Style.RESET_ALL, cdf_survived)
    print (Fore.RED + 'Cummulative Density of People Died  '+ Style.RESET_ALL, cdf_died)
    
    #Graph Plotting.
    plt.title("PDF and CDF of "+features)
    plt.plot(bins_edges_survived[1:],pdf_survived, color = 'black',label = 'pdf of survived patient')
    plt.plot(bins_edges_survived[1:],cdf_survived, color = 'blue',label = 'cdf of survived patient')
    plt.plot(bins_edges_died[1:],pdf_died, color = 'red', label = 'pdf of dead patient')
    plt.plot(bins_edges_died[1:],cdf_died, color = 'green', label = 'cdf of dead patient')
    plt.xlabel(features)
    plt.legend()
plt.show()


**Observation**

Same as **Histogram** **PDF and CDF** plotting doesnot provide proper classification between survived and dead patient. Only information that could really be understood is people age between 30-35 will survive , same can be known from Histogram.

In [None]:
#Box plot with wishkers
fig,axes = plt.subplots(1,3,figsize = (15,5))
for idx, features in enumerate(df.columns[:-1]):
    sns.boxplot(x = 'status', y = features, data = df, ax = axes[idx]).set_title("Box plot with "+features)
plt.show()

#Violin Plot
fig,axes = plt.subplots(1,3,figsize = (15,5))
for idx, features in enumerate(df.columns[:-1]):
    sns.violinplot(x = 'status', y = features, data = df, ax = axes[idx]).set_title("violin plot using " +features)

plt.show()



**Observation**

1. **Box Plot :- ** among the three graph, using the* 'Number of positive axillary nodes detected'* we can distinguish the data as 
      
      if #auxNodes > 0.4
         patient died 
     else 
         survived.
         
2. **Violin Plot :-**  Same as Box Plot, infomation from 'Number of positive axillary nodes detected' is much clearer than others and it can used to classify the data as stated above.
                                                                                                         
 

In [None]:
#Bivariate Analysis

#Pair Plot
sns.pairplot(df,hue = "status", size = 3)
plt.show()



**Observation**

**1.** Using **Pair Plot** we can classify that patient of age less than 40 has survived as can be seen from pair plotting between AGE and YEAR. There are more number of red dots for age <= 40.

**2.** Rest plots are not much useful for classification.

In [None]:

'''#Graph representing the density estimate (Contour Graph)
sns.jointplot(x = "age", y = "year", data = df, kind = 'kde')
plt.show()
sns.jointplot(x = "age", y = "#auxNodes", data = df, kind = 'kde')
plt.show()
sns.jointplot(x= "year", y = "#auxNodes", data = df, kind = 'kde')
plt.show()'''


#Graph with scatter plot then applying density estimate (contour graph) 
#Seperate Contour for died and survived to better visulize both the status.
g = (sns.jointplot("age", "year",data=df[df['status'] == 'survived'], color="black").plot_joint(sns.kdeplot, zorder=0, n_levels=6))
g = (sns.jointplot("age", "year",data=df[df['status'] == 'died'], color="red").plot_joint(sns.kdeplot, zorder=0, n_levels=6))
plt.show()

g = (sns.jointplot("age", "#auxNodes", data = df[df['status']== 'survived'], color = "black").plot_joint (sns.kdeplot, zorder = 0, n_levels = 6))
g = (sns.jointplot("age", "#auxNodes", data = df[df['status']== 'died'], color = "red").plot_joint (sns.kdeplot, zorder = 0, n_levels = 6))
plt.show()

g = (sns.jointplot("year", "#auxNodes", data = df[df['status']== 'survived'], color = "black").plot_joint (sns.kdeplot, zorder = 0, n_levels = 6))
g = (sns.jointplot("year", "#auxNodes", data = df[df['status']== 'died'], color = "red").plot_joint (sns.kdeplot, zorder = 0, n_levels = 6))
plt.show()

**Observation**

From **Contour Graph** it could be observe that most of the people who gone through the surgery where between of age 45-55 in year 1959-1964.

**Final Observations**

**1.** Patient with 'Number of positive axillary nodes detected' is more than 0.4 than can classify it as died.

**2.** Patient who gone through surgery between age 45-55 in 1959-1964 can be classified as survived. 

**3** Patient age between 30-40 can be classified as survived