#    Relevant Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

1.Number of Instances: 306




2.Number of Attributes: 4 (including the class attribute)





Attribute Information:





2.1.Age of patient at time of operation (numerical)




2.2.Patient's year of operation (year - 1900, numerical)




2.3.Number of positive axillary nodes detected (numerical)




2.4.Survival status (class attribute) 1 = the patient survived 5 years or longer 
                                      2 = the patient died within 5 year
                                      
                                      
                                      
                                      

---- source Kaggle dataset.


# Objective


To classify/predict a patient survival who had undergone surgery for breast cancer ,

based upon the patient's age, year of treatment and the number of positive axillary nodes detected .




In [None]:



#import necessary packages 
import pandas as pd# Data analysis and manipulation
import numpy as np# Numerical operations
import seaborn as sns# Data visualization
import matplotlib.pyplot as plt# Data visualization


In [None]:
'''downlaod haberman.csv from https://www.kaggle.com/gilsousa/habermans-survival-data-set/version/1'''
#Load haberman.csv into a pandas dataFrame.
colnames = ['age', 'year', 'nodes', 'status']
hdf=pd.read_csv('../input/haberman.csv',header= None , names= colnames)
hdf.head()

In [None]:
#checking for Null Values
hdf.isnull().sum()

# Observations:

    There are no missing values in this dataset. So there is no need to do data imputation.

In [None]:
#This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage
hdf.info()

In [None]:
#Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values
#To know statistical summary of data
hdf.describe()

# Observations:

   1)The age of the patients vary from 30 to 83 with the mean of 52.
    
   2)75% of the patients have 4 or less  positive lymph nodes and 25% of the patients have no positive lymph nodes 
    
    



In [None]:
#To lnow How many data points for each class are present
hdf["status"].value_counts()

# Observation:
    The target column is imbalanced with 73% of values are '1'

# Univariate analysis 

It is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe; it takes data, summarizes that data and finds patterns in the data. --- source wiki



Distribution plots are used to view  how the data points are distributed with respect to its frequency.


In [None]:
#PDF shows how many of points lies in the same interval.(smoothed form of histogram)
sns.FacetGrid(hdf,hue="status",size=5).map(sns.distplot,"age").add_legend();plt.ylabel("Density");plt.title("Distribution of age")
sns.FacetGrid(hdf,hue="status",size=5).map(sns.distplot,"year").add_legend();plt.ylabel("Density");plt.title("Distribution of year of operation ")
sns.FacetGrid(hdf,hue="status",size=5).map(sns.distplot,"nodes").add_legend();plt.ylabel("Density");plt.title("Distribution of positive axillary nodes detected ")
plt.show();

In [None]:
#CDF -it gives the area under the probability density function from minus infinity to x .
one = hdf.loc[hdf["status"] == 1]
two = hdf.loc[hdf["status"] == 2]


counts, bin_edges = np.histogram(one['age'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(two['age'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

plt.title("pdf and cdf distribution of age")
plt.xlabel("age")
plt.ylabel("% of person's")
label =['PDF of Status One','CDF of Status One','PDF of Status Two','CDF of Status Two']
plt.legend(label)

In [None]:
counts, bin_edges = np.histogram(one['year'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(two['year'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

plt.title("pdf and cdf distribution of year of operation ")
plt.xlabel("age")
plt.ylabel("% of person's")
label =['PDF of Status One','CDF of Status One','PDF of Status Two','CDF of Status Two']
plt.legend(label)

In [None]:
counts, bin_edges = np.histogram(one['nodes'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(two['nodes'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

plt.title("pdf and cdf distribution of positive axillary nodes detected")
plt.xlabel("age")
plt.ylabel("% of person's")
label =['PDF of Status One','CDF of Status One','PDF of Status Two','CDF of Status Two']
plt.legend(label)

# Observations:

As most of the distributions overlap ,we can get only approximate inference as below 

1)persons having age less than 35 ,who had undergone surgery had survived .




2)perosons' who has greater than 45 positive auxillary nodes has not survived.



In [None]:
#Box Plot :In descriptive statistics, a boxplot is a method for graphically depicting groups of numerical data through their quartiles
sns.boxplot(x='status',y='age',data=hdf).set_title("Survival_status based on Age");plt.show()
sns.boxplot(x='status',y='year',data=hdf).set_title("Survival_status based on year of operation");plt.show()
sns.boxplot(x='status',y='nodes',data=hdf).set_title("Survival_status based on positive axillary nodes detected");plt.show()


In [None]:
#Violin Plot
# It is combination of box plot and histogram
sns.violinplot(x = "status", y = "age",  data = hdf).set_title("Survival_status based on Age");plt.show()
sns.violinplot(x = "status", y = "year",  data = hdf).set_title("Survival_status based on year of operation");plt.show()
sns.violinplot(x = "status", y = "nodes",  data = hdf).set_title("Survival_status based on positive axillary nodes detected");plt.show()


# Observation


Almost 80% positive lymph nodes of the type one is highly densed from 0 to 5 

# Bivariate Analysis
Bivariate analysis is one of the simplest forms of quantitative analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them. Bivariate analysis can be helpful in testing simple hypotheses of association                                                                                                                   --- source wiki

# Scatter Plot

A scatter plot is a useful visual representation of the relationship between two numerical variables (attributes) and is usually drawn before working out a linear correlation or fitting a regression line. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.       
--- source wiki



In [None]:
#Pair plot :Plot pairwise relationships in a dataset.
sns.set_style("whitegrid");
sns.pairplot(hdf,hue = "status", vars = ["age", "year", "nodes"],) #code source seaborn 0.9.0 documentation.
plt.show()
# NOTE: the diagnol elements are PDFs for each feature. PDFs are expalined below.

# Observation



    As we can't classify which is the most useful feature because of too much overlapping.

# Conclusion



1)The target column is imbalanced as it does not contains euqal number of data-points for each class. (with 73% of values are '1')




2)In all the plots the features are overlapping each other because of that getting exact inference or forming the exact criteria for bulding a Model is difficult from this data set..