# **Exploratory Data Analysis :**

**Data Description :**
The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information:**
* Age of patient at time of operation (numerical)
* Patient's year of operation (year - 1900, numerical)
* Number of positive axillary nodes detected (numerical)
* Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

# **Data Preperation**

In [None]:
# Importing libaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reading data
df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age', 'year', 'nodes', 'status'])

In [None]:
# (Q) how many data-points and features?
print (df.shape)

In [None]:
#(Q) What are the column names in our dataset?
print (df.columns)

In [None]:
df["status"].value_counts()

**Observations :**

* number of datapoints = 306 per column.
* number of columns - 4.
* We have 3 features here.
* The number of people that survived the disease for 5 years or longer are 225 and the ones who died within 5 years are 81 in number.


 # **Objective**

To predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of positive lymph nodes

In [None]:
print(df.describe())

**Observations:**

* The age of the patients vary from 30 to 83 with the median of 52.
* Although the maximum number of positive lymph nodes observed is 52, nearly 75% of the patients have less than 5 positive lymph nodes and nearly 25% of the patients have no positive lymph nodes
* The dataset contains only a small number of records (306).
* The 73% of values in the target column are '2'

# **Pair plots**

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(df,hue='status', height=4);
plt.show()

**Observations :**

* Shown above is a pairplot of all possible combinations of the features keeping the "status" as hue.
* "age" and "nodes are the most useful features in determining the survival status.
* plots shown in blue represent the patients that survived.
* plots shown in orange represent patients that died within 5 years of the treatment.
* there is not much seperation seen in the data points when plotted on pair plots.

# **Histograms**

In [None]:
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.show();

# **PDF and CDF**

In [None]:
counts, bin_edges = np.histogram(df['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('age')




plt.show();


In [None]:
counts, bin_edges = np.histogram(df['year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('year')




plt.show();

In [None]:
counts, bin_edges = np.histogram(df['nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('nodes')




plt.show();

**Observations :**
    
the pdf and cdf of the survived and dead patients seem to overlap a lot to make any significant observation.

# **Box Plot**

In [None]:
sns.boxplot(x='status',y='age', data=df)
plt.show()

In [None]:
sns.boxplot(x='status',y='year', data=df)
plt.show()

In [None]:
sns.boxplot(x='status',y='nodes', data=df)
plt.show()

**Observations**

* it can be inferred from the plot that the people between the age from 30 to approx. 34 survived after the treatment.
* Also, people past the age of 73 were not able to survive.


# **Violin Plot**

In [None]:
sns.violinplot(x="status", y="age", data=df, size=8)
plt.show()

In [None]:
sns.violinplot(x="status", y="year", data=df, size=8)
plt.show()

In [None]:
sns.violinplot(x="status", y="nodes", data=df, size=8)
plt.show()

**Observations**

* The number of positive lymph nodes of the survivors is highly densed from 0 to 5.
* Almost 80% of the patients have less than or equal to 5 positive lymph nodea.
* The patients treated after 1966 have the slighlty higher chance to surive that the rest. The patients treated before 1959 have the slighlty lower chance to surive that the rest.

# **Multivariate probability density, contour plot**


In [None]:
sns.jointplot(x="status", y="nodes", data=df, kind="kde");
plt.show();

# **Conclusion**
By plotting all pdf, cdf, box-plot, pair plots etc. we get only one conclusion :
* if number of axillary node is less then survival of patients is more.
* We need more features to come on a good conclusion.