In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset description.

* Age of patient at time of operation (numerical)
* Patient's year of operation (year - 1900, numerical)
* Number of positive axillary nodes detected (numerical)
* Survival status (class attribute)
     * 1 = the patient survived 5 years or longer
     * 2 = the patient died within 5 year



In [7]:
col = ['Patient_age', 'Year_of_operation', 'pos_axillary_nodes', 'status']
df = pd.read_csv('../input/haberman.csv', names = col)

In [8]:
df.head()

In [9]:
df.shape

In [10]:
df['status'].value_counts()

** Observations: **
  * The **years_of_operation** gives the last two digits of the year for each patient.
  * There are **306** observations in the dataset.
  * The dataset is classified into two classe.
      * **225** patients of class **1**, those who survived and,
      * **81** patients of class **2**, those who not survived.
      




In [11]:
sns.lmplot(fit_reg = False, data = df, y = 'pos_axillary_nodes', x = 'Patient_age')

** Observations: **
* Most of the people have zero number of **pos_axillary_nodes**.

In [12]:
sns.pairplot(df, hue = 'status')

Relationship between ***pos_axillary_nodes***, ***patient_age***  and ***status***.

In [13]:
sns.FacetGrid(df, hue = "status", size = 5).map(sns.distplot, "Patient_age").add_legend()
plt.show()

Since, the distributions looks normal for both **survived** and **not survived**, mean would be a right measure of data.

In [14]:
print("Mean age of patients survived:", round(np.mean(df[df['status'] == 1]['Patient_age'])))
print("Mean age of patients not survived:", round(np.mean(df[df['status'] == 2]['Patient_age'])))

In [15]:
sns.FacetGrid(df, hue = "status", size = 5).map(sns.distplot, "pos_axillary_nodes").add_legend()
plt.show()

In [16]:
sns.FacetGrid(df, hue = "status", size = 5).map(sns.distplot, "Year_of_operation").add_legend()
plt.show()

**Observations:**

* **pos_axillary_nodes** is the useful features to indentify the survival status. since, the both distributions are way different from each other. 
* The **survived** people mostly fall into **zero** **pos_axillary_nodes**.
* There Mean age of the patients **not survived** is 54 and **survived** is 52 years.
* There are more number of people around **1965** year of operation than people around **1958** in **not survived** class, represents a bimodal distribution.

### Summary statistics

In [23]:
sur = df[df['status'] == 1]
sur.describe()

In [24]:
not_sur = df[df['status'] == 2]
not_sur.describe()

**Observations:**

* The people who are **not survived** tend to have more average number of **pos_axillary_nodes** and more **spread out the distribution** than survived.

In [25]:
sns.violinplot(x='status', y='pos_axillary_nodes', data=df)

In [26]:
from statsmodels import robust
print("\n Median Absolute Deviation")
print(robust.mad(sur['pos_axillary_nodes']))
print(robust.mad(not_sur['pos_axillary_nodes']))

In [27]:
counts, bin_edges = np.histogram(not_sur['pos_axillary_nodes'], bins=30, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(sur['pos_axillary_nodes'], bins=30, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('sur')
plt.legend(['Not_sur_pdf', 'Not_sur_cdf','Sur_pdf', 'Sur_cdf'])
plt.xticks(np.linspace(0,50,12), rotation=-45)
plt.xlabel("pos_axillary_node")
plt.show()

In [28]:
counts, bin_edges = np.histogram(not_sur['Year_of_operation'], bins=30, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(sur['Year_of_operation'], bins=30, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('sur')
plt.legend(['Not_sur_pdf', 'Not_sur_cdf','Sur_pdf', 'Sur_cdf'])

plt.show()

**Observations:**
* We can build a simple classification model where the number of **pos_axillary_nodes** greater than 45.5 are considered to be **not survived** patients.

In [29]:
sns.violinplot(x='status', y='Patient_age', data=df)

In [30]:
sns.violinplot(x='status', y='Year_of_operation', data = df)

In [31]:
sns.jointplot(x= 'Patient_age',kind = 'kde', y='Year_of_operation', data = df)
plt.show()

* There are more number of people undergone operation during the year **1959 - 1964** period and between ages **42 - 60**.