# EXPLORATORY DATA ANALYSIS ON HABERMAN'S DATASET

In [None]:
# Importing all the important packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
haberman = pd.read_csv("../input/haberman.csv")
haberman.head()

In [None]:
# Renaming the columns for better understanding
haberman.columns = ["Age", "Op_year", "axil_nodes_det", "Survived_morethan_5years"]
haberman.head(7)

Replacing 1 with yes and 2 with no in the column "Survived_morethan_5years" for better understanding

In [None]:
haberman["Survived_morethan_5years"] = haberman["Survived_morethan_5years"].map({1:"yes", 2:"no"})
haberman.head()

In [None]:
haberman.info()

In [None]:
haberman.describe()

In [None]:
haberman["Survived_morethan_5years"].value_counts()

Observations -
1. There are 305 data points in total
2. Out of these 305 data points 224 correspond to observation that patient survived more than 5 years, while only 81 patients   could not survive longer than 5 years.
3. It is an imbalanced data set.

In [None]:
haberman.plot(kind="scatter", x='axil_nodes_det', y='Age')
plt.grid()
plt.show()

If the no. of positive auxillary nodes detected are more than 20, than they are found in the people belonging to age group of 35-65.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(haberman, hue="Survived_morethan_5years", size=4)\
    .map(plt.scatter, "Age", "axil_nodes_det")\
    .add_legend();
plt.show()

Their is good concentration of data points when the auxillary nodes are between 0 and 10.
It also shows that patients having 0 auxillary nodes are more likely to survive more than 5 years

In [None]:
plt.close()
sns.set_style("whitegrid")
sns.pairplot(haberman, hue="Survived_morethan_5years", size=4)
plt.show()

In [None]:
sns.FacetGrid(haberman, hue="Survived_morethan_5years", size=5)\
    .map(sns.distplot, "axil_nodes_det")\
    .add_legend()
plt.show()

The above histogram shows that though their is a major overlap in the data points, then also patients with the no. of auxillary nodes detected betweeen 0 to 7 have the highest rate of survival.

In [None]:
sns.FacetGrid(haberman, hue="Survived_morethan_5years", size=5)\
    .map(sns.distplot, "Op_year")\
    .add_legend()
plt.show()

In [None]:
sns.FacetGrid(haberman, hue="Survived_morethan_5years", size=5)\
    .map(sns.distplot, "Age")\
    .add_legend()
plt.show()

Observations -

From the above histograms, we notice that the no. of positive auxillary nodes detected is the most important feature in telling that whether the given person will survive more than 5 years or not.

In [None]:
haberman1 = haberman.loc[haberman["Survived_morethan_5years"] == "yes"]
haberman2 = haberman.loc[haberman["Survived_morethan_5years"] == "no"]

plt.plot(haberman1["axil_nodes_det"], np.zeros_like(haberman1["axil_nodes_det"]))
plt.plot(haberman2["axil_nodes_det"], np.zeros_like(haberman2["axil_nodes_det"]),'o')

Their is a huge concentartion of points between 0 t0 15

In [None]:
counts, bin_edges = np.histogram(haberman1['axil_nodes_det'], bins=10, density=True)

pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf)
plt.plot(bin_edges[1:], pdf)

counts, bin_edges = np.histogram(haberman2['axil_nodes_det'], bins=10, density=True)

pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf)
plt.plot(bin_edges[1:], pdf)

plt.show()

The above graph depicts the Cummulative distribution Function(CDF) and Probability distribution function(CDF)

In [None]:
sns.boxplot(x='Survived_morethan_5years', y='axil_nodes_det', data=haberman)
plt.show()

This box-plot shows that in this datset their many numbers of outliers, so obtaining any insight from the data is very difficult.

In [None]:
sns.violinplot(x='Survived_morethan_5years', y='axil_nodes_det', data=haberman)
plt.show()

In [None]:
print(np.percentile(haberman1["axil_nodes_det"],90))
print(np.percentile(haberman2["axil_nodes_det"],90))

This shows that 90% of the patients who survived more than 5 years has auxillary nodes 8
and 90% of the patients who died in less than 5 year had auxillary nodes 20

In [None]:
print("Mean age: ")
print(np.mean(haberman["Age"]))

Mean age of the patients who got diagnosed with breast cancer is 52 years