### Hello this is my first notbook on EDA. I tried various methods I recently learned on Habermans survival dataset. I know that this work is not that great and would love to get some inputs on what all things I've done right and the things which I didn't. I'd really appreciate it. Thank you!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')

### Let's view first few records of our data set and get an overview of the data we are dealing with

In [None]:
df.head()

In [None]:
df.columns = ['Age','Op_year','Axil_nodes','Survival_status']

In [None]:
df.head()

### Understanding the attribues:
     1. Age: Age of patient at time of operation
     2. Op_year: The year in which the operation is performed
     3. Axil_nodes: Number of positive auxilary nodes detected (It denotes the number of nodes to which cancer has spread)
     4. Surival_status: 1 if he survived more than 5 years, else 2.

#### Before we proceed further, let's change 1 and 2 to terms which can readily convey meaning

In [None]:
df.Survival_status = df.Survival_status.replace({1:'Survived',2:'Died'})

### Let's look at the dataset now

In [None]:
df.head()

In [None]:
df.tail()

### Now with the change in the way Survival_status, it is more comprehensible.
### Let's get into the data

In [None]:
df.Survival_status.value_counts()

This disparity in two classes has to be kept in mind before training a classifier as class imbalance might lead to faulty predictors.

In [None]:
df.shape

It only has 3 attributes,
Let's draw a pair plot to findout if we can get any meaningful relations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df,hue='Survival_status')
plt.show()

This pairplot doesn't seem to be giving a lot of information
Let's get a distribution plot to understand the probability distribution

In [None]:
sns.FacetGrid(df,hue="Survival_status")\
    .map(sns.distplot,'Age').add_legend();

In [None]:
sns.FacetGrid(df,hue='Survival_status').map(sns.distplot,'Axil_nodes').add_legend()

In [None]:
sns.FacetGrid(df,hue='Survival_status').map(sns.distplot,'Op_year')

In [None]:
sns.FacetGrid(df,hue='Survival_status',height=5).map(plt.scatter,'Age','Axil_nodes').add_legend()

### Class imbalance is hindering our eda. Let's normalize our data set

In [None]:
df.shape

In [None]:
df.Survival_status.value_counts()

In [None]:
died_df = df.loc[df['Survival_status'] == 'Died']
died_df.shape

In [None]:
shuffled_df = df.sample(frac = 1,random_state=4)
shuffled_df.shape

In [None]:
died_df = shuffled_df.loc[shuffled_df['Survival_status'] ==  'Died']
died_df.shape

In [None]:
survived_df = shuffled_df.loc[shuffled_df['Survival_status'] == 'Survived'].sample(n=81)


In [None]:
survived_df.shape

In [None]:
normalized_df = pd.concat([died_df,survived_df])

In [None]:
normalized_df.shape

Now that we have a normalized, balanced dataset, let's go ahead with our analysis

In [None]:
sns.FacetGrid(normalized_df,height=6,hue='Survival_status').map(plt.scatter,'Axil_nodes','Age').add_legend()

In [None]:
sns.FacetGrid(normalized_df,hue='Survival_status',height=6).map(sns.distplot,'Axil_nodes').add_legend()

In [None]:
sns.FacetGrid(normalized_df,hue='Survival_status',height=6).map(sns.distplot,'Age').add_legend()

In [None]:
sns.FacetGrid(normalized_df,hue='Survival_status',height=6).map(sns.distplot,'Op_year').add_legend()

In [None]:
sns.pairplot(normalized_df,hue='Survival_status',height=8,markers=["o", "s"])

### Observations
1. Lower the number of Axil_nodes, higher chances for people to survive.
2. Distribution of Op_year and Age for Survived and Death shows that Op_year and Age are of not much help when considered alone, whereas Axil_nodes can be useful.


### Now lets plot PDF and CDF for our attributes

In [None]:
# For Age
import numpy as np
counts, bin_edges = np.histogram(survived_df.Age,bins=10,density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.subplot(2,2,1)
plt.plot(bin_edges[1:],pdf,label = 'Age-PDF-Survived')
plt.legend()
plt.subplot(2,2,2)
plt.plot(bin_edges[1:],cdf,label = 'Age-CDF-Survived')
plt.legend()
counts, bin_edges = np.histogram(died_df.Age,bins=10,density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.subplot(2,2,3)
plt.plot(bin_edges[1:],pdf,label = 'Age-PDF-Died')
plt.legend()
plt.subplot(2,2,4)
plt.plot(bin_edges[1:],cdf,label = 'Age-CDF-Died')
plt.legend()
plt.show()

In [None]:
counts, bin_edges = np.histogram(survived_df.Axil_nodes,bins=10,density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.subplot(2,2,1)
plt.plot(bin_edges[1:],pdf,label = 'Axil-PDF-Survived')
plt.legend()
plt.subplot(2,2,2)
plt.plot(bin_edges[1:],cdf,label = 'Axil-CDF-Survived')
plt.legend()
counts, bin_edges = np.histogram(died_df.Axil_nodes,bins=10,density=True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.subplot(2,2,3)
plt.plot(bin_edges[1:],pdf,label = 'Axil-PDF-Died')
plt.legend()
plt.subplot(2,2,4)
plt.plot(bin_edges[1:],cdf,label = 'Axil-CDF-Died')
plt.legend()
plt.show()

#### Even with these PDF and CDF graphs, it is quite lucid that number of Axil_nodes is the highest weighing factor in one's survival

### Let's get some box plots and violin plots to see if they give any insights

In [None]:
sns.boxplot(data=normalized_df,x='Survival_status',y='Age')

In [None]:
sns.boxplot(data = normalized_df,x='Survival_status',y='Op_year')

### Observation:
1. It is once again clear that Age and Op_year are not good predictors for Survival_status

In [None]:
sns.boxplot(data = normalized_df, x = 'Survival_status', y = 'Axil_nodes')

### Observations:
1. The box plot for survived based on Axil nodes shows that although there are some outliers, the maximum chance of survival is when the person has about 5 Axil nodes.
2. The 25th and 50th percentile are both almost zero.
3. 75th percetile is about 3 axil nodes.

In [None]:
sns.violinplot(data = normalized_df, x = 'Survival_status', y = 'Axil_nodes')

Observations:
1. The highest number of survival cases are when the number of Axil_nodes is close to zero.
2. However, there is also a decent chance of not surviving even though the number is closer to zero.

## Let's draw some Multivariate probability density plots

In [None]:
sns.jointplot(data = survived_df, x = 'Age', y = 'Axil_nodes')

In [None]:
sns.jointplot(data = survived_df, x = 'Age', y = 'Axil_nodes',kind = 'hex')

In [None]:
sns.jointplot(data = survived_df, x = 'Age', y = 'Axil_nodes',kind='kde')

#### Observation:
1. The age group 40-70 with Axil nodes less than 2 constitute the highest in survival cases.

In [None]:
sns.jointplot(data = died_df, x = 'Age', y = 'Axil_nodes',kind='kde')

Observation:
1. In case of deaths, people aged 50 and just around, constitute the highest although they have low number of Axil nodes.

## Summary
### 1. Most number of survival cases are recorded when the number of Axil nodes are zero or closer to zero.
### 2. Age and Op_year don't seem to be conveying a lot of information when examined alone.
### 3. In case of survivals, irrespective of age and operation year, most of them were when the Axil nodes were about zero.
### 4. In case of deaths, most deaths were recorded at ages between 40 and 60 although the Axil nodes were near zero.

## Conclusion
### Although we could point out the importance of Axil nodes in survival chances, it is also clear that it alone is not a predictor and there must be several other factors (maybe like previous illnesses,immunity etc.) pitching in predicting a person's survival.

Question: Was I right to under sample the data?