## **1. Description of the dataset:**

The Dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1.1 Attributes given: 
+ Age of patient at time of operation (numerical)
+ Patient’s year of operation (year — 1900, numerical)
+ Number of positive auxillary nodes detected (numerical)
+ Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

In [None]:
df = pd.read_csv('../input/haberman.csv')
df.columns= ['age', 'op_year', 'axil_nodes', 'survived'] 
print(df.head())


**Observations:**
+ We have 3 feature attributes : Age of the patient when he was operated, his operation year, and number of cancerous nodes he had. 
+ Data points are labeled into a binary class : Survival status after 5 years. 
+ As survival status is the classifying attribute, we should change this to categorical data type. (1-> Yes | Survived, 0->No | Died)


In [None]:
df['survived'] = df['survived'].map({1:'Yes', 2: 'No'})
df['survived'] = df['survived'].astype('category')


### 1.2 Statistics:  

In [None]:
print(df.describe())

**Observations:**
+ Number of patients : 305
+ Time period : 1958-1969
+ Patient's age ranges from 30 to 83, having a mean value of 52. 



In [None]:
print(df['survived'].value_counts())


**Observations:**
+ Data is imbalanced as we have 224 datapoints in 'Yes' class and 81 datapoints in 'No' class. 
+ It can be seen that 73% of patients survived the 5 year time period. 

In [None]:
df.groupby(['survived']).mean()

In [None]:
df_yes= df.loc[df.survived=='Yes']
df_no= df.loc[df.survived=='No']

In [None]:
print("Average age of survivors: ", np.mean(df_yes.age))
print("Average age of Non survivors: ", np.mean(df_no.age))

print("Median age of survivors ", np.median(df_yes.age))
print("Median age of survivors ", np.median(df_no.age))

print("Percentile age of survivors: ", np.percentile(df_yes.age, np.arange(0,100,25)))
print("Percentile age of non survivors: ", np.percentile(df_no.age,  np.arange(0,100,25)))

**Observations:**
+ There is no significant difference in the ages of survivors and non survivors. 

In [None]:
print("Average #nodes of survivors: ", np.mean(df_yes.axil_nodes))
print("Average #nodes of Non survivors: ", np.mean(df_no.axil_nodes))

print("Median #nodes of survivors ", np.median(df_yes.axil_nodes))
print("Median #nodes of survivors ", np.median(df_no.axil_nodes))

print("Percentile #nodes of survivors: ", np.percentile(df_yes.axil_nodes, np.arange(0,100,25)))
print("Percentile #nodes of non survivors: ", np.percentile(df_no.axil_nodes,  np.arange(0,100,25)))

print("Percentile #nodes of survivors: ", np.percentile(df_yes.axil_nodes, 90))
print("Percentile #nodes of non survivors: ", np.percentile(df_no.axil_nodes, 90))

**Observations:**
+ Survivors on the average have 3 nodes, while non survivors on average have 8 nodes. 
+ Mean is highly prone to outliers, as we can see median values for survivors is 0, and for non survivors it is 4. 
+ 75% of survivors have nodes less than 3. 
+ 90% of survivors have nodes less than 8. 
+ 90% of non survivors have nodes less than 20. 
+ So we can clearly see that the number of nodes is a significant factor in the analysis.

## 2. Univariate Analysis

### 2.1 Probability Distribution
Taking each feature and plotting its histogram to visualise the range of its values and how much spread it has. 

In [None]:
# fig, axes= plt.subplots(1,3, figsize=(15,5))
for idx, feature in enumerate(df.columns[:-1]):
    sns.FacetGrid(df,  hue='survived', height=5).map(sns.distplot, feature).add_legend()
    plt.ylabel('Density')
    plt.title('Probability Density function for {}'.format(feature))
    plt.show()


**Observatio(s):**
+ It is seen that age and year of operation is overlapping for most of the cases, hence they arent good factors in classification. 
+ From axil_nodes histogram, we can see that the deviation of #nodes for survivors is much less than that of non survivors. 

### 2.2 CDFs and PDFs

In [None]:
df.hist()
plt.figure(figsize=(15,15))
for idx, feature in enumerate(list(df.columns[:-1])):
    counts, bins= np.histogram(df[feature], bins=10, density=True)
    pdf= counts/sum(counts)
    cdf= np.cumsum(pdf)
    plt.subplot(3,3,idx+1)
    plt.plot(bins[1:], pdf, label="PDF")
    plt.plot(bins[1:], cdf, label= "CDF")
    plt.xlabel(feature)
plt.legend()    
plt.show()
    

### 2.3 Box plots: 

In [None]:
fig, axes = plt.subplots(1,3, figsize= (15,5))
for idx, feature in enumerate(df.columns[:-1]):
    sns.boxplot(x= 'survived', y= feature, data= df, ax= axes[idx])
plt.show()    

### 2.4 Violin Plots: 

In [None]:
fig, axes = plt.subplots(1,3, figsize= (15,5))
for idx, feature in enumerate(df.columns[:-1]):
    sns.violinplot(x= 'survived', y= feature, data= df, ax= axes[idx])
plt.show()    

## 3. Bivariate Analysis: 

### 3.1 Pair Plots: 
Plotting each pair of feature to understand the mutual relationship between all the features. 

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df.loc[:, df.columns!='Id'], hue= 'survived', height= 5)
plt.show()

**Observations:**
+ Year of operation doesnt seem to have any effect on the classification. 
+ Age of patient seem to have a little effect on classification. 
+ Number of axillary nodes is a major feature in this dataset, it can play vital role in classification. 

### CONCLUSION:
+ Plots of Haberman's data are highly overlapping, hence inference from suck plots is quite difficult. 
+ Data is not linearly seperable on any of the feature, hence would require complex system to model such data. 
+ Data set is not balanced, we have more patients surving, so we cant say which features for sure play important role. 
+ Although, Patient's age and number of axillary nodes seem to be significant in making the classification model. 