## EDA

## Habermans Survival data set
<h3 style="line-height:25px;font-family:calibri">The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.</h3>


In [None]:
#importing required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#Loading data set
hab = pd.read_csv('../input/haberman.csv') #here, hab acts as a pandas data frame

In [None]:
print(hab.shape) #It gives total number of rows and columns (i.e; data points and features)

In [None]:
print(hab.head()) #it gives the results of top most rows from the data set to observe the structure of it

<p style="font-family:calibri; font-size:16px"> If we observe the data set structure here, it doesn't have any feature/column names to it </p>

In [None]:
colnames = ["age", "year", "nof_nodes", "status"] #adds column names to it
hab = pd.read_csv('../input/haberman.csv', names=colnames)
print(hab.head()) #it gives the results of top most rows from the data set to observe the structure of it

<p style="font-family:calibri; font-size:16px">Each column is described as follows:</p>
<ol>
 <li><b>age:</b> Age of patient at time of operation (numerical)</li>
 <li><b>year:</b> Patient's year of operation (year - 1900, numerical)</li>
 <li><b>nof_nodes:</b> Number of positive axillary nodes detected (numerical)</li>
 <li><b>status:</b> Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year</li>
</ol>

In [None]:
print(hab.shape) #It gives total number of rows and columns (i.e; data points and features)
print(hab.columns) #It gives all the column names in the data set

## Objective: Based on the available features, whether a patient should survive cancer treatment ot not

In [None]:
#to find how many data points for each class are present (or)
#to find how many patients survived 5 years or longer (1), how many patients died within 5 year (2)
hab['status'].value_counts()

<p style="font-family:calibri; font-size:16px">Here, we can observe that it is an imbalanced data set as number of data points for every class are not same</p>

## Scatter Plot

In [None]:
hab.plot(kind="scatter", x="age", y="nof_nodes");
plt.show()
# here I have considered age as an x-axis and nof_nodes as an y-axis

<p style="font-family:calibri; font-size:18px; line-height:22px">In the above scatter plot, it is difficult to identify the data points specific to class. so it makes more sense if we color the points<p>

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hab, hue="status", size=5) \
   .map(plt.scatter, "age", "nof_nodes") \
   .add_legend();
plt.show();

<p style="font-family:calibri; font-size:18px; line-height:22px">Here, we can observe that most of the patients have positive axillary nodes less than 10</p>
<p style="font-family:calibri; font-size:18px; line-height:22px">In the above scatter plot, as there are many overlappings of data points, it is difficult to classify them based on only these two features, let us look into all the plots<p>

## Pair Plots

In [None]:
sns.set_style("whitegrid");
sns.pairplot(hab, hue="status", size=3);
plt.show()

<p style="font-family:calibri; font-size:18px; line-height:22px">Every pair plot in the above plots have many overlappings of data points. Let us look into univariate analysis</p>

## Histogram, PDF, CDF

In [None]:
sns.FacetGrid(hab, hue="status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();
# Here we have made a distribution plot which consists of both histogram and PDF in it

<p style="font-family:calibri; font-size:18px; line-height:22px">Here the histogram shows number of patients present with the given age, all the patients lies between 30 years to 82 years, here also there are many overlappings of data points, only a small part is individually identified. let us see other features</p>

In [None]:
sns.FacetGrid(hab, hue="status", size=8) \
   .map(sns.distplot, "nof_nodes") \
   .add_legend();
plt.show();
# Here we have made a distribution plot which consists of both histogram and PDF in it

<p style="font-family:calibri; font-size:18px; line-height:22px">Here the histogram shows number of positive auxillary nodes of each patient. here we can observe that there are more patients who have survived with 0 nodes rather than patients who had died with zero nodes and there is a huge gap between them.</p>

In [None]:
sns.FacetGrid(hab, hue="status", size=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.show();
# Here we have made a distribution plot which consists of both histogram and PDF in it

<p style="font-family:calibri; font-size:18px; line-height:22px">Here the histogram shows the year of operation for each patient and it is completely overlapped for both survived and non survived patients. hence year of operation hardly depends on the survival of patient. It is unfair to consider this feature for prediction</p>

<p style="font-family:calibri; font-size:18px; line-height:22px">Among the above three plots, a plot with nof_nodes is some what identifying atleast few data points with large gap. so, lets take this feature and analyze more</p>

In [None]:
hab_status1 = hab.loc[hab["status"] == 1];
hab_status2 = hab.loc[hab["status"] == 2];
print(hab_status1.count())
print(hab_status2.count())

In [None]:
hab_status1.describe()

In [None]:
hab_status2.describe()

<p style="font-family:calibri; font-size:18px; line-height:22px">Here also we can observe the mean and standard deviation for age, year for both survived and died are almost same but for the feature nof_nodes, they are far different. hence, it is definitely true that taking nof_nodes into consideration is benificiary than other available features</p>

## Box Plots

In [None]:
sns.boxplot(x='status',y='nof_nodes', data=hab)
plt.show()

## Violin Plots

In [None]:
sns.violinplot(x="status", y="nof_nodes", data=hab, size=8)
plt.show()

<p style="font-family:calibri; font-size:18px; line-height:22px">Here, from the above plots we can observe that a patient dies if they have nof_nodes greater than 50 for sure and if a patient has zero nof_nodes, then there are more chances for survival than death.</p>

In [None]:
print("\nMedians:")
print(np.median(hab_status1["nof_nodes"]))
print(np.median(hab_status2["nof_nodes"]))

print("\nQuantiles:")
print(np.percentile(hab_status1["nof_nodes"],np.arange(0, 100, 25)))
print(np.percentile(hab_status2["nof_nodes"],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(hab_status1["nof_nodes"],90))
print(np.percentile(hab_status2["nof_nodes"],90))     



<p style="font-family:calibri; font-size:18px; line-height:22px">Here, we can say that 90 percentile of patients who have died have nof_nodes=20</p>

## Conclusion

<ul>
 <li>Among the available features, only nof_nodes (number of positive auxillary nodes) is useful in predicting the survival status upto some extent</li>
 <li>More than 50 percentile of the patients from the patients who have survived has nof_nodes as zero</li>
 <li>We can also observe that the more patients survived who has less number of positive auxillary nodes</li>
</ul>