Data Description: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#importing packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import warnings 
warnings.filterwarnings("ignore")

In [None]:
#importing dataset
data = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv",names= ["Age","YearOfOperation","AxilNodes","SurvivalStatus"])

Description:
1. Age- Age of the patient at the time of operation.

2. YearOfOperation- The year in which operation was done.

3. AxilNodes- The axillary lymph nodes or armpit lymph nodes are lymph nodes in the human armpit. Between 20 and 49 in number, they drain lymph vessels from the lateral quadrants of the breast, the superficial lymph vessels from thin walls of the chest and the abdomen above the level of the navel, and the vessels from the upper limb. They are divided in several groups according to their location in the armpit. These lymph nodes are clinically significant in breast cancer, and metastases from the breast to the axillary lymph nodes are considered in the staging of the disease.
About 75% of lymph from the breasts drains into the axillary lymph nodes, making them important in the diagnosis and staging of breast cancer. 

4. SurvivalStatus- Label Feature has value of 1,2 where 1 corresponds to the patient who survived for more than 5 years and 2 corresponds to patient who survived less than 5 years.

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
data["SurvivalStatus"].value_counts()

From here, we came to know that this is an imbalanced dataset.

In [None]:
data.describe()

Here the mean value of age is 52. And the mean value of axilnodes is 4. The maximum no of axilnodes is 52 while minimum is 0.

In [None]:
data.info()

OBSERVATION: All the values are int64. Number of independent variables are 3 i.e. Age, YearOfOperation and AxilNodes while the dependent one is SurvivalStatus.

**OBJECTIVE:** We have to classify the patients will survive more than 5 years or not.

**Analysis:**
1. Univariate Analysis: PDF, CDF, boxplot, violin plots
2. Bivariate Analysis: Scatter plot, pairplot
3. Multivariate Analysis: Contours

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(data, hue="SurvivalStatus", size=4) \
   .map(plt.scatter, "Age", "YearOfOperation") \
   .add_legend();
plt.show();

Obseravtions:

* 30-40 age interval: More chances of survival.
* 40-70 age interval: Almost eqaul chances of survival and non survival.
* 70-80 age interval: More chances of survival.
* Above 80: Most likely to die.

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(data, hue="SurvivalStatus", size=4) \
   .map(plt.scatter, "Age", "AxilNodes") \
   .add_legend();
plt.show();

Observation:

* When axilnodes< 20 & 30<age<=40 then the chances of survival are much more.
* When axilnodes< 10 & 40<age<=70 then the chances of survival and non survival are almost the same.
* When axilnodes< 10 & 60<age<=70 then the chances of non survival are more.
* When 10<axilnodes< 20 & 30<age<=50 then the chances of survival are more.
* When 10<axilnodes< 20 & 50<age<=70 then the chances of non survival are more.

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(data, hue="SurvivalStatus", size=4) \
   .map(plt.scatter, "YearOfOperation", "AxilNodes") \
   .add_legend();
plt.show();

Cannot infer anything from this plot, except that for the year 1960-61 there is more chance of survival when axilnodes<20.

In [None]:
sns.set_style("whitegrid")
sns.pairplot(data, hue="SurvivalStatus", palette="Dark2" ,size=3)
plt.show()

Observation:

1. From fig in (row 1,col 3), it is clear that if the number of axilnods is very less roughly between(0-5), then the chances of survival are more.
2. Age and Axilnodes turned out to be the more important factors in determining the survival.

In [None]:
sns.FacetGrid(data, hue="SurvivalStatus", size=4)\
.map(plt.scatter, "AxilNodes","Age")\
.add_legend()

In [None]:
sns.FacetGrid(data, hue="SurvivalStatus", size=4)\
.map(sns.distplot, "Age")\
.add_legend()
plt.show()

As we can see that there is a lot of overlapping between the persons surviving and dying, so not much can be concluded.

In [None]:
sns.FacetGrid(data, hue="SurvivalStatus", size=4)\
.map(sns.distplot, "AxilNodes")\
.add_legend()
plt.show()

We can see that maximum survival occurs when the number of axilnodes is 0-1, and survival rate is gradually declining.
But when the axilnodes>20, there is more chance of death.

In [None]:
People_survived= data.loc[data["SurvivalStatus"]==1]
People_not_survived= data.loc[data["SurvivalStatus"]==2]

In [None]:
People_survived.shape

In [None]:
People_not_survived.shape

In [None]:
print(np.mean(People_survived['Age']))
print(np.mean(People_not_survived['Age']))

In [None]:
print(np.mean(People_survived['AxilNodes']))
print(np.mean(People_not_survived['AxilNodes']))

In [None]:
print(np.median(People_survived['Age']))
print(np.median(People_not_survived['Age']))

In [None]:
print(np.median(People_survived['AxilNodes']))
print(np.median(People_not_survived['AxilNodes']))

Observations:
    
1. Axilnodes is more informative than other two features.
2. While there is a lot of difference in values between mean and median of Axilnodes, indicating some outliers.

In [None]:
sns.boxplot(x='SurvivalStatus',y='Age', data=data)
plt.show()

Observation:
    
1. There are no outliers and much can be derived from this plot.
2. Age of survival lies between(42-60)
3. Age of non survival lies between(45-61)

In [None]:
sns.boxplot(x='SurvivalStatus',y='AxilNodes', data=data)
plt.show()

Observation:
    
1. There are a lot of outliers so median is preferred over mean.
2. AxilNodes of survival lies between(0-4)
3. Age of non survival lies between(2-11)

**Final Conclusions:**
1. The important features to study are Age and AxilNodes.
2. If Age lies in the interval(30-40) and the number of AxilNodes lie in (0,10), then the chances of survival are more.
3. If Age lies in the interval(30-48) and the number of AxilNodes lie in (10,20), then the chances of survival are more.
4. If Age lies in the interval(50-70) and the number of AxilNodes lie in (10,20), then the chances of non-survival are more.
5. If Age lies in the interval(70-80), then the chances of survival are more.
6. If Age is greater than 80, then the chances of non-survival are more.
7. If AxilNodes lies in the interval(0-1), then there is maximum chance of survival.

Suggestions are always welcome!

Being a beginner,I will appreciate if you can give a read and review my notebook.Please upvote,if you like my work.It will boost my confidence.

Thank You.