**Context about the dataset**


The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Attribute Information:
Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

**Domain Knowledge**

As axillary lymph nodes are near the breasts, they are often the first location to which breast cancer spreads if it moves beyond the breast tissue.

After a breast cancer diagnosis, a doctor will often check whether cancer cells have spread to the axillary lymph nodes. This can help confirm the diagnosis and staging of the cancer.
If the lymph nodes feel enlarged, it’s likely the cancer has spread. However, if the lymph nodes don’t feel enlarged, it doesn’t mean the nodes are negative (cancer-free).
The pathologist checks the nodes under a microscope. Nearly one-third of women with negative lymph nodes based on a physical exam have nodes with cancer found during the pathology exam. And, some women with enlarged nodes during a physical exam have cancer-free nodes.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
os.chdir("/kaggle/input/habermans-survival-data-set/")

In [None]:
df = pd.read_csv("haberman.csv")
df.columns=['age', 'op_year', 'axil_nodes', 'survived']

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
df.head()

In [None]:
df['survived'] = df['survived'].map({1:"yes", 2:"no"})
df['survived'] = df['survived'].astype('category')

In [None]:
profile = df.profile_report(title='Pandas Profiling before Data Preprocessing', style={'full_width':True})
#profile.to_file(output_file="profiling_before_preprocessing.html")

In [None]:
profile

In [None]:
df.info()

In [None]:
df.describe()

**Preliminary Observations:**


1. There are 306 datapoints
2. Patient's age ranges from 30 to 83 years, with median and mean value being 52
3. The year of operation ranges from 1958 to 1969
4. Maximum number of nodes detected is 52. But there were 136 instances with zero pos_axillary_nodes
5. The dataset has 225 patients surviving for 5 years or longer
6. There are 17 duplicate rows in the dataset. It is difficult to ascertain if that is the result of duplicate entries or similar cases during the time period
7. There is not much correlation among the features
8. There is no missing data, hence no need to impute any column

In [None]:
sns.set_style('dark')
sns.pairplot(df, hue='survived', height=4, diag_kind='kde', palette="cubehelix")

In [None]:
sns.set(style="darkgrid")
ax = sns.countplot(x="op_year", hue="survived", data=df)

**Observations from Pairplot**

1. Most patients have 0-10 pos_axillary_nodes
2. Patients with 0-6 pos_axillary_nodes are more likely to survive
3. Most patients operated in 1961 survived, while 1965 seems to be the worst year
4. Only rare patients had more than 30 pos_axillary_nodes

In [None]:
for (idx, feature) in enumerate(df.columns[:-1]):
    sns.FacetGrid(df, hue="survived", height=5, palette = "cubehelix").map(sns.distplot, feature).add_legend();
    plt.show();

In [None]:
fig, axarr = plt.subplots(1, 3, figsize = (20, 5))
for (idx, feature) in enumerate(df.columns[:-1]):
    sns.boxplot(
        x = 'survived',
        y = feature,
        palette = "cubehelix",
        data = df,
        ax = axarr[idx]
    )    
plt.show()

**Observations from both plots:**


Patients below 40 years of age are more likely to survive, whereas patients above 60 years are likely to die
The patients treated after 1966 have the slighlty higher chance to surive that the rest. The patients treated before 1959 have the slighlty lower chance to surive that the rest.
The number of positive lymph nodes of the survivors is highly densed from 0 to 5. Most patients who dies have atleast one pos_axillary_node.
Some of the patients with zero pos_axillary_nodes died within 5 years. whereas some of them with even more than 7-8 pos_axillary_nodes lived beyond 5 years. This means that the number of pos_axillary_nodes although indicative, cannot be the only yardstick to predict the survival status of the patients

**Conclusion**


Number of positive axillary nodes detected is an important feature for predicting the survival status of patients. However as we saw from the domain knowledge note, results of physical test and pathological test can vary. Hence additional data of other test results would have probably been more helpful in predicting the results accurately
Although there is a lot of overlapping data in age, we could say early diagnosis leads to better chances of survival. This is intuitive because early diagnosis would mean that the cancer is still at an early stage and can be contained before it spreads
Year of operation has a lot of overlap and although some years seem to be better than the others, it is difficult to say if year could have a bearing on the chances of survival