## EDA on Haberman Dataset for breast cancer survival patients.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


haberman = pd.read_csv('../input/haberman.csv')

haberman.columns = ["Age", "Op_Year", "Axil_nodes", "Surv_Status"]
haberman.head(20)

1. Age: Age of the Patient
2. Op_Year: Year of operation
3. Axil_nodes: Number of axil nodes with cancer present
4. Surv_Status: 
                1- patient lived more than 5 years.
                2- patient died within 5 years.

In [None]:
print(haberman.shape)


In [None]:
haberman["Surv_Status"].value_counts()

###### Number of patients who survived more than 5 years are 224 whereas number of patients who died within 5 years are 81.

### Scatter Plots

In [None]:
haberman.plot(kind='scatter', x='Surv_Status', y='Age')
plt.title("Scatter Plot")
plt.rcParams['axes.titlesize']=15
plt.show()

Observations:
1. All the patients with age close to 30 years have survived more than 5 years.
2. Patients with age close to 80 years didn't survive more than 5 years.

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Surv_Status", height=4) \
   .map(plt.scatter, "Age", "Op_Year") \
   .add_legend();
plt.title("Scatter Hue Plot")
plt.show()

##### Observations made above can be seen more clearly with this colour coding plot.

### Pair Plot

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="Surv_Status", vars=["Age", "Op_Year", "Axil_nodes"], height=3);

plt.show()

### Observation:
1. Most of the patients in the year 1958 and 1965 died within 5 years.

### PDF, CDF, Histogram

In [None]:
haberman_1 = haberman.loc[haberman["Surv_Status"] == 1]
haberman_2 = haberman.loc[haberman["Surv_Status"] == 2]
plt.plot(haberman_1["Age"], np.zeros_like(haberman_1["Age"]), 'o')
plt.plot(haberman_2["Age"], np.zeros_like(haberman_2["Age"]), '*')
plt.legend(["1","2"], )
plt.xlabel("AGE")
plt.show()

### Observation:
Doesn't make any sense as the points are overlapping.

In [None]:
sns.FacetGrid(haberman, hue= 'Surv_Status', height=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.title("Histogram")
plt.show();

### Observation
1. Approximately more number of patients couldn't survive more than 5 years aged between 40 to 60 years.
2. Approximately more number of patients survived more than 5 years aged between 30-40 years.

## CDF

In [None]:
counts, bin_edges = np.histogram(haberman_1["Axil_nodes"], bins =10, density=True)
print("Counts:",counts)
print("Bin edges:", bin_edges)
pdf = counts/(sum(counts))
print("PDF:", pdf)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(["PDF","CDF"])

### Observation:
(For patients surviving more than 5 years.)
1. 93% of patients have axil nodes less than or equal to 10.
2. Only 7% of patients have axil nodes more than 10


In [None]:
counts, bin_edges = np.histogram(haberman_1["Op_Year"], bins =10, density=True)
print("Counts:",counts)
print("Bin edges:", bin_edges)
pdf = counts/(sum(counts))
print("PDF:", pdf)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(["PDF","CDF"])

### Observation:
(For patients surviving more than 5 years.)
1. The diagonal graph of CDF shows that almost equal number of patients were operated every year.

In [None]:
counts, bin_edges = np.histogram(haberman_2["Axil_nodes"], bins =10, density=True)
print("Counts:",counts)
print("Bin edges:", bin_edges)
pdf = counts/(sum(counts))
print("PDF:", pdf)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(["PDF","CDF"])

### Observation:
(For patients who couldn't survive more than 5 years.)
1. 70% of patients have axil nodes less than 10 diagnosed with cancer.
2. Only 2% of patients have axil nodes more than 25.

## Box plot and Whiskers

In [None]:
sns.boxplot(x='Surv_Status', y='Axil_nodes', data=haberman)
plt.title("Box Plot")
plt.show()

### Observation:
1. 75% of patients have axil nodes less than 5 in figure 1 whereas 50% of patients have axil nodes less than 5 in figure 2. Hence, most of the patients have less than 5 axil nodes.
2. Patients who couldn't survive more than 5 years have more axil nodes diagonosed with cancer.

In [None]:
sns.violinplot(x="Surv_Status", y="Axil_nodes", data=haberman, size=8)
plt.title("Whiskers")
plt.show()

### Observation:
1. Most number of patients have less than 5 axil nodes diagnosed with cancer.

### Conclusion
Overall the data analysis shows that the patients who survived less had more number of axil nodes tested positive for cancer although most number of patients have 5 axil nodes averaging. The data also displays that almost equal number of patients were operated every year. Younger patients survived more time than the elder patients.

**END OF ASSIGNMENT-1**