# Relevant Information: 
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
### Attribute Information
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
    * 1 = the patient survived 5 years or longer
    * 2 = the patient died within 5 year

### My assumption
1. Younger patient has more probality of survival.
2. patient with less number of nodes has more probality of survival
3. Combination of both patients are more likely to survive.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 

warnings.filterwarnings("ignore") 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
haberman = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv")
haberman.columns =  ['age', 'year', 'nodes', 'status']
haberman.tail()

### Common details

In [None]:
def details(dataset):
    print("Shape of data set:",haberman.shape)
    print(f"Number of rows:{haberman.shape[0]}\nNumber of Columns:{haberman.shape[1]}")
    print("Columns:",haberman.columns)
    print("data types:\n",haberman.dtypes)
details(haberman)

### Observations
1. It has only numerical variables
2. 'status' columns is our target variable

In [None]:
haberman['status'].value_counts()

For the convinence make the target variable 1 and 0. Replace all 2 with 0
* 1 = the patient survived 5 years or longer
* 0 = the patient died within 5 year


* this is imbalance data (it's good more people are survived)

In [None]:
haberman['status'] = haberman['status'].apply(lambda x : 0 if x== 2 else 1)
haberman.head()

# Exploratory data analysis and visualization

* Uni-variate Analysis

### Questions
1. How is the age distributed?
2. How many nodes are there?

In [None]:

sns.set_style('darkgrid')
sns.distplot(haberman['age'],hist=True,kde = True,color= 'g')
plt.show()


# Observation
1. Age is normally distributed

In [None]:
sns.boxplot(haberman['age'],data = haberman,orient='v')
plt.title("Box plot for age")

# Observation
* means age is around 50
 
* No outliers verify this using Q1- 1.5*IQR,Q3+1.5* IQR

In [None]:
print("Mean age is:",haberman['age'].mean())

In [None]:
import numpy as np
#haberman['age'].loc[haberman['age']>= np.percentile()]
Q1 = np.percentile(haberman['age'],q=25)
print(Q1)
Q3 = np.percentile(haberman['age'],q=75)
print(Q3)
IQR = Q3-Q1
print(IQR)
low = Q1- 1.5 * IQR
high = Q3 + 1.5 * IQR
print(low,high)

print(haberman['age'].loc[(haberman['age'] <= low) | (haberman['age'] > high )])

No such values found means our observation is corect

In [None]:
import matplotlib.pyplot as plt
sns.FacetGrid(haberman,size=5) \
   .map(sns.distplot, "age") \
   .add_legend()
plt.show()


* Age's distribution is "Normal distribution"

# tabular method

In [None]:
haberman.describe()

### Observations
* Mean age is 52 
* No missing values

In [None]:
haberman.isnull().sum()

### Check the distribution of 'nodes'

In [None]:
print("nodes and it's count:",haberman['nodes'].value_counts().to_dict())
print("percentage of each nodes:",list(round(haberman['nodes'].value_counts()/len(haberman['nodes'])*100,2)))

* 75.17% patients had nodes <=4 i.e <=mean

In [None]:
sns.set_style('darkgrid')
sns.distplot(haberman['nodes'],hist=True,kde = False,bins = [0,5,10,15,20,25,30,35,40,45,50,55,60],color= 'g' )

In [None]:
plt.figure(figsize=(10,6))
sns.lineplot(x = haberman['nodes'],y = haberman['age'])
plt.title("age vs nodes lineplot")

In [None]:
from collections import Counter
print(Counter(haberman.loc[haberman['nodes'] <=4]['age'].tolist()))
print(haberman.loc[haberman['nodes'] <=4]['age'].tolist())
print(len(haberman.loc[haberman['nodes'] <=4]['age'])/len(haberman)* 100)

* Age and nodes are not co-related but we can observe onething that patients having node <=4 are of age between 48 and 55

### Observations
1. 44% patintent had 0 nodes 13 % has 1 node total 57 % had either 0 or 1 node

**Node can be a good classifer**

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(haberman['nodes'],data = haberman,orient='v')
plt.title("box plot for nodes")

* Average number of node is 4

In [None]:
haberman['nodes'].mean()

In [None]:
Q1 = np.percentile(haberman['nodes'],q=25)
print(Q1)
Q3 = np.percentile(haberman['nodes'],q=75)
print(Q3)
IQR = Q3-Q1
print(IQR)
low = Q1- 1.5 * IQR
high = Q3 + 1.5 * IQR
print(low,high)

print(len(haberman['nodes'].loc[(haberman['nodes'] <= low) | (haberman['nodes'] > high )]))

### Check year
* assumptions:
1. year can't be a classifier becoz it's just a Patient's year of operation

2. But we can observe the trend of survival based on operation year

In [None]:
haberman['year'].value_counts()

### Observations
1. almost equal number of oprations performed in every year except 68 and 69 

# Bi-variate analysis
* we will look for 2 variables at a time

### Age vs Sttatus

In [None]:
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
sns.FacetGrid(haberman,hue = 'status',size = 5).map(sns.scatterplot,'age','nodes').add_legend()
plt.show()

### Observations
1. all patinents with **'0'** nodes are survived irrespective of their **age**
2. 90% of the  patients with 1 nodes are also survived 

In [None]:
sns.FacetGrid(haberman,hue = 'status',height = 5).map(sns.scatterplot,'year','age').add_legend()

### Observation
1. Every year has similar trend of survival (Not a good classifier)

In [None]:
sns.FacetGrid(haberman,hue = 'status',height = 5).map(sns.scatterplot,'year','nodes').add_legend()

### Observations
1. nodes vs year also displays that less number of nodes more survival

# Distribution plots
* Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.

In [None]:
#fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    
    fg = sns.FacetGrid(haberman, hue='status', height=5)
    fg.map(sns.distplot, feature).add_legend()
    plt.show()
    print("*" * 50)

# Observations
1. Age and survival has normal distribution 
2. 81% patitens with  nodes <=4 are  survived  and 19% not survived (verified by code)

In [None]:
haberman.loc[haberman['nodes'] <=4]['status'].value_counts()/len(haberman.loc[haberman['nodes'] <=4]['status']) * 100 

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.boxplot( x='status', y=feature, data=haberman, ax=axes[idx])
plt.show()  

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    sns.violinplot( x='status', y=feature, data=haberman, ax=axes[idx])
plt.show() 

# Density Functions
* Probality Density Function (PDF) is the probabilty that the variable takes a value x. (smoothed version of the histogram)
* Kernel Density Estimate (KDE) is the way to estimate the PDF. The area under the KDE curve is 1.

In [None]:
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    counts, bin_edges = np.histogram(haberman[feature], bins=10, density=True)
    pdf = counts/sum(counts)
    cdf = np.cumsum(pdf)
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.legend(['pdf','cdf'])
    plt.xlabel(feature)

* more than 80% patients has survived who had 0-2 nodes

# Multivariate Analysis

1. Pairplot

In [None]:
sns.pairplot(haberman,hue= 'status',corner= True)
plt.title("pairplot")

In [None]:
sns.heatmap(data= haberman.corr()[['age','nodes','status']],annot = True)

In [None]:
haberman.corr()[['age','nodes','status']]

* No such strong co-relation between any variable

# All observations
1. Age is normally distributed
2. Maximum patients had nodes <=4
3. Max patients with nodes <=4 are survived
3. Age and survival has no relation Year is just to show the year of operation
5. almost equal number of oprations performed in every year