# Analyzing Healthcare data (Exploratory Data Analysis 101)

Having been tired of Titanic Exploration and Iris Exploration, I thought it might be refreshing to use a new realworld dataset to explore! This kernel is primarily aimed at beginners to Exploratory Data Analysis although anybody can enjoy crunching the stats.

> "I'm tired of Titanic Exploration and Iris Exploration" - Abraham Lincoln (2005)

## Introduction 

NTR Vaidya Seva (or Arogya Seva) is the flagship healthcare scheme of the Goverment of Andhra Pradesh, India in which lower-middle class and low-income citizens of the state of Andhra Pradesh can obtain free healthcare for many major diseases and ailments. A similar program exists in the neighboring state of Telangana as well. 

## Let's Code!

We will start by importing the requisite libraries:
* *Pandas* for Data Loading and Exploration
* *Matplotlib, Seaborn* for Visualization.

In [None]:
# import requisite libraries 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.Series.__unicode__ = pd.Series.to_string

let's read the dataset file into the kernel using Pandas *read_csv* function. read_csv can comfortably read Comma Separated Values (csv) files while *read_table* is used for other file types (like xlsx).

there are no restrictions in naming the data variable. *df*, *data* are the most common generic terms. We will go with *data*.

In [None]:
# read dataset into kernel
data = pd.read_csv("../input/ntrarogyaseva.csv")

Have a look at the data using the *head()* function that displays top 5 rows by default.

In [None]:
# display top rows using head 
data.head()
# data.head(10) for top 10 rows

let's print summary statistics (descriptive statistics) of the numeric columns in data. We will use the *describe* function of the data for this.

In [None]:
# print summary statistics
data.describe()

let's have a look at all the column names of the data. 

In [None]:
# display all the column names in the data
data.columns

## Diving deeper into the data

### What is the age distribution of the data?

*value_counts()* is a Pandas function that can be used to print the distributions of data (in the specified column). Let's begin by checking the gender stats of the data.

In [None]:
# Display the counts of each value in the SEX column
data['SEX'].value_counts()

Oops! It looks like there are duplicate values in this column. *Male* and *MALE* are not two different genders! 

We can *replace* the column names to fix this issue. I will also rename Male (Child) -> Boy and Female (Child) -> Girl for convinience purposes.

In [None]:
# mappings to standardize and clean the values
mappings = {'MALE' : 'Male', 'FEMALE' : 'Female', 'Male(Child)' : 'Boy', 'Female(Child)' : 'Girl'}

In [None]:
# replace values using the defined mappings
data['SEX'] = data['SEX'].replace(mappings)
data['SEX'].value_counts()

Visualizing this distribution can be done easily using the in-built plot function of Pandas.

In [None]:
# plot the value counts of sex 
data['SEX'].value_counts().plot.bar()

### What is the age distribution of the data?

Mean, Median and Mode of the data.

In [None]:
# print the mean, median and mode of the age distribution
print("Mean: {}".format(data['AGE'].mean()))
print("Median: {}".format(data['AGE'].median()))
print("Mode: {}".format(data['AGE'].mode()))

Top 10 common ages of the data. Feel free to play around by replacing 10 with the number of your choice.

In [None]:
# print the top 10 ages
data['AGE'].value_counts().head(10)

Box plots are commonly used for visualizing a distribution when bar plots or scatter plots are too overwhelming to understand.

In [None]:
# boxplot for age variable
data['AGE'].plot.box()
# sns.boxplot(data['AGE'])

in the above diagram, the box represents the **[Interquartile Range (IQR)](http://https://en.wikipedia.org/wiki/Interquartile_range)** of the data. 

Interquartile range is the region where 50% of the data lies (i.e) 25% to 75% of the data.
Any data outside 1.5 times the IQR is generally considered an anomaly.  

The little circles in the above figure are considered outliters.

The seaborn library visualizes boxplot much better.

In [None]:
# better looking boxplot (using seaborn) for age variable
sns.boxplot(data['AGE'])

### Answering some questions

Now that we have a hold on the data being explored, let's jump into some questions to better understand the data!

**What if I wanted to analyze only the records pertaining to the district of Krishna?**

I would have to select a subset of the data to proceed. Thankfully, Pandas can help us do that too, in two steps:
    1. Condition to satisfy: data['DISTRICT_NAME']=='Krishna'
    2. Inserting the condition into the dataframe: data[data['DISTRICT_NAME']=='Krishna']
      

In [None]:
# subset involving only records of Krishna district
data[data['DISTRICT_NAME']=='Krishna'].head()

**Most prevalent surgery, district wise**

Can be done by iterating through all the district names and selecting the subset of data for that district

In [None]:
# Most common surgery by district
for i in data['DISTRICT_NAME'].unique():
    print("District: {}\nDisease and Count: {}".format(i,data[data['DISTRICT_NAME']==i]['SURGERY'].value_counts().head(1)))

We can observe that only two surgeries top all the districts: 
    * Dialysis (7 districts)
    * Longbone Fracture (6 districts)

**Average claim amount, district wise**

In [None]:
# Average claim amount for surgery by district
for i in data['DISTRICT_NAME'].unique():
    print("District: {}\nAverage Claim Amount: ₹{}".format(i,data[data['DISTRICT_NAME']==i]['CLAIM_AMOUNT'].mean()))

The standard deviation doesn't seem to be quite high in this case. Guntur district leads the pack with ₹31048 while Vizianagaram comes last with ₹25097

**Statistics by surgery category**

We will be using the Pandas GroupBy construct to gather statistics by grouping data by the surgery category.  The groupby of Pandas works similar to the SQL command of the same name

In [None]:
# group by surgery category to get mean statistics
data.groupby('CATEGORY_NAME').mean()

Cochlear Implant Surgery seems to be the costliest surgery, costing ₹520000 on average. Prostheses costs ₹1200, the cheapest. The youngest age group also happens to be for Cochlear Implant Surgery: 1.58 years, while Neurology has an average patient age of ~56.

**Most common surgery by age group**

To find the most common surgery by age group, let's round off the ages to the nearest ten's place. Make a copy of the dataframe for this operation as we would not want to tinker with the original dataframe.

In [None]:
# create a new memory copy of data to manipulate age 
dataround = data.copy()

We will use the Pandas round function to round off the Age. *-1* specifies that we round up one digit to the left of the decimal place

In [None]:
# round the age variable to 0 or 1 (nearest)
dataround['AGE'] = dataround['AGE'].round(-1)

visualizing the age groups using seaborn's countplot function

In [None]:
# a frequency plot for each age group
sns.countplot(dataround['AGE'])

**Most common surgery per age group**

In [None]:
# Most common surgery and count per age group
for i in sorted(dataround['AGE'].unique()):
    print("Age Group: {}\nMost Common Surgery and Count: {}".format(i,data[data['AGE']==i]['CATEGORY_NAME'].value_counts().head(1)))

## Practise Exercises

Feeling adventerous? Fork this notebook and solve the following challenges to get some practise!

**Value counts of districts**

**Average claim amount for male patients**

**Most common hospital names for treatment**

**Most common age groups  by district** (hint: use dataround)

**Add your own questions here!**

## The End

You've reached the end of the notebook. Congratulations!

I really hope you learnt something and enjoyed going through this notebook. If yes, please upvote and share the notebook!

Feedback? Corrections? Applause? Please comment below! 

This is my first public educative kernel. I hope my performance improves over epochs!
