**This notebook has been used as a demonstration for Machine Learning workshop held at VVCE , Mysore on Oct 9 and Oct 10 , 2019
 **
 
Problem : Given set of a parameters of a medical and patient parameters like **gender, bmi , number of children , smoking habits, region**  the goal is to predict **individual medical costs billed by health insurance .**


In [None]:
#Load libraries and the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('../input/insurance.csv')
df.head()

In [None]:
df.dtypes #check datatype

# Data Cleaning



Lets check if there are any null values in the dataset

In [None]:
df.isnull().sum(axis=0) #check null

 # Exploratory Data Analysis

### Crosstab 

* Let's use crosstab to compute a simple cross tabulation of gender in different regions

In [None]:
#Crosstab - Compute a simple cross tabulation of two (or more) factors. 
pd.crosstab(df["sex"],df["region"],margins=True)

### Boolean Indexing

Boolean Indexing can be used to filter values of a column based on conditions from another set of columns? For instance, we want a list of all males who smoke and are from the region = 'southwest'.
1. 

In [None]:
df.loc[(df["sex"]=="male") & (df["smoker"]=="yes") & (df["region"]=="southwest"), ["sex","smoker","region"]].head(10)

###  Sorting dataframes

 Pandas allow easy sorting based on multiple columns. This can be done as:

In [None]:
df_sorted = df.sort_values(['smoker','region'], ascending=False)
df_sorted[['smoker','region']].head(10)

### Pivot Table

Pandas can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “charges” . As an example, we can pivot it using mean amount of each 'sex' and 'smoker' group. 

In [None]:
# Pivot
impute_grps = df.pivot_table(values=["charges"], index=["sex","smoker"], aggfunc=np.mean)
print (impute_grps)

# Data Visualization





In [None]:
#box plot
df.boxplot(column="charges",by="region", figsize=(18, 8))

In [None]:
#Scatter 
df.plot.scatter(x='charges', y='bmi', figsize=(18, 8))

In [None]:
# Area Plot
df_new = df.drop(columns = 'charges') #dropping charges for the plot
df_new.plot.area(figsize=(18, 8))

In [None]:
# Kernel Density Estimation plot (KDE)
df['charges'].plot(kind='kde')

In [None]:
#Heat Map

corr = df.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(10, 8))
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap='coolwarm', annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

In [None]:
#violin plot

sns.violinplot(x = 'smoker', y = 'charges', data = df, hue = 'smoker', figsize=(10, 8))
plt.show()

In [None]:
#swarm plot

sns.swarmplot(x = 'region', y = 'charges', data = df, hue = 'region')
plt.show()