<a href="https://colab.research.google.com/github/vigoku/glaiml/blob/master/VGK_Project2_AdvStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Set Information

The insurance.csv dataset contains 1338 observations and 7 attributes.

##### Context: 
The data contains medical costs of people characterized by certain attributes. Let’s see if we can dive deep into this data to find some valuable insights.

Attributes:-

##### age: 
age of primary beneficiary
##### sex: 
insurance contractor gender, female, male
##### bmi: 
Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 
##### children: 
Number of children covered by health insurance / Number of dependents
##### smoker: 
Smoking
##### region: 
the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
##### charges: 
Individual medical costs billed by health insurance.

# 1. Import the necessary libraries

In [0]:
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

In [0]:
# Numerical libraries
import numpy as np

# to handle data in form of rows and columns 
import pandas as pd    

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

#importing scipy stats 
import scipy
import scipy.stats as st

from statsmodels.formula.api import ols      # For n-way ANOVA
from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way ANOVA

#for 2 sample T-test
from scipy.stats import ttest_ind

# 2. Read the data as a data frame

In [3]:
# reading the CSV file into pandas dataframe
ins_df = pd.read_csv("insurance.csv")

FileNotFoundError: ignored

In [0]:
# Check top few records to get a feel of the data structure
ins_df.head(10)

In [0]:
ins_df.count()

# 3. Perform basic EDA which should include the following and print out your insights at every step.

## a. Shape of the data

In [0]:
ins_df.shape

#### Observations :: All 1338 rows and 7 columns seem to match the input data

## b. Data type of each attribute

In [0]:
ins_df.dtypes

#### Observations ::  We have age, bmi, children, charges as numbers. Sex, Smoker, Region are strings / categorical

## Categorical column Analysis

In [0]:
ins_df.sex.value_counts()

#### Observations :: Almost equally distributed male and female

In [0]:
ins_df.smoker.value_counts()

#### Observations :: 5 times fewer non smokers.

In [0]:
ins_df.region.value_counts()

#### Observations :: almost a balanced set of people from all 4 directions.. Slightly more people in south east. 

## c. Checking the presence of missing values

In [0]:
ins_df.isnull().values.any()

In [0]:
ins_df.isna().values.any()

#### Observations :: I think we dont have any null / missing values. Cross checked by applying filter in excel. Looks like no null. 

## d. 5 point summary of numerical attributes 

In [0]:
ins_df.describe().transpose()

#### Observations :: 
Sd Deviation for Age, Charges seems large

## e. Distribution of ‘bmi’, ‘age’ and ‘charges’ columns. 

#### bmi

In [0]:
sns.set(color_codes=True)
sns.distplot(ins_df['bmi'], kde = False)

#### Observations :: Lovely Normalish distribution.. Looks like real life data, peaking at 30 which borders on morbid obesity

#### age

In [0]:
sns.distplot(ins_df['age'], kde=False, rug=True)

#### Observations :: This is a bit wierd. Looks like we have double the number of people in their 20s than in other age ranges

#### charges

In [0]:
sns.distplot(ins_df['charges'], kde=False, rug=True)

#### Observations :: Charges seem to be highly concentrated in the 1000 ~ 20000 range 

## f. Measure of skewness of ‘bmi’, ‘age’ and ‘charges’ columns 

In [0]:
ins_df.skew()

#### Observations ::
Indications
Age is least skewed slightly towards left tail of distribution
BMI is moderately skewed with more values towards left tail of distribution
Children is also highly skewed with more values towards left tail of distribution
Charges highly skewed .. highest of the lot


## g. Checking the presence of outliers in ‘bmi’, ‘age’ and ‘charges' columns

#### age

In [0]:
sns.boxplot(ins_df['age'])

#### Observations :: No outliers in age

#### bmi

In [0]:
sns.boxplot(ins_df['bmi'])

#### Observations :: Few outliers are discoevered beyond 46 or so. We may need to get rid of them if we make a model

#### charges

In [0]:
sns.boxplot(ins_df['charges'])

#### Observations :: Large number of outliers.. As always insurance company is making a quick buck

## h. Distribution of categorical columns (include children)

In [0]:
sns.countplot(ins_df['sex'])

In [0]:
sns.countplot(ins_df['children'])

In [0]:
sns.countplot(ins_df['region'])

In [0]:
sns.countplot(ins_df['smoker'])

#### Observations :: Almost balanced male female pop and well distributed across regions too. The number of people without kids is very high

## i. Pair plot that includes all the columns of the data frame

In [0]:
sns.set(style="ticks")
sns.pairplot(ins_df, hue="children")

In [0]:
sns.pairplot(ins_df, hue="sex")

In [0]:
sns.pairplot(ins_df, hue="region")

In [0]:
sns.pairplot(ins_df, hue="smoker")

#### Observations :: Many interesting things here.. 

1. Smokers seem to be charged more insurance
2. As age increases, charges increase
3. BMI, Children does not seem to directly impact charge

# 4. Answer the following questions with statistical evidence

## a. Do charges of people who smoke differ significantly from the people who don't?

### NULL HYPOTHESIS
There is no impact of smoking on insurance charge

### ALT HYPOTHESIS
There is impact of smoking on insurance charge

In [0]:
sns.lineplot(x="age", y="charges", hue="smoker", data=ins_df);

#### Observations :: Yes .. from from graph we can see both age and smoking have incremental effect on charge

In [0]:
#APPLY ANOVA

#Check impact on Charges vs categorical var smoker
formula_smoker = 'charges ~ C(smoker)'
model_smoker = ols(formula_smoker, ins_df).fit()
aov_table_smoker = anova_lm(model_smoker)
print(aov_table_smoker)

In [0]:
#Trying Two Sample T-Test
# Tests whether the means of two independent samples are significantly different.

#Create 2 samples 1 smoker, 1 non - smoker
ins_df_smoker = ins_df.drop(ins_df[ins_df.smoker=='no'].index)
ins_df_nsmoker = ins_df.drop(ins_df[ins_df.smoker=='yes'].index)

print (ins_df_smoker.describe())
print (ins_df_nsmoker.describe())

In [0]:
stat, pvalue = ttest_ind(ins_df_smoker['charges'] , ins_df_nsmoker['charges'])
print("Compare means", ins_df_smoker['charges'].mean() , ins_df_nsmoker['charges'].mean())
print("Tstatistic for charges, Pvalue", stat, pvalue)

if pvalue > 0.05:
	print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
	print('Samples are likely drawn from different distributions (reject H0)')

# Conclusion 4a :: Since PR value <<< 0.05, we can say that NULL Hypothesis can be rejected. Also the HUGE F value(in ANOVA) indicates big dependency. Hence alternative hypothesis is right and there is impact of Smoking on insurance charge

## b. Does bmi of males differ significantly from that of females?

### NULL HYPOTHESIS
There is no variation in mean BMI for men vs female .

### ALT HYPOTHESIS
There is variation in mean BMI for men vs female .

In [0]:
sns.catplot(x='sex', y='bmi', data=ins_df, kind='swarm');

#### Observations :: No -- Does not look like from the graph. There are a few very high values for males. 

In [0]:
#APPLY ANOVA

#Check impact on BMI vs categorical var sex
formula_bmi_sex = 'bmi ~ C(sex)'
model_bmi_sex = ols(formula_bmi_sex, ins_df).fit()
aov_table_bmi_sex = anova_lm(model_bmi_sex)
print(aov_table_bmi_sex)

In [0]:
#Trying Two Sample T-Test
# Tests whether the means of two independent samples are significantly different.

#Create 2 samples 1 make, 1 female
ins_df_m = ins_df.drop(ins_df[ins_df.sex=='female'].index)
ins_df_f = ins_df.drop(ins_df[ins_df.sex=='male'].index)

stat, pvalue = ttest_ind(ins_df_m['bmi'] , ins_df_f['bmi'])
print("Compare means", ins_df_m['bmi'].mean() , ins_df_f['bmi'].mean())
print("Tstatistic for bmi, Pvalue", stat, pvalue)

if pvalue > 0.05:
	print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
	print('Samples are likely drawn from different distributions (reject H0)')

# Conclusion 4b :: Since PR value > 0.05, we can say that NULL Hypothesis can't be rejected. Hence there is no impact on gender on mean bmi. 

## c. Is the proportion of smokers significantly different in different genders?

### NULL HYPOTHESIS
There is no variation in proportion of smokers across genders.
H0 : Proportion of women smoking = Proportion of men smoking
### ALT HYPOTHESIS
There is significant variation in proportion of smokers across genders.
H1 : Proportion of women smoking <> Proportion of men smoking

In [0]:
sns.countplot(x="sex", hue="smoker", data=ins_df);

#### Observations :: The smokers ratio is higher in males . but i would not say its significantly higher.. 

In [0]:
# H0 :: female_smoker_proportion = male_smoker_proportion (hence two tail) p = 0.05
# H1 :: female_smoker_proportion != male_smoker_proportion

# REF :: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

# we already have ins_df_m and ins_df_f - male female sub data sets
# WE will separate the smokers 

ins_df_f_smoker = ins_df_f.drop(ins_df_f[ins_df_f.smoker=='no'].index)
ins_df_m_smoker = ins_df_m.drop(ins_df_m[ins_df_m.smoker=='no'].index)

#smoker_proportion_array = np.array([[ins_df_m.size,ins_df_m_smoker.size], [ins_df_f.size, ins_df_f_smoker.size]])
#summary = st.chi2_contingency (smoker_proportion_array)
#print("chi square summray for proportions of smokers across gender", summary)

from statsmodels.stats.proportion import proportions_ztest
count = np.array([ins_df_m_smoker.shape[0], ins_df_f_smoker.shape[0]])
print (count)
nobs = np.array([ins_df_m.shape[0], ins_df_f.shape[0]])
print (nobs)
stat, pvalue = proportions_ztest(count , nobs, alternative='two-sided')
#print("Compare means", ins_df_f_smoker.mean() , ins_df_f['bmi'].mean())
print("Tstatistic for proportions of smokers across gender, Pvalue", pvalue)

if pvalue > 0.05:
	print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
	print('Samples are likely drawn from different distributions (reject H0)')

# Conclusion 4c :: Since Pvalue < 0.05, we can say that NULL Hypothesis can  be rejected. Hence there is significant variation in proportion of smokers across genders.

## d. Is the distribution of bmi across women with no children, one child and two children, the same ?

In [0]:
#we already have female set from above
ins_df_f.head()

In [0]:
sns.catplot(x='children', y='bmi', data=ins_df_f, kind='swarm');

In [0]:
sns.lineplot(x='children', y='bmi', hue='smoker', data=ins_df_f)

#### Observations :: Does not look like there is any significant difference in BMI vs Number of children for females 0,1,2 -- This is just by looking at it..

### NULL HYPOTHESIS
There is no impact of children (across women with no children, one child and two children) on Distribution of bmi. BAsically there is no variation in mean BMI for women with 0 ,1, 2 kids .

### ALT HYPOTHESIS
There is impact of children (across women with no children, one child and two children) on Distribution of bmi. BAsically there is variation in mean BMI for women with 0 ,1, 2 kids .

In [0]:
#extract population of females with 0,1,2 children.

ins_df_f_012Children = ins_df_f.drop(ins_df_f[ins_df_f.children>=3].index)
ins_df_f_012Children.head(10)

In [0]:
#box plot the thing
sns.boxplot(x=ins_df_f_012Children['children'], y = ins_df_f_012Children['bmi'])

In [0]:
#APPLY ANOVA

#Check impact on BMI vs categorical var children
formula = 'bmi ~ C(children)'
model = ols(formula, ins_df_f_012Children).fit()
aov_table = anova_lm(model)
print(aov_table)

# Conclusion 4d :: Since PR value > 0.05, we can say that NULL Hypothesis cant be rejected. Also F value is very small. Hence there is no impact on number of children on mean bmi. 