## Statistical Analysis

## Attribute Information:

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.

## Objective:
We want to see if we can dive deep into this data to find some valuable insights.

In [None]:
#importing the libraries that we use
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
from scipy import stats
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline

In [None]:
#reading data from csv file as dataframe
df = pd.read_csv("/kaggle/input/insurance/insurance.csv")

In [None]:
#Checking the first 5 records of loaded data 
df.head()

## Shape of each column

In [None]:
df.shape #shape of the data

In [None]:
df.info() #Get the total info of the dataframe

In [None]:
df["children"].value_counts() # Categorical variable

### Datatypes of each column

In [None]:
#Datatypes of each variable
df.dtypes

In [None]:
df_num = df.loc[:,["age","bmi","charges"]]
df_cat = df.loc[:,["sex","children","smoker","region"]]
df_num.head()

In [None]:
df_cat.head()

## Checking the missing values in all the Columns

In [None]:
#Checking the presences of missing values
df.isnull().sum()

### Describe the five point summary

In [None]:
#5 point summary of numerical attributes
df_num.describe()

## Distribution of Columns bmi,age,charges

In [None]:
#Distribution of ‘bmi’ column.
sns.distplot(df["bmi"],color="Green");
plt.show()
sns.violinplot(df["bmi"],color="Orange");
plt.show()
sns.distplot(df["bmi"], hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True),color="cyan")
plt.show()

In [None]:
#Distribution of ‘age’ column.
sns.distplot(df["age"],color="Green");
plt.show()
sns.violinplot(df["age"],color="Orange");
plt.show()
sns.distplot(df["age"], hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True),color="cyan")
plt.show()

In [None]:
#Distribution of 'charges' column.
sns.distplot(df["charges"],color="Green");
plt.show()
sns.violinplot(df["charges"],color="Orange");
plt.show()
sns.distplot(df["charges"], hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True),color="cyan")
plt.show()

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> Checking the Skewness

**Positively skewed:** Most frequent values are low and tail is towards high values.

**Negatively skewed:** Most frequent values are high and tail is towards low values.

If **Mode< Median< Mean** then the distribution is positively skewed.

If **Mode> Median> Mean** then the distribution is negatively skewed.

In [None]:
df_num.skew() #measures the skewness of every numerical attribute

## Checking the presence of outliers

### Boxplots

This kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently(outliers). 

In [None]:
sns.boxplot(df["bmi"]);
plt.show()

***Above bmi*** plot shows the existence outliers

In [None]:
sns.boxplot(df["age"]);
plt.show()

***Above age*** plot shows no outlier

In [None]:
sns.boxplot(df["charges"]);
plt.show()

***Above changes*** plot shows Existence of outlier

# Distribution of categorical columns (include children)

In [None]:
df_cat.columns

### Count plot of Sex column

In [None]:
sns.countplot(df_cat["sex"]);

### Count plot of Children column

In [None]:
sns.countplot(df_cat["children"]);
plt.show()

### Count plot of Smoker column

In [None]:
sns.countplot(df_cat["smoker"]);

### Count plot of region column

In [None]:
sns.countplot(df_cat["region"]);

In [None]:
# Let us analyze categorical variables "smoker" and "sex" (as hue) with respect to "charges" as continuous variable
sns.catplot(x="smoker", y="charges", hue="sex", kind="box", data=df);

In [None]:
# Let us analyze categorical variables "region" and "smoker" (as hue) with respect to "charges" as continuous variable
sns.catplot(x="region", y="charges", hue="smoker", kind="box", data=df);

In [None]:
# Let us analyze categorical variables "region" and "sex" (as hue) with respect to charges as continuous variable
sns.catplot(x="region", y="charges", hue="sex", kind="box", data=df);

## Pair plot of all the columns

In [None]:
sns.pairplot(df);

In [None]:
# Pairplot with smoker as hue
sns.pairplot(data=df, hue='smoker');

In [None]:
# Pairplot with sex as hue
sns.pairplot(data=df, hue='sex');

In [None]:
# Pairplot with region as hue
sns.pairplot(data=df, hue='region');

## Profile report of the complete dataframe

In [None]:
pp.ProfileReport(df)

### Do charges of people who smoke differ significantly from the people who don't?


In [None]:
df_cat["smoker"].value_counts()

In [None]:
ppl_smoke_chrge = df[df["smoker"]=="yes"]["charges"]

In [None]:
ppl_nosmoke_chrge = df[df["smoker"]=="no"]["charges"]

In [None]:
print(f"Mean of smoker charges: {np.mean(ppl_smoke_chrge)}")
print(f"Mean of non-smoker charges: {np.mean(ppl_nosmoke_chrge)}")

In [None]:
print(f"Std of smoker charges {np.std(ppl_smoke_chrge)}")
print(f"Std of non-smoker charges {np.std(ppl_nosmoke_chrge)}")

###  Null and alternative hypothesis


* $H_0$: $\mu{SMC}$ - $\mu{NSMC}$ =      0
* $H_A$: $\mu{SMC}$ - $\mu{NSMC}$ $\neq$  0

### Significance Level 

Here we select $\alpha$ = 0.05

**Since the standard deviation of the population in not known we carryout independent t test**

In [None]:
t_statistic,pval = ttest_ind(ppl_smoke_chrge,ppl_nosmoke_chrge)

In [None]:
print('P Value ',pval)    

In [None]:
if pval <0.05:
  print("Since pval is very much less than significance level we tend to reject null hypothesis")
else:
  print("we Fail to reject null hypothesis")

##### Conclusion

**Yes the charges of people who smoke differ significantly from the people who don't**

.

.

## Does bmi of males differ significantly from that of females?

In [None]:
 df_cat["sex"].value_counts()

In [None]:
bmi_male = df[df["sex"]=="male"]["bmi"]
bmi_female = df[df["sex"]=="female"]["bmi"]

In [None]:
print(f"Mean of bmi male: {np.mean(bmi_male)}")
print(f"Mean of bmi female: {np.mean(bmi_female)}")
print(f"Std of bmi male: {np.std(bmi_male)}")
print(f"Std of bmi female: {np.std(bmi_female)}")

###  Null and alternative hypothesis


* $H_0$: $\mu{BmiMale}$ - $\mu{BmifeMale}$ =      0
* $H_A$: $\mu{BmiMale}$ - $\mu{BmifeMale}$ $\neq$  0

### Significance Level 
Here we select $\alpha$ = 0.05

In [None]:
z_statistic,pval = ttest_ind(bmi_male,bmi_female)

In [None]:
print('P Value ',pval)

In [None]:
if pval <0.05:
  print("Since pval is very much less than significance level we tend to reject null hypothesis")
else:
  print("we Fail to reject null hypothesis")

##### Conclusion

**No the bmi of males donot differ significantly from that of females**

**BMI of males and female are equal**

.

.

### Is the proportion of smokers significantly different in different genders?

#### Ho = The proportions are equal
#### Ha = The two proportions are not equal

### Significance Level
Here we select $\alpha$= 0.05

In [None]:
female_smokers = df[df['sex'] == 'female'].smoker.value_counts()[1]  # number of female smokers
male_smokers = df[df['sex'] == 'male'].smoker.value_counts()[1] # number of male smokers
n_females = df.sex.value_counts()[1] # number of females in the data
n_males = df.sex.value_counts()[0] #number of males in the data

In [None]:
print([female_smokers, male_smokers] , [n_females, n_males])
print(f' Proportion of smokers in females, males = {round(115/662,2)}%, {round(159/676,2)}% respectively')

The proportions are different but are they statistically significant?

In [None]:
stat, pval = proportions_ztest([female_smokers, male_smokers] , [n_females, n_males])

if pval < 0.05:
    print(f'With a p-value of {round(pval,4)} the difference is significant. aka |We reject the null|')
else:
    print(f'With a p-value of {round(pval,4)} the difference is not significant. aka |We fail to reject the null|')

#### Conclusion

***Yes the Proportion of smokers significantly different in different genders***

.

.

### Is the distribution of bmi across women with no children, one child and two children, the same?

In [None]:
df['sex'].value_counts()

In [None]:
bmi_w_noc = df[(df['sex']=='female') & (df['children']==0)].bmi
bmi_w_1c = df[(df['sex']=='female') & (df['children']==1)].bmi
bmi_w_2c = df[(df['sex']=='female') & (df['children']==2)].bmi

In [None]:
sns.distplot(bmi_w_noc);
plt.title("Distribution of BMI of women with no children")
plt.show()

In [None]:
sns.distplot(bmi_w_1c);
plt.title("Distribution of BMI of women with One children")
plt.show()

In [None]:
sns.distplot(bmi_w_2c);
plt.title("Distribution of BMI of women with Two children")
plt.show()

In [None]:
bmi_df = pd.DataFrame()

df1 = pd.DataFrame({'women': 'NoChild', 'bmi':bmi_w_noc})
df2 = pd.DataFrame({'women': 'OneChild', 'bmi':bmi_w_1c})
df3 = pd.DataFrame({'women': 'TwoChild', 'bmi':bmi_w_2c})

bmi_df = bmi_df.append(df1) 
bmi_df = bmi_df.append(df2) 
bmi_df = bmi_df.append(df3)

In [None]:
sns.boxplot(x = "women", y = "bmi", data = bmi_df)
plt.title('BMI of women with children')
plt.show()

#### The boxplot shows mostly same mean but mean slightly differs at women with two children. Let check it statistically

* $H_0$: $\mu1$ = $\mu2$ = $\mu3$
* $H_A$: At least one $\mu$ differs 


Here $\mu1$, $\mu2$ and $\mu3$ are the mean of bmi of women with no children, one children and two children respectively.

#### Significance level 
$\alpha$ = 0.05 

Here we have three groups. Analysis of variance can determine whether the means of three or more groups are different. ANOVA uses F-tests to statistically test the equality of means.

### calculate p value using Anova

In [None]:
mod = ols('bmi ~ women', data = bmi_df).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

##### P value 0.7158  is greater than 0.05. So we are unable to reject null hypothesis .
***Hence the mean and distribution of bmi across women with no child,1child and 2 child are same***

In [None]:
print(pairwise_tukeyhsd(bmi_df['bmi'], bmi_df['women']))

***We are able to see the reject is false in every case so the distribution of bmi across women with 0,1,2children are same***

### END