# Insurance - EDA & Hypothesis Testing

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# Overview of the Notebook:

EDA

- Loading and inspecting the Dataset
- Checking Shape of the Dateset , Meaningful Column names
- Validating Duplicate Records, Checking Missing values
- Unique values (counts & names) for each Feature
- Data & Datatype validation

To test a hypothesis by measuring and examining a random sample of the population under study. The hypotheses are as follows -
- Prove (or disprove) that the medical cost of people who do smoking is greater than those who don't?
- Prove (or disprove) with statistical evidence that the bmi of females is different from that of males
- Is the proportion of smoking significantly different across different regions?
- Is the mean bmi of women with no children, 1 & 2 children is the same? Explain your answer with statistical evidence

### Column Profiling:
- Age: This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).
- Sex: This is the policy holder's gender, either male or female
- bmi: bmi refers to the body mass index of certain person. 
- children: No. of children each person has.
- Smoker: This is yes or no depending on whether the insured regularly smokes tobacco.
- Region: This is the beneficiary's place of residence in Delhi, divided into four geographic regions - northeast,   southeast, southwest, or northwest
- charges: Individual charges for health insurance

# Exploratory data analysis:

#### Importing required packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import NaN, nan, NAN
from scipy import stats
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

#### Loading data into Dataframe:

In [None]:
insurance_data = pd.read_csv("../input/insurance/insurance.csv")

In [None]:
insurance_data.head()

#### Identification of variables and data types:

In [None]:
insurance_data.shape

Total 1338 rows and 7 relevant columns are there in the insurance dataset

In [None]:
insurance_data.info()

__Summary__

- We have data of following dtypes: `objects`, `int`, `float`

In [None]:
insurance_data['children']

In [None]:
# As we can see 'children' denotes the integer indicating how severe the person is.
# Hence this is a categorical feature and thus we need to convert it into object or category datatype.

insurance_data['children'] = insurance_data['children'].astype('category')

# Also, converting other columns with object datatypes to category in order to reduce memory usage.

characteristics_catg = ['sex','smoker','region']
for i in characteristics_catg:
    insurance_data[i] = insurance_data[i].astype("category")
insurance_data.info()

In [None]:
col = ['age','sex','smoker','region','children','bmi','charges']

print(f"Columns with category datatypes (Categorical Features) are : \
{list(insurance_data.select_dtypes('category').columns)}")
print(f"Columns with integer and float datatypes (Numerical Features) are: \
{list(insurance_data.select_dtypes(['int64','float64']).columns)}")

#### Analysing the basic metrics:

In [None]:
insurance_data.describe(include = np.number )

In [None]:
insurance_data.describe(include = 'category' )

__Summary__:
- The mean and median age of all persons is same whereas there's a difference in mean and median charges.
- Having a close median to the mean merely indicates the distribution is not skewed too badly. This is only one property that we want in order to consider our data to be normal which is one of the assumptions for hypothesis testing tests.
- Out of four regions, maximum frequeny of people is from southeast region (364), whereas majority of them are male (676) and non-smokers are more in number (1064/1338) having children of COVID-19 as No Children-0 (574) 

In [None]:
# Checking missing values

total_null = insurance_data.isnull().sum().sort_values(ascending = False)
percent = ((insurance_data.isnull().sum()/insurance_data.isnull().count())*100).sort_values(ascending = False)
print(f"Total records in (insurance_data) = {insurance_data.shape[0]} where missing values are as follows:")
missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data.head(10)

__Summary__:
-  No missing values present in the dataset

# Univariate Analysis:

In [None]:
#Univariate analysis for numerical/continuos variables

def num_feat(col_data):
    fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(12,5))
    sns.histplot(x = col_data, kde=True, ax=ax[0], color = 'purple')
    ax[0].axvline(col_data.mean(), color='r', linestyle='--',linewidth=2)
    ax[0].axvline(col_data.median(), color='k', linestyle='dashed', linewidth=2)
    ax[0].legend({'Mean':col_data.mean(),'Median':col_data.median()})
    sns.boxplot(x=col_data, showmeans=True, ax=ax[1])
    plt.xticks(rotation = 30)
    plt.tight_layout()
    plt.show()

In [None]:
numerical_cols = ['age', 'charges','bmi']

In [None]:
for i in numerical_cols:
    num_feat(insurance_data[i])

In [None]:
#EDA on Univariate Categorical variables

def cat_feat(col_data):
    fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(12,5))
    fig.suptitle(col_data.name+' wise sale',fontsize=15)
    sns.countplot(col_data,ax=ax[0])
    col_data.value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1], shadow = True)
    plt.tight_layout()

In [None]:
categorical_cols = ['sex', 'smoker', 'region', 'children']

In [None]:
for i in categorical_cols:
    cat_feat(insurance_data[i])

__Summary__:
- Percentage of male and female population addmitted to the hospital is nearly same.
- Although the persons who smokes tobacco are less in number as compared to non-smokers, the charges incurred by them are more as we can see from upcoming univariate boxplots.
- persons are coming for treatment from all regions in equity , amongst which southeast region has slightly more persons.
- persons with 0 children / less critical conditions are more in number than more critical conditions. 

__Recommendations__:
- As per the medical advancements in terms recovery rate of people with less Children, they can be advised to home quarantine and with appropriate prescriptions of medices, they can be cured at home without putting much burden on the hospital resources. Such persons can be asked to report biweekly to the hospital for re-testing and re-examination after treatment.

In [None]:
# Checking each categorical feature against charges to detect if any outliers are present

plt.figure(figsize=(15,10))
for i,j in enumerate(categorical_cols):
    plt.subplot(2, 2, i+1)
    plt.subplots_adjust(hspace = 0.8)
    sns.boxplot(x = j, y = 'charges', data = insurance_data)
    plt.tight_layout(pad = 2)

__Summary__:
- Median of the charges for both male and female  as well as for regions seems to be similar visually and hence we can say that the charges are irrespective of the sex or region. We will prove this statistically using hypothesis testing later.
- On the contrary, charges differs significantly for smoker person than non-smokers. It's natural as COVID-19 affect the lungs and lungs of smokers are alrady damaged. Hence persons who smokes needs insurance and Apolo 24/7 can tell corporates with whom they ahve tie-ups that more focus should be on this population.

__Recommendations__:
- Whereas, charges also differs with the children. As we can see that the persons with more Children seems to incurr more charges as compared to less severe persons. Also it is quite evident that the person with no Children are also spending a lot on charges. So, the insurance Management needs to allocate proper resources to the persons with more Children than to no / less children persons. This would increase the the availability in terms of beds, oxygen cylinders, other medical equipments and would hihly beneficial for more severe persons.

# Outliers Detection and Removal:

In [None]:
# Creating a copy of our original data and will use this copy for further processing.

insurance_data_new = insurance_data.copy()

In [None]:
numerical_cols = ['age', 'charges','bmi']

In [None]:

q1 = insurance_data_new[numerical_cols].quantile(0.25)
q3 = insurance_data_new[numerical_cols].quantile(0.75)
iqr = q3 -q1

insurance_data_new = insurance_data_new[~((insurance_data_new[numerical_cols]<q1-1.5*iqr) | (insurance_data_new[numerical_cols]>q3+1.5*iqr)).any(axis = 1)]
insurance_data_new = insurance_data_new.reset_index(drop = True)

In [None]:
insurance_data_new.shape[0] - insurance_data.shape[0]

Total 145 rows are deleted after outliers removal

In [None]:
# After outlier removal, Checking each categorical feature against charges.

plt.figure(figsize=(15,10))
for i,j in enumerate(categorical_cols):
    plt.subplot(2, 2, i+1)
    plt.subplots_adjust(hspace = 0.8)
    sns.boxplot(x = j, y = 'charges', data = insurance_data_new)
    plt.tight_layout(pad = 2)

In [None]:
# After outlier removal, Checking each categorical feature against bmi.

plt.figure(figsize=(15,10))
for i,j in enumerate(categorical_cols):
    plt.subplot(2, 2, i+1)
    plt.subplots_adjust(hspace = 0.8)
    sns.boxplot(x = j, y = 'bmi', data = insurance_data_new)
    plt.tight_layout(pad = 2)

In [None]:
plt.figure(figsize=(15,10))
for i,j in enumerate(categorical_cols):
    plt.subplot(2, 2, i+1)
    plt.subplots_adjust(hspace = 0.8)
    sns.boxplot(x = j, y = 'age', data = insurance_data_new)
    plt.tight_layout(pad = 2)

__Summary__:
- As we can see, there are hardly any outliers present after the outlier treatement using IQR method. We can now proceed with **Bivariate Analysis** and **Statistical Analysis** using this cleaned data.   
- For finding insights and for EDA purposes, we will use our original dataset as the cleaned data will bring certain bias as we have deleted 145 rows.

# Bivariate Analysis:

In [None]:
insurance_data_new.info()

In [None]:
# Correaltion between numerical variables

plt.figure(figsize = (10, 5))
sns.heatmap(insurance_data_new.corr(),annot = True)
plt.yticks(rotation = 360)
plt.show()

__Summary__:
- As we can see there's a good corelation between age and charges. Hence we can deep dive into further bivariate analysis between age and charges.

In [None]:
insurance_data['age'].unique(),insurance_data['age'].nunique()

__Summary__:
- There 47 different types of ages of the person coming to the hospital for treatment.
- It's benefical to get the age groups of these persons to get the excat insights


### Analysis w.r.t 'Age' feature

In [None]:
# Creating 7 bins for age groups of persons.

bins = [0,17,27,37,47,57,67,100]
labels = ['0-17','17-27','27-37','37-47','47-57','57-67','67-100']
insurance_data['age groups'] = pd.cut(x = insurance_data['age'], bins = bins, labels = labels)
insurance_data.head()


In [None]:
#Quick overview of the data w.r.t sex

sns.set_style('white')
sns.pairplot(insurance_data,hue='sex')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
sns.barplot(x = 'age groups', y = 'charges', data = insurance_data, hue = 'sex')
plt.show()

In [None]:
#Quick overview of the data w.r.t sex

sns.set_style('white')
sns.pairplot(insurance_data,hue='age groups')
plt.show()

__Summary__:
- As we can see from the population distribution plot for male and females, the number if persons are similar from both the groups. There's a slight increase in the charges for males than females which can be thought in terms of the slightly higher number of aged males getting admitted as compared to aged females. And this fact can be seen in the barplot where I have focused more on age analysis as we found high corelation between age and charges from the heatmap.
- As the age of the persons increases, the charges also increases as there might be few other complications along with COVID-19 in older people.

# Analysis w.r.t 'Smoker' feature

In [None]:
#Quick overview of the data w.r.t smoker

sns.set_style('white')
sns.pairplot(insurance_data,hue='smoker')
plt.show()

__Summary__:
- There are comparatively less population of smokers to that of non smokers.
- Whereas, the charges and the bmi for smokers are way higher than that of non smokers.
- The smokers with bmi greater than or equal to 30tends to spend more on charges and related expenditures.The non-smoker persons whose bmi is also greater than 30tends to spend less on the charges. This disparity can br thought in terms of the other health complications that are involved for the non-smokers due to increased bmi and it's impact on poor lungs.


__Recommendations__:
- There should be some kind of awareness campaign to put a stop on smoking habits as it's clearly evident from the data that, smoker persons have spend extra money on health which can include, surgery, lung transplant, prolonged covid-19 infection, etc. The reason behind this is also clearly evident from the data of bmis of these smoker persons which is way higher than 30along with other complications as stated and proved in above summary points by comaparing the smoker and non-smokers with bmi greater than 30. These points should be majorly focused and highlighted in the compaign .

# Analysis w.r.t 'Region' feature

In [None]:
#Quick overview of the data w.r.t region

sns.set_style('white')
sns.pairplot(insurance_data,hue='region')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
sns.barplot(x = 'age groups', y = 'charges', data = insurance_data, hue = 'region')
plt.show()

__Summary__:
- As we can see from the pairplot w.r.t region differentiation, persons coming from souteast regions has slightly higher bmi as compared to other three regions, also they bear higher charges as compared to others, whereas they have slightly higher population of aged persons.

__Recomendations__:
- The outbreak of COVID-19 is major and not controlled by the existing civil bodies specifically in the southeast region. The government agencies needs to fix this and make sure all the necessary protocal are being followed in the southeast region to be specific. Proper sanitation, hygene, social distancing and mask protocols should be strictly adhered by the southeast people so the the outbreak can be controlled.If necessary, stricter restrictions such as lockdowns can also be an option that government agencies/bodies has to think of.

# Analysis w.r.t 'Children' feature

In [None]:
#Quick overview of the data w.r.t Children

sns.set_style('white')
sns.pairplot(insurance_data,hue='children')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
sns.barplot(x = 'age groups', y = 'charges', data = insurance_data, hue = 'children')
plt.show()

__Summary__:
- In the age group from 57-67, the charges for children - 4 is maximum which clearly means that persons more number of children tends to spend more on insurance charges

In [None]:
round(insurance_data.groupby(['region','sex','smoker','children']).mean()['charges'].unstack(),2)

In [None]:
insurance_data[insurance_data['sex'] == 'female'].groupby('children')['bmi'].describe()

In [None]:
insurance_data[insurance_data['sex'] == 'male'].groupby('children')['bmi'].describe()

__Summary__:
- The final conclusion from EDA regarding charges can be seen from above dataframe where the charges for the people (male/female) in general from any of the regions who are smokers, are almost 5 times more than the people who don't smoke at all.
- For all the regions, females who are non-smokers have more charges than that of non-smokers males. This might be due to low immunity power or some other complicated issues.

__Recommendations:__
- So the employer needs to encourage the employees who are smokers to get more health insurance coverage or at least increase or add some top up upon their existing coverage so that they won't get get shock after seeing the charges and the final bill if they are supposed to be hospitalized
- Similarly employer needs to encourage the female employees to get their insurance cover renewed to save themselves from further financial crisis in case of any medical emergencies like COVID-19.

# Statistical Analysis:

# Problem statement 1: To prove or disprove that the charges of people who do smoking is greater than those who don't smoke.

**Two-sample t-test assumptions**
- Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
- Data in each group must be obtained via a random sample from the population.
- Data in each group are normally distributed.
- Data values are continuous.
- The variances for the two independent groups are equal.

- Setting up Null Hypothesis (H0) and Stating the alternate hypothesis (Ha) and significance level
    - **H0 : The average charges of smokers is less than or equal to nonsmoker**
    - **Ha : The average charges of smokers are greater than nonsmoker**
    - alpha = 0.05
- As the varaince of population is unknown, we will perform Right tailed T test.

In [None]:
insurance_data_new.groupby('smoker')['charges'].describe()

In [None]:
insurance_data_new[insurance_data_new['smoker'] == 'yes'].shape[0],insurance_data_new[insurance_data_new['smoker'] == 'no'].shape[0]

In [None]:
# Taking 120 samples i.e minimalistic samples that atleast each group has.
smoker_charges = insurance_data_new[insurance_data_new['smoker'] == 'yes']['charges'].sample(120, replace = True)
nonsmoker_charges = insurance_data_new[insurance_data_new['smoker'] == 'no']['charges'].sample(120, replace = True)
#Checking Variance
round(smoker_charges.std()**2,2), round(nonsmoker_charges.std()**2 ,2)


#### Normality Test:
We will perform normality check using **Shapiro test.**

The hypothesis of this test are:
- Null Hypothesis Ho - series is normal
- Alternative Hypothesis Ha - series is not normal

In [None]:
from scipy.stats import shapiro
def normality_check(series, alpha=0.05):
    _, p_value = shapiro(series)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
normality_check(smoker_charges)

In [None]:
normality_check(nonsmoker_charges)

__Conclusions__
- All the distributions do not pass the normality check

#### Equality of Variance Test:
We will perform equivalence check for using using Levene's test.

The hypothesis of this test are:
- Null Hypothesis Ho - Variances are equal
- Alternative Hypothesis Ha - Variances are not equal

In [None]:
from scipy.stats import levene
def variance_check(series1, series2, alpha=0.05):
    _, p_value = levene(series1, series2)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
variance_check(smoker_charges,nonsmoker_charges)

__Conclusion__

The distributions fail to satisfy all the assumptions of t-test. 
Hence, we will use the non-parametric __Mann Whitney test__ to assess whether there is a statistically significant difference in the distribution of smoker_charges,nonsmoker_charges.

__Mann Whitney test__:

The hypothesis of this test are:
- Null Hypothesis Ho - underlying distribution is same 
- Alternative Hypothesis Ha - underlying distribution is not same 

We will use alpha = 0.05

In [None]:
from scipy.stats import mannwhitneyu
test, p_val= mannwhitneyu(smoker_charges,nonsmoker_charges)

if p_val >= 0.05:
    print('We fail to reject the Null Hypothesis Ho')
else:
    print('We reject the Null Hypothesis Ho')

#### Although, normality test and the man whitney test have failed  to validate the assumptions for t-test, Variance for the samples of both the groups is not exactly same but is nearly similar. Hence we can proceed further in doing the 2 sample Right tailed test:

In [None]:
alpha = 0.05
t_stats, p_value = stats.ttest_ind(smoker_charges,nonsmoker_charges, alternative = 'greater',equal_var = True)
print(f"p-value is {p_value}, test statistics is {t_stats}")
if p_value < alpha:
      print(f"Since p value {p_value} is less than alpha {alpha}, we reject the null hypothesis and can say that The average charges of smokers are greater than that of nonsmoker")
else:
    print(f"We fail to reject the H0 and hence can say that charges for both smokers and non smokers are same.")

__Conclusion:__
- **Since p value is less than alpha 0.05**, we reject the null hypothesis and can say that the **average charges of smokers are greater than that of nonsmoker** 

# Problem statement 2: To Prove (or disprove) with statistical evidence that the bmi of females is different from that of males

**Two-sample t-test assumptions**
- Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
- Data in each group must be obtained via a random sample from the population.
- Data in each group are normally distributed.
- Data values are continuous.
- The variances for the two independent groups are equal.

- Setting up Null Hypothesis (H0) and Stating the alternate hypothesis (Ha) and significance level
    - **H0 : The bmi of females is same as that of males**
    - **Ha : the bmi of females is different from that of males**
    - alpha = 0.05
- As the variance of population is unknown, we will perform T test.

In [None]:
insurance_data_new.groupby('sex')['bmi'].describe()

In [None]:
female_viral_load = insurance_data_new[insurance_data_new['sex'] == 'female']['bmi']
male_viral_load = insurance_data_new[insurance_data_new['sex'] == 'male']['bmi']

In [None]:
female_viral_load.shape[0],male_viral_load.shape[0]

We will take 500 as limiting sample size for both male and females's bmi

In [None]:
female_viral_load_sample = insurance_data_new[insurance_data_new['sex'] == 'female']['bmi'].sample(500,replace = True)
male_viral_load_sample = insurance_data_new[insurance_data_new['sex'] == 'male']['bmi'].sample(500, replace = True)

In [None]:
#Checking Variance
round(female_viral_load_sample.std()**2,2), round(male_viral_load_sample.std()**2 ,2)

#### Normality Test:
We will perform normality check using **Shapiro test.**

The hypothesis of this test are:
- Null Hypothesis Ho - series is normal
- Alternative Hypothesis Ha - series is not normal

In [None]:
from scipy.stats import shapiro
def normality_check(series, alpha=0.05):
    _, p_value = shapiro(series)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
normality_check(female_viral_load_sample)
print('-'*50)
normality_check(male_viral_load_sample)

In [None]:
# sns.kdeplot(female_viral_load_sample,color = 'green',shade='green')
# sns.kdeplot(male_viral_load_sample,color = 'blue',shade = 'blue')
# plt.show()

#### Equality of Variance Test:
We will perform equivalence check for using using Levene's test.

The hypothesis of this test are:
- Null Hypothesis Ho - Variances are equal
- Alternative Hypothesis Ha - Variances are not equal

In [None]:
from scipy.stats import levene
def variance_check(series1, series2, alpha=0.05):
    _, p_value = levene(series1, series2)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
variance_check(female_viral_load_sample,male_viral_load_sample)

__Conclusion__

The distributions fail to satisfy all the assumptions of t-test. 
Hence, we will use the non-parametric __Mann Whitney test__ to assess whether there is a statistically significant difference in the distribution of smoker_charges,nonsmoker_charges.

__Mann Whitney test__:

The hypothesis of this test are:
- Null Hypothesis Ho - underlying distribution is same 
- Alternative Hypothesis Ha - underlying distribution is not same 

We will use alpha = 0.05

In [None]:
from scipy.stats import mannwhitneyu
test, p_val= mannwhitneyu(female_viral_load_sample,male_viral_load_sample)

if p_val >= 0.05:
    print('We fail to reject the Null Hypothesis Ho')
else:
    print('We reject the Null Hypothesis Ho')

- **Normality test - Shapiro Wilk test -> Failed**
- **Equality of Variance Test - Levene's Test -> Pass**
- **Non-parametric Test for confirmation - Mann Whitney test -> Pass**

Hence we can proceed for 2 sample t test:

In [None]:
alpha = 0.05
t_stats, p_value = stats.ttest_ind(female_viral_load_sample,male_viral_load_sample,equal_var = True)
print(f"p-value is {p_value}, test statistics is {t_stats}")
if p_value < alpha:
      print(f"Since p value {p_value} is less than alpha {alpha}, we reject the null hypothesis and can say that the bmi of females is different than that of males")
else:
    print(f"We fail to reject the H0 and hence can say that the bmi of females is same as that of males.")

**As concluded from the 2 sampled right tailed T test the bmi of females is same as that of males.**

# Problem statement 3: To check if the proportion of smoking is significantly different across different regions

- Setting up Null Hypothesis (H0) and Stating the alternate hypothesis (Ha) and significance level
    - **H0 : Smokers proportions is not significantly different across different regions**
    - **Ha : Smokers proportions different across different regions**
    - alpha = 0.05
- Here we are comparing two different categorical variables - smoker and region. So in this case we will perform __Chi-Square test of Independence__

**Assumptions:**
- Assumption 1: Both variables are categorical.
- Assumption 2: All observations are independent.
- Assumption 3: Cells in the contingency table are mutually exclusive.
- Assumption 4: Expected value of cells should be 5 or greater in at least 80% of cells.
    - It’s assumed that the expected value of cells in the contingency table should be 5 or greater in at least 80% of cells and that no cell should have an expected value less than 1.

In [None]:
insurance_data_new['region'].count(),insurance_data_new['smoker'].count()

In [None]:
insurance_data_new.groupby(['region','smoker'])['age'].count().unstack()

In [None]:
pd.crosstab(insurance_data_new['region'],insurance_data_new['smoker'],margins = True)


In [None]:
contingency_table = pd.crosstab(insurance_data_new['region'],insurance_data_new['smoker'])
contingency_table

In [None]:
contingency_table.plot(kind = 'bar')
# sns.barplot(data= contingency_table,hue = 'smoker',x = contingency_table['region'])
plt.xticks(rotation = 45)
plt.show()

In [None]:
t_stats, p_value, dof, expected_frequencies  = stats.chi2_contingency(contingency_table)
#stat, p, dof, expected
print(f"Chi-square statististics value = {t_stats}, p-value is {p_value}, degrees of freedom is {dof} and array of expected frequenies is {expected_frequencies}")

In [None]:
alpha = 0.05
if p_value >= alpha: 
    print('We fail to reject the Null Hypothesis Ho and thus we can conclude that smokers proportion is not significantly different in different regions"')
else:
    print('We reject the Null Hypothesis Ho')

**As concluded from the Chi-Square test the smokers proportion is not significantly different in different regions.**

# Problem statement 4: To check if the mean bmi of women with 0-children , 1-children, and 2-children the same.

- Setting up Null Hypothesis (H0) and Stating the alternate hypothesis (Ha) and significance level
    - **H0 : The bmi of women with no children, 1 children and 2 children is same**
    - **Ha : Atleaset one of the mean bmi is different**
    - alpha = 0.05
- Here we are comparing equality of population through variance of samples. So in this case we will perform __One Way Annova__

**Assumptions:**
- Normality – that each sample is taken from a normally distributed population
- Sample independence – that each sample has been drawn independently of the other samples
- Variance equality – that the variance of data in the different groups should be the same
- Your dependent variable – here, “count”, should be continuous – that is, measured on a scale which can be subdivided using increments

In [None]:
insurance_data_new[insurance_data_new['sex'] == 'female'].groupby('children')['bmi'].describe()

In [None]:
insurance_data_new[insurance_data_new['sex'] == 'female'].groupby('children')['bmi'].describe().head(3)

Here the limiting factor for samples selection is 106 and hence we will be taking less than 106 i.e 100 samples for further testing using One way Annova

In [None]:
# As we can see 'children' denotes the integer indicating how severe the person is in the original data
# Since  this is a categorical feature and thus we had converted it to category datatype.
# But, we need to test if the bmi is same for threse children (0,1,2).
# And as we will not be able to comapre with categorical numbers, re-converting into integers. (int64) datatype.

insurance_data_new['children'] = insurance_data_new['children'].astype('int64')
insurance_data_new.info()

In [None]:
female_Children_df = insurance_data_new[insurance_data_new['sex'] == 'female'].loc[insurance_data_new[insurance_data_new['sex'] == 'female']['children'] <= 2]

In [None]:
female_Children_df

In [None]:
female_Children_df['children'].value_counts()

In [None]:
sns.boxplot(x = 'children', y = 'bmi', data =female_Children_df )
plt.show()

#### Normality Test:
We will perform normality check using **Shapiro test.**

The hypothesis of this test are:
- Null Hypothesis Ho - series is normal
- Alternative Hypothesis Ha - series is not normal

In [None]:
from scipy.stats import shapiro
def normality_check(series, alpha=0.05):
    _, p_value = shapiro(series)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
normality_check(female_Children_df['bmi'].sample(100, replace = True))

#### Equality of Variance Test:
We will perform equivalence check for using using Levene's test.

The hypothesis of this test are:
- Null Hypothesis Ho - Variances are equal
- Alternative Hypothesis Ha - Variances are not equal


In [None]:
from scipy.stats import levene
def variance_check(series1, series2, series3, alpha=0.05):
    _, p_value = levene(series1, series2, series3)
    print(f'p value = {p_value}')
    if (p_value >= alpha).all():
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
female_Children_df.info()

In [None]:
sample_with_level_0 = female_Children_df[female_Children_df['children'] == 0 ]['bmi'].sample(100,replace = True)
sample_with_level_1 = female_Children_df[female_Children_df['children'] == 1 ]['bmi'].sample(100,replace = True)
sample_with_level_2 = female_Children_df[female_Children_df['children'] == 2 ]['bmi'].sample(100,replace = True)

In [None]:
sample_with_level_0

In [None]:
variance_check(sample_with_level_0,sample_with_level_1,sample_with_level_2)

In [None]:
# sns.kdeplot(sample_with_level_0,color = 'green',shade='green')
# sns.kdeplot(sample_with_level_1,color = 'blue',shade = 'blue')
# sns.kdeplot(sample_with_level_2,color = 'red',shade = 'blue')
# plt.show()

In [None]:
stat,p_value = stats.f_oneway(sample_with_level_0,sample_with_level_1,sample_with_level_1)
stat,p_value

In [None]:
alpha = 0.05
if p_value >= alpha: 
    print('We fail to reject the Null Hypothesis Ho and thus we can conclude that the bmi of women with no children, 1 children and 2 children is same')
else:
    print('We reject the Null Hypothesis Ho')

**As concluded from the One - way Annova test , the bmi of women with no children, 1 children and 2 children is same**

# Final Observations , Inferences and Recommendations:

### Observations and Inferences:
- The mean and median age of all persons is same whereas there's a difference in mean and median charges. Median of the charges for both male and female as well as for regions seems to be similar visually and hence we can say that the charges are irrespective of the sex or region or age.
- On the contrary, charges differs significantly for smoker person than non-smokers.
- Out of four regions, maximum frequency of people is from southeast region (364), whereas majority of them are male (676) and non-smokers are more in number (1064/1338) having children of COVID-19 as No Children-0 (574) i.e persons with 0 children / less critical conditions are more in number than more critical conditions.
- Although the persons who smokes tobacco are less in number as compared to non-smokers, the charges incurred by them are more.
- Whereas, charges also differs with the children. As we can see that the persons with more Children seems to incur more charges as compared to less children. 
- As we can see from the population distribution plot for male and females, the number of persons are similar from both the groups. There's a slight increase in the charges for males than females which can be thought in terms of the slightly higher number of males getting admitted as compared to females. And this fact can be seen in the barplot where we focused more on age analysis as we found high correlation between age and charges from the heatmap.
- As the age of the persons increases, the charges also increases as there might be few other complications along with ageing in older people.
- There are comparatively less population of smokers to that of non smokers. Whereas, the charges and the bmi for smokers are way higher than that of non smokers.
- The smokers with bmi greater than or equal to 30tends to spend more on charges and related expenditures. The non-smoker persons whose bmi is also greater than 30tends to spend less on the charges. This disparity can be thought in terms of the other health complications that are involved for the non-smokers due to increased bmi and it's impact on poor lungs.
- The max charges for non smokers are around 75,000 whereas, for non smokers it goes till twice this amount i.e 150,000.
- As we can see from the pairplot w.r.t region differentiation, persons coming from southeast regions has slightly higher bmi as compared to other three regions, also they bear higher charges as compared to others, whereas they have slightly higher population of aged persons.
- The final observations from EDA regarding charges can be stated as- the charges incurred by people (male/female) in general from any of the regions who are smokers, are almost 5 times more than the people who don't smoke at all.
- For all the regions, females who are non-smokers have more charges than that of non-smokers males. This might be due to low immunity power or some other complicated issues.
- As concluded from the 2 sampled right tailed T test the average charges of smokers are greater than that of nonsmoker
- As concluded from the 2 sampled right tailed T test the bmi of females is same as that of males.
- As concluded from the One - way Annova test , the bmi of women with no children, 1 children and 2 children is same
- As concluded from the Chi-Square test the smokers proportion is not significantly different in different regions.


# Recommendations:
- As the person with no Children are also spending a lot on charges, the insurance Management needs to allocate proper resources to the persons with more Children than to no / less children persons. 

- Persons who smokes are in dire need of insurance coverage and insurance 24/7 can recommend corporates with whom they have tie-ups that more focus should be on smoker population as they have high chance of getting infected with any disease. So the employer needs to encourage the employees who are smokers to get more health insurance coverage or at least increase or add some top up upon their existing coverage so that they won't get get shock after seeing the charges and the final bill. In addition to this, the employees who work for the employer with whom there's a tie up with insurance 24/7 hospitals, should ask or targeted campaign should be initiated with a focus to enroll their parent's and grandparents for the insurance schemes and policies. Similarly employer needs to encourage the female employees to get their insurance cover renewed to save themselves from additional financial crisis in case of any medical emergencies like COVID-19.

- There should be some kind of awareness campaign to put a stop on smoking habits as it's clearly evident from the data that, smoker persons have spent extra money on health which can include, surgery, lung transplant, prolonged covid-19 infection, etc. The reason behind this is also clearly evident from the data of bmis of these smoker persons which is way higher than 30along with other complications as stated and proved in above summary points by comparing the smoker and non-smokers with bmi greater than 30. These points should be majorly focused and highlighted in the campaign.
