## <span style="color:blue"><u>Basic Understanding the data</u></span>

<span style="font-size:18px">As the data file (insurance.csv) is too small in size (55 KB), it's better to have a glance at the data with Excel for better understanding.</span>

### <span style="color:blue"><u>My Findings after going through the data in Excel :</u></span>

1. Data set is having <b>1338 rows and 7 columns</b> ( columns as given in the Attribute Information).

2. There are 3 Categorical Columns namely <b>sex, smoker, region</b> and 4 Numerical Columns <b>age, bmi, children, charges</b>.

3. age column having discrete numerical values. For age column <b>minimum value is 18 and maximum value is 64</b>, age column having <b>range of (age maximum - age minimum) = (64-18) = 46(64-18) = 46 and mean (average) of 39.20702541 </b>.

4. For sex column only two values are there, namely <b>male and female</b>. It can be <b>binary coded</b> where we can assign female as 0 and male as 1.

5. bmi column having continous numerical values. For bmi column <b>minimum value is 15.96 and maximum value is 53.13</b> , bmi column having <b>range of (bmi maximum - bmi minimum) = (53.13 - 15.96) = 37.17 and mean of 30.66339686 </b>.

6. children column is also having discrete numerical values. Column <b>contains only 0, 1, 2, 3, 4, 5 values</b>, having minimum value of 0 and maximum value of 5.

7. For smoker column only two values are there, namely <b>yes and no</b>. It can also be <b>binary coded</b> like sex column where we can assign no as 0 and yes as 1.

8. For region column four values are there, namely <b>northeast, southeast, southwest, northwest</b>. We can use Label Encoder to code the values.

9. charges column having continous numerical values. For charges column <b>minimum value is 1121.8739 and maximum value is 63770.42801</b> , charges column having <b>range of (charges maximum - charges minimum) = (63770.42801 - 1121.8739) = 62648.55411 and mean of 13270.42227 </b>.

10. <b>There is no missing value in the dataset,</b> but have a <b>Duplicate row</b> in it.

# <span style="color:blue">Tasks</span>

## <span style="color:blue">1. Import the necessary libraries (2 marks)</span>

In [None]:

# impoprting libraries
import copy                                        # require to create real copy / clone
import numpy                 as np                 # adding support for large, multi-dimensional arrays and matrices 
import pandas                as pd                 # data manipulation and analysis library for python
import matplotlib.pyplot     as plt                # integrate MATLAB within Python and provides MATLAB like plotting
import seaborn               as sns                # data visualization library based on matplotlib
from   sklearn.preprocessing import LabelEncoder   # encode target labels with value between 0 and n_classes-1
from   scipy                 import stats          # for scientific computing and technical computing
from   statsmodels.stats     import proportion     # contains probability distributions as well as statistical functions

# for better background in seaborn graphs
sns.set(color_codes=True)

# add graphs into jupiter notebook
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## <span style="color:blue">2. Read the data as a data frame (2 marks)</span>

In [None]:

# insurance csv data set is in working directory and loaded into dataframe named as insData
print()
dataFile = "insurance.csv"       # assigning provided file name in a vaiable called 'datafile'
insData = pd.read_csv("../input/insurance/"+dataFile)  # reading the given csv file using Pandas read_csv() function
print("~"*60)
print("\033[1mAnswer 2 :\033[0m Loaded {0} file into \033[1m\"insData\"\033[0m DataFrame!".format(dataFile))
print("~"*60)
#printing top 5 rows from insData dataframe
print()
print("~"*65)
print("Diplaying \033[1mTop 5 \033[0mrows from DataFrame \033[1m\"insData\"\033[0m")
print("~"*65)
print("\033[1m",insData.head()) # head.() function return top 5 rows from the dataframe
print("\033[0m~"*65)
print()
print("~"*65)
print("Diplaying \033[1mBottom 5 \033[0mrows from DataFrame \033[1m\"insData\"\033[0m")
print("~"*65)
print("\033[1m",insData.tail()) # tail.() function return bottom 5 rows from the dataframe
print("\033[0m~"*65)

## <span style="color:blue">3. Perform basic EDA which should include the following and print out yourinsights at every step. (28 marks)</span>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">a. Shape of the data (2 marks)</span>

In [None]:

print()
print("~"*70)
print("\033[1mAnswer 3.a.\033[0m The Shape of Insurance data loaded in \033[1m\"insData\"\033[0m DataFrame :")
print("~"*70)
print("*"*30)
# getting shape of the data using shape
print("\033[1m",pd.DataFrame({'No_of_Rows':[insData.shape[0]],'No_of_Columns':[insData.shape[1]]}))
print("\033[0m*"*30)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">b. Data type of each attribute (2 marks)</span>

In [None]:

'''
Checking the Number of Missing Values in each column
'''
# declaring variables 
intData = 0
objData = 0
floatData = 0
othData = 0
print()
print("~"*50)
print("\033[1mAnswer 3.b. Checkinig Data Type of Each Attributes :\033[0m")
print("~"*50)
print()
print("*"*30)
print("\033[1mAttributes          DataType\033[0m")
print("*"*30)
# getting data type of all columns
for features in insData.columns:     # looping each coulmns from the dataset
    dt = insData[features].dtypes    # getting the data type of column into dt variable
    if(dt == "object"):              # checking the datatype if it is object or not
        objData = objData + 1        # if object then counting into objData variable 
    elif(dt == "int64"):             # checking the datatype if it is integer or not
        intData = intData + 1        # if integer then counting into intData variable
    elif(dt == "float64"):           # checking the datatype if it is float or not
        floatData = floatData + 1    # if float then counting into floatData variable
    else:                            # other than normal datatypes i.e object , integer and float
        othData = othData + 1        # if data type other than object, integer and float then counting into othData variable
    print("{0}{1}".format(features.ljust(20,' '),dt))
print("*"*30)
print()
print("~"*100)
print("\033[1mNote:\033[0m There are \033[1m{0} Integer Type, {1} Float Type, {2} Object Type and {3} Other Types\033[0m Columns in the Dataset."
      .format(intData,floatData,objData,othData)
     )
print("~"*100)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">c. Checking the presence of missing values (3 marks)</span>

In [None]:

# Checking the Number of Missing Values in each column
totCnt = 0
print()
print("~"*70)
print("\033[1mAnswer 3.c.\033[0m Missing Value for each Attribute in \033[1m\"insData\"\033[0m DataFrame  :")
print("~"*70)
print()
print("*"*32)
print("\033[1mColumnName    MissingValueCount\033[0m")
print("*"*32)
for features in insData.columns:             # looping each coulmns from the dataset
    cnt = insData[features].isnull().sum()   # sum of null values for the columns and assigned to 'cnt' variable 
    totCnt = totCnt + cnt                    # summing up all counted null values from each iteration and assigned to 'totCnt' variable
    print("{0}{1}".format(features.ljust(20,' '),cnt))
print("*"*32)
print()
print("~"*70)
if(totCnt > 0):
    print("\033[1mNote:\033[0m There are Total \033[1m{0} no of missing values\033[0m in the \033[1m\"insData\"\033[0m Dataset."
          .format(totCnt)
         )
else:
    print("\033[1m*** \033[0m There are \033[1m NO missing values\033[0m in the \033[1m\"insData\"\033[0m Dataset.")
print("~"*70)

### Note : The insurance dataset is having a duplicate row in it.

In [None]:

print()
print("*"*55)
print("\033[1mThe below mentioned rows are duplicate in the dataset :\033[0m")
print("*"*55)
print("\033[1m")
print(insData[insData.duplicated()])   # printing duplicate row if any in the insData dataset
print("\033[0m")
print("*"*60)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">d. 5 point summary of numerical attributes (3 marks)</span>

In [None]:

print()
print("~"*90)
print("\033[1mAnswer 3.d. 5 number / point summary\033[0m of numerical attributes of the \033[1m\"insData\"\033[0m DataFrame :")
print("~"*90)
print()
for features in (insData.select_dtypes(include=np.number).columns.tolist()):  # looping through all numeric datatype columns
    print("*"*40)
    print("\033[1m5 number / point summary of {0} feature :\033[0m".format(features))
    print("*"*40)
    print("\033[1m5_Number    Values\033[0m")
    print("*"*20)
    # printing 5 number summary of numerical columns of insData dataset
    print(insData[features].describe()[['min','25%','50%','75%','max']])
    print("*"*40)
    print()

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">e. Distribution of ‘bmi’, ‘age’ and ‘charges’ columns. (4 marks)</span>

In [None]:
print()
print("~"*90)
print("\033[1mAnswer 3.e.\033[0m Distribution of 'bmi', 'age' and 'charges' columns of the \033[1m\"insData\"\033[0m DataFrame :")
print("~"*90)
print()
row = 3
col = 1
plc = 1
plcHolder = str(row) + str(col) + str(plc)      # subplot placeholder with three rows and one column
for features in ('bmi', 'age','charges'):       # looping with bmi, age and charges features
    plt.figure(figsize=(10,15))                 # setting figure size with width = 10 and height = 15
    plt.subplot(plcHolder)                      # plotting in the placeholder
    print("*"*30)
    print("\033[1mDistributon of {0} column :\033[0m".format(features))
    print("*"*30)
    ax = sns.distplot(insData[features])        # seaborn distplot to examine distribution of the feature
    plt.show()                                  # plotting in the notebook
    plcHolder = str(int(plcHolder) + 1)         # increasing placeholder to next row
    # printing mean, meadian and kurtosis of the feature
    print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Kurtosis = {3}\033[0m".
          format(features,round(insData[features].mean(),3),round(insData[features].median(),3),
                 round(insData[features].kurtosis(),3))
         )
    print()
    print("*"*80)
    print()

<span style="color:blue"><b> a. 'bmi' feature :</b></span> It is observed from the Histogram plot of 'bmi' feature that the data-points of bmi are normly distribued and formed bell shaped curve. But looking at the mean and median of the 'bmi' feature <b>(Mean = 30.663 and Median = 30.4) </b> we can say the data points are slightly right skewed or positively skewed.  Hence we can conclude, this feature is almost perfectly distributed with mean and median values close to each other.

<span style="color:blue"><b> b. 'age' feature :</b></span>  It is observed from the Histogram plot of 'age' feature that data-points of age are distributed quite uniformly. Looking at the mean and median of the 'age' feature <b>(Mean = 39.207 and Median = 39.0) </b> we can say the data points are slightly right skewed or positively skewed with mean and median values close to each other.

<span style="color:blue"><b> c. 'charges' feature :</b></span>  It is observed from the Histogram plot of 'charges' feature that data-points of charges are highly right / positively skewed. Same can be inferred by looking at the mean and median of the 'charges' feature <b>(Mean = 13270.422 and Median = 9382.033)</b> Mean is far greater than Median. Hence we can conclude that the 'charges' feature is highly right / positively skewed.

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">f. Measure of skewness of ‘bmi’, ‘age’ and ‘charges’ columns (2 marks)</span>

In [None]:
remarks = ""
print()
print("~"*95)
print("\033[1mAnswer 3.f.\033[0m Measuring Skewness of \033[1m'bmi', 'age' and 'charges'\033[0m columns from \033[1m\"insData\"\033[0m DataFrame :")
print("~"*95)
print()
print("*"*55)
print("\033[1mColumnName   SkewnessValues       SkewnessRoundValue\033[0m")
print("*"*55)
for features in ('bmi','age','charges'):           # looping with bmi, age and charges features
    if(round(insData[features].skew()) == 0):      # comparing if round value of skewness is zero
        remarks = "\033[1mBased on round value of skewness, {0} Column is Normally Distributed.\033[0m".format(features)
    elif(round(insData[features].skew()) > 0):     # comparing if round value of skewness is greater than zero
        remarks = "\033[1mBased on round value of skewness, {0} Column is Right-Skewed.\033[0m".format(features)
    else:                                          # if the above two condition are false i.e round value of skewness is less than zero
        remarks = "\033[1mBased on round value of skewness, {0} Column is Left-Skewed.\033[0m".format(features)
    # printing features with skewness value , round value of skewwness and remarks from the above conditions
    print("\033[1m{0}\033[0m{1}{2}<= {3}"
          .format(features.ljust(13,' '),str(insData[features].skew()).ljust(25,' '),
                  str(round(insData[features].skew())).ljust(10,' '),remarks)
         )
    print("*"*55)
    print()

<span style="color:blue"><b> a. 'bmi' feature :</b></span> Skewness of this feature is very minimal at 0.284 (approx), slightly right / positive skewed.

<span style="color:blue"><b> b. 'age' feature :</b></span> Skewness of this feature is very very minimal at 0.056 (approx), slightly right / positive skewed.

<span style="color:blue"><b> c. 'charges' feature :</b></span> Skewness of this feature is very high at 1.516 (approx), highly right / positive skewed.

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">g. Checking the presence of outliers in ‘bmi’, ‘age’ and ‘charges' columns (4 marks)</span>

In [None]:
print()
print("~"*110)
print("\033[1mAnswer 3.g.\033[0m Checking the presence of outliers in \033[1m'bmi', 'age' and 'charges' columns\033[0m of the \033[1m\"insData\"\033[0m DataFrame :")
print("~"*110)
print()
row = 3
col = 1
plc = 1
plcHolder = str(row) + str(col) + str(plc)     # subplot placeholder with three rows and one column
for features in ( 'bmi','age','charges'):      # looping with bmi, age and charges features
    Q1 = insData[features].quantile(0.25)      # evaluating lower / first quartile
    Q3 = insData[features].quantile(0.75)      # evaluating upper / third quartile
    IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
    '''
    finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
    extreme outliers (Upper quartile + 1.5 times IQR)
    '''
    outliers = insData[((insData[features] < (Q1 - 1.5 * IQR)) |(insData[features] > (Q3 + 1.5 * IQR)))][features]
    plt.figure(figsize=(10,15))                # setting figure size with width = 10 and height = 15
    plt.subplot(plcHolder)                     # plotting in the placeholder
    print("*"*30)
    print("\033[1mBoxplot of {0} column :\033[0m".format(features))
    print("*"*30)
    ax = sns.boxplot(insData[features])        # seaborn boxplot to examine outliers of the feature
    plt.show()                                 # plotting in the notebook
    plcHolder = str(int(plcHolder) + 1)        # increasing placeholder to next row
    # printing mean, median and IQR for the feature
    print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
          .format(features,round(np.mean(insData[features]),3),round(np.median(insData[features]),3),round(IQR,3))
         )
    print()
    if(outliers.shape[0] == 0):                # comparing if number of outlier is zero
        print("There are \033[1mno outliers\033[0m in \033[1m{0}\033[0m feature.".format(features))
    else:                                      # if the above condition is false i.e number of outlier is not zero
        # printing No of outliers, percentage of the data points are outliers and the values of the outliers
        print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
        .format(outliers.shape[0],round(((outliers.shape[0]/insData.bmi.shape[0])*100),3),features,outliers.tolist()))
    print("*"*125)
    print()

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">h. Distribution of categorical columns (include children) (4 marks)</span>

In [None]:
print()
print("~"*105)
print("\033[1mAnswer 3.h.\033[0m Distribution of 'sex', 'children', 'smoker' and 'region' columns of the \033[1m\"insData\"\033[0m DataFrame :")
print("~"*105)
print()
row = 4
col = 1
plc = 1
plcHolder = str(row) + str(col) + str(plc)                     # subplot placeholder with three rows and one column
for features in ('sex','children','smoker','region'):          # looping with sex, children, smoker and region features
    print("*"*30)
    print("\033[1mDistributon of {0} column :\033[0m".format(features))
    print("*"*30)
    ax = sns.catplot(x=features, kind="count", data=insData)   # seaborn count catplot to examine distribution of the feature
    y = []                                                     # creating a null or empty array
    for val in range(insData[features].nunique()):             # looping for number of unique values in the feature
        # appending count of each unique values from feature to array y
        y.append(insData.groupby(insData[features],sort=False)[features].count()[val])
    for i, v in enumerate(y):                                  # looping count of each unique value in the feature
        # including count of each unique values in the plot 
        plt.annotate(str(v), xy=(i,float(v)), xytext=(i-0.1, v+5), color='black', fontweight='bold')
    plt.show()                                                 # plotting in the notebook
    plcHolder = str(int(plcHolder) + 1)                        # increasing placeholder to next row
    #print("\033[1mFeature {0} : Mean = {1} and Median = {2}\033[0m".format(features,round(np.mean(insData[features]),3),round(np.median(insData[features]),3)))
    print()
    print("*"*80)
    print()

<span style="color:blue;font-size:15px"><b>a. sex column :</b></span> Data is almost equally distributed between female (662) and male (676)<br>
<span style="color:blue;font-size:15px"><b>b. children column :</b></span> With increase in No of children, data count decreases.Count of no (zero) children having most portion of the data (574) and 5 children having least portion of the data (18).<br>
<span style="color:blue;font-size:15px"><b>c. smoker column :</b></span> No of smoker (274) is comparatively lesser than non-smoker(1064).<br>
<span style="color:blue;font-size:15px"><b>d. region column :</b></span> Data is almost equally distributed between northeast (324), southeast (364), southwest (325) and northwest (325).

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">i. Pair plot that includes all the columns of the data frame (4 marks)</span>

In [None]:
print()
print("~"*90)
print("\033[1mAnswer 3.i.\033[0m Pair plot of all columns including categorical columns of \033[1m\"insData\"\033[0m DataFrame :")
print("~"*90)
print()
insDataEncoded = copy.deepcopy(insData)         # copying entire insData dataframe into new insDataEncoded dataframe
# using Label Encoder to encode target labels of the insDataEncoded dataframe with value between 0 and n_classes-1
insDataEncoded.loc[:,insData.select_dtypes(include="object").columns.tolist()] = insDataEncoded.loc[:,insData.select_dtypes(include="object").columns.tolist()].apply(LabelEncoder().fit_transform)
print("*"*110)
print("""\033[1mNote :\033[0m As \033[1mpairplot plots only numerical columns\033[0m, we have to \033[1mcovert categorical columns\033[0m to numerical columns.
In this case we are using \033[1mLabel Encoding\033[0m to encode the following categorical columns :
\033[1m{0}\033[0m.""".format(insData.select_dtypes(include="object").columns.tolist()))
print("*"*110)
print()
ax = sns.pairplot(insDataEncoded)               # seaborn pairplot to examine relationship between the features
plt.show()    

* **There's an interesting pattern between 'age' and 'charges. Could be because for the same ailment, older people are charged more than the younger ones**
* **The only obvious correlation of 'charges' is with 'smoker' - Looks like smokers claimed more money than non-smokers**

## <span style="color:blue">4. Answer the following questions with statistical evidence (28 marks)</span>
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">a. Do charges of people who smoke differ significantly from the people who don't? (7 marks)</span>

In [None]:
print()
print("~"*90)
print("\033[1mAnswer 4.a.\033[0m Do charges of people who smoke differ significantly from the people who don't?")
print("~"*90)
print()
print("Let's check relationship of charges for smoker and non-smoker:")
print()
totCharges = insData.groupby(insData.smoker).sum()['charges'].tolist()   # Total charges group by smoker category 
totCnt = insData.smoker.value_counts().tolist()                          # Total count for different smoker category 
# creating a pandas dataframe and populating the Total Charges, Total Count and Average Charges for smoker category
print("\033[1m{0}\033[0m".format(pd.DataFrame({'Smoker':[round(totCharges[0],2),round(totCnt[0]),round(totCharges[0]/totCnt[0],2)],
                                        'Non_Smoker':[round(totCharges[1],2),round(totCnt[1]),round(totCharges[1]/totCnt[1],2)]},
                            index=['Total Charges','Total Count','Average Charges']))) 
print()
print("*"*40)
print()
print("plotting \033[1mscatter plot\033[0m to visualize Charges for smoker and non-smoker :")
print("-"*70)
print()
plt.figure(figsize=(10,5))                                        # setting figure size with width = 10 and height = 5
# seaborn scatterplot to examine relationship between charges and smoker
sns.scatterplot(insData.age,insData.charges,hue=insData.smoker,palette=['blue','red'])
plt.show()                                                        # plotting in the notebook
print()
print("*"*50)
print("\033[1mStating Null Hypothesis and Alternate Hypothesis.\033[0m")
print("*"*50)
print()
Ho = "Charges of Smoker and Non-Smoker are Same."                 # stating null hypothesis and assigning to Ho variable
Ha = "Charges of Smoker and Non-Smoker are not Same."             # stating alternate hypothesis and assigning to Ha variable
print("\033[1mHo :\033[0m {0} \033[1m(Null Hypothesis)\033[0m".format(Ho))
print("\033[1mHa :\033[0m {0} \033[1m(Alternate Hypothesis)\033[0m".format(Ha))
x = np.array(insData[insData.smoker == "yes"].charges)            # charges for smoker-yes assigned to x variable
y = np.array(insData[insData.smoker == "no"].charges)             # charges for smoker-no assigned to y variable
tStatistic, pValue = stats.ttest_ind(x, y, axis=0)                # two-side test for two sample using ttest_ind
print()
alpha = (0.1,0.05,0.01)                                           # assigning alpha for different level of confidence
confidence = (90,95,99)                                           # assigning level of confidence
j = 0                                                             
for i in (alpha):                                                 # looping with differnt values of alpha
    print("At \033[1m{0}% confidence Level\033[0m : Alpha = \033[1m{1} :\033[0m\n".format(confidence[j],i))
    if(pValue > i):                                               # comparing if p-value greater than alpha  
        print("The \033[1mp-Value is {0} and > {1}\033[0m, so we \033[1mfail to reject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ho)
             )
    else:                                                         # if above condition fails i.e if p-value less than equals then alpha
        print("The \033[1mp-Value is {0} and <= {1}\033[0m, so we \033[1mreject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ha)
             )
    j = j + 1
    print()

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">b. Does bmi of males differ significantly from that of females? (7marks)</span>

In [None]:
print()
print("~"*72)
print("\033[1mAnswer 4.b.\033[0m Does bmi of males differ significantly from that of females?")
print("~"*72)
print()
print("Let's check relationship of bmi for male and female:")
print()
totBmi = insData.groupby(insData.sex).sum()['bmi'].tolist()   # Total bmi group by sex category 
totCnt = insData.sex.value_counts().tolist()                  # Total count for different sex category 
# creating a pandas dataframe and populating the Total bmi, Total Count and Average bmi for sex category
print("\033[1m{0}\033[0m".format(pd.DataFrame({'Male':[round(totBmi[0],2),round(totCnt[0]),round(totBmi[0]/totCnt[0],2)],
                                        'Female':[round(totBmi[1],2),round(totCnt[1]),round(totBmi[1]/totCnt[1],2)]},
                            index=['Total BMI','Total Count','Average BMI']))) 
print()
print("*"*40)
print()
print("plotting \033[1mscatter plot\033[0m to visualize bmi for male and female :")
print("-"*60)
print()
plt.figure(figsize=(10,5))                                     # setting figure size with width = 10 and height = 5
# seaborn scatterplot to examine relationship between bmi and sex
sns.scatterplot(insData.age,insData.bmi,hue=insData.sex,palette=['blue','red']) 
plt.show()                                                     # plotting in the notebook
print()
print("*"*50)
print("\033[1mStating Null Hypothesis and Alternate Hypothesis.\033[0m")
print("*"*50)
print()
Ho = "bmi of males did not differ from that of females."       # stating null hypothesis and assigning to Ho variable
Ha = "bmi of males differ significantly from that of females." # stating alternate hypothesis and assigning to Ha variable
print("\033[1mHo :\033[0m {0} \033[1m(Null Hypothesis)\033[0m".format(Ho))
print("\033[1mHa :\033[0m {0} \033[1m(Alternate Hypothesis)\033[0m".format(Ha))
x = np.array(insData[insData.sex == "male"].bmi)               # bmi for sex-male assigned to x variable
y = np.array(insData[insData.sex == "female"].bmi)             # bmi for sex-female assigned to y variable
tStatistic, pValue = stats.ttest_ind(x, y, axis=0)             # two-side test for two sample using ttest_ind
print()
alpha = (0.1,0.05,0.01)                                        # assigning alpha for different level of confidence
confidence = (90,95,99)                                        # assigning level of confidence
j = 0                                                             
for i in (alpha):                                              # looping with differnt values of alpha
    print("At \033[1m{0}% confidence Level\033[0m : Alpha = \033[1m{1} :\033[0m\n".format(confidence[j],i))
    if(pValue > i):                                            # comparing if p-value greater than alpha  
        print("The \033[1mp-Value is {0} and > {1}\033[0m, so we \033[1mfail to reject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ho)
             )
    else:                                                      # if above condition fails i.e if p-value less than equals then alpha
        print("The \033[1mp-Value is {0} and <= {1}\033[0m, so we \033[1mreject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ha)
             )
    j = j + 1
    print()

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">c. Is the proportion of smokers significantly different in different genders? (7 marks)</span>

In [None]:
print()
print("~"*82)
print("\033[1mAnswer 4.c.\033[0m Is the proportion of smokers significantly different in different genders?")
print("~"*82)
print()
print("Let's check relationship of smoker and sex:")
print()
pivot = pd.crosstab(insData.sex,insData.smoker)                    # pivoting sex with smoker
print(pivot)
print()
print("*"*40)
print()
print("plotting \033[1mcount plot\033[0m to visualize smoker for male and female :")
print("-"*60)
print()
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
sns.countplot(insData.smoker,hue=insData.sex)                      # seaborn countplot to examine count of smoker in different sex
plt.show()                                                         # plotting in the notebook
print()
print("*"*50)
print("\033[1mStating Null Hypothesis and Alternate Hypothesis.\033[0m")
print("*"*50)
print()
Ho = "proportion of smokers has no difference in different genders."       # stating null hypothesis and assigning to Ho variable
Ha = "proportion of smokers significantly different in different genders." # stating alternate hypothesis and assigning to Ha variable
print("\033[1mHo :\033[0m {0} \033[1m(Null Hypothesis)\033[0m".format(Ho))
print("\033[1mHa :\033[0m {0} \033[1m(Alternate Hypothesis)\033[0m".format(Ha))
femaleSmokers = insData[(insData.sex == 'female') & (insData.smoker == 'yes')].smoker.count()  # number of female smokers
maleSmokers = insData[(insData.sex == 'male') & (insData.smoker == 'yes')].smoker.count()      # number of male smokers    
totFemale = insData[insData.sex == 'female'].sex.count()          # number of females in the data
totMales = insData[insData.sex == 'male'].sex.count()             #number of males in the data 
# 
stat, pValue = proportion.proportions_ztest([femaleSmokers, maleSmokers] , [totFemale, totMales])
print()
alpha = (0.1,0.05,0.01)                                           # assigning alpha for different level of confidence
confidence = (90,95,99)                                           # assigning level of confidence
j = 0                                                             
for i in (alpha):                                                 # looping with differnt values of alpha
    print("At \033[1m{0}% confidence Level\033[0m : Alpha = \033[1m{1} :\033[0m\n".format(confidence[j],i))
    if(pValue > i):                                               # comparing if p-value greater than alpha  
        print("The \033[1mp-Value is {0} and > {1}\033[0m, so we \033[1mfail to reject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ho)
             )
    else:                                                         # if above condition fails i.e if p-value less than equals then alpha
        print("The \033[1mp-Value is {0} and <= {1}\033[0m, so we \033[1mreject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ha)
             )
    j = j + 1
    print()

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color:blue">d. Is the distribution of bmi across women with no (zero) children, one children and two children, the same? (7 marks)</span>

In [None]:
# filtering data for sex-female and assigning to femaleInsData vaiable
femaleInsData = insData[insData['sex'] == "female"]
print()
print("~"*105)
print("\033[1mAnswer 4.d.\033[0m Is the distribution of bmi across women with no (zero) children, one childand two children, the same?")
print("~"*105)
print()
print("Let's check relationship of bmi across Women with No of children :")
print()
totBmi = femaleInsData.groupby(femaleInsData.children).sum()['bmi'].tolist() # Total bmi group by No of children
totCnt = femaleInsData.children.value_counts().tolist()                      # Total count group by No of children
# creating a pandas dataframe and populating the Total bmi, Total Count and Average bmi for No of Children
print("\033[1m{0}\033[0m".format(pd.DataFrame({'Zero Children':[round(totBmi[0],2),round(totCnt[0]),round(totBmi[0]/totCnt[0],2)],
                                        'One Children':[round(totBmi[1],2),round(totCnt[1]),round(totBmi[1]/totCnt[1],2)],
                                        'Two Children':[round(totBmi[2],2),round(totCnt[2]),round(totBmi[2]/totCnt[2],2)],
                                        'Three Children':[round(totBmi[3],2),round(totCnt[3]),round(totBmi[3]/totCnt[3],2)],
                                        'Four Children':[round(totBmi[4],2),round(totCnt[4]),round(totBmi[4]/totCnt[4],2)],
                                        'Five Children':[round(totBmi[5],2),round(totCnt[5]),round(totBmi[5]/totCnt[5],2)],
                                              },
                            index=['Total BMI','Total Count','Average BMI']).T)) 
print()
print("*"*50)
print()
print("plotting \033[1mbar plot\033[0m to visualize bmi across women for No of Children :")
print("-"*60)
print()
plt.figure(figsize=(10,5))                               # setting figure size with width = 10 and height = 5
sns.barplot(femaleInsData.children, femaleInsData.bmi)   # seaborn barplot to examine mean of bmi for No of Children
plt.show()                                               # plotting in the notebook
print()
print("*"*50)
print("\033[1mStating Null Hypothesis and Alternate Hypothesis.\033[0m")
print("*"*50)
print()
# stating null hypothesis and assigning to Ho variable
Ho = "distribution of bmi across women with no (zero) children, one children and two children, the same."
# stating alternate hypothesis and assigning to Ha variable
Ha = "distribution of bmi across women with no (zero) children, one children and two children, not the same."
print("\033[1mHo :\033[0m {0} \033[1m(Null Hypothesis)\033[0m".format(Ho))
print("\033[1mHa :\033[0m {0} \033[1m(Alternate Hypothesis)\033[0m".format(Ha))
zeroChildren = femaleInsData[femaleInsData['children']==0]['bmi']  # bmi of sex-female for No-Children
oneChildren = femaleInsData[femaleInsData['children']==1]['bmi']   # bmi of sex-female for One-Children
twoChildren = femaleInsData[femaleInsData['children']==2]['bmi']   # bmi of sex-female for Two-Children
# Oneway ANNOVA test for above three groups i.e No-Children, One-Children and Two Children
fStatistic, pValue = stats.f_oneway( zeroChildren, oneChildren, twoChildren)
print()
alpha = (0.1,0.05,0.01)                                           # assigning alpha for different level of confidence
confidence = (90,95,99)                                           # assigning level of confidence
j = 0                                                             
for i in (alpha):                                                 # looping with differnt values of alpha
    print("At \033[1m{0}% confidence Level\033[0m : Alpha = \033[1m{1} :\033[0m\n".format(confidence[j],i))
    if(pValue > i):                                               # comparing if p-value greater than alpha  
        print("The \033[1mp-Value is {0} and > {1}\033[0m, so we \033[1mfail to reject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ho)
             )
    else:                                                         # if above condition fails i.e if p-value less than equals then alpha
        print("The \033[1mp-Value is {0} and <= {1}\033[0m, so we \033[1mreject\033[0m the Null Hypothesis and conclude that \033[1m{2}\033[0m"
              .format(pValue,i,Ha)
             )
    j = j + 1
    print()