# Sarah A. Thomas  
# Project 2 - Axis Insurance

__Description__ (copied from project assignment) 

__Context__ - Leveraging customer information is of paramount importance for most businesses. In the case of an insurance company, the attributes of customers like the ones mentioned below can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have.

__Objective__ – Statistical Analysis of Business Data. Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

You are expected to do the following:

1. Explore the dataset and extract insights using Exploratory Data Analysis.
2. Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't? (Hint- Formulate a hypothesis and prove/disprove it)
3. Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.
4. Is the proportion of smokers significantly different across different regions? (Hint : Create a contingency table/cross tab, Use the function : stats.chi2_contingency())
5. Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.
*Consider a significance level of 0.05 for all tests.

__Data Dictionary –__

1. Age - This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).
2. Sex - This is the policy holder's gender, either male or female.
3. BMI - This is the body mass index (BMI), which provides a sense of how over or underweight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9.
4. Children - This is an integer indicating the number of children/dependents covered by the insurance plan.
5. Smoker - This is yes or no depending on whether the insured regularly smokes tobacco.
6. Region - This is the beneficiary's place of residence in the U.S., divided into four geographic regions - northeast, southeast, southwest, or northwest.
7. Charges​ - Individual medical costs billed to health insurance

# 1 - Load Packages and Read In the Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore') #filter out warnings

In [2]:
axis_ins = pd.read_csv('AxisInsurance.csv')

# 2 - Initial Exploration of the Dataset

## 2.1 - Check the first and last 10 rows of the dataset

In [4]:
axis_ins.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [5]:
axis_ins.tail(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1328,23,female,24.225,2,no,northeast,22395.74424
1329,52,male,38.6,2,no,southwest,10325.206
1330,57,female,25.74,2,no,southeast,12629.1656
1331,23,female,33.4,0,no,southwest,10795.93733
1332,52,female,44.7,3,no,southwest,11411.685
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


__Observation:__ Data looks clean and consistent with what was provided in the data dictionary.

## 2.2 - Check the shape of the data

In [6]:
axis_ins.shape

(1338, 7)

__Observation:__ The dataset has 1338 rows and 7 columns.

## 2.3 - Check the datatype of the variables

In [7]:
axis_ins.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

__Observation:__ Sex, smoker, and region should be converted to categorical variables.

In [8]:
axis_ins["sex"] = axis_ins["sex"].astype("category")
axis_ins["smoker"] = axis_ins["smoker"].astype("category")
axis_ins["region"] = axis_ins["region"].astype("category")

Re-checking datatypes to ensure conversion of datatype took place:

In [9]:
axis_ins.dtypes

age            int64
sex         category
bmi          float64
children       int64
smoker      category
region      category
charges      float64
dtype: object

Conversion of datatype took place properly.

## 2.4 - Check for null values

In [10]:
axis_ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1338 non-null   int64   
 1   sex       1338 non-null   category
 2   bmi       1338 non-null   float64 
 3   children  1338 non-null   int64   
 4   smoker    1338 non-null   category
 5   region    1338 non-null   category
 6   charges   1338 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 46.2 KB


__Observation:__ No null values to contend with.

## 2.5 - Analyze quantitative variables

In [11]:
axis_ins.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


__Observations:__  
* Age ranges from 18-64, with mean and median very close in value (mean = 39.207, median = 39.000). This indicates near zero skewness.
* BMI ranges from 15.96-53.13, with mean and median very close in value (mean = 30.663, median = 30.400). This indicates near zero skewness.
* Number of children ranges from 0-5 with mean and median very close in value (mean = 1.095, median = 1.000). This indicates near zero skewness.
* Charges range from 1121.87-63770.43, a wide range. With the mean (13,270.42) greater than the median (9,382.03), the data is right-skewed.

## 2.6 - Analyze categorical variables

In [12]:
axis_ins.describe(include = ["category"])

Unnamed: 0,sex,smoker,region
count,1338,1338,1338
unique,2,2,4
top,male,no,southeast
freq,676,1064,364


__Observations:__  
* More males are policy holders (676) compared to females (662).
* Most policy holders do not smoke (1064).
* Most policy holders live in the southeast region of the U.S. (364).

# 3.0 - Univariate Analysis