# **U.S.Health Insurance - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://blog.amopportunities.org/wp-content/uploads/2019/07/Health-Insurance.jpg'

# Display the image
Image(url=image_url)

# **Context**

This dataset can be helpful in a simple, yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured, and how they affect the insurance premium.


# **About the Dataset**

This dataset contains 1338 rows of insured data, where the insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker, and Region. There are no missing or undefined values in the dataset.

Body mass index (BMI) provides an understanding of body weights that are relatively high or low relative to height. It is considered to be the objective index of body weight using the ratio of height to weight, ideally 18.5 to 24.9.

| **Variable**| **Description**                                      |
|:------------|:-----------------------------------------------------|
| AGE         | The age of the primary beneficiary   |
| SEX         | Male or Female                       |
| BMI         | Body Mass Index (kg/m<sup>2</sup>)    |
| Number of children | Number of dependents covered by the insurance |
| Smoker      | Yes or No                                    |
| Region      | The beneficiary's residential area in the U.S.<br>northeast, southeast, southwest, and northwest                            |
| Charges     | Individual medical costs billed by health insurance  |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [3]:
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/US%20Health%20Insurance.csv'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**
Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable           | Classification            |
|:-------------------|:--------------------------|
| Age                | Quantitative, continuous  |
| Sex                | Qualitative               |
| BMI                | Quantitative, continuous  |
| Number of children | Quantitative, discrete    |

## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of region. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for region. Round relative frequency to 3 decimal places.

In [5]:
# Define the name of the variable to be analyzed
variable = df['region']                    #variable = df['...']

# Create the frequency table and sort the categories in numerical order.
# .sort_index works here because the category names are numerical.
# rename "count" to "frequency"
freq_table = pd.value_counts(variable)
freq_table = freq_table.rename('Frequency')

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table = freq_table/len(df)                                            #relative_freq_table=freq_table/... HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('Relative Frequency').round(3)     # relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
combined_table=pd.concat([freq_table, relative_freq_table], axis=1)                 # combined_table=pd.concat([..., ...], axis=1)

# Rename the column 'cp_label' to 'Chest Pain Type'
combined_table.index.name = 'Region'

# Print the combined table.
combined_table                                                          # ...


Unnamed: 0_level_0,Frequency,Relative Frequency
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
southeast,364,0.272
southwest,325,0.243
northwest,325,0.243
northeast,324,0.242


**2b)** Construct a relative frequency bar chart to describe the distribution of chest pain type.

In [6]:
# The argument in (...) tells the system which table to use to make the
#      bar graph. Since we want a relative freqency bar chart, which is the
#      correct label to replace the ellipsis (...) in this code?
dfrf = pd.DataFrame(relative_freq_table)  #dfrf = pd.DataFrame(...)

# Create the bar graph
fig = px.bar(x=dfrf.index,y=dfrf['Relative Frequency'],
             title='Relative Frequency Distribution Bar Chart') #title=...,

# Update axis labels
fig.update_layout(xaxis_title="Region of Beneficiary's Residence")        #xaxis_title=...
fig.update_layout(yaxis_title="Relative Frequency")             #yaxis_title=...

# Show the bar graph
fig.show()

**2c)** Describe the distribution of chest pain type. When discussing pain types, use the descriptors in addition to the number to help the reader understand. For example, if you want to talk about chest pain type 2, you could say something like "... non-anginal pain (type 2) ...".

The patients are approximately evenly split between the four regions.

-OR-

The patients are more likely to live in the Southeast, and are about equally likely to live in the Southwest, Northwest, or Northest

-OR-

All four regions are represented approximately evenly in the dataset.

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| **Last Name** | **Variable Description**               |
|:--------------|:---------------------------------------|
| A-L           | BMI                |
| M-Z           | Charges    |



**3a)** Construct a histogram for your variable. Use number of bins = 20.

# **A-L**

In [7]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.
fig = px.histogram(x=df['bmi'],nbins = 20,                           #fig = px.histogram(x=df['...'],nbins = ...,
             title='Histogram of BMI',                  #title='...',
             labels={'x':'BMI'})                        #labels={'x':'...')

# Update the vertical axis title.
fig.update_layout(yaxis_title='Frequency')                       #fig.update_layout(yaxis_title="...")

# Print the histogram
fig.show()

# **M-Z**

In [8]:
fig = px.histogram(x=df['charges'],nbins = 20,                                 #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Insurance Charges',                 #title=...,
             labels={'x':'Insurance Charges'})      #labels={'x':...,'y':...})
fig.update_layout(yaxis_title="Frequency")
fig.show()

**3b)** Construct a boxplot for your variable.  

# **A-L**

In [9]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['bmi'],                                         #px.box(x=df[...],
       title='Boxplot of BMI',                 #title=...,
       labels={'x':'BMI'})                     #labels={'x':...})
#fig.update_layout(xaxis_title="BMI")

# **M-Z**

In [None]:
px.box(x=df['charges'],
       title='Boxplot of Insurance Charges',
       labels={'x':'Insurance Charges'})

**3c)** Calculate the following summary statistics for your variable: 5 number summary, mean, and standard deviation. Round to two decimal places.

In [None]:
# Calculate the numerical summaries
# Indicate your variable.
descriptive_stats = df[['bmi','charges']].describe().round(2)       #descriptive_stats = df[[...]].describe().round(...)

# Print the results.
descriptive_stats                                                                               #...


Unnamed: 0,bmi,charges
count,1338.0,1338.0
mean,30.66,13270.42
std,6.1,12110.01
min,15.96,1121.87
25%,26.3,4740.29
50%,30.4,9382.03
75%,34.69,16639.91
max,53.13,63770.43


**3d)** Use information from (3a), (3b) and 3(c) to describe your variable in terms of shape, center, spread, and outliers.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which center and which spread you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, state the number of outliers and provide a range. EX: There are 12 outliers ranging from 2.341 inches to 79.742 inches.
* If there are more than 15 outliers, use "more than 15" to describe how many, and then give the range.
* Include units, if any, for all numbers.
* For BMI units, (kg/m^2), you can type the superscript with '^': kg/m^2
* For charges units, `$`, you must put use backticks. The backtick is located to the left of the 1 on a computer keyboard. Ex: `$243.21` (you will have to double click this text box to see what is typed).

**NOTE:** Students must use the center/spread that match their chosen shape. Skewed/outliers = median/IQR, Symm/no outliers = mean/sd

**A-L:** The distribution of BMI is approximately symmetric. There are 9 outliers ranging from 47.41 kg/m^2 to 53.13 kg/m^2. The median is 30.4 kg/m^2. The IQR is 8.39 kg/m^2.

**M-Z:** The distribution of charges is skewed right. There are a lot of outliers ranging from `$346,178.40` to `$637,704.30`. The median is `$9,382.03`. The IQR is `$11,899.62`.

**3e)** Interpret the standard deviation in context.

**A-L:** The typical BMI falls within 6.1 kg/m^2 of the mean BMI.

**M-Z:** The typical insurance charge falls within `$12,110.01` of the mean insurance charge.

**3f)** Interpret the IQR in context.

**A-L:** The range of the middle half (50%) of BMI's is 8.39 gm/m^2.

**M-Z:** The range of the middle half (50%) of insurance charges is `$11,899.62`.

## **QUESTION 4**

How do the BMI and insurance charges compare in the Southeast and Southwest?

Calculate the mean BMI and mean charges for both regions. Round to two decimal places. Compare the results and answer a question about the code.

**4a)** Calculate and state the mean BMI and mean insurance charges for the Southeast and Southwest regions. Round to two decimal places.

In [None]:
# heart_disease_means = df[df['heart disease variable name'] == 1]
# [['quant 1', 'quant 2', 'quant 3'].mean().round(...)
southeast_means = df[df['region'] == 'southeast'][['bmi', 'charges']].mean().round(2)                        #heart_disease_means = df[df['...'] == 1][['...', '...', '...']].mean().round(...)
southwest_means = df[df['region'] == 'southwest'][['bmi', 'charges']].mean().round(2)                     #no_heart_disease_means = df[df['target'] == ...][['...', '...', '...']].mean().round(...)

# Combine the two table and specify labels.
combined_means = pd.DataFrame({'BMI': [southeast_means['bmi'], southwest_means['bmi']],
                         'Charges': [southeast_means['charges'], southwest_means['charges']]},
                        index=['Southeast', 'Southwest'])

# Print the table
combined_means # ...

Unnamed: 0,BMI,Charges
Southeast,33.36,14735.41
Southwest,30.6,12346.94


**4b)** Compare the results.

Patients in the Southeast have a higher mean BMI and higher mean insurance charges.

**4c)** In the table below are some snippets of code.

| **Last Name Initial** | **Code**                     |
|:----------------------|:-----------------------------|
| A-L                    | df[df['region'] == 'southeast'] |
| M-Z                    | df['region'] == 'southwest'               |


Based on your last name, interpret the snippet of code.

**A-L:** Creates a DataFrame for only the Southeast region.

**M-Z:** Checks if the region equals southeast.


## **QUESTION 5**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 5a or 5b, but not both.

**5a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**5b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
1. Run all code one last time
2. File-> Print Preview opens in a new browser window
3. Verify you can see all graphs. If not, go back to step 1.
4. File -> Print (or ctrl-p/cmnd-p)
5. Change destination to PDF (don't save, yet)
6. Scroll through preview to make sure you can see your graphs entirely. If not, click Cancel. Make the browser window narrower. Go back to step 4.
7. Repeat steps 4-6 until you can see your graphs completely. But do not make them too narrow.
8. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the "Tutorial 2 Practice Upload" assignment
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the right, and a display of pages (thumbnails) on the right.
Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"
