# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Identify which variables most affect insurance charges using exploratory data analysis and visualization.
* Compare insurance charges between smokers and non-smokers to determine the impact of smoking status.
* Analyse the relationship between BMI and insurance charges by grouping BMI and visualizing average charges.
* Utilise both interactive (Plotly) and static (Matplotlib) visualizations to effectively communicate findings.

## Inputs

* The cleaned insurance data file: cleaned_insurance.csv (located in the data folder)
Python libraries: pandas, matplotlib, and plotly
Columns required: age, sex, bmi, children, smoker, region, and charges 

## Outputs

 Visualizations:
- An interactive Plotly scatter plot (BMI vs Charges, colored by smoker status)
- A Matplotlib boxplot (Charges by smoker status)
- A Matplotlib bar plot (Average charges by BMI group)
+ Cleaned data file: cleaned_insurance.csv 
+ Summary statistics and insights from the exploratory data analysis

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Student\\OneDrive\\Documents\\VS code projects\\Individual P\\Project-1--Individual\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1\- Extracting Data

Section 1 content
- Imported the necessary Python libraries for data analysis and visualization: pandas, NumPy, matplotlib, seaborn, and plotly.
- Loaded the insurance dataset from the data folder.
- Displayed the first few rows to get an initial sense of the dataset’s structure.
- Used .info() to review column data types and check for missing values.
- Applied .describe() to generate summary statistics for the numerical variables.


Import the packages

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Extract the data 

In [13]:
df = pd.read_csv('../data/cleaned_insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest,bmi_category
0,19,1,27.9,0,1,16884.924,False,False,True,Overweight
1,18,0,33.77,1,0,1725.5523,False,True,False,Obese
2,28,0,33.0,3,0,4449.462,False,True,False,Obese
3,33,0,22.705,0,0,21984.47061,True,False,False,Normal
4,32,0,28.88,0,0,3866.8552,True,False,False,Overweight


Reviewed the dataset’s structure and key attributes to understand its overall composition.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1337 non-null   int64  
 1   sex               1337 non-null   int64  
 2   bmi               1337 non-null   float64
 3   children          1337 non-null   int64  
 4   smoker            1337 non-null   int64  
 5   charges           1337 non-null   float64
 6   region_northwest  1337 non-null   bool   
 7   region_southeast  1337 non-null   bool   
 8   region_southwest  1337 non-null   bool   
 9   bmi_category      1337 non-null   object 
dtypes: bool(3), float64(2), int64(4), object(1)
memory usage: 77.2+ KB


In [15]:
df.describe()

Unnamed: 0,age,sex,bmi,children,smoker,charges
count,1337.0,1337.0,1337.0,1337.0,1337.0,1337.0
mean,39.222139,0.495138,30.663452,1.095737,0.204936,13279.121487
std,14.044333,0.500163,6.100468,1.205571,0.403806,12110.359656
min,18.0,0.0,15.96,0.0,0.0,1121.8739
25%,27.0,0.0,26.29,0.0,0.0,4746.344
50%,39.0,0.0,30.4,1.0,0.0,9386.1613
75%,51.0,1.0,34.7,2.0,0.0,16657.71745
max,64.0,1.0,53.13,5.0,1.0,63770.42801


# Section 2\- Transforming Data

In this section, I carried out data cleaning and preparation steps:

- Verified the absence of missing values using .isnull().sum().
- Identify and removing duplicate records to ensure data quality if necessary 

Check for any missing data.

In [18]:
df.isnull().sum()

age                 0
sex                 0
bmi                 0
children            0
smoker              0
charges             0
region_northwest    0
region_southeast    0
region_southwest    0
bmi_category        0
dtype: int64

There are no missing values in the dataset.


Check for duplicate rows

In [19]:
df.duplicated().sum()

np.int64(0)

No duplications in the dataset

---

# Section 3 \- Data Visualisation

In this section, I have explored the insurance dataset using various data visualisation techniques. The visualisations help identify patterns and relationships between key variables such as charges, BMI, age, and smoking status. By comparing groups such as smokers versus non-smokers and analysing how different factors influence insurance charges, I gain deeper insights into the data and highlight important trends.

### 1. Does BMI relate to insurance charges, and does smoking status affect this relationship 

To investigate this, I used an interactive Plotly scatter plot to visualise the relationship between BMI and insurance charges, with points coloured by smoking status. This allowed me to observe how charges vary with BMI and whether there are noticeable differences between smokers and non-smokers.

In [25]:
fig = px.scatter(
df, x='bmi', y='charges', color='smoker',
title='BMI vs Charges (Colored by Smoker Status)',
labels={'bmi': 'BMI', 'charges': 'Insurance Charges', 'smoker': 'Smoker'}
)
fig.show() 

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
