# **Data cleaning**

### Objectives
- Clean the .csv raw data from Kaggle

### Inputs
- This file uses insurance.csv data located in the data/raw.

### Outputs
- The file will save the cleaned data into data/cleaned folder.


### Load the libraries and the data

First we need to load pandas for data manipulation:

In [1]:
import pandas as pd

The data will be loaded as the variable insurance:

In [2]:
insurance = pd.read_csv("../data/raw/insurance.csv")
print(insurance.shape)
insurance.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Check for null entries

In [3]:
insurance.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

As you can see, all data is filled and there are no null values.

### Look into the categorical data
Here I will look at different categorical variables in order to understand them better and to decide what to do with these variables in the data cleaning process.

In [4]:
insurance.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

We have 3 categorical variables: `sex`, `smoker` and `region`. Let's look at the unique variables for each category:

In [5]:
print(f"sex: {insurance['sex'].unique()}")
print(f"smoker: {insurance['smoker'].unique()}")
print(f"region: {insurance['region'].unique()}")

sex: ['female' 'male']
smoker: ['yes' 'no']
region: ['southwest' 'southeast' 'northwest' 'northeast']


All of the categories have few options and none of them seem anomalous, therefore there is no need to combine any of them. However we might change `smoker` to boolean to make the data lighter for the memory.

## Data cleaning