## Get our environment set up

The first thing we'll need to do is load in the libraries and datasets we'll be using. We'll working with one datasets containing information on air quality conditon (Humidity, temperature, pressure, CO2 levels) and numbers of microparticles (between 0.1 and 5 μm in size by m3 of air) in an operating room .

In [None]:
# Modules we'll use
import pandas as pd
import numpy as np

# Define the names of the columns we want treated as dates or datetimes
date_cols = ['ts']

# Read in our data as a Pandas DataFrame
df = pd.read_csv("/content/gams_indoor.csv", parse_dates=date_cols)

# set seed for reproducibility
np.random.seed(0)

## Get sample of our dataset

We simply display the first 5 row of our data.

In [None]:
df.head()

Unnamed: 0,ts,co2,humidity,pm10,pm25,temperature,voc
0,2016-11-21 00:47:03,708.0,72.09,10.2,9.0,20.83,0.062
1,2016-11-21 00:48:03,694.0,70.95,10.9,10.1,21.01,0.062
2,2016-11-21 00:49:03,693.0,69.12,10.2,9.9,21.2,0.062
3,2016-11-21 00:50:03,692.0,68.83,9.6,9.6,21.37,0.062
4,2016-11-21 00:51:03,690.0,68.6,9.4,8.4,21.49,0.062


## Get the features of our dataset

As we have just explained above, we have  characteristics to measure to define if an operating room has an infectious risk: 

*   microparticles by size (< 10 μm and < 2.5μm)
*   temperature
*   pressure
*   CO2 levels
*   humidity

### Categorical columns

`risk` is our categorical column that defines the level of infectious risk depending on the differents features of a row ( at a given point in time) 

### Date type columns

`ts`  column contains date format values. So the first things to do is to make sure that the data type of this column is the right one.



In [None]:
# We print the differents columns of our DataFrame
for col in df.columns:
    print(col)

ts
co2
humidity
pm10
pm25
temperature
voc


# Process for labeling raw data 

First and foremost, we need to "manually" label our data 

# Data Cleaning

## Parsing Dates

### Checkout our dates dtype

Because the dtype of our column is `object` rather than `datetime64`, we can tell that Python doesn't know that this column contains dates.

In [None]:
print(df['ts'].head())

0   2016-11-21 00:47:03
1   2016-11-21 00:48:03
2   2016-11-21 00:49:03
3   2016-11-21 00:50:03
4   2016-11-21 00:51:03
Name: ts, dtype: datetime64[ns]


### Convert our date columns to datetime

We can use a guide called as ["strftime directive"](https://strftime.org/) to indentify the format of our dates. For example: 


*   The `date` column has values with the format `%X`



In [None]:
# update the "date" column with the parsed dates
df['ts'] = pd.to_datetime(df['ts'], errors='coerce', format = "%x")

Now when we check the first few rows of the new column, we can see that the dtype is `datetime64`.

In [None]:
df['ts'].head()

0   2016-11-21 00:47:03
1   2016-11-21 00:48:03
2   2016-11-21 00:49:03
3   2016-11-21 00:50:03
4   2016-11-21 00:51:03
Name: ts, dtype: datetime64[ns]