# Data Preprocessing


Before using interpolation methods and outlier removal, I'll apply certain changes to only get the relevant data and give it a certain format. These changes will mostly be: 
- Editing column names 
- Eliminating irrelevant features to the project
- Accurately replacing null values
- Determine column data types
- Extract relevant instances 
- Reduce number of instances


In [49]:
import pandas as pd
import numpy as np

## Edit column names and eliminate innecessary features

The data was obtained from the official page of the SEMADET link here, describe columns 

In [None]:
filename = "semadet-aire-2017"
filepath = f"datasets/{filename}.csv"

df = pd.read_csv(filepath,
                     encoding='utf-8',
                     )

df.columns = df.columns.str.lower().str.strip()

df.rename({"pm2.5": "pm25", 
           "date_time": "date",
           "precipitacion": "pp", 
           "rad solar": "rs",
           "presion barometrica": "pba"},
          axis="columns",
          inplace=True)

df.drop(["rs", "nox", "no", "tmpi"], axis="columns", inplace=True)

In [37]:
df.head()

Unnamed: 0,date,hora,o3,no2,so2,co,pm10,pm25,tmp,rh,ws,wd,pp
0,1/1/2017,00:00:00,0.013,0.017,0.0038,0.79,48.2,,17.6,58.4,2.9,5.8,0.0
1,1/1/2017,01:00:00,0.01,0.023,0.0032,0.829,76.4,,16.8,63.4,1.3,4.8,0.0
2,1/1/2017,02:00:00,0.01,0.021,0.003,0.891,56.6,,16.0,67.1,0.4,4.9,0.0
3,1/1/2017,03:00:00,0.006,0.028,0.003,1.163,36.4,,15.2,71.1,0.1,2.4,0.0
4,1/1/2017,04:00:00,0.005,0.032,0.0034,1.935,50.9,,14.6,74.0,0.5,4.1,0.0


## Replace null values

I will replace all the identifiers for the null values (which are quite varied) with nan to see which columns to keep based on amount of values.

In [39]:
def replace_with_null(row):
    null_values = ["IO", "SE", "ND", "IF", "VE", "IR", "VZ", "IC", "IR 1000", "IR valor 1000", " ", "", "-", "SD"]
    exclude_columns = ["estacion", "date", "hora"]
    
    for column in row.index:
        if column not in exclude_columns and row[column] in null_values:
            row[column] = np.nan
            
    return row
    

In [40]:
df = df.apply(replace_with_null, axis="columns")

Sice the no2 and so2 columns have no useful data, they'll be dropped entirely.

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8746 entries, 0 to 8745
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    8746 non-null   object 
 1   hora    8696 non-null   object 
 2   o3      8636 non-null   float64
 3   no2     8226 non-null   float64
 4   so2     8660 non-null   float64
 5   co      8683 non-null   float64
 6   pm10    8655 non-null   float64
 7   pm25    3301 non-null   float64
 8   tmp     8696 non-null   float64
 9   rh      8696 non-null   float64
 10  ws      8696 non-null   float64
 11  wd      8495 non-null   float64
 12  pp      8696 non-null   float64
dtypes: float64(11), object(2)
memory usage: 888.4+ KB


In [42]:
df.drop(["no2", "so2"], axis="columns", inplace=True)

## Specify data tyoes

Now that the null values have been replaced, I can specify the correct data type for each column.

In [44]:
float_cols = ["o3", "co", "pm10", "pm25", "tmp", "rh", "ws", "wd", "pp"]
df[float_cols] = df[float_cols].astype('float')

## Extract relevant instances

The AQI daily forecasting will only be done for the Tlaquepaque region, so I'll also extract all the pertaining rows and can also now drop the column estacion. I'll also convert the date column into the index.

In [20]:
df = df[df["estacion"] == "Tlaquepaque"]

In [21]:
df.head(3)

Unnamed: 0,estacion,date,hora,o3,co,pm10,pm25,tmp,rh,ws,wd,pp,pba
78840,Tlaquepaque,1/1/2019,12:00:00 AM,0.004,6.856,260.52,,9.8,74.5,0.33,186.11,0.0,
78841,Tlaquepaque,1/1/2019,1:00:00 AM,0.004,5.402,306.19,,9.2,76.3,0.34,216.73,0.0,
78842,Tlaquepaque,1/1/2019,2:00:00 AM,0.005,8.872,387.25,,8.7,78.3,0.47,122.66,0.0,


In [22]:
df.drop("estacion", axis="columns", inplace=True)

Some datasets have the incorrect date time format

In [39]:
# df.drop("date", axis="columns", inplace=True)

In [40]:
# Create a date range for the year 2021
# dates_2021 = pd.date_range('2021-01-01', '2021-12-31', freq='D')

# Repeat each date 24 times (for 24 hours)
# repeated_dates = pd.Series(dates_2021.repeat(24))

# Create the DataFrame and assign the repeated dates to the 'date' column
# df ['date'] = repeated_dates

Use date as index

In [45]:
df.index = pd.to_datetime(df['date'], format='%m/%d/%Y')
df.drop("date", axis="columns", inplace=True)

In [46]:
df = df.sort_index()

In [47]:
df.head()

Unnamed: 0_level_0,hora,o3,co,pm10,pm25,tmp,rh,ws,wd,pp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-01-01,00:00:00,0.013,0.79,48.2,,17.6,58.4,2.9,5.8,0.0
2017-01-01,01:00:00,0.01,0.829,76.4,,16.8,63.4,1.3,4.8,0.0
2017-01-01,02:00:00,0.01,0.891,56.6,,16.0,67.1,0.4,4.9,0.0
2017-01-01,03:00:00,0.006,1.163,36.4,,15.2,71.1,0.1,2.4,0.0
2017-01-01,04:00:00,0.005,1.935,50.9,,14.6,74.0,0.5,4.1,0.0


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8746 entries, 2017-01-01 to 2018-01-01
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   hora    8696 non-null   object 
 1   o3      8636 non-null   float64
 2   co      8683 non-null   float64
 3   pm10    8655 non-null   float64
 4   pm25    3301 non-null   float64
 5   tmp     8696 non-null   float64
 6   rh      8696 non-null   float64
 7   ws      8696 non-null   float64
 8   wd      8495 non-null   float64
 9   pp      8696 non-null   float64
dtypes: float64(9), object(1)
memory usage: 751.6+ KB


## Reduce number of instances

Finally, since I want to forecast the daily AQI, I'll only leave one entry for each day. It'll be the average of every feauture, except wind direction, which should use the cirular mean, which is a way to calculate the average of angular data.

In [50]:
# Define a function for circular mean
def circular_mean(angles):
    angles_rad = np.deg2rad(angles)  # Convert degrees to radians
    mean_sin = np.mean(np.sin(angles_rad))
    mean_cos = np.mean(np.cos(angles_rad))
    mean_angle = np.arctan2(mean_sin, mean_cos)  # Compute mean angle
    return np.rad2deg(mean_angle) % 360  # Convert back to degrees and normalize

In [51]:
aggregation_functions = {col: "mean" for col in df.columns if col not in ["wd", "hora"]}
aggregation_functions["wd"] = circular_mean

In [52]:
df_daily = df.groupby("date").agg(aggregation_functions)

After grouping the values by dates, the pba (Barometric Pressure) and co (Carbon dioxide) has no relevant information at all, so I'll eliminate it.

In [53]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 366 entries, 2017-01-01 to 2018-01-01
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   o3      366 non-null    float64
 1   co      366 non-null    float64
 2   pm10    366 non-null    float64
 3   pm25    139 non-null    float64
 4   tmp     366 non-null    float64
 5   rh      366 non-null    float64
 6   ws      366 non-null    float64
 7   pp      366 non-null    float64
 8   wd      366 non-null    float64
dtypes: float64(9)
memory usage: 28.6 KB


In [None]:
df_daily.drop(["pba", "co"], axis="columns", inplace=True)

In [55]:
df_daily.head()

Unnamed: 0_level_0,o3,pm10,pm25,tmp,rh,ws,pp,wd
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2017-01-01,0.014792,48.945833,,19.475,52.125,2.516667,0.0,9.648721
2017-01-02,0.014,55.2,,19.358333,49.691667,1.291667,0.0,9.957045
2017-01-03,0.017458,62.9375,,19.970833,43.0375,1.0625,0.0,11.695896
2017-01-04,0.0195,80.2125,,20.620833,38.220833,1.195833,0.0,12.786368
2017-01-05,0.013708,53.1375,,21.870833,31.508333,2.375,0.0,14.54004


In [56]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 366 entries, 2017-01-01 to 2018-01-01
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   o3      366 non-null    float64
 1   pm10    366 non-null    float64
 2   pm25    139 non-null    float64
 3   tmp     366 non-null    float64
 4   rh      366 non-null    float64
 5   ws      366 non-null    float64
 6   pp      366 non-null    float64
 7   wd      366 non-null    float64
dtypes: float64(8)
memory usage: 25.7 KB


I'll just corroborate that all the months have complete dates.

In [61]:
year = df_daily.index.year
month = df_daily.index.month
dates_per_month = df_daily.groupby([year, month]).size().unstack(fill_value=0)
dates_per_month.index.name = 'year'
dates_per_month.columns.name = 'month'
dates_per_month

month,1,2,3,4,5,6,7,8,9,10,11,12
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017,31,28,31,30,31,30,31,31,30,31,30,31


## Save pre processed data

Now that the data has been properly processed, I'll save it in a new file.

In [62]:
df_daily.to_csv(f"datasets/preprocess/{filename}-processed.csv")