# Data Preprocessing


Before using interpolation methods and outlier removal, I'll apply certain changes to only get the relevant data and give it a certain format. These changes will mostly be: 
- Editing column names 
- Eliminating irrelevant features to the project
- Accurately replacing null values
- Determine column data types
- Extract relevant instances 
- Reduce number of instances


In [49]:
import pandas as pd
import numpy as np

## Edit column names and eliminate innecessary features

The data was obtained from the official page of the SEMADET link here, describe columns 

In [67]:
filename = "semadet-aire-2023"
filepath = f"datasets/{filename}.csv"

df = pd.read_csv(filepath,
                     encoding='utf-8',
                     )

df.columns = df.columns.str.lower().str.strip()

df.rename({"pm2.5": "pm25", 
           "date_time": "date",
           "precipitacion": "pp", 
           "rad solar": "rs",
           "presion barometrica": "pba"},
          axis="columns",
          inplace=True)

df.drop(["rs", "nox", "no", "tmpi", "uvi"], axis="columns", inplace=True)

In [68]:
df.head(3)

Unnamed: 0,estacion,date,hora,o3,no2,so2,co,pm10,pm25,tmp,rh,ws,wd,pp,pba
0,Aguilas,1/1/2023,0,0.002,SE,SE,SE,61.8,58.1,12.6,88.7,0.38,190.77,0.25,SE
1,Aguilas,1/1/2023,1,0.002,SE,SE,SE,83.8,76.5,12.1,89.8,1.27,215.13,0.0,SE
2,Aguilas,1/1/2023,2,0.003,SE,SE,SE,98.2,95.0,11.8,89.5,2.44,240.82,0.0,SE


## Replace null values

I will replace all the identifiers for the null values (which are quite varied) with nan to see which columns to keep based on amount of values.

In [66]:
def replace_with_null(row):
    null_values = ["IO", "SE", "ND", "IF", "VE", "IR", "VZ", "IC", "IR 1000", "IR valor 1000", " ", "", "-", "SD"]
    exclude_columns = ["estacion", "date", "hora"]
    
    for column in row.index:
        if column not in exclude_columns and row[column] in null_values:
            row[column] = np.nan
            
    return row
    

In [69]:
df = df.apply(replace_with_null, axis="columns")

Sice the no2 and so2 columns have no useful data, they'll be dropped entirely.

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87600 entries, 0 to 87599
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   estacion  87600 non-null  object 
 1   date      87600 non-null  object 
 2   hora      87600 non-null  int64  
 3   o3        49551 non-null  object 
 4   no2       0 non-null      float64
 5   so2       0 non-null      float64
 6   co        28843 non-null  object 
 7   pm10      57768 non-null  object 
 8   pm25      50068 non-null  object 
 9   tmp       44625 non-null  object 
 10  rh        46416 non-null  object 
 11  ws        49970 non-null  object 
 12  wd        39858 non-null  object 
 13  pp        38714 non-null  object 
 14  pba       39048 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 10.0+ MB


In [71]:
df.drop(["no2", "so2"], axis="columns", inplace=True)

## Specify data tyoes

Now that the null values have been replaced, I can specify the correct data type for each column.

In [72]:
float_cols = ["o3", "co", "pm10", "pm25", "tmp", "rh", "ws", "wd", "pp"]
df[float_cols] = df[float_cols].astype('float')

## Extract relevant instances

The AQI daily forecasting will only be done for the Tlaquepaque region, so I'll also extract all the pertaining rows and can also now drop the column estacion. I'll also convert the date column into the index.

In [73]:
df = df[df["estacion"] == "Tlaquepaque"]

In [74]:
df.head(3)

Unnamed: 0,estacion,date,hora,o3,co,pm10,pm25,tmp,rh,ws,wd,pp,pba
70080,Tlaquepaque,1/1/2023,0,0.002,,185.39,120.36,13.9,83.4,0.21,180.01,0.0,
70081,Tlaquepaque,1/1/2023,1,0.002,,244.2,164.48,13.4,84.2,0.09,196.35,0.0,
70082,Tlaquepaque,1/1/2023,2,0.003,,232.98,175.42,12.9,86.1,0.11,168.65,0.0,


In [75]:
df.drop("estacion", axis="columns", inplace=True)

In order to convert the data in a timeseries, the dates will be used as the index.

In [76]:
df.index = pd.to_datetime(df['date'], format='%m/%d/%Y')
df.drop("date", axis="columns", inplace=True)

In [77]:
df = df.sort_index()

In [78]:
df.head(3)

Unnamed: 0_level_0,hora,o3,co,pm10,pm25,tmp,rh,ws,wd,pp,pba
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-01-01,0,0.002,,185.39,120.36,13.9,83.4,0.21,180.01,0.0,
2023-01-01,1,0.002,,244.2,164.48,13.4,84.2,0.09,196.35,0.0,
2023-01-01,2,0.003,,232.98,175.42,12.9,86.1,0.11,168.65,0.0,


## Reduce number of instances

Finally, since I want to forecast the daily AQI, I'll only leave one entry for each day. It'll be the average of every feauture, except wind direction, which should use the cirular mean, which is a way to calculate the average of angular data.

In [80]:
# Define a function for circular mean
def circular_mean(angles):
    angles_rad = np.deg2rad(angles)  # Convert degrees to radians
    mean_sin = np.mean(np.sin(angles_rad))
    mean_cos = np.mean(np.cos(angles_rad))
    mean_angle = np.arctan2(mean_sin, mean_cos)  # Compute mean angle
    return np.rad2deg(mean_angle) % 360  # Convert back to degrees and normalize

In [81]:
aggregation_functions = {col: "mean" for col in df.columns if col not in ["wd", "hora"]}
aggregation_functions["wd"] = circular_mean

In [82]:
df_daily = df.groupby("date").agg(aggregation_functions)

After grouping the values by dates, the pba (Barometric Pressure) and co (Carbon dioxide) has no relevant information at all, so I'll eliminate it.

In [83]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 365 entries, 2023-01-01 to 2023-12-31
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   o3      180 non-null    float64
 1   co      0 non-null      float64
 2   pm10    365 non-null    float64
 3   pm25    365 non-null    float64
 4   tmp     131 non-null    float64
 5   rh      315 non-null    float64
 6   ws      319 non-null    float64
 7   pp      365 non-null    float64
 8   pba     0 non-null      object 
 9   wd      257 non-null    float64
dtypes: float64(9), object(1)
memory usage: 31.4+ KB


In [84]:
df_daily.drop(["pba", "co"], axis="columns", inplace=True)

I'll just corroborate that all the months have complete dates.

In [86]:
year = df_daily.index.year
month = df_daily.index.month
dates_per_month = df_daily.groupby([year, month]).size().unstack(fill_value=0)
dates_per_month.index.name = 'year'
dates_per_month.columns.name = 'month'
dates_per_month

month,1,2,3,4,5,6,7,8,9,10,11,12
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2023,31,28,31,30,31,30,31,31,30,31,30,31


## Save pre processed data

Now that the data has been properly processed, I'll save it in a new file.

In [87]:
df_daily.to_csv(f"datasets/preprocess/{filename}-processed.csv")