
# Correlating Specific Pollution Sources with Air Quality Level

Steffany Lara, Brisma Alvarez, Emiliano Ruiz and Daniel De Pool

*Instituto Tecnológico y de Estudios Superiores de Monterrey (ITESM)*

**Abstract**


The overall objective is to quantify and explain the relative contribution of specific pollution sources such as factories, airports, and vehicular traffic to criteria pollutant levels in Monterrey through multivariate time series analysis from monitoring stations, in order to prioritize mitigation and optimization actions for air quality management.


## Knowing the Business

### Exploratory Analysis

With aim of comprehending the data for a better prediction, the information from all the datasets would be analyzed. At a glance, in this approach, information regarding gases concentration from 2020 to 2022 would be used just for exploratory analysis, so that, we'll understand how the quarantine and reduction of multiples industries and less traffic congestion impacted on the concentration of polluting gases. Also, information from 2023 to 2024 would be useful in this exploratory analysis. Furthermore, information regarding the dataset "Padrón Medio Ambiente" would be checked for understanding which places have the most concurrency and the models that are more common in all of those places.  


In [15]:
#Importar librerías
import pandas as pd

In [16]:
#First, we will joing all the databases in the same format: an excel with multiple sheets, every sheet cointains information regarding a certain station.,


#año 2020 y 2021 

db_2020_2021 = pd.read_excel("Bases_Datos/DATOS HISTÓRICOS 2020_2021_TODAS ESTACIONES.xlsx",sheet_name=None) 
db_2022_2023 = pd.read_excel("Bases_Datos/DATOS HISTÓRICOS 2022_2023_TODAS ESTACIONES.xlsx",sheet_name=None)
db_2023_2024 = pd.read_excel("Bases_Datos/DATOS HISTÓRICOS 2023_2024_TODAS ESTACIONES_ITESM-2.xlsx",sheet_name=None)


In [17]:
print(db_2020_2021.keys())
print(db_2022_2023.keys())
print(db_2023_2024.keys())


dict_keys(['SURESTE', 'NORESTE', 'CENTRO', 'NOROESTE', 'SUROESTE', 'NOROESTE2', 'NORTE', 'SUROESTE2', 'SURESTE2', 'SURESTE3', 'SUR', 'NORTE2', 'NORESTE2', 'NORESTE3', 'NOROESTE3', 'CATÁLOGO'])
dict_keys(['SURESTE', 'NORESTE', 'CENTRO', 'NOROESTE', 'SUROESTE', 'NOROESTE2', 'NORTE', 'SUROESTE2', 'SURESTE2', 'SURESTE3', 'SUR', 'NORTE2', 'NORESTE2', 'NORESTE3', 'NOROESTE3', 'CATÁLOGO'])
dict_keys(['Param_horarios_Estaciones', 'Hoja2'])


In [66]:
print(db_2022_2023['NOROESTE2'])

                     date    CO    NO   NO2    NOX    O3   PM10   PM2.5  \
0     2022-01-01 00:00:00  2.22   4.7  25.3   30.0  22.0  273.0  223.32   
1     2022-01-01 01:00:00  2.54  12.0  31.3   43.1  12.0  169.0  114.31   
2     2022-01-01 02:00:00  4.30  72.0  39.1  110.9   6.0  215.0  143.92   
3     2022-01-01 03:00:00  2.07  12.9  17.8   30.6  15.0  291.0  218.59   
4     2022-01-01 04:00:00  1.08   3.0   3.8    6.7  24.0  134.0   71.94   
...                   ...   ...   ...   ...    ...   ...    ...     ...   
14250 2023-08-17 19:00:00  0.30   2.9   8.7   11.6  39.0  109.0     NaN   
14251 2023-08-17 20:00:00  0.33   2.9  10.5   13.4  32.0  102.0     NaN   
14252 2023-08-17 21:00:00  0.34   3.4  15.9   19.3  25.0   97.0     NaN   
14253 2023-08-17 22:00:00  0.43   4.8  23.0   27.8  15.0  103.0     NaN   
14254 2023-08-17 23:00:00  0.31   3.3  12.5   15.7  22.0  116.0     NaN   

         PRS  RAINF    RH  SO2     SR   TOUT   WSR    WDR  
0      694.4    0.0  52.0  2.8  0.000  

Here we notice that the database of 2023 and 2024 is not organized by station, so we will do it manually.


In [94]:
print(db_2023_2024['Hoja2']) #it is empty, so we will create a dictionary of dataframes from the data of the first sheet

print(db_2023_2024['Param_horarios_Estaciones'].columns)
df = db_2023_2024['Param_horarios_Estaciones']

df = df.drop(index=0).reset_index(drop=True) 
df = df.drop(index=0).reset_index(drop=True) 
print(df)



Empty DataFrame
Columns: []
Index: []
Index([    nan,    'CO',    'NO',   'NO2',   'NOX',    'O3',  'PM10', 'PM2.5',
         'PRS', 'RAINF',
       ...
        'PM10', 'PM2.5',   'PRS', 'RAINF',    'RH',   'SO2',    'SR',  'TOUT',
         'WSR',   'WDR'],
      dtype='object', name=0, length=240)
0                      NaN    CO    NO   NO2   NOX   O3 PM10  PM2.5    PRS  \
0      2023-01-01 00:00:00  2.37  54.5  32.6  87.1    3  110     68  721.7   
1      2023-01-01 01:00:00  2.12  38.7  30.3  68.9    3  116  67.18  721.5   
2      2023-01-01 02:00:00  2.05  38.7  28.8  67.4    3  117  75.12  721.1   
3      2023-01-01 03:00:00   2.5  60.5  29.1  89.4    3  135  82.81  720.8   
4      2023-01-01 04:00:00  1.94  42.3  25.7  67.7  NaN  132  59.56  720.7   
...                    ...   ...   ...   ...   ...  ...  ...    ...    ...   
13865  2024-07-31 19:00:00  0.67   NaN   4.3   7.5   26   88    NaN  721.7   
13866  2024-07-31 20:00:00  0.66   2.9   4.5   7.4   24   94    NaN  721.8  

In [116]:
import pandas as pd  # Assuming pandas is imported

# Your DataFrame
df = db_2023_2024['Param_horarios_Estaciones']

# Define lists of stations and starting column indices
estaciones_db_2023_2024 = [
    'SURESTE', 'NORESTE', 'CENTRO', 'NOROESTE', 'SUROESTE',
    'NOROESTE2', 'NORTE', 'NORESTE2', 'SURESTE2', 'SUROESTE2',
    'SURESTE 3', 'SUR', 'NORTE2', 'NORESTE3', 'NOROESTE3'
]
estaciones_db_2023_2024_inicio = [1, 17, 33, 49, 65, 81, 97, 113, 129, 145, 161, 177, 193, 209, 225]

# Create the dictionary
diccionario_db_2023_2024 = {}
for estacion, inicio in zip(estaciones_db_2023_2024, estaciones_db_2023_2024_inicio):
    # Select column 0 (date) and 15 parameter columns (inicio to inicio + 14, inclusive)
    cols_indices = [0] + list(range(inicio, inicio + 15))  # Adjusted to 15 to include all parameters
    # Verify that indices don't exceed the number of columns
    print("-"*10,estacion)
    cols = df.iloc[:, cols_indices]
    # Use row 0 as column names (parameters: date, CO, NO, etc.)
    new_column_names = ['date'] + cols.iloc[0, 1:].fillna('unknown').tolist()  # Replace NaN in row 0, skip date column
    # Create new DataFrame, dropping rows 0 (parameters) and 1 (units)
    new_df = cols.drop(index=[0, 1]).reset_index(drop=True).copy()
    # Assign new column names
    new_df.columns = new_column_names
    diccionario_db_2023_2024[estacion] = new_df
    print(diccionario_db_2023_2024[estacion].head())
    
# Example: Print first few rows of NOROESTE3

print("Claves del diccionario:", diccionario_db_2023_2024.keys())

---------- SURESTE
                  date    CO    NO   NO2   NOX   O3 PM10  PM2.5    PRS RAINF  \
0  2023-01-01 00:00:00  2.37  54.5  32.6  87.1    3  110     68  721.7     0   
1  2023-01-01 01:00:00  2.12  38.7  30.3  68.9    3  116  67.18  721.5     0   
2  2023-01-01 02:00:00  2.05  38.7  28.8  67.4    3  117  75.12  721.1     0   
3  2023-01-01 03:00:00   2.5  60.5  29.1  89.4    3  135  82.81  720.8     0   
4  2023-01-01 04:00:00  1.94  42.3  25.7  67.7  NaN  132  59.56  720.7     0   

   RH  SO2 SR   TOUT  WSR  WDV  
0  68  3.5  0  16.39  3.2  257  
1  72  3.4  0  15.17  3.3  278  
2  71  3.6  0  14.82  3.7  278  
3  68  3.8  0  15.51  3.6  197  
4  73  3.6  0  13.81  4.9  271  
---------- NORESTE
                  date    CO    NO   NO2    NOX O3 PM10 PM2.5    PRS RAINF  \
0  2023-01-01 00:00:00   3.4  30.4    43   73.4  7  222   NaN  718.4     0   
1  2023-01-01 01:00:00   4.3  67.2  44.4  111.6  8  311   NaN  718.1     0   
2  2023-01-01 02:00:00  4.28  63.9  41.5  105.5  

Now I will rewrite the dictionary that we imported from the beggining.



In [122]:
db_2023_2024 = diccionario_db_2023_2024
print(db_2023_2024.keys())

with pd.ExcelWriter('Bases_Datos/DATOS HISTÓRICOS 2023_2024_TODAS ESTACIONES - corregido.xlsx', engine='openpyxl') as writer:
    for sheet_name, df in db_2023_2024.items():
        df.to_excel(writer, sheet_name=sheet_name, index=False)

print("Archivo Excel creado DATOS HISTÓRICOS 2023_2024_TODAS ESTACIONES - corregido.xlsx")

dict_keys(['SURESTE', 'NORESTE', 'CENTRO', 'NOROESTE', 'SUROESTE', 'NOROESTE2', 'NORTE', 'NORESTE2', 'SURESTE2', 'SUROESTE2', 'SURESTE 3', 'SUR', 'NORTE2', 'NORESTE3', 'NOROESTE3'])
Archivo Excel creado DATOS HISTÓRICOS 2023_2024_TODAS ESTACIONES - corregido.xlsx


Since 2023 is repeated in 2 datasets, the data is repeated. Thus, records for 2023 in the dictionary db_2022_2023 will be removed. The reason of this selection is because in the database 2022-2023 information of year 2023 from august to december doesn't exist.

In [135]:
# db_2022_2023
db_2022={}

for estacion in db_2022_2023.keys():
    df = db_2022_2023[estacion]
    #we are filtering the df, so then it doesn´t include the records in which year is 2023
    df_filtrado = df[~df.iloc[:, 0].astype(str).str.contains("2023", na=False)]

    db_2022[estacion] = df_filtrado

    print("-"*30,estacion)
    print(df_filtrado.tail(5))


------------------------------ SURESTE
                    date    CO    NO   NO2   NOX    O3   PM10  PM2.5    PRS  \
8755 2022-12-31 19:00:00  1.51  12.1  39.9  51.9  10.0   52.0  23.08  721.1   
8756 2022-12-31 20:00:00  1.93  28.7  41.1  69.7   6.0   60.0  38.44  721.4   
8757 2022-12-31 21:00:00  2.47  56.3  43.0  99.1   5.0   86.0  62.71  721.5   
8758 2022-12-31 22:00:00  2.45  52.8  39.2  91.8   4.0  135.0  72.58  721.7   
8759 2022-12-31 23:00:00  2.23  27.5  34.2  61.6   4.0  127.0  72.58  722.0   

      RAINF    RH  SO2   SR   TOUT  WSR    WDR  
8755    0.0  47.0  3.4  0.0  21.64  5.5  198.0  
8756    0.0  51.0  3.4  0.0  20.25  4.6  227.0  
8757    0.0  56.0  3.5  0.0  19.07  4.8  240.0  
8758    0.0  56.0  3.5  0.0  18.63  4.3  249.0  
8759    0.0  61.0  3.3  0.0  17.74  4.4  214.0  
------------------------------ NORESTE
                    date    CO    NO   NO2    NOX    O3   PM10  PM2.5    PRS  \
8755 2022-12-31 19:00:00  1.77  16.0  44.5   60.5  12.0   84.0   52.0  71

In [140]:
#now we will create 2 master datasests

#2020-2022 (joining 2 datasets)

print(db_2020_2021.keys())
print(db_2022.keys())
#verify if the columns are the same, I'll join them 

import pandas as pd

# Crear diccionario maestro para 2020–2022
db_2020_2022 = {}

# Iterar sobre las claves (estaciones)
for station in db_2020_2021.keys():
    if station in db_2022:
        # Obtener los DataFrames
        df1 = db_2020_2021[station]
        df2 = db_2022[station]
        
        # Verificar que las columnas sean iguales
        cols1 = df1.columns.tolist()
        cols2 = df2.columns.tolist()
        if cols1 == cols2:
            # Columnas idénticas, concatenar directamente
            df_combined = pd.concat([df1, df2], ignore_index=True)
            # Asegurar formato datetime
            db_2020_2022[station] = df_combined
        else:
            # Reportar diferencias
            print(f"Advertencia: Las columnas de {station} no coinciden.")
            print(f"db_2020_2021[{station}]: {cols1}")
            print(f"db_2022_2023[{station}]: {cols2}")
            # Opcional: Usar solo columnas comunes
            common_cols = list(set(cols1).intersection(cols2))
            if common_cols:
                print(f"Usando columnas comunes para {station}: {common_cols}")
                df_combined = pd.concat([df1[common_cols], df2[common_cols]], ignore_index=True)
                df_combined['date'] = pd.to_datetime(df_combined['date'], errors='coerce')
                df_combined = df_combined.sort_values('date').reset_index(drop=True)
                db_2020_2022[station] = df_combined
            elif station=="NOROESTE3":#esta es excepción porque comenzó apartir de 2022 por tanto no estará en la otra db
                    df_combined = pd.concat([df1, df2], ignore_index=True)
                    db_2020_2022[station] = df_combined
                    print("Caso de ",station," combinado por excepción en Estación Misión San Juan.")
            else:
                print(f"No hay columnas comunes para {station}. Omitiendo.")
    else:
        print(f"Advertencia: La estación {station} no está en db_2022_2023")

# Manejar CATÁLOGO (si es metadatos, copiar directamente)
if 'CATÁLOGO' in db_2020_2021:
    db_2020_2022['CATÁLOGO'] = db_2020_2021['CATÁLOGO']  # O usa db_2022_2023['CATÁLOGO']

# Verificar claves del diccionario resultante
print("Claves de db_2020_2022:", db_2020_2022.keys())

#2023-2024 (already created and separated by station)
db_2023_2024

SyntaxError: invalid syntax (819621874.py, line 45)