# Extracting the Data Files for Source Apportionment

This notebook will extract the data for source apportionment. The output is a set of CSV files, which contains the concentrations of metals, ions, and the mass concentration of PM2.5 at a specified site for all available years.

The following procedures will use the CSV files prepared with [extract_PM25_data.ipynb](./extract_PM25_data.ipynb). So, run the notebook before using this notebook,

The output CSV files will have the following names and will be saved in a directory named `{NAPS site ID}_for_PMF`.

| - | Metal data | Ion data | PM2.5 data |
| - | ---------- | -------- | ---------- |
| File Name | NT_{element full name}.csv | ion_{element full name}.csv | PM25_Sampler1.csv |

These files will have the following structure.

| sampling_date | {full name of element} | {abbreviation of element}-MDL |
| ------------- | ------- | --- |
| Date | element or ion's concentration | MDL |

The concentration will be provided in $ng/m^3$. Because the ion concentration and PM2.5 mass concentration are provided in $µg/m^3$ in the original files, they will be converted.

We will start with importing required libraries and setting the directory paths.

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parents[0]
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

import numpy as np
import pandas as pd

from src.config import INDEX_CSV
from src.data.source_apportionment_extraction import create_nt_element_files, \
create_PM25_file, create_ion_files


For feasibility test of source apportionment, use a set of the concentrations for one site. The following code will check which site will be suitable for this purpose - the frequency with one in three days is preferable.

In [None]:
index_df = pd.read_csv(INDEX_CSV)

# filter for frequency == 3
freq3_df = index_df[index_df['frequency'] == 3]

# group by 'year', 'site_id', 'element_form' and count unique 'element'
element_counts = freq3_df.groupby(
    ['year', 'site_id', 'element_form'])['element'].nunique().reset_index(name='element_count')

# group by 'site_id' and sum the 'element_count' to get the total per site_id
total_elements_per_site = element_counts.groupby(
    'site_id')['element_count'].sum().reset_index(name='total_elements')

total_elements_per_site.sort_values('total_elements', ascending=False).head(5)

NAPS site 60211 has the most data as the site with the frequency of one in three days and seems suitable for the first data. The following code will create data files for each element measured at 60211. 

In [None]:
target_site_id = 60211

create_nt_element_files(target_site_id)


In [None]:
create_PM25_file(target_site_id)

In [None]:
create_ion_files(target_site_id)