# Extracting PM 2.5 data

## Table of Contents:

1. [Extraction of integrated data (pre 2010)](#extraction_pre_2010)
    1. [Correction for the data from site 100702 in 2006 and 2007](#correction_100702_files)
    2. [Extraction of metal and ion data (2003-2009)](extraction_metal_ion_2003_2009)
2. [Extraction of integrated data (post 2010)](#extraction_post_2010)
3. [Extraction of continuous data](#extraction_continuous_pm25)

This notebook will call functions to extract the Near Total and Water-soluble concentration of metals from the data set prepared by [index_PM25_data.ipynb](index_PM25_data.ipynb). The code assumes the directory tree is like the following.

```
├── data/
│   ├── config/
│   ├── metadata/
│   │   │
│   │   ├── index.csv
│   │   └── stations_metadata.csv
│   │
│   └── raw/
│       ├── 2003/
│       │    └─SPECIATION/
│       │        ├─S50104_CARB.XLS
│       │        ├─S50104_IC.XLS
│       │        ├── ...
│       │        └─S101004_WICPMS.XLS
│       ├── ...
│       └─── 2019/
│
├── notebooks/
│   └── extract_PM25_data.ipynb (this notebook)
├── ...
```

The following code will start with importing required libraries and setting the directory paths.


In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parents[0]
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

import numpy as np
import pandas as pd
from src.data.extract_pre_2010_data import extract_pre_2010
from src.data.extract_post_2010_data import extract_post_2010
from src.data.extract_continuous_pm25_data import extract_continuous_pm25


Because the file formats and structures for pre and post 2010 are quite different. For example, the Near Total and Water-soluble data are provided with two different files before 2009; however, these two analyte types are in the same file from 2010. Therefore, the extraction will be carried out separately for each data set.

## 1. Extraction of integrated data (pre 2010)<a name='extraction_pre_2010'></a>

### 1.1. Correction for the data from site 100702 in 2006 and 2007<a name='correction_100702_files'></a>

<div class="alert alert-block alert-warning">
    <b>S100702_ICPMS.XLS</b> of <b>2006</b> and <b>2007</b> files have a critical format error. They must be <b>manually</b> corrected before running the following code (this operation has not been automated due to a trouble with xlutils.copy); otherwise, you will encounter InvalidIndexError saying 'Reindexing only valid with uniquely valued Index objects.'
</div>

- Situation
    - The values of column AS (START_TIME) are stored in column BG. The values of column AT (END_TIME) are stored in column BH. This error also causes the column for NAPS ID to move to column BI. 
    - ![problem_S100702_ICPMS.png](images/problem_S100702_ICPMS.png)
- Solution
    - The last three columns in the files should be shifted to the left.
- Tips
    - The values between AV and BI in the second row should be elimiated. Even if the cells seem to be empty, invisible value may remain. So, select the range and delete the value by keyboard; otherwise, the following code may rise an error.
    - The corrected files should be saved as XLS file format as they are. Note the Excel tries to save the file in xlsx format.

### 1.2. Extraction of metal and ion data (2003-2009)<a name='extraction_metal_ion_2003_2009'></a>

The following function will extract the pre 2010 data (i.e., 2003 to 2009) and store CSV files under a new directory `INTEGRATED_PM25_DIR` which default location is `data/processed/integrated_pm25`. Also, this function will extract ion data if exsited for a year in which ICPMS-measured data exist. Ions data will be stored in a separate file from the ICP-MS-measured data, with a file name suffix `_IC`.


In [None]:
extract_pre_2010()


## 2. Extraction of integrated data (post 2010)<a name='extraction_post_2010'></a>

The following function will extract the post 2010 data (i.e., 2010 to 2019) from the Excel files. The mass concentrations of PM2.5 are merged with the metals data during the extraction. The PM2.5 data from Sampler #1 will be combined with the Near Total metals data, whereas the PM2.5 data from Sampler #2 will be combined with the Water-soluble metals data.

In [None]:
extract_post_2010()


Now, the Near-Total and Water-Soluble metals and ions data were saved in CSV files in `data/processed`. The data is ready for the QA/QC for analysis. Also, these files can be further extracted for the source apportionment.

## 3. Extraction of continuous PM$_{2.5}$ data<a name='extraction_continuous_pm25'></a>

In [None]:
extract_continuous_pm25('all')
