# 1. Setup and Check

This notebook provides code to:

- obtain metadata for the NAPS dataset (e.g., station ID, location, reported species, etc.)
- verify that your environment is properly set up for subsequent data analysis
- demonstrate basic usage of pandas for working with tabular data

---

**💡 Tip for Beginners**

If nothing happens when you try to run a cell:

- Check if the **kernel is running** (top-right corner).
- If it's unresponsive, go to **Kernel > Restart Kernel** and try again.

---

The cell below imports the required packages. Press ```Shift + Enter``` to run it.

In [None]:
import pandas as pd
import sys
from pathlib import Path

# set project root
sys.path.insert(0, str(Path.cwd().parent))

from src.config import *
from src.download_data import *
from src.load_data import *

## 1.1. Download the Station Information

The following function will download the NAPS stations information file to your local machine.

<details>
  <summary><strong>Optional: manual download (click to expand)</strong></summary>

  If the automatic download fails, download the file manually:

  1. Open [NAPS website](https://data-donnees.az.ec.gc.ca/data/air/monitor/national-air-pollution-surveillance-naps-program/)  
  2. Navigate to `ProgramInformation-InformationProgramme/` → `StationsNAPS-StationsSNPA.xlsx`  
  3. Save it to `../data/meta/`

</details>

In [None]:
download_station_file()


## 1.2. Check the File and Column Definitions

If you're curious about what a column means (e.g., `Status`, `Carbonyl`, or `Core_Site`), you can look it up in the second worksheet of the Excel file.

To do this:

1. Open the file `../data/raw/StationsNAPS-StationsSNPA.xlsx` in Excel or another spreadsheet program.
2. Go to the **second worksheet**.
3. Look in the **first column** for the column name you're interested in (you can use Ctrl+F or ⌘+F to search).
4. The **definition** will be shown in the second column next to it.

This is a useful way to understand the contents of the station dataset before working with it.


## 1.3. Load the Station Information

The following function will load the downloaded file into a DataFrame named `station_df`.

Once loaded, we’ll explore the dataset using a few common pandas commands:

- `.head()` — shows the first few rows
- `.tail()` — shows the last few rows
- `.info()` — gives a summary of the structure (columns, types, missing values, etc.)

These commands help you get a quick sense of what the data looks like before you start analyzing it.


In [None]:
station_df = load_station_data()

# show first three rows
display(station_df.head(3))


In [None]:
# show final three rows
display(station_df.tail(3))


In [None]:
# show dtypes, non-null counts
print(station_df.info())


## 1.4. Explore the Station Data Yourself

Now that we’ve loaded the station data into the `station_df` DataFrame, try exploring it on your own.

Here are some useful things you can try:

Count how many rows are in the dataset:

```python
len(station_df)
```

Look at all column names:

```python
station_df.columns
```

Count how many rows contain a specific word or phrase. For example, to find how many station names contain the word "Vancouver":

```python
station_df['Station_Name'].str.contains("Vancouver", case=False, na=False).sum()
```

View the matching rows:
```python
station_df[station_df['Station_Name'].str.contains("Vancouver", case=False, na=False)]
```

In [None]:
# You can edit this code cell and/or add other code cells


Please share something interesting you found with your team—such as a station name, an unexpected value, or a summary count.

Feel free to try more than one column. Be curious!

## 1.5. Focus on Our Target Data

We will focus on **core sites**, which provide more comprehensive data than other stations.

> **Core sites** include a wide range of measurements at representative locations across Canada.  
> - **Tier 1 (T1)** sites include PM2.5 speciation data.  
> - **Tier 2 (T2)** sites include PM2.5 reference method (gravimetric) data.

The code below filters for Tier 1 and Tier 2 core sites.


In [None]:
# filter for Tier 1 and Tier 2 core sites
core_df = station_df[station_df['Core_Site'].isin([1, 2])]

# show the number of rows of the DataFrame that stores the data of core sites
print('The number of core sites:', len(core_df))

# show first three rows of the core site data
display(core_df.head(3))


Now, `core_df` contains only core monitoring sites.

Next, let’s check if any of them are located in Burnaby by searching for stations whose names contain "Burnaby":


In [None]:
core_df[core_df['Station_Name'].str.contains("Burnaby", case=False, na=False)]


Therefore, we will use the data from the BURNABY SOUTH station (NAPS ID: 100119), as it is the only core site among the stations near Burnaby.