# SmokeyNet Process

<b>Summary:</b><br>
The SmokeyNet data used was generated and prepared by the San Diego Supercomputer Center and was manually provided in json format. The data is image predictions for the [FIgLib dataset](http://hpwren.ucsd.edu/HPWREN-FIgLib/HPWREN-FIgLib-Data/). The data consists of both whole image predictions and image tile predictions, and also contains the actual groundtruth value. The data provided was originally split into 3 files: train, validation, and test. This notebook reads in the 3 files, and combines them into a single file with a new column added for the file source (train, valid, test).

- Read in 3 smokeynet files
- Join into single dataset with new column to retain the source (train, valid, test)
- Write new dataset to csv

<b>Output:</b><br>
.<br>
└──data<br>
&emsp;&emsp;&emsp;└── processed<br>
&emsp;&emsp;&emsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp;└── smokeynet.csv<br>

<b>Areas for Improvement:</b><br>

In [1]:
import os

import pandas as pd

In [2]:
def smokeynet_parse(data_df):
    """
    Extract date, year and adds them as columns to the dataframe for aggregation.
    """
    # tranpose the data
    data_df_T = data_df.transpose().reset_index().rename(columns={"index": "filepath"})

    # camera name is actuall event name
    data_df_T = data_df_T.rename(columns={"camera_name": "event_name"})

    # extract values
    event_split_df = data_df_T["event_name"].str.split("_", n=2, expand=True)
    data_df_T["camera_name"] = event_split_df[2]
    data_df_T["date"] = event_split_df[0]
    data_df_T["year"] = data_df_T["date"].str[:4]
    # data_df_T["img_seq"] = data_df_T["filepath"].str.split("_", expand=True)[3]

    return data_df_T

## Get data + explore event counts by year across 3 json files

### Train

In [3]:
train_data_df = pd.read_json("../../data/raw/smokeynet_train.json")
train_data_df = smokeynet_parse(train_data_df)
# camera_name is actually event_name
train_data_df[["event_name", "year"]].drop_duplicates()[
    "year"
].value_counts().sort_index()
# unique fire events - 143

2016    12
2017    38
2018    53
2019    29
2020    11
Name: year, dtype: int64

### Validation

In [4]:
valid_data_df = pd.read_json("../../data/raw/smokeynet_valid.json")
valid_data_df = smokeynet_parse(valid_data_df)
# camera_name is actually event_name
valid_data_df[["event_name", "year"]].drop_duplicates()[
    "year"
].value_counts().sort_index()
# unique fire events - 64

2018     6
2019    29
2020    27
2021     2
Name: year, dtype: int64

### Test

In [5]:
test_data_df = pd.read_json("../../data/raw/smokeynet_test.json")
test_data_df = smokeynet_parse(test_data_df)
# camera_name is actually event_name
test_data_df[["event_name", "year"]].drop_duplicates()[
    "year"
].value_counts().sort_index()
# unique fire events - 63

2016     2
2017     1
2018     4
2019    21
2020    29
2021     6
Name: year, dtype: int64

## Concat + write to csv

In [6]:
train_data_df["file_source"] = "train"  # 10438 rows
valid_data_df["file_source"] = "valid"  # 4911 rows
test_data_df["file_source"] = "test"  # 4885 rows

In [7]:
all_smokeynet_data_df = pd.concat(
    [train_data_df, valid_data_df, test_data_df]
)  # 20234 rows
# all_smokeynet_data_df

In [8]:
all_smokeynet_data_df.to_csv("../../data/processed/smokeynet.csv", index=False)