# Dataset Staging Lab
Stages datasets for preprocession to avoid repeatedly sampling from the 22m row raw dataset. Also ensures that the test environment comes from the development set population, ensuring that all test observations have model inference data, such as perplexities and sentiments. 

In [1]:
import os
import numpy as np
import pandas as pd
from genailabslm.flow.data_prep.preprocess.task import FilterTask
from genailabslm.infra.utils.file.io import IOService
from tqdm import tqdm


pd.set_option("display.max_columns", 999)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 999)

## Setup Configuration
Configurations for each file include the source and destination filepaths, the fraction of the source to sample, and random_state for reproducibility.  

In [None]:
configs = [
    {
        "source": "data/raw/reviews",
        "dest": "data/stage/prod/reviews",
        "frac": 1,
        "random_state": 65,
        "force": False,
    },
    {
        "source": "data/raw/reviews",
        "dest": "data/stage/dev/reviews",
        "frac": 0.01,
        "random_state": 51,
        "force": False,
    },
    {
        "source": "data/stage/dev/reviews",
        "dest": "data/stage/test/reviews",
        "frac": 0.1,
        "random_state": 51,
        "force": False,
    },
]

## Stage Files
Iterating through the configs, this cell stages the production, development, and test files using the same seeds as used in the pipelines to ensure that the data are the same as that in the workspace.

In [3]:
column = "date"
date = 2020
filepath = None
df = None
for config in tqdm(configs):
    if os.path.exists(config["dest"]) and not config["force"]:
        print(f"File {config['dest']} already exists. Skipping...")
    else:
        if filepath != config["source"]:
            print(f"Reading dataset from {config['source']}.")
            df = IOService.read(filepath=config["source"])
            filepath = config["source"]
        filter = FilterTask(
            column=column,
            frac=config["frac"],
            date=date,
            random_state=config["random_state"],
        )
        data = filter.run(data=df)
        IOService.write(filepath=config["dest"], data=data)
        print(
            f"Created dataset of {data.shape[0]} rows and persisted to {config['dest']}"
        )

  0%|          | 0/3 [00:00<?, ?it/s]

File data/stage/prod/reviews already exists. Skipping...
File data/stage/dev/reviews already exists. Skipping...
Reading dataset from data/stage/dev/reviews.


100%|██████████| 3/3 [00:00<00:00,  7.21it/s]



                                   FilterTask                                   
                                   ----------                                   
                          Start Datetime | Tue, 19 Nov 2024 21:13:56
                       Complete Datetime | Tue, 19 Nov 2024 21:13:56
                                 Runtime | 0.08 seconds
Created dataset of 8670 rows and persisted to data/stage/test/reviews





## Validate Results
We compare id's from the development set created to those from the development set in the workspace.  

In [4]:
# Compare dev set
fp1 = "data/stage/dev/reviews"
fp2 = "workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-01_preprocess-review-dataset.parquet"
df1 = IOService.read(fp1)
df2 = IOService.read(fp2)
id1 = df1["id"].sort_values().values
id2 = df2["id"].sort_values().values
assert len(id1) == len(id2)
assert np.array_equal(id1, id2)

In [5]:
df1.sort_values(by="id").tail()

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,category
2125594,9999888324,1400078107,Punch Time Clock Hours Tracker,6000,d87f758f20fcc6f1aad2,5.0,Once you learn the ins and outs and how this app works it’s one of the better ones to keep your time. I love it and I use it every day to keep track of my hours.,0,0,2023-06-04 23:02:00,Business
18455769,9999916482,1433889544,Cougar: Dating Mature Women,6005,c7111c6f4391fd3497e8,4.0,There’s clearly some fake profiles but the app is very user friendly,0,0,2023-06-04 23:15:55,Social Networking
4459444,9999941021,1446075923,Disney+,6016,73d5ec46b0bd4fa586bd,2.0,I change it pick something new and they come back on. Disappointed and frustrated. Think a trillion dollar company can make it work.,0,0,2023-06-04 23:27:41,Entertainment
8462032,9999952606,600446812,Pacer Pedometer & Step Tracker,6013,e44aeb47a0400420b7e8,5.0,"Easy, Peasy…",0,0,2023-06-04 23:33:29,Health & Fitness
2700280,9999986474,1196524622,Minecraft Education,6017,a8d67dffe6b88c5e5ea2,4.0,"I’ve mainly played Minecraft education on the school laptop and decided to play it again over summer break. Sign in and the game is the same, but the controls are sort of weird. The WASD is replaced with arrows, which makes moving around take more effort than it should. I was thinking maybe replace the arrows with a joystick like Roblox mobile has. It also took me a few minutes to figure out how to stop flying (I imported a world where I saved while I was flying) since you have to be up in the air to stop and can’t deactivate it while you’re ~2 blocks from the ground.",0,0,2023-06-04 23:49:58,Education


In [6]:
df2.sort_values(by="id").tail()

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,review_length,category
7402,9999888324,1400078107,Punch Time Clock Hours Tracker,6000,d87f758f20fcc6f1aad2,5,Once you learn the ins and outs and how this app works it’s one of the better ones to keep your time. I love it and I use it every day to keep track of my hours.,0,0,2023-06-04 23:02:00,37,Business
67740,9999916482,1433889544,Cougar: Dating Mature Women,6005,c7111c6f4391fd3497e8,4,There’s clearly some fake profiles but the app is very user friendly,0,0,2023-06-04 23:15:55,12,Social Networking
15066,9999941021,1446075923,Disney+,6016,73d5ec46b0bd4fa586bd,2,I change it pick something new and they come back on. Disappointed and frustrated. Think a trillion dollar company can make it work.,0,0,2023-06-04 23:27:41,23,Entertainment
32313,9999952606,600446812,Pacer Pedometer & Step Tracker,6013,e44aeb47a0400420b7e8,5,"Easy, Peasy…",0,0,2023-06-04 23:33:29,2,Health & Fitness
9332,9999986474,1196524622,Minecraft Education,6017,a8d67dffe6b88c5e5ea2,4,"I’ve mainly played Minecraft education on the school laptop and decided to play it again over summer break. Sign in and the game is the same, but the controls are sort of weird. The WASD is replaced with arrows, which makes moving around take more effort than it should. I was thinking maybe replace the arrows with a joystick like Roblox mobile has. It also took me a few minutes to figure out how to stop flying (I imported a world where I saved while I was flying) since you have to be up in the air to stop and can’t deactivate it while you’re ~2 blocks from the ground.",0,0,2023-06-04 23:49:58,110,Education
