# Dataset Staging Lab
Stages datasets for ingestion to avoid repeatedly sampling from the 22m row raw dataset. Also ensures that the test environment comes from the development set population, ensuring that all test observations have model inference data, such as perplexities and sentiments. 

In [14]:
import os
import numpy as np
import pandas as pd
from discover.flow.data_prep.ingest.task import FilterTask
from discover.infra.utils.file.io import IOService
from tqdm import tqdm


pd.set_option("display.max_columns", 999)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 999)

## Setup Configuration
Configurations for each file include the source and destination filepaths, the fraction of the source to sample, and random_state for reproducibility.  

In [None]:
configs = [
    {
        "source": "data/raw/reviews",
        "dest": "data/stage/prod/reviews",
        "frac": 1,
        "random_state": 65,
    },
    {
        "source": "data/raw/reviews",
        "dest": "data/stage/dev/reviews",
        "frac": 0.01,
        "random_state": 51,
    },
    {
        "source": "data/stage/dev/reviews",
        "dest": "data/stage/test/reviews",
        "frac": 0.1,
        "random_state": 51,
    },
]

## Stage Files
Iterating through the configs, this cell stages the production, development, and test files using the same seeds as used in the pipelines to ensure that the data are the same as that in the workspace.

In [None]:
column = "date"
date = 2020
filepath = None
df = None
for config in tqdm(configs):
    if os.path.exists(config["dest"]):
        proceed = input(
            f"file already exists at {config['dest']}. Would you like to proceed? [y/n]: "
        )
        if "n" in proceed.lower():
            break
    if filepath != config["source"]:
        print(f"Reading dataset from {config['source']}.")
        df = IOService.read(filepath=config["source"])
        filepath = config["source"]
    filter = FilterTask(
        column=column,
        frac=config["frac"],
        date=date,
        random_state=config["random_state"],
    )
    data = filter.run(data=df)
    IOService.write(filepath=config["dest"], data=data)
    print(f"Created dataset of {data.shape[0]} rows and persisted to {config['dest']}")

  0%|          | 0/3 [00:00<?, ?it/s]

Reading dataset from data/raw/reviews.


                                   FilterTask                                   
                                   ----------                                   
                          Start Datetime | Tue, 19 Nov 2024 20:48:31
                       Complete Datetime | Tue, 19 Nov 2024 20:48:44
                                 Runtime | 12.87 seconds


 33%|███▎      | 1/3 [01:24<02:48, 84.24s/it]

Created dataset of 8670475 rows and persisted to data/stage/prod/reviews


                                   FilterTask                                   
                                   ----------                                   
                          Start Datetime | Tue, 19 Nov 2024 20:49:15
                       Complete Datetime | Tue, 19 Nov 2024 20:49:29
                                 Runtime | 14.62 seconds


 67%|██████▋   | 2/3 [01:40<00:44, 44.13s/it]

Created dataset of 86705 rows and persisted to data/stage/dev/reviews
Reading dataset from data/stage/dev/reviews.


                                   FilterTask                                   
                                   ----------                                   
                          Start Datetime | Tue, 19 Nov 2024 20:49:36


100%|██████████| 3/3 [01:47<00:00, 35.69s/it]

                       Complete Datetime | Tue, 19 Nov 2024 20:49:37
                                 Runtime | 0.53 seconds
Created dataset of 867 rows and persisted to data/stage/test/reviews





## Validate Results
We compare id's from the development set created to those from the development set in the workspace.  

In [15]:
# Compare dev set
fp1 = "data/stage/dev/reviews"
fp2 = "workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-01_ingest-review-dataset.parquet"
df1 = IOService.read(fp1)
df2 = IOService.read(fp2)
id1 = df1["id"].sort_values().values
id2 = df2["id"].sort_values().values
assert len(id1) == len(id2)
assert np.array_equal(id1, id2)

In [None]:
df1.sort_values(by="id").tail()

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,category
9340033,10000000715,442138791,Fast Chest and Arms Workouts,6013,0bccb91ceff07cd1b848,5.0,Excellent,0,0,2023-06-04 23:56:58,Health & Fitness
8434846,10000006274,600446812,Pacer Pedometer & Step Tracker,6013,9ad5d8c172370634d65d,5.0,It’s great so far. It is cool to know how many steps I’ve taken while walking. This will help me get to my goal weight,0,0,2023-06-04 23:59:41,Health & Fitness
9619923,10000071511,1454778585,Water tracker Waterllama,6013,330959af7d0e278e5661,5.0,Made me drink water,0,0,2023-06-05 00:30:34,Health & Fitness
18116766,10000120462,985746746,"Discord - Chat, Talk & Hangout",6005,0e44e615ce705c94d990,3.0,It is useful with voice chats and things but some buttons should be simplified or have a thing to show you what things you can do and stuff.,1,1,2023-06-05 00:53:28,Social Networking
417938,10000146481,1076402606,"Libby, by OverDrive",6018,42d1e4fff46e4a190a8c,5.0,Libby is an easy to use app. I linked it to my library card and can easily borrow both audible and written books; I can download books to my kindle and/or iPad. What a delight!,0,0,2023-06-05 01:05:00,Book


In [None]:
df2.sort_values(by="id").tail()

Unnamed: 0,id,app_id,app_name,category_id,author,rating,content,vote_sum,vote_count,date,review_length,category
33550,10000000715,442138791,Fast Chest and Arms Workouts,6013,0bccb91ceff07cd1b848,5,Excellent,0,0,2023-06-04 23:56:58,1,Health & Fitness
37675,10000006274,600446812,Pacer Pedometer & Step Tracker,6013,9ad5d8c172370634d65d,5,It’s great so far. It is cool to know how many steps I’ve taken while walking. This will help me get to my goal weight,0,0,2023-06-04 23:59:41,25,Health & Fitness
28003,10000071511,1454778585,Water tracker Waterllama,6013,330959af7d0e278e5661,5,Made me drink water,0,0,2023-06-05 00:30:34,4,Health & Fitness
65304,10000120462,985746746,"Discord - Chat, Talk & Hangout",6005,0e44e615ce705c94d990,3,It is useful with voice chats and things but some buttons should be simplified or have a thing to show you what things you can do and stuff.,1,1,2023-06-05 00:53:28,28,Social Networking
2491,10000146481,1076402606,"Libby, by OverDrive",6018,42d1e4fff46e4a190a8c,5,Libby is an easy to use app. I linked it to my library card and can easily borrow both audible and written books; I can download books to my kindle and/or iPad. What a delight!,0,0,2023-06-05 01:05:00,35,Book
