In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = True

# AppVoCAI Dataset Ingestion
In this section, we unbox the dataset, survey its key characteristics, profile its structure, format, and data types, then register it as an asset, prior to downstream data quality assessment, cleaning, enrichment and analysis activities.

In [2]:
from discover.analytics.base import Analysis
from discover.setup import auto_wire_container
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageEnum, PhaseEnum
from discover.flow.stage.data_prep.ingest import IngestionStage

# Wire container
container = auto_wire_container()

## Ingest Data
The IngestionStage loads the raw data, and performs encoding verification, data type casting, and removal of newlines from the review text, ensuring data accessibility for downstream processing and analysis.   

The following orchestrates the initialization and execution of the Ingestion pipeline within a broader data preparation workflow. It begins with the retrieval of a specific configuration through the `FlowConfigReader`. This reader is tasked with pulling the configuration that defines the parameters for the pipeline, specifically targeting the `INGEST` stage within the `DataPrep` phase. This configuration encapsulates the structure, settings, and tasks required for ingesting the data.

With the configuration in hand, the code then proceeds to the next step: the construction of the `IngestionStage` pipeline. Using the `build` method of the `IngestionStage` class, it dynamically assembles the pipeline, injecting the necessary configurations and optionally setting a flag to force the execution of the pipeline, even if the destination data already exists. This enables flexibility, allowing the pipeline to rerun when necessary, or to skip execution if the data is up to date.

Finally, the pipeline is executed by invoking the `run()` method, triggering the ingestion workflow, applying the defined tasks and operations. Upon successful execution, the method returns an asset ID, serving as a reference to the processed data. This asset ID can be used for subsequent stages or for tracking the output of the pipeline. 

In [3]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(phase=PhaseEnum.DATAPREP, stage=DataPrepStageEnum.INGEST)

# Build and run Data Ingestion Stage
stage = IngestionStage.build(stage_config=stage_config, force=FORCE)
dataset = stage.run()



#                              Data Ingestion Stage                              #


Task                                    Start       End         Runtime     
----------------------------------------------------------------------------
VerifyEncodingTask                      01:47:00    01:47:00    0.02 seconds
CastDataTypeTask                        01:47:00    01:47:00    0.02 seconds
RemoveNewlinesTask                      01:47:00    01:47:00    0.0 seconds 
____________________________________________________________________________
Data Ingestion Stage                    01:47:00    01:47:00    0.33 seconds





## AppVoCAI Dataset Summary
Let's  load and summarize the data.

In [4]:
reviews = Analysis(df=dataset.to_pandas())
reviews.summary()



                            AppVoCAI Dataset Summary                            
                       Number of Reviews | 8,670
                     Number of Reviewers | 8,668
              Number of Repeat Reviewers | 2 (0.0%)
         Number of Influential Reviewers | 575 (6.6%)
                          Number of Apps | 2,636
                    Number of Categories | 14
                 Average Reviews per App | 3.3
                                Features | 11
                        Memory Size (Mb) | 6.74
                    Date of First Review | 2020-01-01 11:06:50
                     Date of Last Review | 2023-08-29 13:25:46


As evidenced here, we've extracted 8.7 million reviews dated between January of 2020, and September of 2023. These reviews were contributed by over 7 million unique users, with approximately 14% identified as repeat reviewers. Notably, over 7% of the reviewers qualify as influencers, defined by having at least one review with a non-zero vote count. The dataset encompasses nearly 35,000 apps spanning 14 distinct categories. On average, each review contains 32 words, while the average number of reviews per app nears 250. With a memory footprint of 5.8 GB, the dataset is of a moderately large scale, posing both opportunities and challenges for data processing and analysis.

In the next stage, we assess data quality, illuminating interventions for data cleaning.