In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# **AppVoCAI Dataset Enrichment**
To set the foundation for exploratory analysis, we add a few features to the dataset at the **review**, **app**, and **category** levels, providing a cross-dimensional view of the data.  

## **1. Review-Level Enrichments**  
At the most granular level, reviews are enhanced with temporal and contextual features:  
- **Temporal Features**: By decomposing timestamps, we derive attributes such as review age and submission details (e.g., month, day, and hour). These features allow us to identify temporal trends and patterns in user feedback.  
- **Rating, Review Age and Review Length Deviations**: Each review is compared against the average for its app's category, highlighting outliers and unique characteristics within individual reviews.  

## **2. App-Level Enrichments**  
Aggregating data at the app level provides a broader perspective on app performance:  
- **Key Summaries**: Metrics such as the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, and sentiment distribution offer a broader view of each app’s reception.  
- **Deviation Statistics**: Comparing app-level metrics against their category averages sheds light on how an app deviates from its peers, offering insights into competitive positioning and unique strengths or weaknesses.  

## **3. Category-Level Enrichments**  
Zooming out further, category-level summaries offer a macro view of app trends within specific domains:  
- **Statistical Summaries**: Similar to the app level, category-level features include the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, `review_age`, `review_length`, and sentiment distribution.  
- **Contextual Insights**: These summaries provide benchmarks for evaluating app performance within its category, helping to contextualize deviations and patterns observed at the app and review levels.  

This cross-layered enrichment process equips the exploratory analysis with additional nuance and context of user feedback across individual reviews, apps, and broader categories, forming the backbone of the **AppVoCAI** discovery phase.  

Let's do it!


## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.core.flow import PhaseDef, StageDef
from discover.flow.stage.data_prep.enrich import DataEnrichmentStage

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.stage.base",
    ],
)

## Review Enrichment

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_REVIEW
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/26/2024 03:43:59 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RETURN
[11/26/2024 03:43:59 PM] [DEBUG] [Stage.run] [wrapper] : Stage: Review Enrichment Stage
[11/26/2024 03:43:59 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Tue, 26 Nov 2024 15:43:59
[11/26/2024 03:43:59 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Tue, 26 Nov 2024 15:43:59
[11/26/2024 03:43:59 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 0.0 seconds
[11/26/2024 03:43:59 PM] [DEBUG] [Stage.run] [wrapper] : Cached Result: True




#                            Review Enrichment Stage                             #



                            Review Enrichment Stage                             
                           Stage Started | Tue, 26 Nov 2024 15:43:59
                         Stage Completed | Tue, 26 Nov 2024 15:43:59
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## App Enrichment

In [5]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_APP
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/26/2024 03:43:59 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RUN
[11/26/2024 03:43:59 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating a spark session.
[11/26/2024 03:43:59 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating an Spark session. log4j Configuration: file:/home/john/projects/appvocai-discover/log4j.properties




#                              App Enrichment Stage                              #



your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                



                               AppAggregationTask                               
                               ------------------                               
                          Start Datetime | Tue, 26 Nov 2024 15:44:11


[11/26/2024 03:44:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Task: AppAggregationTask
[11/26/2024 03:44:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Started: Tue, 26 Nov 2024 15:44:11
[11/26/2024 03:44:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Completed: Tue, 26 Nov 2024 15:44:12
[11/26/2024 03:44:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Runtime: 0.92 seconds


                       Complete Datetime | Tue, 26 Nov 2024 15:44:12
                                 Runtime | 0.92 seconds


[11/26/2024 03:44:21 PM] [DEBUG] [Stage.run] [wrapper] : Stage: App Enrichment Stage
[11/26/2024 03:44:21 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Tue, 26 Nov 2024 15:43:59
[11/26/2024 03:44:21 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Tue, 26 Nov 2024 15:44:21
[11/26/2024 03:44:21 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 21.63 seconds




                              App Enrichment Stage                              
                           Stage Started | Tue, 26 Nov 2024 15:43:59
                         Stage Completed | Tue, 26 Nov 2024 15:44:21
                           Stage Runtime | 21.63 seconds





## Category Enrichment

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_CATEGORY
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/26/2024 03:48:38 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RUN
[11/26/2024 03:48:38 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating a spark session.
[11/26/2024 03:48:38 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_session] : Creating an Spark session. log4j Configuration: file:/home/john/projects/appvocai-discover/log4j.properties




#                           Category Enrichment Stage                            #



                            CategoryAggregationTask                             
                            -----------------------                             
                          Start Datetime | Tue, 26 Nov 2024 15:48:38


[11/26/2024 03:48:38 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Task: CategoryAggregationTask
[11/26/2024 03:48:38 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Started: Tue, 26 Nov 2024 15:48:38
[11/26/2024 03:48:38 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Completed: Tue, 26 Nov 2024 15:48:38
[11/26/2024 03:48:38 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Runtime: 0.27 seconds


                       Complete Datetime | Tue, 26 Nov 2024 15:48:38
                                 Runtime | 0.27 seconds


[11/26/2024 03:48:39 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/test/dataset/01_dataprep/appvocai_discover-01_dataprep-08_enrich-category-dataset.parquet from repository.
[11/26/2024 03:48:39 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-test-dataprep-enrich_category-category from the repository.
[11/26/2024 03:48:39 PM] [DEBUG] [Stage.run] [wrapper] : Stage: Category Enrichment Stage
[11/26/2024 03:48:39 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Tue, 26 Nov 2024 15:48:38
[11/26/2024 03:48:39 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Tue, 26 Nov 2024 15:48:39
[11/26/2024 03:48:39 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 1.75 seconds
[11/26/2024 03:48:39 PM] [DEBUG] [Stage.run] [wrapper] : Cached Result: True




                           Category Enrichment Stage                            
                           Stage Started | Tue, 26 Nov 2024 15:48:38
                         Stage Completed | Tue, 26 Nov 2024 15:48:39
                           Stage Runtime | 1.75 seconds
                           Cached Result | True





---

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.