In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# AppVoCAI Dataset Enrichment and Aggregation
The Data Enrichment and Aggregation module is designed to transform raw app review data into actionable insights by enhancing and summarizing key features. This module begins by incorporating critical review metadata, such as review length and review age, to provide context on the scale and recency of user feedback. It then applies sentiment classification of each review, followed by a review quality assessment to filter for high-value content. Deviation analysis is performed to identify outliers and assess how individual reviews compare to broader patterns. The enriched data is subsequently aggregated at both the app and category levels, enabling a view of performance and sentiment trends across the mobile app ecosystem.

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.enrich.metadata.stage import MetadataStage
from discover.flow.enrich.sentiment.stage import SentimentClassificationStage
from discover.flow.enrich.quality.stage import QualityStage
from discover.flow.enrich.deviation.stage import DeviationStage
from discover.flow.aggregation.stage import AggregationStage
from discover.core.flow import PhaseDef, EnrichmentStageDef, AggregationStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.enrich.stage",
        "discover.flow.aggregation.stage",
    ],
)

## Metadata Pipeline
The Metadata Pipeline enriches the dataset by extracting and analyzing key features such as **review age**, **review length**, and three **temporal fields**—**month**, **day of the week**, and **hour of the day**. These features are designed to provide insights into user engagement patterns and app usage, supporting unmet needs discovery in the following ways:

1. **Review Age**: Understanding how review content changes over time can help identify shifting user expectations and long-standing issues that may require attention. Analyzing review age trends can inform decisions on product updates and feature prioritization.

2. **Review Length**: Longer reviews often contain richer, more detailed feedback. By evaluating review length, we can identify opportunities to address comprehensive user concerns or highlight areas where users have strong opinions or pain points.

3. **Month**: Monthly trends can reveal **seasonal usage patterns**, indicating how user needs change throughout the year. This can inform strategies like seasonal feature releases, marketing campaigns, or resource allocation.

4. **Day of the Week**: Analyzing reviews by the day of the week might uncover insights into **weekly routines** and how apps fit into users' schedules. For instance, productivity apps might show spikes on weekdays, while entertainment apps might peak on weekends, guiding feature development aligned with these usage patterns.

5. **Hour of the Day**: Understanding when users are most active can provide clues about **contextual needs**. Apps with high engagement at night might suggest opportunities for sleep-related features, while those used during commutes may benefit from quick, on-the-go functionalities.

Together, these metadata features enhance our understanding of app usage and user behavior, laying the foundation for identifying unmet needs and informing opportunity discovery.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.METADATA
)

# Build and run Data Ingestion Stage
stage = MetadataStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                             Review Metadata Stage                              #



                             Review Metadata Stage                              
                           Stage Started | Fri, 08 Nov 2024 22:33:19
                         Stage Completed | Fri, 08 Nov 2024 22:33:19
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Sentiment Classification Pipeline  
The Review-Level Sentiment Classification Pipeline uses spaCy to analyze sentiment on a scale from -1 to 1. Reviews are then classified as negative, neutral, or positive by dividing this scale into three equal spans.

In [5]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.SENTIMENT
)

# Build and run Data Ingestion Stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                         Sentiment Classification Stage                         #



                         Sentiment Classification Stage                         
                           Stage Started | Fri, 08 Nov 2024 22:33:26
                         Stage Completed | Fri, 08 Nov 2024 22:33:26
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Review Text Quality Analysis (TQA) Pipeline
Review text quality is an indicator of the content's richness, coherence, and informativeness. In this section, we integrate two complementary quality assessment measures—a lexical/syntactic score and a perplexity-based score—into a weighted sum. This approach provides a balanced evaluation, capturing both the structural diversity and natural language fluency of the reviews.

### Lexical and Syntactic Quality Assessment  
The lexical and syntactic quality assessment (TQA) evaluates review quality using a composite score derived from multiple syntactic and lexical measures. These measures are computed with specific weights:

1. **POS Count Score (40%)**: Reflects the richness of content using counts of nouns, verbs, adjectives, and adverbs.
2. **POS Diversity Score (20%)**: Captures variety in language using an entropy-based calculation.
3. **POS Intensity Score (10%)**: Assesses the density of key parts of speech relative to total word count.
4. **Structural Complexity Score (20%)**: Evaluates text complexity using unique word proportion, special character usage, and word length variation.
5. **TQA Check Score (10%)**: Incorporates quality signals such as limited digit use, minimal special characters, and proper terminal punctuation.

The lexical and syntactic quality score is a weighted sum of these components.

### Perplexity-Based Quality Assessment  
This measure evaluates review quality by applying 13 linguistic and structural filters, each assigned a weight derived from relative perplexity differences between the full dataset and filtered subsets. The filters assess features like adjective presence, punctuation ratios, word repetition, and special character use. Weights are computed to emphasize filters that most reduce perplexity, thus enhancing text fluency and coherence. The final score is a weighted sum of these filter indicators. 

For detailed implementation and methodology, refer to the [Appendix](#appendix-link).

In [6]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.QUALITY
)

# Build and run Data Ingestion Stage
stage = QualityStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                              Review Quality Stage                              #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f4d16e13-0df9-42f0-97ad-e4859827f922;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                                    NLPTask                                     
                                    -------                                     
                          Start Datetime | Fri, 08 Nov 2024 22:33:48
pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
[ | ]pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
Download done! Loading the resource.
[ / ]

                                                                                

[OK!]
                       Complete Datetime | Fri, 08 Nov 2024 22:34:04
                                 Runtime | 16.28 seconds


                              ComputePOSStatsTask                               
                              -------------------                               
                          Start Datetime | Fri, 08 Nov 2024 22:34:04
                       Complete Datetime | Fri, 08 Nov 2024 22:34:05
                                 Runtime | 0.64 seconds


                             ComputeBasicStatsTask                              
                             ---------------------                              
                          Start Datetime | Fri, 08 Nov 2024 22:34:05
                       Complete Datetime | Fri, 08 Nov 2024 22:34:06
                                 Runtime | 0.67 seconds


                             ComputeTQAFiltersTask                              
                             ---------------------                   

                                                                                

                       Complete Datetime | Fri, 08 Nov 2024 22:35:57
                                 Runtime | 1.0 minutes and 50.57 seconds


                                    TQATask2                                    
                                    --------                                    
                          Start Datetime | Fri, 08 Nov 2024 22:35:57
                       Complete Datetime | Fri, 08 Nov 2024 22:35:57
                                 Runtime | 0.51 seconds


                                    TQATask3                                    
                                    --------                                    
                          Start Datetime | Fri, 08 Nov 2024 22:35:57


                                                                                

                       Complete Datetime | Fri, 08 Nov 2024 22:39:37
                                 Runtime | 3.0 minutes and 40.07 seconds






                              Review Quality Stage                              
                           Stage Started | Fri, 08 Nov 2024 22:33:27
                         Stage Completed | Fri, 08 Nov 2024 22:41:36
                           Stage Runtime | 8.0 minutes and 8.85 seconds





                                                                                

## Deviation Analysis Pipeline
Deviation analysis is crucial for understanding how user feedback varies around average app performance metrics. At the app level, we analyze deviations in rating, review length, and sentiment relative to their respective averages. These insights reveal patterns in user experience, highlighting anomalies or inconsistencies that may indicate areas for improvement or unexpected user behavior.

In [7]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.DEVIATION
)

# Build and run Data Ingestion Stage
stage = DeviationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/08/2024 10:41:36 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/02_enrichment/appvocai_discover-02_enrichment-03_deviation-review-dataset.parquet from repository.
[11/08/2024 10:41:36 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-enrichment-deviation-review from the repository.




#                        Review Deviation Analysis Stage                         #



                          ComputePercentDeviationTask                           
                          ---------------------------                           
                          Start Datetime | Fri, 08 Nov 2024 22:41:36
                       Complete Datetime | Fri, 08 Nov 2024 22:41:36
                                 Runtime | 0.27 seconds


                          ComputePercentDeviationTask                           
                          ---------------------------                           
                          Start Datetime | Fri, 08 Nov 2024 22:41:36
                       Complete Datetime | Fri, 08 Nov 2024 22:41:36
                                 Runtime | 0.21 seconds


                          ComputePercentDeviationTask                           
                          ---------------------------                           
                          Start Da





                        Review Deviation Analysis Stage                         
                           Stage Started | Fri, 08 Nov 2024 22:41:36
                         Stage Completed | Fri, 08 Nov 2024 22:41:39
                           Stage Runtime | 2.77 seconds





                                                                                

## App and Aggregation Pipelines
Aggregating data at the app and category levels provides a high-level view of review trends and user behavior, offering insights into user engagement and feedback patterns. At the app level, we consolidate key metrics, such as average ratings, review length, review count, and total vote sum, while identifying standout reviews based on highest vote counts, top TQA scores, and longest review lengths. 

A similar approach is used at the category level, aggregating metrics across all apps within a category to reveal trends that may indicate common strengths or pain points across similar apps. This two-tiered aggregation—app-level and category-level—allows for both detailed and broad insights into app performance, aiding in strategic decisions and market comparisons.

### App Aggregation Pipeline

In [8]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.AGGREGATION, stage=AggregationStageDef.APP
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/08/2024 10:41:39 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/03_aggregation/appvocai_discover-03_aggregation-00_app-app-dataset.parquet from repository.
[11/08/2024 10:41:39 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-aggregation-app-app from the repository.




#                             App Aggregation Stage                              #



                               AppAggregationTask                               
                               ------------------                               
                          Start Datetime | Fri, 08 Nov 2024 22:41:39
                       Complete Datetime | Fri, 08 Nov 2024 22:41:39
                                 Runtime | 0.38 seconds


                                                                                



                             App Aggregation Stage                              
                           Stage Started | Fri, 08 Nov 2024 22:41:39
                         Stage Completed | Fri, 08 Nov 2024 22:41:45
                           Stage Runtime | 6.25 seconds





### Category Aggregation Pipeline

In [9]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.AGGREGATION, stage=AggregationStageDef.CATEGORY
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/08/2024 10:41:45 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/03_aggregation/appvocai_discover-03_aggregation-01_category-category-dataset.parquet from repository.
[11/08/2024 10:41:45 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-aggregation-category-category from the repository.




#                           Category Aggregation Stage                           #



                            CategoryAggregationTask                             
                            -----------------------                             
                          Start Datetime | Fri, 08 Nov 2024 22:41:45
                       Complete Datetime | Fri, 08 Nov 2024 22:41:46
                                 Runtime | 0.36 seconds


[Stage 72:>                                                       (0 + 10) / 10]



                           Category Aggregation Stage                           
                           Stage Started | Fri, 08 Nov 2024 22:41:45
                         Stage Completed | Fri, 08 Nov 2024 22:41:48
                           Stage Runtime | 3.08 seconds





                                                                                

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.