In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# AppVoCAI Dataset Enrichment and Aggregation
The Data Enrichment and Aggregation module is designed to transform raw app review data into actionable insights by enhancing and summarizing key features. This module begins by incorporating critical review metadata, such as review length and review age, to provide context on the scale and recency of user feedback. It then applies sentiment classification of each review, followed by a review quality assessment to filter for high-value content. Deviation analysis is performed to identify outliers and assess how individual reviews compare to broader patterns. The enriched data is subsequently aggregated at both the app and category levels, enabling a view of performance and sentiment trends across the mobile app ecosystem.

## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.enrich.metadata.stage import MetadataStage
from discover.flow.enrich.sentiment.stage import SentimentClassificationStage
from discover.flow.enrich.quality.stage import QualityStage
from discover.flow.enrich.deviation.stage import DeviationStage
from discover.flow.aggregation.stage import AggregationStage
from discover.core.flow import PhaseDef, EnrichmentStageDef, AggregationStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.enrich.stage",
        "discover.flow.aggregation.stage",
    ],
)

## Metadata Pipeline
The Metadata Pipeline calculates the `review age` to understand the temporal relevance of each review and `review length` to gauge the depth of user feedback.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.METADATA
)

# Build and run Data Ingestion Stage
stage = MetadataStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                             Review Metadata Stage                              #



                             Review Metadata Stage                              
                           Stage Started | Fri, 08 Nov 2024 11:53:31
                         Stage Completed | Fri, 08 Nov 2024 11:53:31
                           Stage Runtime | 0.01 seconds
                           Cached Result | True





## Sentiment Classification Pipeline  
The Review-Level Sentiment Classification Pipeline uses spaCy to analyze sentiment on a scale from -1 to 1. Reviews are then classified as negative, neutral, or positive by dividing this scale into three equal spans.

In [5]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.SENTIMENT
)

# Build and run Data Ingestion Stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/08/2024 11:53:41 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/02_enrichment/appvocai_discover-02_enrichment-01_sentiment-review-dataset.parquet from repository.
[11/08/2024 11:53:41 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-enrichment-sentiment-review from the repository.




#                         Sentiment Classification Stage                         #



                           SpacySentimentAnalysisTask                           
                           --------------------------                           
                          Start Datetime | Fri, 08 Nov 2024 11:53:42


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4533), Label(value='0 / 4533'))), …

                       Complete Datetime | Fri, 08 Nov 2024 11:57:01
                                 Runtime | 3.0 minutes and 19.26 seconds


                         Sentiment Classification Stage                         
                           Stage Started | Fri, 08 Nov 2024 11:53:41
                         Stage Completed | Fri, 08 Nov 2024 11:57:01
                           Stage Runtime | 3.0 minutes and 19.99 seconds





## Review Text Quality Analysis (TQA) Pipeline
Review text quality is an indicator of the content's richness, coherence, and informativeness. In this section, we integrate two complementary quality assessment measures—a lexical/syntactic score and a perplexity-based score—into a weighted sum. This approach provides a balanced evaluation, capturing both the structural diversity and natural language fluency of the reviews.

### Lexical and Syntactic Quality Assessment  
The lexical and syntactic quality assessment (TQA) evaluates review quality using a composite score derived from multiple syntactic and lexical measures. These measures are computed with specific weights:

1. **POS Count Score (40%)**: Reflects the richness of content using counts of nouns, verbs, adjectives, and adverbs.
2. **POS Diversity Score (20%)**: Captures variety in language using an entropy-based calculation.
3. **POS Intensity Score (10%)**: Assesses the density of key parts of speech relative to total word count.
4. **Structural Complexity Score (20%)**: Evaluates text complexity using unique word proportion, special character usage, and word length variation.
5. **TQA Check Score (10%)**: Incorporates quality signals such as limited digit use, minimal special characters, and proper terminal punctuation.

The lexical and syntactic quality score is a weighted sum of these components.

### Perplexity-Based Quality Assessment  
This measure evaluates review quality by applying 13 linguistic and structural filters, each assigned a weight derived from relative perplexity differences between the full dataset and filtered subsets. The filters assess features like adjective presence, punctuation ratios, word repetition, and special character use. Weights are computed to emphasize filters that most reduce perplexity, thus enhancing text fluency and coherence. The final score is a weighted sum of these filter indicators. 

For detailed implementation and methodology, refer to the [Appendix](#appendix-link).

In [6]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.QUALITY
)

# Build and run Data Ingestion Stage
stage = QualityStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                              Review Quality Stage                              #



                              Review Quality Stage                              
                           Stage Started | Fri, 08 Nov 2024 11:57:02
                         Stage Completed | Fri, 08 Nov 2024 11:57:02
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Deviation Analysis Pipeline
Deviation analysis is crucial for understanding how user feedback varies around average app performance metrics. At the app level, we analyze deviations in rating, review length, and sentiment relative to their respective averages. These insights reveal patterns in user experience, highlighting anomalies or inconsistencies that may indicate areas for improvement or unexpected user behavior.

In [7]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.ENRICHMENT, stage=EnrichmentStageDef.DEVIATION
)

# Build and run Data Ingestion Stage
stage = DeviationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                        Review Deviation Analysis Stage                         #



                        Review Deviation Analysis Stage                         
                           Stage Started | Fri, 08 Nov 2024 11:57:02
                         Stage Completed | Fri, 08 Nov 2024 11:57:02
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## App and Aggregation Pipelines
Aggregating data at the app and category levels provides a high-level view of review trends and user behavior, offering insights into user engagement and feedback patterns. At the app level, we consolidate key metrics, such as average ratings, review length, review count, and total vote sum, while identifying standout reviews based on highest vote counts, top TQA scores, and longest review lengths. 

A similar approach is used at the category level, aggregating metrics across all apps within a category to reveal trends that may indicate common strengths or pain points across similar apps. This two-tiered aggregation—app-level and category-level—allows for both detailed and broad insights into app performance, aiding in strategic decisions and market comparisons.

### App Aggregation Pipeline

In [8]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.AGGREGATION, stage=AggregationStageDef.APP
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                             App Aggregation Stage                              #



                             App Aggregation Stage                              
                           Stage Started | Fri, 08 Nov 2024 11:57:02
                         Stage Completed | Fri, 08 Nov 2024 11:57:02
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





### Category Aggregation Pipeline

In [9]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.AGGREGATION, stage=AggregationStageDef.CATEGORY
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                           Category Aggregation Stage                           #



                           Category Aggregation Stage                           
                           Stage Started | Fri, 08 Nov 2024 11:57:02
                         Stage Completed | Fri, 08 Nov 2024 11:57:02
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with critical features, including review metadata (such as length and age), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. These enriched features set the stage for an informed and focused Exploratory Data Analysis (EDA). 

In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.