In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# AppVoCAI Dataset Enrichment
This data enrichment effort will imbue subsequent data quality and exploratory analyses with essential quality signals, user engagement data, target class distributions, and aggregations that position us for a systematic, and intensive data quality analysis, and an insight-rich exploratory effort. This data enrichment stage will unfold through five progressive steps:

1. **Text Quality Detection**: We identify and address extraneous characters, non-standard symbols, and other noise elements that may distort analytical insights. This ensures our textual data maintains a high level of clarity and precision, which is crucial for accurate natural language processing.

2. **Text Quality Analysis**: We evaluate grammatical complexity, syntactic structure diversity, coherence, clarity, intensity, and the overall linguistic elaborateness {ref}`appendix:tqs`. These factors significantly impact the performance of language models, enhancing our understanding of nuanced user sentiment and intent.

3. **Sentiment Classification**: Utilizing SpaCy’s rule-based sentiment classifier allows for a computationally efficient, high-level analysis of sentiment distribution and balance within the dataset. This provides an initial framework to identify emotional trends and ensure the dataset is representative of a wide range of user experiences.

4. **Quantitative Enrichment**: Decomposing timestamps yields valuable temporal features, such as the relative age of reviews and submission details like month, day, and hour. This enables us to conduct temporal and longitudinal analyses, uncover cyclical trends in app usage, and observe variations in user behavior. Analyzing deviations from category-level and app-level themes may reveal unmet needs, feature gaps, and inconsistencies in user experiences.

5. **Aggregate Data Analysis**: By summarizing data at the app, author, and category levels, we expose overarching themes related to user engagement, satisfaction, and app performance. This macro-level analysis provides  insight into broader dynamics of user interactions, highlighting areas of strength and opportunities for improvement.

### Early Feature Engineering?
Excellent question. Whereas feature engineering derives new variables that are expected to have an influential effect on model development and predictive performance, this data enrichment effort aims to facilitate rigorous data quality analysis, and exploration while minimizing bias, and avoiding transformations that might distort or invalidate analytical interpretations.

Let's move forward!


## Import Libraries

In [None]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.flow.data_prep.edp.stage import QuantStage
from discover.flow.data_prep.sentiment.stage import SentimentClassificationStage
from discover.flow.data_prep.dqm.stage import TextQualityDetectionStage
from discover.flow.data_prep.tqa.stage import TQAStage
from discover.flow.data_prep.agg.stage import AggregationStage
from discover.core.flow import PhaseDef, StageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.base.stage",
        "discover.flow.data_prep.aggregation.stage",
    ],
)

## Text Quality Detection (DQA) Pipeline

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.DQD
)
# Build and run the stage
stage = TextQualityDetectionStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                          Text Quality Detection Stage                          #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fd14763e-415c-4152-9f60-d90845e0a767;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                             DetectOrRepairURLTask                              
                             ---------------------                              
                          Start Datetime | Tue, 12 Nov 2024 05:02:42
                       Complete Datetime | Tue, 12 Nov 2024 05:02:42
                                 Runtime | 0.22 seconds


                         DetectOrRepairEmailAddressTask                         
                         ------------------------------                         
                          Start Datetime | Tue, 12 Nov 2024 05:02:42
                       Complete Datetime | Tue, 12 Nov 2024 05:02:42
                                 Runtime | 0.06 seconds


                         DetectOrRepairPhoneNumberTask                          
                         -----------------------------                          
                          Start Datetime | Tue, 12 Nov 2024 05:02:42
                       Complete Datetime | Tue, 12 

                                                                                

                       Complete Datetime | Tue, 12 Nov 2024 05:02:49
                                 Runtime | 6.0 seconds


                         DetectOrRemoveShortReviewsTask                         
                         ------------------------------                         
                          Start Datetime | Tue, 12 Nov 2024 05:02:49
                       Complete Datetime | Tue, 12 Nov 2024 05:02:49
                                 Runtime | 0.04 seconds




## Text Quality Analysis (TQA) Pipeline 

In [None]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.TQA
)

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

## Sentiment Classification Pipeline  
The Review-Level Sentiment Classification Pipeline uses spaCy to analyze sentiment on a scale from -1 to 1. Reviews are then classified as negative, neutral, or positive by dividing this scale into three equal spans.

In [None]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.SENTIMENT
)

# Build and run Data Ingestion Stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

---

## Quantitative Enrichment Pipeline

In [None]:
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.QUANT
)

# Build and run Data Ingestion Stage
stage = QuantStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

## Aggregation Pipelines
Aggregating data at the app and category levels provides a high-level view of review trends and user behavior, offering insights into user engagement and feedback patterns. At the app level, we consolidate key metrics, such as average ratings, review length, review count, and total vote sum, while identifying standout reviews based on highest vote counts, top TQA scores, and longest review lengths. 

A similar approach is used at the category level, aggregating metrics across all apps within a category to reveal trends that may indicate common strengths or pain points across similar apps. This two-tiered aggregation—app-level and category-level—allows for both detailed and broad insights into app performance, aiding in strategic decisions and market comparisons.

### App Aggregation Pipeline

In [None]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.AGG
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_ids = stage.run()

In [None]:
asset_ids

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.