In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# AppVoCAI Dataset Enrichment
This data enrichment effort will imbue subsequent data quality and exploratory analyses with essential quality signals, user engagement data, target class distributions, and aggregations that position us for a systematic, and intensive data quality analysis, and an insight-rich exploratory effort. This data enrichment stage will unfold through five progressive steps:

1. **Text Quality Detection**: We identify and address extraneous characters, non-standard symbols, and other noise elements that may distort analytical insights. This ensures our textual data maintains a high level of clarity and precision, which is crucial for accurate natural language processing.

2. **Text Quality Analysis**: We evaluate grammatical complexity, syntactic structure diversity, coherence, clarity, intensity, and the overall linguistic elaborateness {ref}`appendix:tqs`. These factors significantly impact the performance of language models, enhancing our understanding of nuanced user sentiment and intent.

3. **Sentiment Classification**: Utilizing SpaCy’s rule-based sentiment classifier allows for a computationally efficient, high-level analysis of sentiment distribution and balance within the dataset. This provides an initial framework to identify emotional trends and ensure the dataset is representative of a wide range of user experiences.

4. **Quantitative Enrichment**: Decomposing timestamps yields valuable temporal features, such as the relative age of reviews and submission details like month, day, and hour. This enables us to conduct temporal and longitudinal analyses, uncover cyclical trends in app usage, and observe variations in user behavior. Analyzing deviations from category-level and app-level themes may reveal unmet needs, feature gaps, and inconsistencies in user experiences.

5. **Aggregate Data Analysis**: By summarizing data at the app, author, and category levels, we expose overarching themes related to user engagement, satisfaction, and app performance. This macro-level analysis provides  insight into broader dynamics of user interactions, highlighting areas of strength and opportunities for improvement.

### Early Feature Engineering?
Excellent question. Whereas feature engineering derives new variables that are expected to have an influential effect on model development and predictive performance, this data enrichment effort aims to facilitate rigorous data quality analysis, and exploration while minimizing bias, and avoiding transformations that might distort or invalidate analytical interpretations.

Let's move forward!


## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageDef
from discover.flow.data_prep.quant.stage import QuantStage
from discover.flow.data_prep.sentiment.stage import SentimentClassificationStage
from discover.flow.data_prep.dqm.stage import TextQualityDetectionStage
from discover.flow.data_prep.tqa.stage import TQAStage
from discover.flow.data_prep.aggregation.stage import AggregationStage
from discover.core.flow import PhaseDef, DataPrepStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.aggregation.stage",
    ],
)

## Text Quality Detection (TQD) Pipeline

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.TQD
)
# Build and run the stage
stage = TextQualityDetectionStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                          Text Quality Detection Stage                          #



                          Text Quality Detection Stage                          
                           Stage Started | Tue, 12 Nov 2024 02:09:15
                         Stage Completed | Tue, 12 Nov 2024 02:09:15
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Text Quality Analysis (TQA) Pipeline 

In [5]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.TQA
)

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                          Text Quality Analysis Stage                           #



                          Text Quality Analysis Stage                           
                           Stage Started | Tue, 12 Nov 2024 02:09:17
                         Stage Completed | Tue, 12 Nov 2024 02:09:17
                           Stage Runtime | 0.0 seconds
                           Cached Result | True





## Sentiment Classification Pipeline  
The Review-Level Sentiment Classification Pipeline uses spaCy to analyze sentiment on a scale from -1 to 1. Reviews are then classified as negative, neutral, or positive by dividing this scale into three equal spans.

In [6]:
FORCE = True
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.SENTIMENT
)

# Build and run Data Ingestion Stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/12/2024 02:09:24 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-04_sentiment-review-dataset.parquet from repository.
[11/12/2024 02:09:24 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-sentiment-review from the repository.




#                         Sentiment Classification Stage                         #



                           VaderSentimentAnalysisTask                           
                           --------------------------                           
                          Start Datetime | Tue, 12 Nov 2024 02:09:24


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4817), Label(value='0 / 4817'))), …

                       Complete Datetime | Tue, 12 Nov 2024 02:09:31
                                 Runtime | 7.4 seconds


                         Sentiment Classification Stage                         
                           Stage Started | Tue, 12 Nov 2024 02:09:24
                         Stage Completed | Tue, 12 Nov 2024 02:09:32
                           Stage Runtime | 8.38 seconds





---

## Quantitative Enrichment Pipeline

In [7]:
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.QUANT
)

# Build and run Data Ingestion Stage
stage = QuantStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/12/2024 02:09:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-05_quant-review-dataset.parquet from repository.
[11/12/2024 02:09:34 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-quant-review from the repository.




#                         Quantitative Enrichment Stage                          #



your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                



                              ComputeReviewAgeTask                              
                              --------------------                              
                          Start Datetime | Tue, 12 Nov 2024 02:09:47


                                                                                

                       Complete Datetime | Tue, 12 Nov 2024 02:09:51
                                 Runtime | 3.79 seconds


                             ComputeReviewMonthTask                             
                             ----------------------                             
                          Start Datetime | Tue, 12 Nov 2024 02:09:51
                       Complete Datetime | Tue, 12 Nov 2024 02:09:51
                                 Runtime | 0.09 seconds


                           ComputeReviewDayofWeekTask                           
                           --------------------------                           
                          Start Datetime | Tue, 12 Nov 2024 02:09:51
                       Complete Datetime | Tue, 12 Nov 2024 02:09:51
                                 Runtime | 0.07 seconds


                             ComputeReviewHourTask                              
                             ---------------------                          





                         Quantitative Enrichment Stage                          
                           Stage Started | Tue, 12 Nov 2024 02:09:34
                         Stage Completed | Tue, 12 Nov 2024 02:10:02
                           Stage Runtime | 28.58 seconds





                                                                                

## Aggregation Pipelines
Aggregating data at the app and category levels provides a high-level view of review trends and user behavior, offering insights into user engagement and feedback patterns. At the app level, we consolidate key metrics, such as average ratings, review length, review count, and total vote sum, while identifying standout reviews based on highest vote counts, top TQA scores, and longest review lengths. 

A similar approach is used at the category level, aggregating metrics across all apps within a category to reveal trends that may indicate common strengths or pain points across similar apps. This two-tiered aggregation—app-level and category-level—allows for both detailed and broad insights into app performance, aiding in strategic decisions and market comparisons.

### App Aggregation Pipeline

In [8]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.AGG
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_ids = stage.run()



#                               Aggregation Stage                                #



[11/12/2024 02:10:03 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-06_agg-app-dataset.parquet from repository.
[11/12/2024 02:10:03 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-agg-app from the repository.




                               AppAggregationTask                               
                               ------------------                               
                          Start Datetime | Tue, 12 Nov 2024 02:10:03
                       Complete Datetime | Tue, 12 Nov 2024 02:10:03
                                 Runtime | 0.53 seconds


[11/12/2024 02:10:13 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-06_agg-category-dataset.parquet from repository.
[11/12/2024 02:10:13 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-agg-category from the repository.




                            CategoryAggregationTask                             
                            -----------------------                             
                          Start Datetime | Tue, 12 Nov 2024 02:10:13
                       Complete Datetime | Tue, 12 Nov 2024 02:10:13
                                 Runtime | 0.38 seconds


[Stage 53:>                                                       (0 + 10) / 10]



                               Aggregation Stage                                
                           Stage Started | Tue, 12 Nov 2024 02:10:03
                         Stage Completed | Tue, 12 Nov 2024 02:10:16
                           Stage Runtime | 13.05 seconds





                                                                                

In [9]:
asset_ids

{'app': 'dataset-dev-dataprep-agg-app',
 'category': 'dataset-dev-dataprep-agg-category'}

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.