In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# AppVoCAI Dataset Enrichment
This data enrichment effort will imbue subsequent data quality and exploratory analyses with essential quality signals, user engagement data, target class distributions, and aggregations that position us for a systematic, and intensive data quality analysis, and an insight-rich exploratory effort. This data enrichment stage will unfold through five progressive steps:

1. **Sentiment Classification**: Leveraging SpaCy's rule-based sentiment classifier will provide a computationally efficient, high-level sense of sentiment distribution and balance in the dataset. 
2. **Text Quality Assessment**: We evaluate the grammatical sophistication, syntactic structure, diversity, coherence, clarity, intensity, and linguistic elaborateness of the user reviews - aspects known to sizeably effect LLM model performance.
3. **Quantitative Enrichment**: Disembling timestamps will release temporal features, such as the age of each review relative to the most recent one and the month, day, and hour of submission for temporal and longitudinal analysis of user engagement patterns of app usage and cyclical trends in user behavior. Deviations from broader themes at the category and app levels may reveal feature gaps, unmet needs, and notable variability in user experiences.
4. **Aggregate Data**: Summarizing information at the app, author, and category levels to expose overarching themes related to user engagement, satisfaction, and app performance, providing a macro-level perspective on the data.

Is this early feature engineering? Great question. It is not. Feature engineering derives new variables that are expected to have an influential effect on model development and predictive performance. While there may be some overlap, our purpose is to facilitate rigorous data quality analysis, and exploration while minimizing bias, and avoiding transformations that might distort or invalidate analytical interpretations.

Let's do this.


## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageDef
from discover.flow.data_prep.quant.stage import QuantStage
from discover.flow.data_prep.sentiment.stage import SentimentClassificationStage
from discover.flow.data_prep.tqa.stage import TQAStage
from discover.flow.data_prep.aggregation.stage import AggregationStage
from discover.core.flow import PhaseDef, DataPrepStageDef

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.aggregation.stage",
    ],
)

## Sentiment Classification Pipeline  
The Review-Level Sentiment Classification Pipeline uses spaCy to analyze sentiment on a scale from -1 to 1. Reviews are then classified as negative, neutral, or positive by dividing this scale into three equal spans.

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.SENTIMENT
)

# Build and run Data Ingestion Stage
stage = SentimentClassificationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                         Sentiment Classification Stage                         #



                         Sentiment Classification Stage                         
                           Stage Started | Mon, 11 Nov 2024 04:24:55
                         Stage Completed | Mon, 11 Nov 2024 04:24:55
                           Stage Runtime | 0.01 seconds
                           Cached Result | True





---

## Review Text Quality Analysis (TQA) Pipeline
Review text quality is an indicator of the content's richness, coherence, and informativeness. In this section, we integrate two complementary quality assessment measures—a lexical/syntactic score and a perplexity-based score—into a weighted sum. This approach provides a balanced evaluation, capturing both the structural diversity and natural language fluency of the reviews.

### Lexical and Syntactic Complexity Assessment  
The lexical and syntactic quality assessment (TQA) evaluates review quality using a composite score derived from multiple syntactic and lexical measures. These measures are computed with specific weights:

1. **POS Count Score (40%)**: Reflects the richness of content using counts of nouns, verbs, adjectives, and adverbs.
2. **POS Diversity Score (20%)**: Captures variety in language using an entropy-based calculation.
3. **POS Intensity Score (10%)**: Assesses the density of key parts of speech relative to total word count.
4. **Structural Complexity Score (20%)**: Evaluates text complexity using unique word proportion, special character usage, and word length variation.
5. **TQA Check Score (10%)**: Incorporates quality signals such as limited digit use, minimal special characters, and proper terminal punctuation.

A high Lexical and Syntactic Complexity Score typically indicates a text rich in linguistic features, with varied sentence structures and a well-balanced mix of nouns, verbs, and modifiers (like adjectives and adverbs). This variety is particularly valuable for tasks like Aspect-Based Sentiment Analysis (ABSA), where structural complexity can signal content with nuanced aspects and sentiments.

### Perplexity-Based Quality Assessment  
This measure evaluates review quality by applying 13 linguistic and structural filters, each assigned a weight derived from relative perplexity differences between the full dataset and filtered subsets. The filters assess features like adjective presence, punctuation ratios, word repetition, and special character use. Weights are computed to emphasize filters that most reduce perplexity, thus enhancing text fluency and coherence. The final score is a weighted sum of these filter indicators. 

Lower perplexity implies higher fluency, coherence, and grammatical correctness, which are key indicators of text quality. This component is useful for flagging low-quality or noisy text that may be unpredictable or deviate significantly from standard linguistic norms.

### Weighted Scoring Approach
To create a balanced quality score, the Syntactic Complexity Score and Perplexity-Based Score are combined with tailored weights that emphasize their respective strengths. 

- **Syntactic Complexity Weight**: Typically given more weight when the task demands detailed and linguistically rich text, such as ABSA, where richer syntactic content improves aspect and sentiment extraction.
- **Perplexity-Based Weight**: Often assigned a moderate weight to capture coherence and fluency, ensuring that only grammatically sound and predictable text is prioritized without sacrificing syntactic diversity.

The final **Text Quality Score** is a weighted average of these two components, providing a single score that balances both syntactic richness and linguistic fluency. The remainder of this notebook will execute the text quality scoring pipeline, computing and integrating the two quality measures into a final text quality score. The specifics of the measures are provided in {ref}`Text Quality Method` Section of the appendix.   

In [None]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.TQA
)

# Build and run Data Ingestion Stage
stage = TQAStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/11/2024 04:24:56 AM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-tqa-review from the repository.




#                               Text Quality Stage                               #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b56ba3a3-aaec-476b-a6b0-7e971f0ee028;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                                    NLPTask                                     
                                    -------                                     
                          Start Datetime | Mon, 11 Nov 2024 04:25:26
pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
[ — ]pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
Download done! Loading the resource.
[ \ ]

                                                                                

[OK!]
                       Complete Datetime | Mon, 11 Nov 2024 04:26:07
                                 Runtime | 41.08 seconds


                              ComputePOSStatsTask                               
                              -------------------                               
                          Start Datetime | Mon, 11 Nov 2024 04:26:07
                       Complete Datetime | Mon, 11 Nov 2024 04:26:07
                                 Runtime | 0.71 seconds


                             ComputeBasicStatsTask                              
                             ---------------------                              
                          Start Datetime | Mon, 11 Nov 2024 04:26:07
                       Complete Datetime | Mon, 11 Nov 2024 04:26:08
                                 Runtime | 0.81 seconds


                             ComputeTQAFiltersTask                              
                             ---------------------                   

                                                                                

                       Complete Datetime | Mon, 11 Nov 2024 04:27:59
                                 Runtime | 1.0 minutes and 49.92 seconds


                                    TQATask2                                    
                                    --------                                    
                          Start Datetime | Mon, 11 Nov 2024 04:27:59
                       Complete Datetime | Mon, 11 Nov 2024 04:27:59
                                 Runtime | 0.71 seconds


                                    TQATask3                                    
                                    --------                                    
                          Start Datetime | Mon, 11 Nov 2024 04:27:59


                                                                                

                       Complete Datetime | Mon, 11 Nov 2024 04:35:15
                                 Runtime | 7.0 minutes and 15.08 seconds






                               Text Quality Stage                               
                           Stage Started | Mon, 11 Nov 2024 04:24:56
                         Stage Completed | Mon, 11 Nov 2024 04:37:03
                           Stage Runtime | 12.0 minutes and 6.57 seconds





                                                                                

## Quantitative Enrichment
This pipeline enriches the dataset by extracting and analyzing key features such as **review age**, **review length**, and three **temporal fields**—**month**, **day of the week**, and **hour of the day**. These features are designed to provide insights into user engagement patterns and app usage, supporting unmet needs discovery in the following ways:

1. **Review Age**: Understanding how review content changes over time can help identify shifting user expectations and long-standing issues that may require attention. Analyzing review age trends can inform decisions on product updates and feature prioritization.

2. **Review Length**: Longer reviews often contain richer, more detailed feedback. By evaluating review length, we can identify opportunities to address comprehensive user concerns or highlight areas where users have strong opinions or pain points.

3. **Month**: Monthly trends can reveal **seasonal usage patterns**, indicating how user needs change throughout the year. This can inform strategies like seasonal feature releases, marketing campaigns, or resource allocation.

4. **Day of the Week**: Analyzing reviews by the day of the week might uncover insights into **weekly routines** and how apps fit into users' schedules. For instance, productivity apps might show spikes on weekdays, while entertainment apps might peak on weekends, guiding feature development aligned with these usage patterns.

5. **Hour of the Day**: Understanding when users are most active can provide clues about **contextual needs**. Apps with high engagement at night might suggest opportunities for sleep-related features, while those used during commutes may benefit from quick, on-the-go functionalities.

Together, these metadata features enhance our understanding of app usage and user behavior, laying the foundation for identifying unmet needs and informing opportunity discovery.

In [None]:
FORCE = True
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.QUANT
)

# Build and run Data Ingestion Stage
stage = QuantStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                         Quantitative Enrichment Stage                          #



                              ComputeReviewAgeTask                              
                              --------------------                              
                          Start Datetime | Mon, 11 Nov 2024 04:37:47
                       Complete Datetime | Mon, 11 Nov 2024 04:37:48
                                 Runtime | 0.21 seconds


                             ComputeReviewMonthTask                             
                             ----------------------                             
                          Start Datetime | Mon, 11 Nov 2024 04:37:48
                       Complete Datetime | Mon, 11 Nov 2024 04:37:48
                                 Runtime | 0.02 seconds


                           ComputeReviewDayofWeekTask                           
                           --------------------------                           
                          Start Da

[11/11/2024 04:37:48 AM] [ERROR] [ComputePercentDeviationTask.run] [wrapper] : Exception occurred in ComputePercentDeviationTask called with data=DataFrame[category: string, id: string, app_id: string, app_name: string, category_id: string, author: string, rating: smallint, content: string, vote_sum: bigint, vote_count: bigint, date: timestamp_ntz, review_length: bigint, sentiment: double, sentiment_classification: string, pos_n_nouns: int, pos_n_verbs: int, pos_n_adjectives: int, pos_n_adverbs: int, pos_n_determiners: int, pos_p_nouns: double, pos_p_verbs: double, pos_p_adjectives: double, pos_p_adverbs: double, pos_p_determiners: double, stats_char_count: int, stats_digits_count: int, stats_digits_proportion: double, stats_special_chars_count: int, stats_special_chars_proportion: double, stats_punctuation_count: int, stats_punctuation_proportion: double, stats_word_count: int, stats_unique_word_count: int, stats_unique_word_proportion: double, stats_word_repetition_ratio: double, sta

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `enr_review_length` cannot be resolved. Did you mean one of the following? [`enr_review_month`, `review_length`, `enr_review_age`, `enr_review_hour`, `enr_review_day_of_week`].;
'Aggregate [category#3988], [category#3988, avg('enr_review_length) AS avg_value#4775]
+- Project [category#3988, id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, ... 42 more fields]
   +- Project [category#3988, id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, ... 43 more fields]
      +- Project [category#3988, id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, ... 42 more fields]
         +- Join LeftOuter, (category#3988 = category#4506)
            :- Project [id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, stats_char_count#3951, ... 41 more fields]
            :  +- Project [id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, stats_char_count#3951, ... 40 more fields]
            :     +- Project [id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, stats_char_count#3951, ... 39 more fields]
            :        +- Project [id#3928, app_id#3929, app_name#3930, category_id#3931, author#3932, rating#3933, content#3934, vote_sum#3935L, vote_count#3936L, date#3937, review_length#3938L, sentiment#3939, sentiment_classification#3940, pos_n_nouns#3941, pos_n_verbs#3942, pos_n_adjectives#3943, pos_n_adverbs#3944, pos_n_determiners#3945, pos_p_nouns#3946, pos_p_verbs#3947, pos_p_adjectives#3948, pos_p_adverbs#3949, pos_p_determiners#3950, stats_char_count#3951, ... 38 more fields]
            :           +- Relation [id#3928,app_id#3929,app_name#3930,category_id#3931,author#3932,rating#3933,content#3934,vote_sum#3935L,vote_count#3936L,date#3937,review_length#3938L,sentiment#3939,sentiment_classification#3940,pos_n_nouns#3941,pos_n_verbs#3942,pos_n_adjectives#3943,pos_n_adverbs#3944,pos_n_determiners#3945,pos_p_nouns#3946,pos_p_verbs#3947,pos_p_adjectives#3948,pos_p_adverbs#3949,pos_p_determiners#3950,stats_char_count#3951,... 37 more fields] parquet
            +- Aggregate [category#4506], [category#4506, avg(rating#4451) AS avg_value#4443]
               +- Project [id#4446, app_id#4447, app_name#4448, category_id#4449, author#4450, rating#4451, content#4452, vote_sum#4453L, vote_count#4454L, date#4455, review_length#4456L, sentiment#4457, sentiment_classification#4458, pos_n_nouns#4459, pos_n_verbs#4460, pos_n_adjectives#4461, pos_n_adverbs#4462, pos_n_determiners#4463, pos_p_nouns#4464, pos_p_verbs#4465, pos_p_adjectives#4466, pos_p_adverbs#4467, pos_p_determiners#4468, stats_char_count#4469, ... 41 more fields]
                  +- Project [id#4446, app_id#4447, app_name#4448, category_id#4449, author#4450, rating#4451, content#4452, vote_sum#4453L, vote_count#4454L, date#4455, review_length#4456L, sentiment#4457, sentiment_classification#4458, pos_n_nouns#4459, pos_n_verbs#4460, pos_n_adjectives#4461, pos_n_adverbs#4462, pos_n_determiners#4463, pos_p_nouns#4464, pos_p_verbs#4465, pos_p_adjectives#4466, pos_p_adverbs#4467, pos_p_determiners#4468, stats_char_count#4469, ... 40 more fields]
                     +- Project [id#4446, app_id#4447, app_name#4448, category_id#4449, author#4450, rating#4451, content#4452, vote_sum#4453L, vote_count#4454L, date#4455, review_length#4456L, sentiment#4457, sentiment_classification#4458, pos_n_nouns#4459, pos_n_verbs#4460, pos_n_adjectives#4461, pos_n_adverbs#4462, pos_n_determiners#4463, pos_p_nouns#4464, pos_p_verbs#4465, pos_p_adjectives#4466, pos_p_adverbs#4467, pos_p_determiners#4468, stats_char_count#4469, ... 39 more fields]
                        +- Project [id#4446, app_id#4447, app_name#4448, category_id#4449, author#4450, rating#4451, content#4452, vote_sum#4453L, vote_count#4454L, date#4455, review_length#4456L, sentiment#4457, sentiment_classification#4458, pos_n_nouns#4459, pos_n_verbs#4460, pos_n_adjectives#4461, pos_n_adverbs#4462, pos_n_determiners#4463, pos_p_nouns#4464, pos_p_verbs#4465, pos_p_adjectives#4466, pos_p_adverbs#4467, pos_p_determiners#4468, stats_char_count#4469, ... 38 more fields]
                           +- Relation [id#4446,app_id#4447,app_name#4448,category_id#4449,author#4450,rating#4451,content#4452,vote_sum#4453L,vote_count#4454L,date#4455,review_length#4456L,sentiment#4457,sentiment_classification#4458,pos_n_nouns#4459,pos_n_verbs#4460,pos_n_adjectives#4461,pos_n_adverbs#4462,pos_n_determiners#4463,pos_p_nouns#4464,pos_p_verbs#4465,pos_p_adjectives#4466,pos_p_adverbs#4467,pos_p_determiners#4468,stats_char_count#4469,... 37 more fields] parquet


## Aggregation Pipelines
Aggregating data at the app and category levels provides a high-level view of review trends and user behavior, offering insights into user engagement and feedback patterns. At the app level, we consolidate key metrics, such as average ratings, review length, review count, and total vote sum, while identifying standout reviews based on highest vote counts, top TQA scores, and longest review lengths. 

A similar approach is used at the category level, aggregating metrics across all apps within a category to reveal trends that may indicate common strengths or pain points across similar apps. This two-tiered aggregation—app-level and category-level—allows for both detailed and broad insights into app performance, aiding in strategic decisions and market comparisons.

### App Aggregation Pipeline

In [None]:
# Obtain the configuration
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.AGG
)

# Build and run Data Ingestion Stage
stage = AggregationStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.