In [None]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# **AppVoCAI Dataset Enrichment**
To set the foundation for exploratory analysis, we add a few features to the dataset at the **review**, **app**, and **category** levels, providing a cross-dimensional view of the data.  

## **1. Review-Level Enrichments**  
At the most granular level, reviews are enhanced with quality, temporal and contextual features:  
- **Review Features**: Update review length following the data cleaning stage.
- **Quality Features**: Each review is given a text quality score based on its syntactic and lexical richness, and diversity.
- **Temporal Features**: By decomposing timestamps, we derive attributes such as review age and submission details (e.g., month, day, and hour). These features allow us to identify temporal trends and patterns in user feedback.  
- **Rating, Review Age and Review Length Deviations**: Each review is compared against the average for its app's category, highlighting outliers and unique characteristics within individual reviews.  

## **2. App-Level Enrichments**  
Aggregating data at the app level provides a broader perspective on app performance:  
- **Key Summaries**: Metrics such as the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, and sentiment distribution offer a broader view of each app’s reception.  
- **Deviation Statistics**: Comparing app-level metrics against their category averages sheds light on how an app deviates from its peers, offering insights into competitive positioning and unique strengths or weaknesses.  

## **3. Category-Level Enrichments**  
Zooming out further, category-level summaries offer a macro view of app trends within specific domains:  
- **Statistical Summaries**: Similar to the app level, category-level features include the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, `review_age`, `review_length`, and sentiment distribution.  
- **Contextual Insights**: These summaries provide benchmarks for evaluating app performance within its category, helping to contextualize deviations and patterns observed at the app and review levels.  

This cross-layered enrichment process equips the exploratory analysis with additional nuance and context of user feedback across individual reviews, apps, and broader categories, forming the backbone of the **AppVoCAI** discovery phase.  

Let's do it!


## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import StageDef
from discover.core.flow import PhaseDef, StageDef
from discover.flow.stage.data_prep.enrich import DataEnrichmentStage

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.stage.base",
    ],
)

## Review Enrichment
### Review Enrichment Pipeline

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_REVIEW
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
review_enrichment_asset_id = stage.run()

[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib] [wrapper] : matplotlib data path: /home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/matplotlib/mpl-data
[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib] [wrapper] : CONFIGDIR=/home/john/.config/matplotlib
[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib] [<module>] : interactive is False
[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib] [<module>] : platform is linux
[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib] [wrapper] : CACHEDIR=/home/john/.cache/matplotlib
[12/14/2024 11:35:00 PM] [DEBUG] [matplotlib.font_manager] [_load_fontmanager] : Using fontManager instance from /home/john/.cache/matplotlib/fontlist-v330.json
[12/14/2024 11:35:01 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RUN
[12/14/2024 11:35:01 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_nlp_session] : Creating a spark nlp session.
[12/14/2024 11:35:01 PM] [DEBUG] [discover.infra.service.spark.session.S



#                            Review Enrichment Stage                             #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a0300612-d31b-4e6f-b7e2-2992a3292e8e;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                                    NLPTask                                     
                                    -------                                     
                          Start Datetime | Sat, 14 Dec 2024 23:35:24
pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
[ | ]pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
Download done! Loading the resource.
[ / ]

                                                                                

[OK!]


[12/14/2024 11:35:45 PM] [DEBUG] [NLPTask.run] [wrapper] : Task: NLPTask
[12/14/2024 11:35:45 PM] [DEBUG] [NLPTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:24
[12/14/2024 11:35:45 PM] [DEBUG] [NLPTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:45
[12/14/2024 11:35:45 PM] [DEBUG] [NLPTask.run] [wrapper] : Runtime: 20.76 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:35:45
                                 Runtime | 20.76 seconds


                             ComputeTextQualityTask                             
                             ----------------------                             
                          Start Datetime | Sat, 14 Dec 2024 23:35:45


[12/14/2024 11:35:45 PM] [DEBUG] [ComputeTextQualityTask.run] [wrapper] : Task: ComputeTextQualityTask
[12/14/2024 11:35:45 PM] [DEBUG] [ComputeTextQualityTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:45
[12/14/2024 11:35:45 PM] [DEBUG] [ComputeTextQualityTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:45
[12/14/2024 11:35:45 PM] [DEBUG] [ComputeTextQualityTask.run] [wrapper] : Runtime: 0.66 seconds
[12/14/2024 11:35:46 PM] [DEBUG] [ComputeReviewLengthPS.run] [wrapper] : Task: ComputeReviewLengthPS
[12/14/2024 11:35:46 PM] [DEBUG] [ComputeReviewLengthPS.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:45
[12/14/2024 11:35:46 PM] [DEBUG] [ComputeReviewLengthPS.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:46
[12/14/2024 11:35:46 PM] [DEBUG] [ComputeReviewLengthPS.run] [wrapper] : Runtime: 0.09 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:35:45
                                 Runtime | 0.66 seconds


                             ComputeReviewLengthPS                              
                             ---------------------                              
                          Start Datetime | Sat, 14 Dec 2024 23:35:45
                       Complete Datetime | Sat, 14 Dec 2024 23:35:46
                                 Runtime | 0.09 seconds


                              ComputeReviewAgeTask                              
                              --------------------                              
                          Start Datetime | Sat, 14 Dec 2024 23:35:46


[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewAgeTask.run] [wrapper] : Task: ComputeReviewAgeTask
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewAgeTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:46
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewAgeTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:49
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewAgeTask.run] [wrapper] : Runtime: 3.54 seconds
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewMonthTask.run] [wrapper] : Task: ComputeReviewMonthTask
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewMonthTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:49
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewMonthTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:49
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewMonthTask.run] [wrapper] : Runtime: 0.04 seconds
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewDayofWeekTask.run] [wrapper] : Task: ComputeReviewDayofWeekTask
[12/14/2024 11:35:49 PM] [DEBUG] [ComputeReviewDayofWeekTask.

                       Complete Datetime | Sat, 14 Dec 2024 23:35:49
                                 Runtime | 3.54 seconds


                             ComputeReviewMonthTask                             
                             ----------------------                             
                          Start Datetime | Sat, 14 Dec 2024 23:35:49
                       Complete Datetime | Sat, 14 Dec 2024 23:35:49
                                 Runtime | 0.04 seconds


                           ComputeReviewDayofWeekTask                           
                           --------------------------                           
                          Start Datetime | Sat, 14 Dec 2024 23:35:49
                       Complete Datetime | Sat, 14 Dec 2024 23:35:49
                                 Runtime | 0.04 seconds


                             ComputeReviewHourTask                              
                             ---------------------                          

[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Task: ComputePercentDeviationTask
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:49
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:50
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Runtime: 0.29 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:35:50
                                 Runtime | 0.29 seconds


                          ComputePercentDeviationTask                           
                          ---------------------------                           
                          Start Datetime | Sat, 14 Dec 2024 23:35:50


[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Task: ComputePercentDeviationTask
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:50
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:50
[12/14/2024 11:35:50 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Runtime: 0.41 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:35:50
                                 Runtime | 0.41 seconds


                          ComputePercentDeviationTask                           
                          ---------------------------                           
                          Start Datetime | Sat, 14 Dec 2024 23:35:50


[12/14/2024 11:35:51 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Task: ComputePercentDeviationTask
[12/14/2024 11:35:51 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:35:50
[12/14/2024 11:35:51 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:35:51
[12/14/2024 11:35:51 PM] [DEBUG] [ComputePercentDeviationTask.run] [wrapper] : Runtime: 0.69 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:35:51
                                 Runtime | 0.69 seconds


[12/14/2024 11:35:52 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/test/dataset/01_dataprep/appvocai_discover-01_dataprep-06_enrich-review-dataset.parquet from repository.
[12/14/2024 11:35:52 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-test-dataprep-enrich_review-review from the repository.
[12/14/2024 11:36:11 PM] [DEBUG] [Stage.run] [wrapper] : Stage: Review Enrichment Stage
[12/14/2024 11:36:11 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Sat, 14 Dec 2024 23:35:01
[12/14/2024 11:36:11 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Sat, 14 Dec 2024 23:36:11
[12/14/2024 11:36:11 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 1.0 minutes and 9.74 seconds




                            Review Enrichment Stage                             
                           Stage Started | Sat, 14 Dec 2024 23:35:01
                         Stage Completed | Sat, 14 Dec 2024 23:36:11
                           Stage Runtime | 1.0 minutes and 9.74 seconds





### Review Enrichment Data

In [None]:
repo = container.repo.dataset_repo()
df = repo.get(asset_id=review_enrichment_asset_id, distributed=False, nlp=False).content

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 53 columns):
 #   Column                                            Non-Null Count  Dtype         
---  ------                                            --------------  -----         
 0   id                                                7787 non-null   object        
 1   app_id                                            7787 non-null   object        
 2   app_name                                          7787 non-null   object        
 3   category_id                                       7787 non-null   object        
 4   author                                            7787 non-null   object        
 5   rating                                            7787 non-null   int16         
 6   content                                           7787 non-null   object        
 7   vote_sum                                          7787 non-null   int64         
 8   vote_count                  

## App Enrichment

In [7]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_APP
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
app_enrichment_asset_id = stage.run()

[12/14/2024 11:36:12 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RUN
[12/14/2024 11:36:12 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_nlp_session] : Creating a spark nlp session.
[12/14/2024 11:36:12 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_nlp_session] : Creating an SparkNLP session. log4j Configuration: file:/home/john/projects/appvocai-discover/log4j.properties




#                              App Enrichment Stage                              #



                               AppAggregationTask                               
                               ------------------                               
                          Start Datetime | Sat, 14 Dec 2024 23:36:12


[12/14/2024 11:36:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Task: AppAggregationTask
[12/14/2024 11:36:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:36:12
[12/14/2024 11:36:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:36:12
[12/14/2024 11:36:12 PM] [DEBUG] [AppAggregationTask.run] [wrapper] : Runtime: 0.36 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:36:12
                                 Runtime | 0.36 seconds


[12/14/2024 11:36:14 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/test/dataset/01_dataprep/appvocai_discover-01_dataprep-07_enrich-app-dataset.parquet from repository.
[12/14/2024 11:36:14 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-test-dataprep-enrich_app-app from the repository.
[12/14/2024 11:36:16 PM] [DEBUG] [Stage.run] [wrapper] : Stage: App Enrichment Stage
[12/14/2024 11:36:16 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Sat, 14 Dec 2024 23:36:12
[12/14/2024 11:36:16 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Sat, 14 Dec 2024 23:36:16
[12/14/2024 11:36:16 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 4.33 seconds




                              App Enrichment Stage                              
                           Stage Started | Sat, 14 Dec 2024 23:36:12
                         Stage Completed | Sat, 14 Dec 2024 23:36:16
                           Stage Runtime | 4.33 seconds





## Category Enrichment

In [8]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=StageDef.ENRICH_CATEGORY
)
# Build and run the stage
stage = DataEnrichmentStage.build(stage_config=stage_config, force=FORCE)
category_enrichment_asset_id = stage.run()

[12/14/2024 11:36:16 PM] [DEBUG] [discover.flow.stage.base.DataEnrichmentStage] [run] : Execution path: RUN
[12/14/2024 11:36:16 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_nlp_session] : Creating a spark nlp session.
[12/14/2024 11:36:16 PM] [DEBUG] [discover.infra.service.spark.session.SparkSessionPool] [_create_nlp_session] : Creating an SparkNLP session. log4j Configuration: file:/home/john/projects/appvocai-discover/log4j.properties




#                           Category Enrichment Stage                            #



                            CategoryAggregationTask                             
                            -----------------------                             
                          Start Datetime | Sat, 14 Dec 2024 23:36:16


[12/14/2024 11:36:17 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Task: CategoryAggregationTask
[12/14/2024 11:36:17 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Started: Sat, 14 Dec 2024 23:36:16
[12/14/2024 11:36:17 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Completed: Sat, 14 Dec 2024 23:36:17
[12/14/2024 11:36:17 PM] [DEBUG] [CategoryAggregationTask.run] [wrapper] : Runtime: 0.29 seconds


                       Complete Datetime | Sat, 14 Dec 2024 23:36:17
                                 Runtime | 0.29 seconds


[12/14/2024 11:36:17 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/test/dataset/01_dataprep/appvocai_discover-01_dataprep-08_enrich-category-dataset.parquet from repository.
[12/14/2024 11:36:17 PM] [DEBUG] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-test-dataprep-enrich_category-category from the repository.
[12/14/2024 11:36:19 PM] [DEBUG] [Stage.run] [wrapper] : Stage: Category Enrichment Stage
[12/14/2024 11:36:19 PM] [DEBUG] [Stage.run] [wrapper] : Stage Started: Sat, 14 Dec 2024 23:36:16
[12/14/2024 11:36:19 PM] [DEBUG] [Stage.run] [wrapper] : Stage Completed: Sat, 14 Dec 2024 23:36:19
[12/14/2024 11:36:19 PM] [DEBUG] [Stage.run] [wrapper] : Stage Runtime: 2.25 seconds




                           Category Enrichment Stage                            
                           Stage Started | Sat, 14 Dec 2024 23:36:16
                         Stage Completed | Sat, 14 Dec 2024 23:36:19
                           Stage Runtime | 2.25 seconds





---

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.