In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = False

# **AppVoCAI Dataset Enrichment**
In advance of the exploratory data analysis, we enrich the dataset with derived features and aggregations that enhance its depth and contextual richness. Specifically, this enrichment phase involves:

**1. Sentiment Classification**: Each review will be classified into one of five sentiment categories: `Very Negative`, `Negative`, `Neutral`, `Positive` and `Very Positive`.
**2. Review Quality**: Reviews will be scored by
**2. Review Data Enrichment**: Reviews are enhanced with rating, review quality, and temporal features.
**3. App Data Enrichment**: Rating, review count and vote data are aggregated at the app level.
**4. Category Data Enrichment**: Similarly, rating, review count, and vote data are aggregated at the category level.

This cross-layered enrichment process will equip the exploratory analysis with additional nuance and context of user feedback across individual reviews, apps, and broader categories.  

## **1. Quantitative Enrichments**  

### **1. Review-Level Enrichments**  
At the most granular level, reviews are enhanced with quality, temporal and contextual features:  
- **Review Features**: Update review length following the data cleaning stage.
- **Quality Features**: Each review is given a text quality score based on its syntactic and lexical richness, and diversity.
- **Temporal Features**: By decomposing timestamps, we derive attributes such as review age and submission details (e.g., month, day, and hour). These features allow us to identify temporal trends and patterns in user feedback.  
- **Rating, Review Age and Review Length Deviations**: Each review is compared against the average for its app's category, highlighting outliers and unique characteristics within individual reviews.  

## Import Libraries

In [2]:
from discover.setup import auto_wire_container
from discover.infra.config.flow import FlowConfigReader
from discover.core.flow import DataPrepStageDef
from discover.core.flow import Phase, DataPrepStageDef
from discover.flow.stage.enrich.base import DataEnrichment

# Wire container
container = auto_wire_container()

### Review Enrichment Pipeline

In [3]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.ENRICH_REVIEW
)
# Build and run the stage
stage = DataEnrichment.build(stage_config=stage_config, force=FORCE)
dataset = stage.run()



#                            Review Enrichment Stage                             #

____________________________________________________________________________
Review Enrichment Stage                 07:07:16    07:07:16    0.0 seconds 





### Review Enrichment Data

### **2. App-Level Enrichments**  
Aggregating data at the app level provides a broader perspective on app performance:  
- **Key Summaries**: Metrics such as the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, and sentiment distribution offer a broader view of each app’s reception.  
- **Deviation Statistics**: Comparing app-level metrics against their category averages sheds light on how an app deviates from its peers, offering insights into competitive positioning and unique strengths or weaknesses.  


In [4]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.ENRICH_APP
)
# Build and run the stage
stage = DataEnrichment.build(stage_config=stage_config, force=FORCE)
dataset = stage.run()



#                              App Enrichment Stage                              #

____________________________________________________________________________
App Enrichment Stage                    07:07:16    07:07:16    0.0 seconds 





In [6]:
df = dataset.to_pandas()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2325 entries, 0 to 2324
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   app_id                     2325 non-null   object        
 1   app_name                   2325 non-null   object        
 2   category_id                2325 non-null   object        
 3   review_count               2325 non-null   int64         
 4   author_count               2325 non-null   int64         
 5   average_rating             2325 non-null   float64       
 6   average_review_length      2325 non-null   float64       
 7   average_review_age         2325 non-null   float64       
 8   total_vote_sum             2325 non-null   int64         
 9   total_vote_count           2325 non-null   int64         
 10  first_review_date          2325 non-null   datetime64[us]
 11  avg_review_date            2325 non-null   float64       
 12  last_r

### **Category-Level Enrichments**  
Zooming out further, category-level summaries offer a macro view of app trends within specific domains:  
- **Statistical Summaries**: Similar to the app level, category-level features include the total number of reviews, median `vote_count` and `vote_sum`, `rating`, `perplexity`, `review_age`, `review_length`, and sentiment distribution.  
- **Contextual Insights**: These summaries provide benchmarks for evaluating app performance within its category, helping to contextualize deviations and patterns observed at the app and review levels.  


#### Category Enrichment Pipeline

In [5]:
# Obtain the configuration
reader = FlowConfigReader()
stage_config = reader.get_stage_config(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.ENRICH_CATEGORY
)
# Build and run the stage
stage = DataEnrichment.build(stage_config=stage_config, force=FORCE)
dataset = stage.run()



#                           Category Enrichment Stage                            #



your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                


Task                                    Start       End         Runtime     
----------------------------------------------------------------------------
CategoryAggregationTask                 07:07:27    07:07:28    0.88 seconds


[Stage 3:>                                                          (0 + 1) / 1]

____________________________________________________________________________
Category Enrichment Stage               07:07:16    07:07:35    18.56 seconds





                                                                                

---

## **Qualitative Enrichment**  

## Enrichment Stage Wrap-Up
The enrichment stage enhanced the dataset with features, including review metadata (such as length, age and temporal data), sentiment analysis, text quality scores, and comprehensive app- and category-level aggregations. In the upcoming EDA phase, we will leverage these enriched attributes to uncover patterns, relationships, and trends that illuminate user behavior and app performance.