In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = True

# AppVoCAI Dataset Preprocessing
---
In this section, we unbox and preprocess the AppVoCAI dataset, survey its key characteristics, and profile its structure, format, and data types, in advance of downstream data quality assessment, cleaning, analysis and feature engineering. 

In [2]:
from genailab.setup import auto_wire_container
from genailab.asset.dataset.config import DatasetConfig
from genailab.core.dtypes import DFType
from genailab.core.flow import PhaseDef, StageDef
from genailab.infra.utils.file.fileset import FileFormat
from genailab.flow.dataprep.preprocess.builder import PreprocessStageBuilder

container = auto_wire_container()

## Data Preprocessing Pipeline
---

The `PreprocessingStage` ensures that the data are in a structure and format suitable for downstream processing and analysis. This involves verifying UTF-8 encoding, casting data to appropriate types, converting datetimes to millisecond precision (for Spark) and removing any extraneous newlines from the review text. 

Next, we'll define the configurations for the raw and preprocessed datasets, construct the PreprocessStage pipeline, the run it.

In [3]:
# Raw Datas Configuration
source = DatasetConfig(phase=PhaseDef.DATAPREP, stage=StageDef.RAW, name="review", file_format=FileFormat.PARQUET, dftype=DFType.PANDAS)
# Target Dataset Configuration
target = DatasetConfig(phase=PhaseDef.DATAPREP, stage=StageDef.PREPROCESS, name="review", file_format=FileFormat.PARQUET, dftype=DFType.PANDAS)

In [4]:

# Create the preprocess stage builder
builder = PreprocessStageBuilder()
# Add preprocess tasks to the builder and return the stage object.
stage = (builder
    .encoding()  # Verifies UTF-8 Encoding
    .datatypes()  # Casts appropriate datatypes, i.e. category, int, float, and datetime variables.
    .newlines()  # Removes newlines from text
    .datetime()  # Converts datatime to millisecond precision (for pyspark)
    .build(source_config=source, target_config=target)  # Constructs the pipeline and returns the stage
)
# Run the stage pipeline
dataset = stage.run(force=FORCE)



#               Data Preprocessing Stage Thu, 30 Jan 2025 15:45:43               #


Task                                    Start       End         Runtime     
----------------------------------------------------------------------------
VerifyEncodingTask                      15:45:43    15:45:43    0.01 seconds
CastDataTypeTask                        15:45:43    15:45:43    0.01 seconds
RemoveNewlinesTask                      15:45:43    15:45:43    0.0 seconds 
ConvertDateTimetoMS                     15:45:43    15:45:43    0.0 seconds 
____________________________________________________________________________
Data Preprocessing Stage                15:45:43    15:45:43    0.24 seconds





## AppVoCAI Dataset Structure
Let's examine the dataset structure, data types, completeness, uniqueness, and size.

In [5]:
dataset.profile

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],5904,0,1.0,5904,0,1.0,396128
1,app_id,string[python],5904,0,1.0,2157,3747,0.365346,392750
2,app_name,string[python],5904,0,1.0,2157,3747,0.365346,478086
3,category_id,category,5904,0,1.0,14,5890,0.002371,7314
4,author,string[python],5904,0,1.0,5902,2,0.999661,454608
5,rating,Int16,5904,0,1.0,5,5899,0.000847,17712
6,content,string[python],5904,0,1.0,5732,172,0.970867,2910420
7,vote_sum,Int64,5904,0,1.0,14,5890,0.002371,53136
8,vote_count,Int64,5904,0,1.0,18,5886,0.003049,53136
9,date,datetime64[ms],5904,0,1.0,5904,0,1.0,47232


The dataset comprises 22,166,591 fully complete records, with no missing values, and a well-structured variety of data types. Key interpretations include:

- **Data Types**: The dataset employs a thoughtful mix of data types, such as strings for identifiers and text fields, `int16` and `int64` for numerical columns like `Rating`, `Vote Sum`, and `Vote Count`, and `datetime64[ms]` for precise date tracking. This combination ensures both efficiency and accuracy in data handling.

- **Duplicate Review IDs**: There are 117 duplicate `ID` values, indicating potential duplicate reviews. This suggests the need for a deduplication process to ensure data integrity and prevent biases in analysis due to repeated entries.

- **Categorical Insights**: The dataset features 14 unique `Category` values, reflecting the breadth of application categories, and 5 unique `Rating` values, consistent with a standard 5-point rating scale. These are critical for categorical analyses and aggregating review sentiment.

- **Duplicate Content**: The `Content` column shows high uniqueness overall but also includes significant duplicate entries. This could indicate commonly used phrases or templated responses in short reviews, which may require special handling during text analysis to differentiate between genuine user feedback and repetitive content.

- **High Uniqueness in Key Columns**: Columns like `ID`, `Content`, and `Date` demonstrate high uniqueness, essential for detailed individual review analysis and time-series studies.

- **Memory Efficiency**: Despite the large volume, efficient use of data types—particularly categorical and numerical fields—helps manage the dataset's memory footprint. The `Content` field, being text-heavy, dominates memory usage but is critical for in-depth textual analysis.

Overall, the dataset is ready for a more robust quality analysis, with attention to duplication, relevance, validity, and privacy concerns. 

---

## AppVoCAI Dataset Summary
Here, we summarize the dataset contents in terms of reviews, apps, reviewer engagement, influence, app, and categorical breadth.

In [6]:
dataset.summary



                            AppVoCAI Dataset Summary                            
                             Data Preparation Phase                             
                            Data Preprocessing Stage                            
                       Number of Reviews | 5,904
                     Number of Reviewers | 5,902
              Number of Repeat Reviewers | 2 (0.0%)
         Number of Influential Reviewers | 333 (5.6%)
                          Number of Apps | 2,157
                 Average Reviews per App | 2.7
                    Number of Categories | 14
                                Features | 11
                       Min Review Length | 1
                       Max Review Length | 1,008
                   Average Review Length | 32.41
                        Memory Size (Mb) | 4.59
                    Date of First Review | 2021-01-01 02:20:30
                     Date of Last Review | 2023-08-30 12:49:02


### Key Observations

- **Volume and Scale**: The dataset contains a substantial number of reviews (22.17 million) and reviewers (15.71 million), indicating a broad and diverse user engagement across a wide range of applications.

- **Repeat Reviewers**: Approximately 22.9% of reviewers have submitted more than one review, suggesting a significant proportion of engaged users who consistently contribute feedback. This can provide valuable longitudinal insights into user experiences and loyalty.

- **Influential Reviewers**: With 6.6% of reviewers deemed influential (based on vote sum and counts), their contributions could play a pivotal role in shaping app perceptions and rankings.

- **App Diversity**: The dataset covers 36,377 unique apps across 14 categories, indicating a wide-ranging scope of applications. This diversity is beneficial for conducting category-specific analyses and identifying trends within various app domains.

- **Review Distribution**: On average, each app has approximately 609 reviews. This high level of engagement per app supports detailed app-level performance and sentiment analysis.

- **Temporal Range**: The dataset spans over 15 years, from July 2008 to September 2023. This extensive timeframe allows for robust historical analysis, capturing the evolution of user feedback and app development trends over time.

- **Memory Usage**: The dataset's size is significant, with a memory footprint of approximately 14.51 GB. This underscores the need for efficient data handling and processing strategies, particularly for large-scale analyses.

- **Feature Richness**: With 11 distinct app, reviewer, and review features, the dataset enables both qualitative (review) and quantitative (rating, review_count, vote metrics) analysis of app performance and user sentiment.

---

### Observations
This initial data profiling reveals a substantial and diverse dataset with significant potential for evaluating the performance of LLMs and SLMs, particularly in the context of fine-tuning foundation models. Transformer models require large volumes of data, and the volume of reviews, the extensive 15 year temporal span, and categorical coverage provide a solid foundation for LLM model training and evaluation. Yet, the data quality analysis to follow will evince dataset validity, relevance, completeness, and uniqueness, providing a more nuanced understanding of its suitability for training and evaluating LLMs and SLMs for specific tasks, such as Aspect-Based Sentiment Analysis (ABSA).