In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# AppVoCAI Dataset Preprocessing
---
In this section, we unbox and preprocess the AppVoCAI dataset, survey its key characteristics, and profile its structure, format, and data types, before downstream data quality assessment, cleaning, analysis activities and feature engineering. The raw dataset was loaded into the workspace during project initialization. 

In [2]:
from genailab.infra.config.flow import FlowConfigReader
from genailab.setup import auto_wire_container
from genailab.infra.utils.file.fileset import FileFormat
from genailab.core.flow import PhaseDef, StageDef
from genailab.asset.dataset.config import DatasetConfigfig
from genailab.flow.dataprep.preprocess.builder import PreprocessStageBuilder

# Wire container
container = auto_wire_container()

## Data Preprocessing Pipeline
--

The `PreprocessingStage` ensures that the data are in a structure and format suitable for downstream processing and analysis. This involves verifying UTF-8 encoding, casting data to appropriate types, converting datetimes to millisecond precision (for Spark) and removing any extraneous newlines from the review text.

The next code cell creates and runs the PreprocessingStage pipeline.

In [4]:
# Create the preprocess stage builder
builder = PreprocessStageBuilder()
# Add preprocess tasks to the builder and return the stage object.
stage = (
    .encoding()  # Verifies UTF-8 Encoding
    .datatypes()  # Casts appropriate datatypes, i.e. category, int, float, and datetime variables.
    .newlines()  # Removes newlines from text
    .datetime()  # Converts datatime to millisecond precision (for pyspark)
    .build()  # Constructs the pipeline and returns the stage
)
# Run the stage pipeline
dataset = stage.run(force=FORCE)

[01/19/2025 02:49:23 PM] [ERROR] [genailab.infra.persist.object.dao.ShelveDAO] [read] : Asset dataset_dataprep_preprocess_review_v0.1.1 was not found.
Asset dataset_dataprep_preprocess_review_v0.1.1 was not found.




#                 Data Preprocessing Stage Sun, 19 Jan 2025 14:49:23                 #


Task                                    Start       End         Runtime     
----------------------------------------------------------------------------
VerifyEncodingTask                      14:49:23    14:49:23    0.01 seconds
CastDataTypeTask                        14:49:23    14:49:23    0.01 seconds
RemoveNewlinesTask                      14:49:23    14:49:23    0.0 seconds 
____________________________________________________________________________
Data Preprocessing Stage                    14:49:23    14:49:23    0.2 seconds 





## AppVoCAI Dataset Structure
Let's examine the dataset structure, data types, completeness, uniqueness, and size.

In [5]:
dataset.info

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],8670,0,1.0,8670,0,1.0,581479
1,app_id,string[python],8670,0,1.0,2636,6034,0.304037,576305
2,app_name,string[python],8670,0,1.0,2634,6036,0.303806,703398
3,category_id,category,8670,0,1.0,14,8656,0.001615,10080
4,author,string[python],8670,0,1.0,8668,2,0.999769,667590
5,rating,int16,8670,0,1.0,5,8665,0.000577,17340
6,content,string[python],8670,0,1.0,8368,302,0.965167,4192835
7,vote_sum,int64,8670,0,1.0,17,8653,0.001961,69360
8,vote_count,int64,8670,0,1.0,19,8651,0.002191,69360
9,date,datetime64[ms],8670,0,1.0,8670,0,1.0,69360


The dataset comprises 22,166,591 fully complete records, with no missing values, and a well-structured variety of data types. Key interpretations include:

- **Data Types**: The dataset employs a thoughtful mix of data types, such as strings for identifiers and text fields, `int16` and `int64` for numerical columns like `Rating`, `Vote Sum`, and `Vote Count`, and `datetime64[ms]` for precise date tracking. This combination ensures both efficiency and accuracy in data handling.

- **Duplicate Review IDs**: There are 117 duplicate `ID` values, indicating potential duplicate reviews. This suggests the need for a deduplication process to ensure data integrity and prevent biases in analysis due to repeated entries.

- **Categorical Insights**: The dataset features 14 unique `Category` values, reflecting the breadth of application categories, and 5 unique `Rating` values, consistent with a standard 5-point rating scale. These are critical for categorical analyses and aggregating review sentiment.

- **Duplicate Content**: The `Content` column shows high uniqueness overall but also includes significant duplicate entries. This could indicate commonly used phrases or templated responses in short reviews, which may require special handling during text analysis to differentiate between genuine user feedback and repetitive content.

- **High Uniqueness in Key Columns**: Columns like `ID`, `Content`, and `Date` demonstrate high uniqueness, essential for detailed individual review analysis and time-series studies.

- **Memory Efficiency**: Despite the large volume, efficient use of data types—particularly categorical and numerical fields—helps manage the dataset's memory footprint. The `Content` field, being text-heavy, dominates memory usage but is critical for in-depth textual analysis.

Overall, the dataset is ready for a more robust quality analysis, with attention to duplication, relevance, validity, and privacy concerns. 

---

## AppVoCAI Dataset Summary
Here, we summarize the dataset contents in terms of reviews, apps, reviewer engagement, influence, app, and categorical breadth.

In [6]:
dataset.summary



                            AppVoCAI Dataset Summary                            
                       Number of Reviews | 8,670
                     Number of Reviewers | 8,668
              Number of Repeat Reviewers | 2 (0.0%)
         Number of Influential Reviewers | 575 (6.6%)
                          Number of Apps | 2,636
                    Number of Categories | 14
                 Average Reviews per App | 3.3
                                Features | 11
                        Memory Size (Mb) | 6.71
                    Date of First Review | 2020-01-01 11:06:50
                     Date of Last Review | 2023-08-29 13:25:46


### Key Observations

- **Volume and Scale**: The dataset contains a substantial number of reviews (22.17 million) and reviewers (15.71 million), indicating a broad and diverse user engagement across a wide range of applications.

- **Repeat Reviewers**: Approximately 22.9% of reviewers have submitted more than one review, suggesting a significant proportion of engaged users who consistently contribute feedback. This can provide valuable longitudinal insights into user experiences and loyalty.

- **Influential Reviewers**: With 6.6% of reviewers deemed influential (based on vote sum and counts), their contributions could play a pivotal role in shaping app perceptions and rankings.

- **App Diversity**: The dataset covers 36,377 unique apps across 14 categories, indicating a wide-ranging scope of applications. This diversity is beneficial for conducting category-specific analyses and identifying trends within various app domains.

- **Review Distribution**: On average, each app has approximately 609 reviews. This high level of engagement per app supports detailed app-level performance and sentiment analysis.

- **Temporal Range**: The dataset spans over 15 years, from July 2008 to September 2023. This extensive timeframe allows for robust historical analysis, capturing the evolution of user feedback and app development trends over time.

- **Memory Usage**: The dataset's size is significant, with a memory footprint of approximately 14.51 GB. This underscores the need for efficient data handling and processing strategies, particularly for large-scale analyses.

- **Feature Richness**: With 11 distinct app, reviewer, and review features, the dataset enables both qualitative (review) and quantitative (rating, review_count, vote metrics) analysis of app performance and user sentiment.

---

### Limitations of the Dataset

While the AppVoCAI dataset provides a wealth of user review and satisfaction information, there are notable limitations regarding missing data on app performance and financial metrics. Specifically, data on **downloads**, **price**, and **sales figures** are not included. These omissions limit the ability to perform comprehensive analyses that would require correlating user reviews with app popularity and financial performance. Future datasets that incorporate these factors would enable a more holistic view of the app ecosystem, enhancing the ability to draw connections between user sentiment, app success, and market performance.

In summary, the AppVoCAI dataset offers a rich and expansive resource for analyzing user reviews across a wide variety of applications and categories, with strong potential for deriving actionable insights from its longitudinal and categorical data.

---

In the next section, we will analyze the validity, relevance, uniqueness, privacy and completeness of the dataset, providing a robust, multi-dimensional data quality analysis.