In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# AppVoCAI Dataset Ingestion
In this section, we unbox and ingest the AppVoCAI dataset into our workspace, survey its key characteristics, and profile its structure, format, and data types, before downstream data quality assessment, cleaning, enrichment, and analysis activities.

In [2]:
from discover.setup import auto_wire_container
from discover.flow.dataprep.ingest.builder import IngestStageBuilder

# Wire container
container = auto_wire_container()

## Ingest Data

The `IngestionStage` is responsible for loading raw data while ensuring it is properly prepared for downstream processing and analysis. This stage includes verifying UTF-8 encoding, casting data to appropriate types, and removing any extraneous newlines from the review text.

The process begins by instantiating the `IngestStageBuilder`, which facilitates the construction of the ingest pipeline. Through a series of method calls, various tasks such as encoding verification, datatype casting, and newline removal are added to the pipeline. The final `IngestionStage` object is then exposed as a property.

Once the pipeline is fully constructed, the `run()` method is invoked. This executes the ingestion workflow, applying all defined tasks and transformations. Upon successful execution, the method returns the ingested dataset, ready for further data quality analysis and processing.

In [3]:
# Create the ingest stage builder
builder = IngestStageBuilder()
# Add ingest tasks to the builder and return the stage object.
stage = (
    builder.source_filepath(
        from_config=True
    )  # The source filepath is in the stage configuration YAML
    .encoding()  # Verifies UTF-8 Encoding
    .datatypes()  # Casts appropriate datatypes, i.e. category, int, float, and datetime variables.
    .newlines()  # Removes newlines from text
    .build()  # Constructs the pipeline
    .stage  # Return the stage property
)
# Run the stage pipeline
dataset = stage.run(force=FORCE)



#                 Data Ingestion Stage Thu, 16 Jan 2025 23:40:06                 #

____________________________________________________________________________
Data Ingestion Stage                    23:40:06    23:42:13    2.0 minutes and 7.64 seconds





## AppVoCAI Dataset Structure
Let's examine the dataset structure, data types, completeness, uniqueness, and size.

In [4]:
dataset.info

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],22166591,0,1.0,22166474,117,0.9999947,1480962173
1,app_id,string[python],22166591,0,1.0,36377,22130214,0.001641073,1468716232
2,app_name,string[python],22166591,0,1.0,36363,22130228,0.001640442,1871227874
3,category_id,category,22166591,0,1.0,14,22166577,6.315811e-07,22168001
4,author,string[python],22166591,0,1.0,15710479,6456112,0.7087458,1706827507
5,rating,int16,22166591,0,1.0,5,22166586,2.255647e-07,44333182
6,content,string[python],22166591,0,1.0,19078767,3087824,0.8606992,8070646310
7,vote_sum,int64,22166591,0,1.0,504,22166087,2.273692e-05,177332728
8,vote_count,int64,22166591,0,1.0,678,22165913,3.058657e-05,177332728
9,date,datetime64[ns],22166591,0,1.0,20751108,1415483,0.9361434,177332728


The dataset comprises 22,166,591 fully complete records, with no missing values, and a well-structured variety of data types. Key interpretations include:

- **Data Types**: The dataset employs a thoughtful mix of data types, such as strings for identifiers and text fields, `int16` and `int64` for numerical columns like `Rating`, `Vote Sum`, and `Vote Count`, and `datetime64[ms]` for precise date tracking. This combination ensures both efficiency and accuracy in data handling.

- **Duplicate Review IDs**: There are 117 duplicate `ID` values, indicating potential duplicate reviews. This suggests the need for a deduplication process to ensure data integrity and prevent biases in analysis due to repeated entries.

- **Categorical Insights**: The dataset features 14 unique `Category` values, reflecting the breadth of application categories, and 5 unique `Rating` values, consistent with a standard 5-point rating scale. These are critical for categorical analyses and aggregating review sentiment.

- **Duplicate Content**: The `Content` column shows high uniqueness overall but also includes significant duplicate entries. This could indicate commonly used phrases or templated responses in short reviews, which may require special handling during text analysis to differentiate between genuine user feedback and repetitive content.

- **High Uniqueness in Key Columns**: Columns like `ID`, `Content`, and `Date` demonstrate high uniqueness, essential for detailed individual review analysis and time-series studies.

- **Memory Efficiency**: Despite the large volume, efficient use of data types—particularly categorical and numerical fields—helps manage the dataset's memory footprint. The `Content` field, being text-heavy, dominates memory usage but is critical for in-depth textual analysis.

Overall, the dataset is ready for a more robust quality analysis, with attention to duplication, relevance, validity, and privacy concerns. 

---

## AppVoCAI Dataset Summary
Here, we summarize the dataset contents in terms of reviews, apps, reviewer engagement, influence, app, and categorical breadth.

In [5]:
dataset.summary



                            AppVoCAI Dataset Summary                            
                       Number of Reviews | 22,166,591
                     Number of Reviewers | 15,710,479
              Number of Repeat Reviewers | 3,604,683 (22.9%)
         Number of Influential Reviewers | 1,033,636 (6.6%)
                          Number of Apps | 36,377
                    Number of Categories | 14
                 Average Reviews per App | 609.4
                                Features | 11
                        Memory Size (Mb) | 14,514.01
                    Date of First Review | 2008-07-10 10:15:37
                     Date of Last Review | 2023-09-03 02:14:35


### Key Observations

- **Volume and Scale**: The dataset contains a substantial number of reviews (22.17 million) and reviewers (15.71 million), indicating a broad and diverse user engagement across a wide range of applications.

- **Repeat Reviewers**: Approximately 22.9% of reviewers have submitted more than one review, suggesting a significant proportion of engaged users who consistently contribute feedback. This can provide valuable longitudinal insights into user experiences and loyalty.

- **Influential Reviewers**: With 6.6% of reviewers deemed influential (based on vote sum and counts), their contributions could play a pivotal role in shaping app perceptions and rankings.

- **App Diversity**: The dataset covers 36,377 unique apps across 14 categories, indicating a wide-ranging scope of applications. This diversity is beneficial for conducting category-specific analyses and identifying trends within various app domains.

- **Review Distribution**: On average, each app has approximately 609 reviews. This high level of engagement per app supports detailed app-level performance and sentiment analysis.

- **Temporal Range**: The dataset spans over 15 years, from July 2008 to September 2023. This extensive timeframe allows for robust historical analysis, capturing the evolution of user feedback and app development trends over time.

- **Memory Usage**: The dataset's size is significant, with a memory footprint of approximately 14.51 GB. This underscores the need for efficient data handling and processing strategies, particularly for large-scale analyses.

- **Feature Richness**: With 11 distinct app, reviewer, and review features, the dataset enables both qualitative (review) and quantitative (rating, review_count, vote metrics) analysis of app performance and user sentiment.

---

### Limitations of the Dataset

While the AppVoCAI dataset provides a wealth of user review and satisfaction information, there are notable limitations regarding missing data on app performance and financial metrics. Specifically, data on **downloads**, **price**, and **sales figures** are not included. These omissions limit the ability to perform comprehensive analyses that would require correlating user reviews with app popularity and financial performance. Future datasets that incorporate these factors would enable a more holistic view of the app ecosystem, enhancing the ability to draw connections between user sentiment, app success, and market performance.

In summary, the AppVoCAI dataset offers a rich and expansive resource for analyzing user reviews across a wide variety of applications and categories, with strong potential for deriving actionable insights from its longitudinal and categorical data.

---

In the next section, we will analyze the validity, relevance, uniqueness, privacy and completeness of the dataset, providing a robust, multi-dimensional data quality analysis.