# AppVoCAI Dataset: Provenance, Structure, Contents and Limitations
Curated in September 2023, the AppVoCAI Dataset is a collection of App Store reviews spanning 15 years, from the App Store's launch on July 10, 2008, through September 2023. This section outlines the dataset's provenance, key characteristics, motivation, and limitations.

## Motivation
Exploring the nexus of Generative AI, consumer behavior analysis, and the mobile app customer experience for opportunity discovery, research, marketing, and product design is definitionally data intensive. User feedback manifested through ratings and reviews, is a consequential determinant of an app’s market and financial success. In App Store Optimization (ASO), ratings influence app discoverability, user engagement via word-of-mouth, and ultimately app downloads and financial outcomes for developers. However, despite the ubiquity and strategic importance of user reviews, there remains a dearth of large-scale datasets for research and analysis of the *mobile app* user experience.

In August of 2023, we launched an app store data acquisition effort to obtain app data and user reviews for analysis and modeling. This dataset contains anonymized reviews collected between August 29 and September 5, 2023, from publicly accessible sources in the App Store, ensuring transparency and adherence to ethical guidelines throughout the engagement.

## Literature Context

The motivation behind the AppVoCAI (App Voice of the Customer) project draws upon user experience research, sentiment analysis, and consumer behavior analysis within digital platforms. Studies by Sällberg, Wang, and Numminen (2023) emphasize the significance of user reviews in shaping perceptions and adoption of mobile apps, providing insights into sentiment analysis and thematic categorization {cite}`sallbergCombinatoryRoleOnline2023`. Additionally, theoretical frameworks such as the Technology Acceptance Model (TAM) and Uses and Gratifications Theory (UGT) offer lenses through which to interpret user behaviors and motivations in the context of app interactions {cite}`davisPerceivedUsefulnessPerceived1989` {cite}`katzUsesGratificationsResearch1973`.

Furthermore, the unified theoretical model proposed by Tas, Huseynov, and Kose (2024) integrates various theories and empirical data to comprehensively explain mobile application usage behavior, supported by real usage data. This model contributes significantly to understanding user motivations, preferences, and decision-making processes in digital environments {cite}`tasUnifiedTheoreticalModel2024`.

## Why YARD (Yet Another Review Dataset)?

Existing review datasets like Amazon, Yelp, and IMDB have been invaluable for research in consumer behavior. These datasets are accessible and widely adopted, providing a rich source of information that has contributed significantly to understanding purchasing decisions, service satisfaction, and entertainment preferences. They serve as benchmark datasets, helping researchers develop and validate models that predict consumer behavior, sentiment, and preferences. The insights derived from these reviews have driven advancements in recommendation systems, personalized marketing, and sentiment analysis, showcasing their utility and importance in the field.

Yet, these benchmark datasets primarily focus on product choices, service quality, and entertainment preferences. They provide valuable insights but fall short of capturing the profound impact of technology on consumer behavior, human interactions, and capabilities. The unique and transformative nature of mobile app usage, where technology seamlessly integrates into our daily routines through smartphones, wearables, and other devices, is best observed via a specialized dataset. Mobile app reviews serve as a historical record of this technological evolution, capturing user experiences, challenges, and adaptations in real-time. These reviews document the gradual yet significant shifts in human behavior and interaction brought about by technological advancements, providing a source of data for understanding this ongoing evolution of the human/technological nexus.

## Key Features of AppVoCAI Dataset

- **Scale**: Provides an extensive and diverse collection of over 22 million reviews, making it one of the few large-scale mobile app review datasets available.
- **User Base**: Includes opinions from approximately 16 million unique users, reflecting a vast and diverse range of perspectives.
- **App Diversity**: Features reviews for over 36 thousand unique apps, showcasing a broad spectrum of app interests and needs.
- **Category Coverage**: Spans 14 of the most popular app categories, illuminating the diverse interests and needs of mobile users.
- **Voice-on-Voice**: Captures user engagement and the value placed on the sentiments and opinions of other users, with data on the number and value of user votes on reviews.
- **Temporal Coverage**: Reviews span from July 10, 2008, the day the App Store was launched, to September 3, 2023, providing a 15-year longitudinal view of app usage and user feedback trends.

## Dataset Variables
The dataset contains 12 variables, offering a multifaceted view of user interactions and app attributes. 

- **id**: Unique identifier for each review.
- **app_id**: Unique identifier for the app being reviewed.
- **app_name**: Name of the mobile application being reviewed.
- **category_id**: Four-digit identifier representing the category or genre of the app.
- **category**: Category or genre name of the app.
- **author**: Name or identifier of the reviewer.
- **rating**: Numeric rating provided by the author for the app.
- **content**: Detailed content of the review provided by the author.
- **vote_sum**: Total sum of votes on the usefulness of the rating.
- **vote_count**: Number of votes on the usefulness of the rating.
- **date**: Date when the review was written.

These variables collectively evince the mobile app user experience and provide a basis for analyzing user feedback and extracting actionable intelligence about areas for improvement, feature requests, unmet needs, and user preferences.

## Sampling Strategy
To ensure a representative and insightful analysis of app reviews, we established specific inclusion criteria for selecting categories. The criteria aim to balance diversity, user engagement, and functional relevance across various app types. Below are the articulated inclusion criteria based on the categories selected:

1. **High User Engagement and Relevance:** Categories selected should represent areas with high user engagement and relevance to daily activities, professional tasks, and personal interests. This ensures that the insights derived are impactful and reflect significant user interactions.
2. **Diverse Functional Coverage:** The chosen categories should cover a wide range of functions and utilities to provide a holistic view of app usage. This includes productivity, education, lifestyle, health, finance, entertainment, and information-related apps, ensuring a comprehensive understanding of user needs.
3. **Balanced Representation Across Major App Types:** The selection should include categories that collectively represent a balanced cross-section of the app ecosystem. This means including high-percentage categories like Business, Education, and Utilities, as well as lower-percentage but critical categories like Medical and Social Networking.
4. **Potential for Impactful Insights:** Categories that are likely to provide actionable insights for app development and improvement are prioritized. This includes areas where user feedback can directly inform enhancements in functionality, user experience, and engagement.
5. **Consistency in User Behavior and Engagement Patterns:** Selected categories should exhibit relatively consistent and comparable user behavior and engagement patterns. This facilitates more accurate and meaningful analysis, avoiding the variability and trends often seen in highly volatile categories like gaming.

## Caution on Data Representation
The AppVoCAI dataset represents the reviews collected during the data-gathering period and does not capture every review or the complete distribution of reviews during the collection period. 

Though the AppVoCAI dataset covers 14 of the most popular and active categories in the App Store, several categories were omitted during the data acquisition effort, most notably absent, the Game category. Gaming apps often exhibit unique user behavior and engagement patterns that differ from other app categories in terms of frequency, duration, intensity, and context of their mobile app engagement. Whereas mobile games are characterized by immersive gameplay and extended user sessions, we prioritized a utilitarian, task-oriented application of mobile technology. 

## Data Preparation Approach
Our data preparation approach is structured to ensure the dataset is suitably robust for an insightful exploratory effort. This phase is divided into three distinct stages:

1. **Preprocessing (Unboxing) Stage**: This initial step involves an *unboxing* of the dataset. Here, we examine the data’s structure, data types, formats, and overall profile. The goal is to assess the quality of the dataset, identifying any irregularities or gaps, while also pinpointing initial data cleaning needs and opportunities for data enrichment.
2. **Perplexity Analysis Stage**: Perplexity, an information-theoretic measure of uncertainty, is computed for each review, normalized by its length to provide a proxy measure of review quality under the hypothesis, that higher uncertainty correlates with review complexity, depth, and nuance; whereas, low perplexity might indicate repetition, spam or other artifacts signaling low-quality reviews.  
3. **Sentiment Analysis Stage**: Leveraging a pre-trained [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) {cite}`sanhDistilBERTDistilledVersion2019`  model, we gain a sense class balance and alignment between sentiment and user ratings. 
4. **Data Quality Analysis Stage**: We evaluate dataset validity, completeness, uniqueness, and relevance to the aspect-based sentiment analysis (ABSA) task. 
5. **Clean Stage**: In this stage, we implement selected data cleaning techniques to rectify noise, inconsistencies, and artifacts, ensuring that the dataset is free from distortions that could introduce bias or undermine the integrity of subsequent analyses or modeling efforts.

With that rather turgid introduction, we move next to the unboxing stage.
