In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")

# AppStore VOC 

## Motivation

At the 1983 International Design Conference, Steve Jobs described an ecosystem where consumers would effortlessly access, purchase, and download software directly onto 'an incredibly great computer in a book that you can carry around with you' {cite}`stevejobsTalkStevenJobs1983`. Fast forward 25 years to July 10, 2008, and Apple materialized Jobs' foresight with the launch of the App Store, an event that revolutionized how technology is acquired, engaged, and integrated into daily life. One of the world's first commercially successful mobile app marketplaces, the App Store stands as a digital leviathan, boasting a staggering 1,928,363 apps and catering to over 1.3 billion iOS users worldwide {cite}`agIOSAppleApp`.

> "Because we offer nearly two million apps — and we want you to feel good about using every single one of them." - Apple

User feedback, manifested through ratings and reviews, serves as a critical component of App Store Optimization (ASO), influencing app discoverability, user engagement, and financial outcomes for developers. However, despite the ubiquity and strategic importance of user reviews, there remains a dearth of large-scale datasets dedicated to comprehensive analysis of the *mobile app* user experiences.

### Literature Context

The motivation behind the AppVoC (App Voice of the Customer) project draws upon foundational literature in user experience research, sentiment analysis, and consumer behavior within digital platforms. Studies by Sällberg, Wang, and Numminen (2023) emphasize the significance of user reviews in shaping perceptions and adoption of mobile apps, providing insights into sentiment analysis and thematic categorization {cite}`sallbergCombinatoryRoleOnline2023`. Additionally, theoretical frameworks such as the Technology Acceptance Model (TAM) and Uses and Gratifications Theory (UGT) offer lenses through which to interpret user behaviors and motivations in the context of app interactions {cite}`davisPerceivedUsefulnessPerceived1989` {cite}`katzUsesGratificationsResearch1973`.

Furthermore, the unified theoretical model proposed by Tas, Huseynov, and Kose (2024) integrates various theories and empirical data to comprehensively explain mobile application usage behavior, supported by real usage data. This model contributes significantly to understanding user motivations, preferences, and decision-making processes in digital environments {cite}`tasUnifiedTheoreticalModel2024`.

### Why YARD (Yet Another Review Dataset)?

Existing review datasets like Amazon, Yelp, and IMDB have been invaluable for research in consumer behavior. These datasets are accessible and widely adopted, providing a rich source of information that has contributed significantly to understanding purchasing decisions, service satisfaction, and entertainment preferences. They serve as benchmark datasets, helping researchers develop and validate models that predict consumer behavior, sentiment, and preferences. The insights derived from these reviews have driven advancements in recommendation systems, personalized marketing, and sentiment analysis, showcasing their utility and importance in the field.

Yet, these benchmark datasets primarily focus on product choices, service quality, and entertainment preferences. They provide valuable insights but fall short of capturing the profound impact of technology on consumer behavior, human interactions, and capabilities. The unique and transformative nature of mobile app usage, where technology seamlessly integrates into our daily routines through smartphones, wearables, and other devices, necessitates a specialized dataset. This is where the AppVoC dataset becomes essential. Mobile app reviews serve as a historical record of this technological evolution, capturing user experiences, challenges, and adaptations in real time. These reviews document the gradual yet significant shifts in human behavior and interaction brought about by technological advancements, providing a rich source of data for understanding this ongoing evolution. The AppVoC Dataset is not just another collection of reviews; it is a critical resource for documenting and driving the next stages of the human/technological nexus.

### Key Features of AppVoC Dataset

- **Scale**: Contains approximately 22 million reviews, making it one of the most extensive collections of app store reviews publicly available.
- **User Base**: Includes opinions from approximately 16 million unique users, reflecting a vast and diverse range of perspectives.
- **App Diversity**: Features reviews for over 36 thousand unique apps, showcasing a broad spectrum of app interests and needs.
- **Category Coverage**: Spans 14 of the most popular app categories, illuminating the diverse interests and needs of mobile users.
- **Voice-on-Voice**: Captures user engagement and the value placed on the sentiments and opinions of other users, with data on the number and value of user votes on reviews.
- **Temporal Coverage**: Reviews span from July 10, 2008, the day the App Store was launched, to September 3, 2023, providing a 15-year longitudinal view of app usage and user feedback trends.
- **Rich Metadata**: Each review includes metadata such as app name, category, rating, vote count, and timestamp, enabling multifaceted analysis.

### Value of the Dataset

- **Comprehensive Analysis**: The large scale of the dataset allows for in-depth and statistically significant analyses of user reviews and app feedback.
- **Diverse Perspectives**: With opinions from nearly 16 million unique users, the dataset provides a broad and varied range of user insights, enhancing the robustness of research findings.
- **Wide Applicability**: Reviews for over 36 thousand unique apps across 14 categories enable research across different domains, making the dataset valuable for a wide range of applications and studies.
- **User Engagement Insights**: The Voice-on-Voice feature offers unique insights into user engagement dynamics, helping to identify influential reviews and understand collective user sentiment.
- **Longitudinal Trends**: The 15-year temporal coverage allows for the analysis of trends and changes in user sentiment over time, providing historical context and aiding in predictive analysis.
- **Multifaceted Analysis**: Rich metadata supports comprehensive analyses, enabling researchers to explore various dimensions of the data and uncover deeper insights.
- **Predictive Power**: Temporal data empowers stakeholders to anticipate trends, mitigate risks, and capitalize on emerging opportunities in the app ecosystem.
- **Enhanced Decision Making**: Insights derived from the dataset can inform app development, marketing strategies, and policy-making, leading to better user experiences and improved business outcomes.
The AppStore VOC Dataset consists of 22 million reviews spanning 14 categories, collected from the Apple App Store over a two-week period in late August 2023. The data was sourced from publicly accessible sections of the App Store, ensuring compliance with ethical standards and transparency throughout the process. For this unmet needs and opportunity discovery project, the analysis focuses specifically on reviews written between January 2021 and September 2023.

### Dataset Variables
The dataset contains 12 variables, offering a multifaceted view of user interactions and app attributes. 

- **id**: Unique identifier for each review.
- **app_id**: Unique identifier for the app being reviewed.
- **app_name**: Name of the mobile application being reviewed.
- **category_id**: Four-digit identifier representing the category or genre of the app.
- **category**: Category or genre name of the app.
- **author**: Name or identifier of the reviewer.
- **rating**: Numeric rating provided by the author for the app.
- **content**: Detailed content of the review provided by the author.
- **vote_sum**: Total sum of votes on the usefulness of the rating.
- **vote_count**: Number of votes on the usefulness of the rating.
- **eda_review_length**: Number of words in review.
- **date**: Date when the review was written.

These variables collectively evince the mobile app user experience and provide a basis for analyzing user feedback and extracting actionable intelligence about areas for improvement, feature requests, unmet needs, and user preferences.

### Caution on Data Representation

While the AppVoC dataset provides a substantial collection of over 22 million reviews, it is essential to recognize that the dataset represents the reviews collected during the data gathering period and may not capture every review or the complete distribution of reviews in the App Store. However, the dataset offers valuable insights into user feedback and app interactions within the selected categories.

The dataset's extensive coverage across 14 categories, which represent 70.55% of active apps in the App Store, ensures that it provides a significant and meaningful snapshot of the mobile app ecosystem. While variations in the number of reviews across categories might be influenced by factors such as the popularity of specific app types and temporal fluctuations in user activity, the breadth and depth of the data allow for robust analyses and valuable inferences.

While the dataset may not capture the entirety of the App Store review landscape, it offers a representative and diverse sample that reflects significant aspects of user experiences and interactions. This allows for meaningful inferences about user preferences, emerging trends, and the performance of various app categories. Researchers and developers can leverage the dataset to generate insights that inform future developments in mobile technology and app design.

## What's Next?
The next steps will guide us through preparing the data for deeper analysis. Here’s the plan:

1. **Unboxing the Data:** We’ll explore the key characteristics of the dataset, including its size, structure, and main features.
2. **Data Quality Assessment:** Identifying any anomalies or irregularities that need attention, such as missing values, inconsistencies, or outliers.
3. **Data Cleaning:** Addressing the identified issues through cleaning processes to ensure data consistency and reliability.
4. **Data Preprocessing:** Preparing the data for analysis by transforming and encoding the necessary elements.
5. **Feature Engineering:** Creating new features from the data to enhance our ability to uncover insights and patterns.
6. **Review Text Quality Analysis and Pruning:** Analyzing the quality of the review text, pruning lower-quality reviews, and optimizing the dataset for high-value analysis.

Each step builds toward a refined, actionable dataset, ready for in-depth analysis and the discovery of valuable insights.