# Introduction

> "Because we offer nearly two million apps — and we want you to feel good about using every single one of them." - Apple

In 1983, Steve Jobs envisioned a future where consumers could seamlessly purchase and download software directly from their computers. Fast forward 25 years to 2008, Apple launched the App Store, realizing Jobs' vision and revolutionizing the way we interact with technology. Following the success of the iTunes digital music store, the Apple App Store became one of the world's first commercially successful mobile app marketplaces, and today, it stands as a digital leviathan. As of May 29, 2024, the App Store boasts a staggering 1,928,363 apps available for download, catering to over 1.3 billion iOS users worldwide.

And, amidst this flourishing ecosystem, understanding how users *actually* feel about the apps they download and use every day, remains crucial. It's incontestable. Ratings and reviews play a pivotal role in App Store Optimization (ASO) and app discoverability, which can be the difference between app success and failure. Yet, the dearth of large-scale customer rating and review datasets for in-depth analysis is astounding. 

Disaffected by the scarcity of comprehensive App Store review datasets, an undertaking was begun to construct a collection of app reviews that would serve as a laboratory where the frontier of generative methods, text synthesis, graph neural networks, next-gen NLP solutions, advanced integration technologies, and other innovative methods can be explored, stretched, taxed and charged with revealing the nuances and contours of the customer experience, to uncover the latent, and perhaps undistinguished emergent market need with AI.

Introducing the AppVoC (Voice of the Customer) dataset — a collection of iOS app reviews, believed to be one of the largest collections of Apple iOS user review datasets available, second only to that of the App Store itself.

## Key Features of AppVoC Dataset

- **Scale**: The dataset includes approximately 18 million reviews, making it one of the most extensive collections of app store reviews publicly available.
- **User Base**: It covers 13 million unique users, reflecting a vast and diverse range of perspectives.
- **App Diversity**: The dataset includes reviews for for some 34,000 unique apps showcasing the broad spectrum of app interests and needs.
- **Category Coverage**: The dataset spans ten of the most popular app categories, illuminating the diverse interests and needs of mobile users.
- **Voice-on-Voice**: The dataset captures how users engage and value the sentiments and opinions of other users. The number and value of user votes on reviews opens up new avenues allowing researchers and analysts to explore user engagement dynamics, identify influential reviews, and understand the collective sentiment of the user community towards specific apps. These insights can inform app developers, stakeholders, and decision-makers in optimizing app experiences, addressing user concerns, and driving user engagement and satisfaction.
- **Temporal Coverage**: Spanning from 2008 to 2023, the dataset captures a wealth of user interactions and feedback over 15 years, providing a longitudinal view of app usage and user feedback trends. Timestamps illuminate the individual and collective progression of sentiment over time, motivating deeper exploration into the evolving dynamics of user engagement and feedback. By analyzing the temporal dimension of reviews, researchers and analysts can uncover trends, patterns, and fluctuations in user sentiment, shedding light on the factors influencing app popularity, user satisfaction, and overall app ecosystem dynamics. This temporal lens not only provides valuable insights into past user behavior but also enables predictive analysis and forecasting, empowering stakeholders to anticipate trends, mitigate risks, and capitalize on emerging opportunities in the ever-evolving app landscape.
- **Rich Metadata**: Each review is accompanied by metadata such as app name, category, rating, vote count, and timestamp, enabling multifaceted analysis.

## Major Statistics of the AppVoC Dataset
With an overview of the AppVoC dataset established, let's delve into the major statistics that define its scope and depth. This section provides a detailed summary of key metrics, including the number of reviews, users, apps, and categories, and a breakdown of reviews by category.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

In [2]:
from appvocai-discover.analysis.overview import DatasetOverview
from appvocai-discover.utils.repo import ReviewRepo

In [3]:
review_repo = ReviewRepo()
df = review_repo.read(directory="01_normalized",filename="reviews.pkl")
ov = DatasetOverview(data=df)
ov.overview

Unnamed: 0,Characteristic,Total
0,Number of Reviews,18306
1,Number of Users,18289
2,Number of Apps,4199
3,Number of Categories,10


The table below provides a breakdown of the review distribution across various app categories, highlighting the count and percentage of reviews each category has received within the full dataset. 

In [4]:
ov.summary

Unnamed: 0,category,Count,Percent
0,Book,796,4.35
1,Business,1418,7.75
2,Education,1126,6.15
3,Entertainment,2004,10.95
4,Health & Fitness,4009,21.9
5,Lifestyle,1726,9.43
6,Medical,662,3.62
7,Productivity,794,4.34
8,Social Networking,2699,14.74
9,Utilities,3072,16.78


## AppVoC Dataset Variables
The dataset contains 13 variables, offering a multifaceted view of user interactions and app attributes. 

- **id**: Unique identifier for each review.
- **app_id**: Unique identifier for the app being reviewed.
- **app_name**: Name of the mobile application being reviewed.
- **category_id**: Four-digit identifier representing the category or genre of the app.
- **category**: Category or genre name of the app.
- **author**: Name or identifier of the reviewer.
- **rating**: Numeric rating provided by the author for the app.
- **title**: Title of the review.
- **content**: Detailed content of the review provided by the author.
- **review_length**: Length of the review in words.
- **vote_sum**: Total sum of votes on the usefulness of the rating.
- **vote_count**: Number of votes on the usefulness of the rating.
- **date**: Date when the review was written.

These variables collectively evince the mobile app user experience and provide a basis for analyzing user feedback and extracting actionable intelligence about areas for improvement, feature requests, unmet needs, and user preferences.

## AppVOC Preprocessing

In advance of the exploratory and interactive analyses,  our aim for the next few sections is to ensure that the text data are clean, structured, and ready for in-depth examination and modeling. With that, our six-stage data preprocessing effort will unfold as follows:

1. **Data Normalization**: Standardize data types and encoding to resolve technical anomalies before further processing and analysis.
2. **Data Quality Analysis**: Identify and rectify noise within the dataset, including profanity, excessive special characters, and identifiable patterns like emails and URLs.
3. **Cleaning**: Purge biased or distorted observations detected during the data quality analysis, ensuring data integrity.
4. **Feature Engineering**: Enhance data by transforming date fields into informative features such as month, day of the week, and day of the month. Additionally, anonymize author information to uphold privacy.
5. **Text Preprocessing**: Optimize textual data for downstream tasks such as word cloud generation and topic modeling, utilizing techniques like tokenization and stemming.
6. **Metrics**: Establish an Analytics Precomputation Layer (APL) to calculate aggregate statistical summaries, facilitating swift query responses and cost-effective analytics.

These next steps will ensure that the dataset is clean, consistent, and well-structured for exploratory data analysis. Next up? Data quality analysis.