In [7]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = True

# AppVoCAI Dataset
The AppVoCAI Dataset was curated in September 2023 to provide a collection of app store reviews, encapsulating the voice of the mobile technology consumer for research, experimentation, and analysis. This section will:

1. **Introduce the dataset** – detailing its origins, intended applications, and any limitations that may impact analysis outcomes.
2. **Unbox the dataset** – examining its structure, format, and composition to gain a foundational understanding.
3. **Condition the data** – addressing encoding, structural, or formatting inconsistencies to ensure clean and reliable input/output operations.

To support a meaningful exploratory analysis, we will also enhance the dataset with variables such as *review length* and *review age*, that foster insight or offer additional perspective into the customer experience over time. 

## Motivation
Exploring the nexus of Generative AI, consumer behavior analysis, and the mobile app customer experience for opportunity discovery, research, marketing and product design is definitionally data intensive. User feedback, manifested through ratings and reviews, are consequential determinants of app market and financial success. In App Store Optimization (ASO), ratings influence app discoverability, user engagement via word-of-mouth, and ultimately app downloads and financial outcomes for developers. However, despite the ubiquity and strategic importance of user reviews, there remains a dearth of large-scale datasets for research and analysis of the *mobile app* user experience.

In August of 2023, we launched an app store data acquisition effort to obtain app data and user reviews for analysis and modeling. This dataset contains anonymized reviews collected between August 29 and September 5, 2023, from publicly accessible sources in the App Store, ensuring transparency and adherence to ethical guidelines throughout the engagement.

### Literature Context

The motivation behind the AppVoCAI (App Voice of the Customer) project draws upon user experience research, sentiment analysis, and consumer behavior analysis within digital platforms. Studies by Sällberg, Wang, and Numminen (2023) emphasize the significance of user reviews in shaping perceptions and adoption of mobile apps, providing insights into sentiment analysis and thematic categorization {cite}`sallbergCombinatoryRoleOnline2023`. Additionally, theoretical frameworks such as the Technology Acceptance Model (TAM) and Uses and Gratifications Theory (UGT) offer lenses through which to interpret user behaviors and motivations in the context of app interactions {cite}`davisPerceivedUsefulnessPerceived1989` {cite}`katzUsesGratificationsResearch1973`.

Furthermore, the unified theoretical model proposed by Tas, Huseynov, and Kose (2024) integrates various theories and empirical data to comprehensively explain mobile application usage behavior, supported by real usage data. This model contributes significantly to understanding user motivations, preferences, and decision-making processes in digital environments {cite}`tasUnifiedTheoreticalModel2024`.

### Why YARD (Yet Another Review Dataset)?

Existing review datasets like Amazon, Yelp, and IMDB have been invaluable for research in consumer behavior. These datasets are accessible and widely adopted, providing a rich source of information that has contributed significantly to understanding purchasing decisions, service satisfaction, and entertainment preferences. They serve as benchmark datasets, helping researchers develop and validate models that predict consumer behavior, sentiment, and preferences. The insights derived from these reviews have driven advancements in recommendation systems, personalized marketing, and sentiment analysis, showcasing their utility and importance in the field.

Yet, these benchmark datasets primarily focus on product choices, service quality, and entertainment preferences. They provide valuable insights but fall short of capturing the profound impact of technology on consumer behavior, human interactions, and capabilities. The unique and transformative nature of mobile app usage, where technology seamlessly integrates into our daily routines through smartphones, wearables, and other devices, is best observed via a specialized dataset. Mobile app reviews serve as a historical record of this technological evolution, capturing user experiences, challenges, and adaptations in real time. These reviews document the gradual yet significant shifts in human behavior and interaction brought about by technological advancements, providing a source of data for understanding this ongoing evolution of the human/technological nexus.

### Key Features of AppVoCAI Dataset

- **Scale**: Provides an extensive and diverse collection of over 22 million reviews, making it one of the few large-scale mobile app review datasets available.
- **User Base**: Includes opinions from approximately 16 million unique users, reflecting a vast and diverse range of perspectives.
- **App Diversity**: Features reviews for over 36 thousand unique apps, showcasing a broad spectrum of app interests and needs.
- **Category Coverage**: Spans 14 of the most popular app categories, illuminating the diverse interests and needs of mobile users.
- **Voice-on-Voice**: Captures user engagement and the value placed on the sentiments and opinions of other users, with data on the number and value of user votes on reviews.
- **Temporal Coverage**: Reviews span from July 10, 2008, the day the App Store was launched, to September 3, 2023, providing a 15-year longitudinal view of app usage and user feedback trends.

### Dataset Variables
The dataset contains 12 variables, offering a multifaceted view of user interactions and app attributes. 

- **id**: Unique identifier for each review.
- **app_id**: Unique identifier for the app being reviewed.
- **app_name**: Name of the mobile application being reviewed.
- **category_id**: Four-digit identifier representing the category or genre of the app.
- **category**: Category or genre name of the app.
- **author**: Name or identifier of the reviewer.
- **rating**: Numeric rating provided by the author for the app.
- **content**: Detailed content of the review provided by the author.
- **vote_sum**: Total sum of votes on the usefulness of the rating.
- **vote_count**: Number of votes on the usefulness of the rating.
- **eda_review_length**: Number of words in review.
- **date**: Date when the review was written.

These variables collectively evince the mobile app user experience and provide a basis for analyzing user feedback and extracting actionable intelligence about areas for improvement, feature requests, unmet needs, and user preferences.

### Sampling Strategy
To ensure a representative and insightful analysis of app reviews, we established specific inclusion criteria for selecting categories. The criteria aim to balance diversity, user engagement, and functional relevance across various app types. Below are the articulated inclusion criteria based on the categories selected:

1. **High User Engagement and Relevance:** Categories selected should represent areas with high user engagement and relevance to daily activities, professional tasks, and personal interests. This ensures that the insights derived are impactful and reflect significant user interactions.
2. **Diverse Functional Coverage:** The chosen categories should cover a wide range of functions and utilities to provide a holistic view of app usage. This includes productivity, education, lifestyle, health, finance, entertainment, and information-related apps, ensuring a comprehensive understanding of user needs.
3. **Balanced Representation Across Major App Types:** The selection should include categories that collectively represent a balanced cross-section of the app ecosystem. This means including high-percentage categories like Business, Education, and Utilities, as well as lower-percentage but critical categories like Medical and Social Networking.
4. **Potential for Impactful Insights:** Categories that are likely to provide actionable insights for app development and improvement are prioritized. This includes areas where user feedback can directly inform enhancements in functionality, user experience, and engagement.
5. **Consistency in User Behavior and Engagement Patterns:** Selected categories should exhibit relatively consistent and comparable user behavior and engagement patterns. This facilitates more accurate and meaningful analysis, avoiding the variability and trends often seen in highly volatile categories like gaming.

### Caution on Data Representation
The AppVoCAI dataset represents the reviews collected during the data gathering period and does not capture every review or the complete distribution of reviews during the collection period. 

Though the AppVoCAI dataset covers 14 of the most popular and active categories in the App Store, several categories were omitted during the data acquisition effort, most notably absent, the Game category. Gaming apps often exhibit unique user behavior and engagement patterns that differ from other app categories in terms of frequency, duration, intensity and context of their mobile app engagement. Whereas mobile games are characterized by immersive gameplay and extended user sessions, we prioritized a utilitarian, task oriented application of mobile technology. 

With that, let's unbox the data.

In [8]:
from discover.assets.review import Review
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.infra.utils.file.io import IOService
from discover.flow.data_prep.ingest.stage import IngestStage
from discover.infra.utils.visual.print import Printer

In [9]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.app.base",
    ],
)

## Unbox Data


In [10]:
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=True)
filepath = config.dataprep.stages.ingest.source_config.filepath
df = IOService.read(filepath=filepath)

In [None]:
from discover.assets.dataset import Dataset
from discover.core.flow import DataPrepStageDef, PhaseDef


dataset = Dataset(
    phase=PhaseDef.DATAPREP, stage=DataPrepStageDef.RAW, content=df, name="reviews"
)

In [None]:
# Instantiate the repository
repo = container.repo.dataset_repo()
# Load the dataset from the repository
repo.add(dataset=dataset)
# Instantiate the Review object for analysis.
reviews = Review(df=df)

Next, we build and run the ingestion pipeline. 

In [None]:
stage = IngestStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()



#                              Data Ingestion Stage                              #



Finally, we load the review dataset from the repository and instantiate an Review object.

## AppVoCAI Dataset Overview


In [None]:
reviews.overview()



                           AppVoCAI Dataset Overview                            
                       Number of Reviews | 22,166,591
                       Number of Authors | 15,710,479
                          Number of Apps | 36,377
                    Number of Categories | 14
                                Features | 11
                    Date of First Review | 2008-07-10 10:15:37
                     Date of Last Review | 2023-09-03 02:14:35


The AppVoCAI captures over 22 million reviews from nearly 16 million users. These reviews span some 36,377 apps across 14 categories., providing a diverse range of insights into user experiences. The dataset contains 12 features and spans from July 10, 2008, the date the appstore launched, through early September of 2023.

### AppVoCAI Dataset Structural Analysis
The dataset contains 12 variables, offering both qualitative and quantitative views user opinion. 

- **id**: Unique identifier for each review.
- **app_id**: Unique identifier for the app being reviewed.
- **app_name**: Name of the mobile application being reviewed.
- **category_id**: Four-digit identifier representing the category or genre of the app.
- **category**: Category or genre name of the app.
- **author**: Name or identifier of the reviewer.
- **rating**: Numeric rating provided by the author for the app.
- **content**: Detailed content of the review provided by the author.
- **vote_sum**: Total sum of votes on the usefulness of the rating.
- **vote_count**: Number of votes on the usefulness of the rating.
- **review_length**: Number of words in review.
- **review_age**: Age of review relative to the last review in the dataset.
- **date**: Date when the review was written.

In [None]:
reviews.info()

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],22166591,0,1.0,22166474,117,0.9999947,1480962173
1,app_id,string[python],22166591,0,1.0,36377,22130214,0.001641073,1468716232
2,app_name,string[python],22166591,0,1.0,36363,22130228,0.001640442,1871227874
3,category_id,object,22166591,0,1.0,14,22166577,6.315811e-07,1352162051
4,category,category,22166591,0,1.0,14,22166577,6.315811e-07,22168090
5,author,object,22166591,0,1.0,15710479,6456112,0.7087458,1706827507
6,rating,float64,22166591,0,1.0,5,22166586,2.255647e-07,177332728
7,content,string[python],22166591,0,1.0,19079963,3086628,0.8607531,8070646310
8,vote_sum,Int64,22166591,0,1.0,504,22166087,2.273692e-05,199499319
9,vote_count,Int64,22166591,0,1.0,678,22165913,3.058657e-05,199499319


Here's a summary of key points based on the provided dataset overview:

1. **ID and Uniqueness**:
   - The `id` column is showing some duplication (93 items) which will require treatment during the data cleaning stage.

2. **App-Specific Details (`app_id`, `app_name`)**:
   - Both `app_id` and `app_name` are highly duplicated, with only about 5.6% uniqueness, which suggests a large number of reviews per app.

3. **Category Distribution**:
   - `category_id` and `category` are consistent, each with only 14 unique categories, suggesting clear category classifications across apps.

4. **Authors and Reviews**:
   - The `author` column has a uniqueness of 86%, indicating that about 14% of the reviewers have submitted multiple analyzer. This may be valuable for user behavior analysis over time.

5. **Rating Consistency**:
   - The `rating` column has only 5 unique values, confirming that ratings follow a predefined scale, typical of app store ratings.

6. **Vote Metrics (`vote_count`, `vote_sum`)**:
   - Both `vote_count` and `vote_sum` have low uniqueness (0.075% and 0.057%), suggesting that vote distributions are limited in range, possibly clustering around common values.
   
7. **Review Length**:
   - `review_length` has a moderate uniqueness of 0.18%, showing some diversity in review length.
   
8. **Data Completeness**:
   - All columns are 100% complete, with no missing values, ensuring the dataset’s integrity for analysis.

9. **Data Volume and Memory Considerations**:
   - The dataset weighs in at about 4.9 Gb of memory, which may require special consideration during data preprocessing and modeling stages.

Next, we will examine the distributions of ratings, votes, review lengths overall and by category.

## App Store Categories

In [None]:
reviews.categories(y="category", title="App Store Categories")

In [None]:
samples = df.sample(n=5, random_state=27)
samples

The random sample of 5 observations include a mix of highly rated apps, as well as lower ratings, with a range of review lengths, all of which reflect different user experiences. However, a common feature across the sample is the absence of votes (both vote count and vote sum are zero), reinforcing either a low level of engagement with individual reviews or that votes are not a common interaction on these apps.

The reviews span a significant time range, from as early as 2011 to as recent as 2023, showcasing both historical and recent user sentiments. Review lengths vary from concise (7 words) to more detailed feedback (31 words), which highlights the difference in how users express their opinions depending on the app and personal style.

Overall, the sample reflects the broader patterns that may emerge in the dataset: varied user ratings, relatively short reviews, and minimal engagement in terms of votes. This suggests that future analysis may need to pay close attention to factors like review length, category-specific trends, and potential changes in engagement over time, especially for reviews that go unnoticed in terms of user votes.

#### Spectrum TV Review Analysis

In [None]:
printer = Printer()
printer.print_dataframe_as_dict(
    df=samples, list_index=0, title=samples["app_name"].values[0], text_col="content"
)

This review provides insights into **qualitative** and **quantitative** aspects, highlighting user frustration with technical performance.

---

##### **Qualitative Analysis**:

- **Content and Context**:
  - The review describes two main issues:
    1. **Streaming Quality**: The user is dissatisfied with the app's performance, noting issues with video quality ("fuzzes out the pic") and buffering. They compare Spectrum TV unfavorably to other streaming apps, implying that their internet connection is not the problem, but rather the app itself.
    2. **Audio Compatibility**: The user reports a specific audio issue on Apple TV, where sound fails when switching to the Spectrum app, though it works fine on other apps.

- **Sentiment**:
  - The sentiment is negative, with a rating of 2 stars. The language reflects significant frustration, particularly with phrases like "terrible job" and "Do better spectrum."

- **Tone/Emotion**:
  - The tone conveys disappointment and frustration. Phrases like "Do better spectrum" emphasize the user's dissatisfaction and expectation for improvement.

- **Key Topics**:
  - Key topics include **streaming quality** (buffering and picture clarity) and **device compatibility** (sound issues on Apple TV). Both are critical technical aspects that impact user experience.

- **Clarity**:
  - The review is clear and detailed. At 81 words, it provides enough context to understand the user’s specific issues without ambiguity.

---

##### **Quantitative Analysis**:

- **Length**:
  - At 81 words, this review is moderately long, suggesting the user is invested enough to describe multiple issues rather than simply leaving a brief complaint.

- **Rating**:
  - The 2-star rating signals clear dissatisfaction, though it isn't the lowest rating, possibly indicating that the user sees potential in the app if improvements are made.

- **Vote Count/Engagement**:
  - The review has no votes (`vote_count = 0`, `vote_sum = 0`), which may suggest it hasn't gained visibility or that other users have not prioritized these issues.

- **Entropy/Complexity**:
  - The review includes specific technical details, which could contribute to moderate entropy. Words like "buffering," "streaming data," and "Apple TV" add to the complexity, likely making it more informative for developers seeking feedback on technical issues.

---

##### Summary:
This review highlights technical performance concerns with **Spectrum TV**, specifically around streaming quality and compatibility with Apple TV audio. The detailed feedback may reflect a user looking for constructive improvements rather than merely venting. The lack of engagement may indicate these issues are not widely experienced or that the review hasn't reached enough users. This type of review could be valuable for identifying recurring technical problems across similar analyzer. Further analysis of streaming-related complaints could reveal if these are common issues with Spectrum TV.

#### Instagram Review Analysis

In [None]:
df["date"].min()

In [None]:
printer.print_dataframe_as_dict(
    df=samples, list_index=1, title=samples["app_name"].values[1], text_col="content"
)

This review of **Instagram** indicates some frustration:

---

##### **Qualitative Analysis**:

- **Content and Context**:
  - The user expresses dissatisfaction with the presence of **suggested posts and ads** in their feed, which appears to be overwhelming their personal content. This is a common concern among users who prefer an unfiltered experience on social media.
  - The strong reaction ("On the verge of deleting account") indicates the user may abandon the app if these issues persist, which points to a significant risk for user retention.

- **Sentiment**:
  - The sentiment is highly negative, as reflected by the 1-star rating and phrases like "SICK of it!!!" and "Would not recommend!!!!"

- **Tone/Emotion**:
  - The tone conveys frustration and annoyance, with the use of capital letters and multiple exclamation marks intensifying the sense of anger. This suggests a user who is close to leaving the app over these issues.

- **Key Topics**:
  - The main topics are **advertising** and **content curation**. The user is specifically unhappy with the prevalence of ads and suggested posts, which they perceive as intrusive to their Instagram experience.

- **Clarity**:
  - The review is clear and direct. Despite its brevity (26 words), it conveys strong feelings and specific grievances, making it easy to understand the user's complaints.

---

##### **Quantitative Analysis**:

- **Length**:
  - With only 26 words, this is a short review, likely indicating that the user was focused on expressing dissatisfaction without detailing the underlying issues or offering solutions.

- **Rating**:
  - The 1-star rating underscores extreme dissatisfaction, aligning with the content of the review, which is highly critical of Instagram’s ad-heavy content.

- **Vote Count/Engagement**:
  - This review has no votes (`vote_count = 0`, `vote_sum = 0`), which may indicate limited visibility or that others haven’t prioritized this complaint, even if they share similar frustrations.

- **Entropy/Complexity**:
  - Given the short length and use of strong emotional language, the review likely has low entropy (predictability), with common words reflecting a straightforward expression of frustration rather than a nuanced or complex opinion.

---

##### Summary:
This review highlights significant dissatisfaction with Instagram’s reliance on **ads and suggested posts**, which the user finds intrusive and overwhelming. The short, emotional tone suggests that the user is on the verge of abandoning the app, potentially signaling a retention risk for Instagram. Although this review hasn't gained engagement from other users, analyzing similar reviews could determine if this frustration is a widespread issue. Overall, this review points to key areas of user dissatisfaction, particularly for users seeking a more personalized and less commercialized experience on the platform.

#### Move! Coach Review Analysis

In [None]:
printer.print_dataframe_as_dict(
    df=samples, list_index=2, title=samples["app_name"].values[2], text_col="content"
)

This review of **MOVE! Coach** focuses on the app’s usability and functionality.

---

##### **Qualitative Analysis**:

- **Content and Context**:
  - The user is frustrated with the app's **usability** and **lack of functionality**, specifically regarding the inability to log meals and exercises. They describe spending considerable time navigating the app's manuals, videos, and sections, only to find limited functionality (specifically, the option to add weight but not other health metrics).
  - This suggests a potential mismatch between user expectations and the app's features, as well as possible shortcomings in the app’s user interface and navigation.

- **Sentiment**:
  - The sentiment is clearly negative, with a 1-star rating and language that reflects frustration. Phrases like "Super frustrating" emphasize the user’s disappointment and dissatisfaction with the app's usability.

- **Tone/Emotion**:
  - The tone conveys frustration and disappointment, as the user feels let down by the app's lack of key features. This sentiment is intensified by the fact that the user put in considerable effort to understand the app's functionality.

- **Key Topics**:
  - Key topics include **feature accessibility** (finding features to log meals and exercises) and **usability** (difficulty navigating the app). These are critical areas in health and fitness apps where users expect ease of use and comprehensive functionality.

- **Clarity**:
  - The review is relatively clear, though there are some minor grammatical errors ("I doesn’t an hour"), likely stemming from user frustration or haste. However, the core message of dissatisfaction with the app's lack of functionality is conveyed effectively.

---

##### **Quantitative Analysis**:

- **Length**:
  - With 37 words, this review is brief but provides enough context to understand the user’s main issues, especially around app functionality.

- **Rating**:
  - The 1-star rating reinforces the user's strong dissatisfaction with the app. The rating aligns with the critical nature of the review content.

- **Vote Count/Engagement**:
  - This review has no votes (`vote_count = 0`, `vote_sum = 0`), which may indicate it hasn’t gained visibility or resonance with other users. However, the issues raised may still be relevant for product improvement.

- **Entropy/Complexity**:
  - The review likely has low entropy, as it uses common words and phrases to convey frustration. The straightforward language indicates immediate concerns without nuanced discussion.

---

##### Summary:
This review highlights significant usability issues with **MOVE! Coach**, focusing on the user's inability to find essential features such as meal and exercise logging. The frustration expressed reflects unmet expectations for a health and fitness app, where users anticipate easy access to comprehensive tracking features. The lack of engagement from other users may suggest that this experience is either unique to this user or not widely shared. However, if similar feedback is found in other reviews, it could indicate broader usability issues that may warrant improvement.

#### Pump Log App Review

In [None]:
printer.print_dataframe_as_dict(
    df=samples, list_index=3, title=samples["app_name"].values[3], text_col="content"
)

This review of **Pump Log** provides insights into the app's effectiveness and emotional support for users, as well as value for cost.

---

##### **Qualitative Analysis**:

- **Content and Context**:
  - The review expresses strong satisfaction with the app’s role in supporting breastfeeding mothers. The user mentions that the app helped them persevere during challenging moments ("This app kept me going when I felt I couldn’t anymore"), indicating that the app provides not just functionality, but also emotional support and motivation.
  - The user also highlights the **value for cost**, noting that the extended options are "worth every bit" of the small charge. This speaks to the perceived value and effectiveness of the paid features.

- **Sentiment**:
  - The sentiment is highly positive, with a 5-star rating. The language used, such as "10/10 recommend" and "Worth every bit," conveys strong endorsement and satisfaction.

- **Tone/Emotion**:
  - The tone is enthusiastic and supportive. The user’s experience with the app is deeply personal, and they express gratitude, indicating that the app has had a meaningful impact on their breastfeeding journey.

- **Key Topics**:
  - Key topics include **recommendation** (highly recommending the app to other mothers) and **value for cost** (justifying the cost of extended options). The app’s supportive nature and functionality for "milk making mamas" are emphasized.

- **Clarity**:
  - The review is clear and concise. It provides specific praise for the app’s motivational support and value without any ambiguity.

---

##### **Quantitative Analysis**:

- **Length**:
  - At 33 words, the review is short but impactful, providing both a recommendation and specific reasons for the user’s satisfaction with the app.

- **Rating**:
  - The 5-star rating aligns with the review's positive tone and enthusiastic language, showing that the user finds the app exceptionally valuable.

- **Vote Count/Engagement**:
  - This review has no votes (`vote_count = 0`, `vote_sum = 0`), indicating that it hasn’t gained visibility or engagement from other users. However, its highly positive tone could resonate if promoted to a larger audience.

- **Entropy/Complexity**:
  - Given its brevity and straightforward language, the review likely has low entropy, using simple words to convey strong endorsement without much complexity.

---

##### Summary:
This review highlights the effectiveness of **Pump Log** in supporting breastfeeding mothers, both functionally and emotionally. The user’s enthusiastic recommendation and praise for the app’s value indicate a high level of satisfaction, particularly with the app’s role in motivating users during challenging times. Despite the lack of engagement, the review provides valuable insight into how this app meets specific user needs in a meaningful way. Analyzing similar reviews could reveal if this level of emotional impact is common among users of this app.

#### MyFitnessPal: Calorie Counter App Review

In [None]:
printer.print_dataframe_as_dict(
    df=samples, list_index=4, title=samples["app_name"].values[4], text_col="content"
)

This review of **Pluto TV - Live TV and Movies** offers insights into user discovery and initial impressions of app content.

---

##### **Qualitative Analysis**:

- **Content and Context**:
  - The review reflects a positive initial discovery experience. The user recently found **Pluto TV** and is pleasantly surprised by the wide range of available TV shows, expressing appreciation for the app’s content library.
  - The user’s phrase, "I am amazed that you have so many," highlights satisfaction with the app’s offerings, suggesting that Pluto TV provides a breadth of content that appeals to new users.

- **Sentiment**:
  - The sentiment is very positive, as indicated by the 5-star rating and the words "amazed" and "thank you," which show the user's satisfaction and gratitude.

- **Tone/Emotion**:
  - The tone is enthusiastic and appreciative. The user's surprise at the content selection conveys excitement and a positive first impression.

- **Key Topics**:
  - Key topics include **content variety** (large selection of TV shows) and **user discovery** (newly finding the app). This feedback highlights the importance of content diversity in attracting and retaining new users for entertainment apps.

- **Clarity**:
  - The review is clear and concise. Although brief, it communicates both the user’s initial impression and appreciation effectively.

---

##### **Quantitative Analysis**:

- **Length**:
  - At 29 words, the review is short, focusing on the user’s initial excitement about discovering the app’s content without providing additional detail.

- **Rating**:
  - The 5-star rating aligns with the enthusiastic tone, showing that the user’s initial experience with the app was very positive.

- **Vote Count/Engagement**:
  - This review has no votes (`vote_count = 0`, `vote_sum = 0`), suggesting that it hasn’t gained visibility or engagement from other users. However, the positive tone could be influential if shared with prospective users.

- **Entropy/Complexity**:
  - The review likely has low entropy due to its straightforward language and short length, focusing on simple expressions of amazement and gratitude.

---

##### Summary:
This review reflects a highly positive initial experience with **Pluto TV**, focusing on the breadth of available content. The user's excitement at discovering the variety of TV shows highlights an important value proposition for the app in attracting new users. While the review lacks engagement from others, it underscores the impact of a diverse content library on first-time users. Further analysis of similar reviews could help identify if content variety is a recurring theme that appeals to new Pluto TV users.

## Summary
The dataset consists of 5.9 million observations, including approximately 5 million unique authors, 33,000 apps, and 14 categories, spanning from January 2021 to September 2023. It contains 12 features, including anonymized author identifiers, app ratings, review vote engagement data, as well as review content. A number of duplicate review identifiers are extant, and will be addressed during data preparation. 

Next, we conduct data cleaning to prepare the dataset for feature engineering and downstream analysis.