In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")

# AppVoCAI Reviews Dataset Unboxed
In the prior section, we introduced the AppVoCAI Reviews Dataset, the motivation, its provenance, features and characteristics. Our aim here is to: 
- **extract** reviews of interest from January 2021 through September of 2023 - data which will form the basis for unmet needs and opportunity discovery, 
- **transform** data types and validate encoding, and 
- **load** the dataset into a repository for downstream preprocessing, and analysis.

Then, we'll walk through the dataset's core attributes, including review lengths, engagement statistics, and category summaries, before inspecting a few sample observations to give a concrete sense of the data we’re working with. 

In [2]:
from discover.assets.review import Review
from discover.assets.idgen import AssetIDGen
from discover.container import DiscoverContainer
from discover.core.flow import DataPrepStageDef, PhaseDef
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.ingest.stage import IngestStage

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.flow.data_prep.dqa",
        "discover.analysis.base",
    ],
)

## AppVoCAI Reviews Dataset Selection and Ingestion
The following code will load the selection and ingestion pipeline configuration.

In [4]:
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["ingest"]

Next, we build and run the ingestion pipeline. This will take a moment.

In [None]:
stage = IngestStage.build(stage_config=stage_config, force=False)
asset_id = stage.run()

Finally, we obtain the review dataset from the repository and instantiate the review object.

In [6]:
# Instantiate the repository
repo = container.repo.dataset_repo()
# Load the dataset from the repository
dataset = repo.get(asset_id)
# Instantiate the Review object for analysis.
reviews = Review(dataset=dataset)

## Dataset Overview


In [None]:
reviews.overview()

The AppVoCAI dataset offers a comprehensive view of user feedback in the mobile app ecosystem, capturing nearly 6 million reviews from over 5 million users. These reviews span 33,134 apps across 14 categories, providing a diverse range of insights into user experiences. With 12 different features to analyze, the dataset covers both qualitative and quantitative aspects of user interaction. The data spans a period from January 2021 to early September 2023, offering a rich temporal range to explore evolving trends and identify key areas for opportunity and product development. This dataset forms the foundation for uncovering unmet needs and understanding customer engagement across the app marketplace.

### Dataset Structural Analysis

In [None]:
reviews.info()

The AppVoCAI dataset consists is structured into 12 columns, capturing essential information about the reviews and their context. Key columns include identifiers for the reviews and apps, along with app names and their respective categories. 

The dataset contains detailed user-generated content such as the review text, author information, and engagement metrics like ratings, vote counts, and total votes received. It also includes a `review_length` column, which indicates the number of words in each review, and timestamps marking when the reviews were submitted. 

All columns are complete, with no missing values, and are stored efficiently, using appropriate data types like categories for app categories and integers for rating and vote counts. The memory footprint is optimized at 472.8 MB, making the dataset manageable for analysis at scale.

### Rating Analysis

In [None]:
reviews.distribution_plot(x="rating", title="Rating Distribution Analysis")

The distribution of ratings across the dataset shows a noticeable skew toward higher ratings. The standard deviation of indicates considerable variability in user feedback. While the minimum rating is 1, the majority of the reviews are concentrated around the higher end of the scale—half of the ratings are 5, as reflected in the 50th, 75th, and maximum percentiles, all being 5. This suggests that users are more likely to give positive ratings, though there is still a portion of the dataset with lower ratings, as indicated by the lower 25th percentile.

The KDE and violin plots further emphasize this trend, visually highlighting the density of high ratings and the thinner distribution in the lower range. The violin plot adds another layer, showing the shape of the distribution and reinforcing the concentration of reviews toward the maximum rating.

### Review Length Analysis

In [None]:
reviews.distribution_plot(
    x="review_length", title="Review Length Distribution Analysis"
)

### Vote Count Analysis

In [None]:
reviews.distribution_plot(x="vote_count", title="Vote Count Distribution Analysis")

In [None]:
reviews.frequency_distribution(x="vote_count", topn=10)

### Vote Sum Analysis

In [None]:
reviews.distribution_plot(x="vote_sum", title="Vote Sum Distribution Analysis")

In [None]:
reviews.frequency_distribution(x="vote_sum", topn=10)

### Review Date Analysis

In [None]:
reviews.distribution_plot(y="date", title="Review Date Analysis")

### App Review Frequency Analysis

In [None]:
reviews.frequency_distribution_plot(
    x="app_name", title="App Review Frequency Distribution Analysis"
)

In [None]:
reviews.frequency_plot(y="app_name", topn=20, title="Most Frequently Reviewed Apps")

### Review Author Frequency Analysis

In [None]:
reviews.frequency_distribution_plot(
    x="author", title="Review Author Frequency Analysis"
)

### Category Analysis

#### Category Rating Analysis

In [None]:
reviews.association_plot(x="rating", y="category", title="Category Rating Analysis")

#### Category Review Length Analysis

In [None]:
reviews.association_plot(
    x="review_length", y="category", title="Category Review Length Analysis"
)

#### Category Vote Count Analysis

In [None]:
reviews.association_plot(
    x="vote_count", y="category", title="Category Vote Count Analysis"
)

#### Category Vote Sum Analysis

In [None]:
reviews.association_plot(x="vote_sum", y="category", title="Category Vote Sum Analysis")

In [None]:
reviews.sample(n=5, random_state=55)

In [None]:
reviews.sample(n=1, random_state=22)