In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# AppVoCAI Dataset Overview
In this section, we provide an overview of the AppVoCAI dataset, beginning with key metrics, including the counts of reviews, apps, users, and categories, as well as the number of features and the dataset's temporal range. We then explore the dataset's structure, covering data types, measurement scales, and variable descriptions. Following this, we profile the dataset to assess its completeness, cardinality, uniqueness, and size. Next, we inspect a random sampling of observations to better understand its contents. Finally, we close with a special analysis of the App Store’s first-ever review from July 10, 2008, a pivotal moment in the history of mobile customer engagement.

## Imports

In [2]:
import pandas as pd
from discover.app.eda import EDA
from discover.assets.idgen import AssetIDGen
from discover.container import DiscoverContainer
from discover.core.flow import EnrichmentStageDef, PhaseDef
from discover.infra.utils.visual.print import Printer

pd.options.display.max_rows = 999

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.app.base",
    ],
)

## Load Dataset
The enriched dataset asset has been registered in the repository in the phase (Enrichment) and stage (Deviation Analysis) in which it was last created. Once we obtain the asset id, we extract the dataset from the repository and instantiate the EDA object.

In [4]:
idg = AssetIDGen()
asset_id = idg.get_asset_id(
    asset_type="dataset",
    phase=PhaseDef.ENRICHMENT,
    stage=EnrichmentStageDef.DEVIATION,
    name="review",
)

# Instantiate the repository
repo = container.repo.dataset_repo()
# Load the dataset from the repository
df = repo.get(asset_id, distributed=False).content
# Instantiate the Review object for analysis.
eda = EDA(df=df)

In [5]:
eda.overview()



                           AppVoCAI Dataset Overview                            
                       Number of Reviews | 81,594
                      Number of Reviewrs | 81,399
              Number of Repeat Reviewers | 193
         Number of Influential Reviewers | 5,288
                          Number of Apps | 8,944
                    Number of Categories | 14
                                Features | 37
                        Memory Size (Mb) | 61.59
                    Date of First Review | 2020-01-01 00:25:16
                     Date of Last Review | 2023-09-03 01:46:54


The AppVoCAI captures over 22 million reviews from nearly 16 million users. These reviews span some 36,377 apps across 14 categories. The enriched dataset contains 37 features and spans from July 10, 2008, the date the appstore launched, through early September of 2023.

## AppVoCAI Dataset Structure
The enriched AppVoCAI dataset consists of:

- **Core Variables**: The original set of 10 variables, including details on the app, category, author, rating, review content, vote counts, and date information.
- **Metadata**: Additional metadata features, such as review length and temporal attributes like review age, month, day, and hour.
- **Sentiment Analysis**: Sentiment scores ranging from -1 to 1, alongside classifications of 'negative,' 'neutral,' and 'positive,' reflecting the tone and satisfaction level expressed in the reviews.
- **Text Quality Analysis**: A composite text quality score based on lexical, syntactic, and perplexity measures, capturing coherence, richness, and complexity. Supplementary quality metrics include POS counts, diversity, intensity, and structural complexity to provide a balanced assessment.
- **Part-of-Speech (POS)**: POS tagging frequencies and proportions for parts of speech like nouns, verbs, adverbs, and adjectives, emphasizing patterns linked to high-quality reviews.
- **Statistics**: Indicators of review content quality, including character counts, unique word counts, and unique word proportions.
- **Deviation Analysis**: Measures of deviation for variables such as rating, review length, sentiment, and text quality, comparing them to category averages to highlight significant shifts or patterns that differ from overall trends.

In sum, the Enhanced AppVoCAI Dataset contains the following 37 nominal, interval, ratio, discrete and continuous variables, representing multiple dimensions of the mobile app customer experience. 

|    | Column                                   | DataType       | Measure    | Description                                                                                               |
|----|------------------------------------------|----------------|------------|-----------------------------------------------------------------------------------------------------------|
| 0  | id                                       | string[python] | Nominal    | Unique identifier for each review                                                                         |
| 1  | app_id                                   | string[python] | Nominal    | Unique identifier for each app                                                                            |
| 2  | app_name                                 | string[python] | Nominal    | The name of the app as listed in the App Store                                                            |
| 3  | category                                 | category       | Nominal    | The primary category to which the app is assigned                                                         |
| 4  | author                                   | string[python] | Nominal    | Anonymized author name                                                                                    |
| 5  | rating                                   | int16          | Interval   | Rating for the app in [1,5]                                                                               |
| 6  | content                                  | string[python] | Nominal    | The review content                                                                                        |
| 7  | vote_count                               | int64          | Discrete   | Number of votes, indicating a degree of review helpfullness                                               |
| 8  | vote_sum                                 | int64          | Discrete   | The sum of the votes                                                                                      |
| 9  | date                                     | datetime64[us] | Interval   | Date of review                                                                                            |
| 10 | enrichment_meta_review_age               | int32          | Ratio      | Review age relative to the latest review in the dataset.                                                  |
| 11 | enrichment_meta_review_length            | int32          | Discrete   | Number of words in the review                                                                             |
| 12 | enrichment_meta_review_month             | int32          | Ordinal    | The month the review was submitted                                                                        |
| 13 | enrichment_meta_review_day_of_week       | int32          | Nominal    | The day of week the review was submitted where 1=Monday and   7=Sunday                                    |
| 14 | enrichment_meta_review_hour              | int32          | Interval   | The hour the review was submitted in [0,23]                                                               |
| 15 | enrichment_sentiment                     | float64        | Continuous | The sentiment of the review in range -1 to 1.                                                             |
| 16 | enrichment_sentiment_classification      | string[python] | Discrete   | The review sentiment classification as 'negative', 'neutral',   or   'positive'                           |
| 17 | enrichment_tqa_score_final               | float64        | Continuous | The review text quality analysis, combining lexical,   syntactic, and   perplexity based measures.        |
| 18 | enrichment_pct_deviation_rating          | float64        | Ratio      | The deviation of the rating score from the average for the app                                            |
| 19 | enrichment_pct_deviation_review_length   | float64        | Ratio      | The deviation of review length, from the average for the app                                              |
| 20 | enrichment_pct_deviation_sentiment       | float64        | Ratio      | The deviation of review sentiment, from the average for the   app                                         |
| 21 | enrichment_pct_deviation_tqa_score_final | float64        | Ratio      | The deviation of text quality score, from the average for the   app                                       |
| 22 | pos_n_nouns                              | int32          | Discrete   | Number of nouns in the review                                                                             |
| 23 | pos_n_verbs                              | int32          | Discrete   | Number of verbs in the review                                                                             |
| 24 | pos_n_adjectives                         | int32          | Discrete   | Number of adjectives in the review                                                                        |
| 25 | pos_n_adverbs                            | int32          | Discrete   | Number of adverbs in the review                                                                           |
| 26 | pos_p_nouns                              | float64        | Ratio      | Proportion of nouns in the review                                                                         |
| 27 | pos_p_verbs                              | float64        | Ratio      | Proportion of verbs in the review                                                                         |
| 28 | pos_p_adjectives                         | float64        | Ratio      | Proportion of adjectives in the review                                                                    |
| 29 | pos_p_adverbs                            | float64        | Ratio      | Proportion of adverbs in the review                                                                       |
| 30 | stats_char_count                         | int32          | Discrete   | Number of characters in the review                                                                        |
| 31 | stats_unique_word_count                  | int32          | Discrete   | Number of unique words in the review                                                                      |
| 32 | stats_unique_word_proportion             | float64        | Ratio      | Proportion of unique words in the review                                                                  |
| 33 | tqm_pos_count_score                      | float64        | Continuous | The part-of-speech count score                                                                            |
| 34 | tqm_pos_diversity_score                  | float32        | Continuous | The part-of-speech diversity score                                                                        |
| 35 | tqm_structural_complexity_score          | float32        | Continuous | The structural complexity score                                                                           |
| 36 | tqm_pos_intensity_score                  | float32        | Continuous | The part-of-speech intensity score                                                                        |



### Variable Types and Measurement
#### Numeric Types
The dataset features a range of numeric data and measurement types, each serving a distinct purpose in capturing user interactions and app performance:
1. **Discrete**: Variables such as `vote_count`,`vote_sum`, `pos_n_nouns`, and `enrichment_meta_review_month` are discrete numeric variables that take on integer values. 
2. **Interval**: Interval data are a type of ordered quantitative data measured on a scale with equal intervals that have meaning, such as `rating`, and `date`.  
3. **Continuous**: Continuous data are measured, rather than counted and can take on any value within a range, such as `sentiment`, `tqa_score`, or `tqm_structural_complexity_score`. 
4. **Ratio**: Ratio data are quantitative and measured on a continuous scale with a true zero value, and equal distance between adjacent values, such as `review_age`, `pct_deviation_rating`, and `pos_p_nouns`, the proportion of nouns.

#### Categorical Types 
1. **Ordinal**: A qualitative categorical value, such as `review_month`, with a natural order; however, the distances between the categories are not known nor assumed to be equal.  
2. **String (Nominal)**: Used for identifiers such as `id`, `app_id`, `app_name`, `author`, and `content`. These represent categorical data without any inherent order or numerical value.
3. **Category (Nominal)**: The app `category` variable is a category data type, a type of string variable that takes on a limited, fixed number of possible values. It is optimized for efficiency, processing speed, and functionality.

#### Rating as Interval
Our decision to treat rating as an interval type may raise eyebrows among traditionalists in data analysis pedagogy and measurement theory, as it challenges the long-standing orthodoxy surrounding data types. Since Harvard Psychology Professor, Dr.Stanley Smith Stevens proposed his taxonomy in the 1946 {cite}`stevensTheoryScalesMeasurement1946a`, the orthodoxy surrounding data types has been well-established. Rating, he argued, takes on a distinct nature of ordinal measurement. In this framework, ratings are characterized by their inherent order but lack consistent intervals between categories. This view aligns with the theoretical justification that ordinal scales provide a ranking of values without implying specific quantitative differences between them. Therefore, certain constructs such as 'average rating', for instance, have no mathematically interpretation.

Our departure from this view leverages the inherent properties of interval scales, including equal intervals and meaningful arithmetic operations. Interval scales exhibit the property of equal intervals, where the difference between any two consecutive points on the scale remains constant. Mathematically, this implies that for any ratings $i$ and $j$, that $|i-j| = |k-l| \space \forall \space i,j,k,l$. Because of this, arithmetic operations such as addition and averaging are meaningful, and more sophisticated statistical techniques, such as regression and correlation analysis, can be applied to facilitate a deeper understanding of user feedback. Moreover, the widespread adoption of average ratings by industry-leading platforms (such as Amazon, IMDb, and Yelp) as key metric for summarizing user feedback and comparing different products, services, or categories, underscores the practical acceptance of treating ratings as interval data. By treating ratings as interval data, we align with common industry practices and leverage the full range of statistical tools available for numerical data.

Our departure from the conventional treatment of ratings as ordinal measurements is not without precedent. Inspired by the critiques of Velleman, Wilkinson, Rozeboom, and others, we have adopted a perspective that "the scale type of data may be determined in part by the questions we ask of the data or the purposes for which we intend it" {cite}`vellemanNominalOrdinalInterval1993a`. For those curious about the rationale behind our methodology, we invite you to explore {ref}rating_as_interval.

## AppVoCAI Dataset Profile
The profile summarizes the data and aspects of data quality in terms of:

- **Column**: Represents the column names in the dataset.
- **DataType**: Indicates the data type of each column.
- **Complete**: Displays the count of complete cases (non-null values) in each column.
- **Null**: Shows the count of null values in each column.
- **Completeness**: Represents the completeness of each column, calculated as the ratio of complete cases to the total number of cases.
- **Unique**: Indicates the count of unique values in each column.
- **Duplicate**: Displays the count of duplicate values in each column.
- **Uniqueness**: Represents the uniqueness of values within each column, calculated as the ratio of unique values to the total number of cases.
- **Size**: Represents the size of the column in bytes.

In [6]:
structure = eda.info()
structure

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],81594,0,1.0,81594,0,1.0,5472144
1,app_id,string[python],81594,0,1.0,8944,72650,0.109616,5423467
2,app_name,string[python],81594,0,1.0,8940,72654,0.109567,6609661
3,category,category,81594,0,1.0,14,81580,0.000172,83093
4,author,string[python],81594,0,1.0,81399,195,0.99761,6282738
5,rating,Int16,81594,0,1.0,5,81589,6.1e-05,244782
6,content,string[python],81594,0,1.0,76737,4857,0.940474,18655099
7,vote_count,Int64,81594,0,1.0,73,81521,0.000895,734346
8,vote_sum,Int64,81594,0,1.0,59,81535,0.000723,734346
9,date,datetime64[us],81594,0,1.0,81533,61,0.999252,652752


In [None]:
print(
    f"Size of dataset in memory: {round(structure['Size (Bytes)'].sum()/(1024*1024),2)} Mb"
)

This dataset presents several important characteristics that make it suitable for exploratory analysis:

1. **Completeness**: All columns have a completeness score of 1.0, indicating that there are no missing values throughout the dataset. This high data integrity provides a strong foundation for accurate and comprehensive analyses.

2. **Low-Cardinality Variables**: Several categorical variables exhibit low-cardinality uniqueness, making them excellent candidates for segmentation and clustering. These include:
   - **Category**: With only 14 unique categories, this variable is well-suited for analyzing app market segments and identifying usage or sentiment trends across app types.
   - **Rating**: The presence of only 5 unique values allows for clear and interpretable distribution and sentiment analyses.
   - **Review Month, Day of Week, and Hour**: These temporal variables are also low in cardinality, which facilitates clustering analyses based on usage patterns, such as peak review times or seasonal trends.
   - **Sentiment Classification**: With only 3 unique classifications, sentiment analysis becomes streamlined, making it easier to identify general trends in user feedback.

3. **High-Cardinality Variables**: In contrast, columns like `id` and `author` have very high cardinality and uniqueness, which is expected, as they identify unique reviews and authors. While essential for tracking and linking reviews, these columns may not be the focus of deeper analytical modeling but are useful for ensuring data granularity.

4. **Textual and POS Features**: Variables related to text characteristics (e.g., `content`, `stats_char_count`, `stats_unique_word_count`) and Part-of-Speech (POS) counts offer rich analytical opportunities. These features can be instrumental in assessing text quality, complexity, and sentiment, which are crucial for understanding user feedback nuances.

5. **Enrichment Scores and Deviations**: The enrichment and text quality metric columns, such as `tqm_pos_count_score` and `enrichment_pct_deviation_*`, provide additional layers of analysis. These metrics could be valuable for exploring how deviations in ratings or review lengths correlate with sentiment or overall app performance.

In essence, this dataset’s completeness and a mix of low- and high-cardinality features enable a diverse range of analyses, from straightforward sentiment distribution to more advanced clustering and trend detection. The well-structured categorical variables facilitate segmentation, while the detailed enrichment scores and text metrics provide a strong basis for in-depth quality and sentiment analysis.

## AppVoCAI Data Sample