In [1]:
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings

warnings.filterwarnings("ignore")
FORCE = False

# AppVoCAI Dataset Overview
In this section, we provide an overview of the AppVoCAI dataset, beginning with key metrics, including the counts of reviews, apps, users, and categories, as well as the number of features and the dataset's temporal range. We then explore the dataset's structure, covering data types, measurement scales, and variable descriptions. Following this, we profile the dataset to assess its completeness, cardinality, uniqueness, and size. Next, we inspect a random sampling of observations to better understand its contents. Finally, we close with a special analysis of the App Store’s first-ever review from July 10, 2008, a pivotal moment in the history of mobile customer engagement.

## Imports

In [None]:
import pandas as pd
from discover.app.eda import EDA
from discover.assets.idgen import AssetIDGen
from discover.container import DiscoverContainer
from discover.core.flow import EnrichmentStageDef, PhaseDef
from discover.infra.utils.visual.print import Printer

pd.options.display.max_rows = 999

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
        "discover.app.base",
    ],
)

## Load Dataset
The enriched dataset asset has been registered in the repository in the phase (Enrichment) and stage (Deviation Analysis) in which it was last created. Once we obtain the asset id, we extract the dataset from the repository and instantiate the EDA object.

In [4]:
idg = AssetIDGen()
asset_id = idg.get_asset_id(
    asset_type="dataset",
    phase=PhaseDef.ENRICHMENT,
    stage=EnrichmentStageDef.DEVIATION,
    name="review",
)

# Instantiate the repository
repo = container.repo.dataset_repo()
# Load the dataset from the repository
df = repo.get(asset_id, distributed=False).content
# Instantiate the Review object for analysis.
eda = Reviewalizerdf=df)

## AppVoC Dataset Key Characteristics
The key characteristics include the number of reviews, reviewers, apps, and categories. We also indicate the number of features in the dataset, its size, and temporal range. 

In [5]:
eda.overview()



                 AppVoCAI Dataset Overview and Characteristics                  
                       Number of Reviews | 81,594
                      Number of Reviewrs | 81,399
              Number of Repeat Reviewers | 193
         Number of Influential Reviewers | 5,288
                          Number of Apps | 8,944
                    Number of Categories | 14
                                Features | 37
                        Memory Size (Mb) | 61.59
                    Date of First Review | 2020-01-01 00:25:16
                     Date of Last Review | 2023-09-03 01:46:54


The AppVoCAI captures over 22 million reviews from nearly 16 million users. These reviews span some 36,377 apps across 14 categories. A small percentage of reviewers are considered influential, indicated by a non-zero review vote count, and fewer have written more than one review. This enriched dataset contains 37 features and spans from July 10, 2008, the date the appstore launched, through early September of 2023.

## AppVoCAI Dataset Structure
The enriched AppVoCAI dataset consists of:

- **Core Variables**: The original set of 10 variables, including details on the app, category, author, rating, review content, vote counts, and date information.
- **Lexical Features**: Lexical features include word length, total word count, and the numbers and proportions of unique words in the review text.
- **Temporal Features**: Temporal features include the review date, its age relative to the last review date in the dataset, and the month, day and hour the review was submitted.
- **Sentiment Analysis**: Sentiment scores ranging from -1 to 1, alongside classifications of 'negative,' 'neutral,' and 'positive,' reflecting the tone and satisfaction level expressed in the reviews.
- **Text Quality Analysis**: A composite text quality score based on syntactic complexity, and perplexity measures {ref}`appendix:tqs` capturing coherence, richness, and complexity. Supplementary quality metrics include POS counts, diversity, intensity, and structural complexity to provide a balanced assessment.
- **Syntactic Features**: Part-of-Speech (POS) tags indicate the frequencies and proportions of parts of speech like nouns, verbs, adverbs, and adjectives, emphasizing patterns linked to high-quality reviews.
- **Deviation Analysis**: Measures of deviation for variables such as rating, review length, sentiment, and text quality, comparing them to category averages to highlight significant shifts or patterns that differ from overall trends.

In sum, the Enhanced AppVoCAI Dataset contains the following 37 nominal, interval, ratio, discrete and continuous variables, representing multiple dimensions of the mobile app customer experience. 

|    | Column                                   | DataType       | Measure    | Description                                                                                               |
|----|------------------------------------------|----------------|------------|-----------------------------------------------------------------------------------------------------------|
| 0  | id                                       | string[python] | Nominal    | Unique identifier for each review                                                                         |
| 1  | app_id                                   | string[python] | Nominal    | Unique identifier for each app                                                                            |
| 2  | app_name                                 | string[python] | Nominal    | The name of the app as listed in the App Store                                                            |
| 3  | category                                 | category       | Nominal    | The primary category to which the app is assigned                                                         |
| 4  | author                                   | string[python] | Nominal    | Anonymized author name                                                                                    |
| 5  | rating                                   | int16          | Interval   | Rating for the app in [1,5]                                                                               |
| 6  | content                                  | string[python] | Nominal    | The review content                                                                                        |
| 7  | vote_count                               | int64          | Discrete   | Number of votes, indicating a degree of review helpfullness                                               |
| 8  | vote_sum                                 | int64          | Discrete   | The sum of the votes                                                                                      |
| 9  | date                                     | datetime64[us] | Interval   | Date of review                                                                                            |
| 10 | quant_review_age               | int32          | Ratio      | Review age relative to the latest review in the dataset.                                                  |
| 11 | quant_review_length            | int32          | Discrete   | Number of words in the review                                                                             |
| 12 | quant_review_month             | int32          | Ordinal    | The month the review was submitted                                                                        |
| 13 | quant_review_day_of_week       | int32          | Nominal    | The day of week the review was submitted where 1=Monday and   7=Sunday                                    |
| 14 | quant_review_hour              | int32          | Interval   | The hour the review was submitted in [0,23]                                                               |
| 15 | sentiment                     | float64        | Continuous | The sentiment of the review in range -1 to 1.                                                             |
| 16 | sentiment_classification      | string[python] | Discrete   | The review sentiment classification as 'negative', 'neutral',   or   'positive'                           |
| 17 | enrichment_tqa_score_final               | float64        | Continuous | The review text quality analysis, combining lexical,   syntactic, and   perplexity based measures.        |
| 18 | enrichment_pct_deviation_rating          | float64        | Ratio      | The deviation of the rating score from the average for the app                                            |
| 19 | enrichment_pct_deviation_review_length   | float64        | Ratio      | The deviation of review length, from the average for the app                                              |
| 20 | enrichment_pct_deviation_sentiment       | float64        | Ratio      | The deviation of review sentiment, from the average for the   app                                         |
| 21 | enrichment_pct_deviation_tqa_score_final | float64        | Ratio      | The deviation of text quality score, from the average for the   app                                       |
| 22 | pos_n_nouns                              | int32          | Discrete   | Number of nouns in the review                                                                             |
| 23 | pos_n_verbs                              | int32          | Discrete   | Number of verbs in the review                                                                             |
| 24 | pos_n_adjectives                         | int32          | Discrete   | Number of adjectives in the review                                                                        |
| 25 | pos_n_adverbs                            | int32          | Discrete   | Number of adverbs in the review                                                                           |
| 26 | pos_p_nouns                              | float64        | Ratio      | Proportion of nouns in the review                                                                         |
| 27 | pos_p_verbs                              | float64        | Ratio      | Proportion of verbs in the review                                                                         |
| 28 | pos_p_adjectives                         | float64        | Ratio      | Proportion of adjectives in the review                                                                    |
| 29 | pos_p_adverbs                            | float64        | Ratio      | Proportion of adverbs in the review                                                                       |
| 30 | stats_char_count                         | int32          | Discrete   | Number of characters in the review                                                                        |
| 31 | stats_unique_word_count                  | int32          | Discrete   | Number of unique words in the review                                                                      |
| 32 | stats_unique_word_proportion             | float64        | Ratio      | Proportion of unique words in the review                                                                  |
| 33 | tqm_pos_count_score                      | float64        | Continuous | The part-of-speech count score                                                                            |
| 34 | tqm_pos_diversity_score                  | float32        | Continuous | The part-of-speech diversity score                                                                        |
| 35 | tqm_structural_complexity_score          | float32        | Continuous | The structural complexity score                                                                           |
| 36 | tqm_pos_intensity_score                  | float32        | Continuous | The part-of-speech intensity score                                                                        |



### Variable Types and Measurement
#### Numeric Types
The dataset features a range of numeric data and measurement types, each serving a distinct purpose in capturing user interactions and app performance:
1. **Discrete**: Variables such as `vote_count`,`vote_sum`, `pos_n_nouns`, and `quant_review_month` are discrete numeric variables that take on integer values. 
2. **Interval**: Interval data are a type of ordered quantitative data measured on a scale whereby the distance between adjacent values are equal and have meaning. Interval values include `rating`, and `date`.  
3. **Continuous**: Continuous data are measured, rather than counted and can take on any value within a range, such as `sentiment`, and quality measures such as `tqa_score`, and `tqm_structural_complexity_score`. 
4. **Ratio**: Ratio data are quantitative and measured on a continuous scale with a true zero value, and equal distance between adjacent values, such as `review_age`, `pct_deviation_rating`, and `pos_p_nouns`, the proportion of nouns.

#### Categorical Types 
1. **Ordinal**: A qualitative categorical value, such as `review_month`, with a natural order; however, the distances between the categories are not known nor are they assumed to be equal.  
2. **String (Nominal)**: Used for identifiers such as `id`, `app_id`, `app_name`, `author`, and `content`. These represent categorical data without any inherent order or numerical value.
3. **Category (Nominal)**: The app `category` variable is a category data type, a type of string variable that takes on a limited, fixed number of possible values. It is optimized for efficiency, processing speed, and functionality.

#### Rating as Interval
Our decision to treat rating as an interval type may raise eyebrows among traditionalists in data analysis pedagogy and measurement theory, as it challenges the long-standing orthodoxy surrounding data types. Since Harvard Psychology Professor, Dr.Stanley Smith Stevens proposed his taxonomy in the 1946 {cite}`stevensTheoryScalesMeasurement1946a`, the orthodoxy surrounding data types has been well-established. Rating, he argued, takes on a distinct nature of ordinal measurement. In this framework, ratings are characterized by their inherent order but lack consistent intervals between categories. This view aligns with the theoretical justification that ordinal scales provide a ranking of values without implying specific quantitative differences between them. Therefore, certain constructs such as 'average rating', for instance, have no mathematically interpretation.

Our departure from this view leverages the inherent properties of interval scales, including equal intervals and meaningful arithmetic operations. Interval scales exhibit the property of equal intervals, where the difference between any two consecutive points on the scale remains constant. Mathematically, this implies that for any ratings $i$ and $j$, that $|i-j| = |k-l| \space \forall \space i,j,k,l$. Because of this, arithmetic operations such as addition and averaging are meaningful, and more sophisticated statistical techniques, such as regression and correlation analysis, can be applied to facilitate a deeper understanding of user feedback. Moreover, the widespread adoption of average ratings by industry-leading platforms (such as Amazon, IMDb, and Yelp) as key metric for summarizing user feedback and comparing different products, services, or categories, underscores the practical acceptance of treating ratings as interval data. By treating ratings as interval data, we align with common industry practices and leverage the full range of statistical tools available for numerical data.

Our departure from the conventional treatment of ratings as ordinal measurements is not without precedent. Inspired by the critiques of Velleman, Wilkinson, Rozeboom, and others, we have adopted a perspective that "the scale type of data may be determined in part by the questions we ask of the data or the purposes for which we intend it" {cite}`vellemanNominalOrdinalInterval1993a`. For those curious about the rationale behind our methodology, we invite you to explore {ref}rating_as_interval.

## AppVoCAI Dataset Profile
The profile summarizes the data and aspects of data quality in terms of:

- **Column**: Represents the column names in the dataset.
- **DataType**: Indicates the data type of each column.
- **Complete**: Displays the count of complete cases (non-null values) in each column.
- **Null**: Shows the count of null values in each column.
- **Completeness**: Represents the completeness of each column, calculated as the ratio of complete cases to the total number of cases.
- **Unique**: Indicates the count of unique values in each column.
- **Duplicate**: Displays the count of duplicate values in each column.
- **Uniqueness**: Represents the uniqueness of values within each column, calculated as the ratio of unique values to the total number of cases.
- **Size**: Represents the size of the column in bytes.

In [6]:
structure = eda.info()
structure

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string[python],81594,0,1.0,81594,0,1.0,5472144
1,app_id,string[python],81594,0,1.0,8944,72650,0.109616,5423467
2,app_name,string[python],81594,0,1.0,8940,72654,0.109567,6609661
3,category,category,81594,0,1.0,14,81580,0.000172,83093
4,author,string[python],81594,0,1.0,81399,195,0.99761,6282738
5,rating,Int16,81594,0,1.0,5,81589,6.1e-05,244782
6,content,string[python],81594,0,1.0,76737,4857,0.940474,18655099
7,vote_count,Int64,81594,0,1.0,73,81521,0.000895,734346
8,vote_sum,Int64,81594,0,1.0,59,81535,0.000723,734346
9,date,datetime64[us],81594,0,1.0,81533,61,0.999252,652752


In [7]:
print(
    f"Size of dataset in memory: {round(structure['Size (Bytes)'].sum()/(1024*1024),2)} Mb"
)

Size of dataset in memory: 61.67 Mb


This dataset presents several important characteristics worth noting:

1. **Completeness**: All columns have a completeness score of 1.0, indicating that there are no missing values throughout the dataset. This high data integrity provides a strong foundation for accurate and comprehensive analyses.

2. **Low-Cardinality Variables**: Several categorical variables exhibit low-cardinality uniqueness, making them excellent candidates for segmentation and clustering. These include:
   - **Category**: With only 14 unique categories, this variable is well-suited for analyzing app market segments and identifying usage or sentiment trends across app types.
   - **Rating**: The presence of only 5 unique values allows for clear and interpretable distribution and sentiment analyses.
   - **Review Month, Day of Week, and Hour**: These temporal variables are also low in cardinality, which facilitates clustering analyses based on usage patterns, such as peak review times or seasonal trends.
   - **Sentiment Classification**: With only 3 unique classifications, sentiment analysis becomes streamlined, making it easier to identify general trends in user feedback.

3. **High-Cardinality Variables**: In contrast, columns like `id` and `author` have very high cardinality and uniqueness, which is expected, as they identify unique reviews and authors. While essential for tracking and linking reviews, these columns may not be the focus of deeper analytical modeling but are useful for ensuring data granularity.

4. **Textual and POS Features**: Variables related to text characteristics (e.g., `content`, `stats_char_count`, `stats_unique_word_count`) and Part-of-Speech (POS) counts offer rich analytical opportunities. These features can be instrumental in assessing text quality, complexity, and sentiment, which are crucial for understanding user feedback nuances.

5. **Enrichment Scores and Deviations**: The enrichment and text quality metric columns, such as `tqm_pos_count_score` and `enrichment_pct_deviation_*`, provide additional layers of analysis. These metrics could be valuable for exploring how deviations in ratings or review lengths correlate with sentiment or overall app performance.

Overall, this dataset’s completeness and a mix of low- and high-cardinality features support a diverse range of analyses, from straightforward sentiment distribution to more advanced clustering and trend detection. The categorical variables facilitate segmentation, while the detailed enrichment scores and text metrics provide a strong basis for in-depth quality and sentiment analysis.

## AppVoCAI Data Sample
For a concrete sense of the data, let's examine a random sample from the dataset. For presentation purposes, we'll examine the 37 features in parts, starting with the core variables, then we'll move on to the lexical and syntactic features that are used to measure text quality. Next, we'll inspect the text quality measures and scores, the temporal features and finally the deviation related variables

### AppVoCAI Core Data
Core variables include `id` `'app_id`, `app_name`, `category`, `author`, `rating`, `content`, `vote_count`, `vote_sum`, and `date`. 

In [8]:
eda.sample(n=5, random_state=4, column_subset=["core"])

Unnamed: 0,id,app_id,app_name,category,author,rating,content,vote_count,vote_sum,date
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,4,when using voice voice chat you cant hear the ...,0,0,2021-10-21 03:04:15
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,5,Hot and delicious,0,0,2022-03-05 23:23:04
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,2,Was really excited to watch Strade and was sup...,0,0,2020-08-03 03:34:44
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,3,Everything else works fine but when i turn on ...,0,0,2020-05-03 09:44:54
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,2,ive had the app for over 2 years now and i jus...,99,55,2021-06-07 13:51:00


### AppVoCAI Lexical Data
Variables such as `review_length`, `character_count`, `unique_word_count`, and `unique_word_proportion` make up the lexical features in the dataset.

In [9]:
eda.sample(n=5, random_state=4, column_subset=["lexical"])

Unnamed: 0,id,app_id,app_name,category,author,quant_review_length,stats_char_count,stats_unique_word_count,stats_unique_word_proportion
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,19,94,17,0.894737
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,3,17,3,1.0
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,44,255,36,0.818182
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,31,158,25,0.806452
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,252,1302,148,0.587302


### AppVoCAI Syntactic Data
Our syntactic features include the POS tags reflecting absolute counts and relative proportions of various parts-of-speech associated with high quality reviews for ABSA.

In [10]:
eda.sample(n=5, random_state=4, column_subset=["syntactic"])

Unnamed: 0,id,app_id,app_name,category,author,pos_n_nouns,pos_n_verbs,pos_n_adjectives,pos_n_adverbs,pos_p_nouns,pos_p_verbs,pos_p_adjectives,pos_p_adverbs
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,5,3,2,1,0.263158,0.157895,0.105263,0.052632
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,0,0,1,0,0.0,0.0,0.333333,0.0
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,7,11,3,5,0.142857,0.22449,0.061224,0.102041
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,7,4,2,3,0.212121,0.121212,0.060606,0.090909
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,41,46,16,31,0.152416,0.171004,0.05948,0.115242


### AppVoCAI Sentiment Data
The dataset includes both numeric `sentiment` scores, and a `sentiment_classification` for each review.

In [11]:
eda.sample(n=5, random_state=4, column_subset=["sentiment"])

Unnamed: 0,id,app_id,app_name,category,author,sentiment,sentiment_classification
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,-0.35,negative
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,0.625,positive
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,0.225556,neutral
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,0.158333,neutral
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,0.074102,neutral


### AppVoCAI Temporal Data
Temporal features include review `date` as well as the `review_month`, `review_day_of_week`, and `hour` in which the review was submitted.

In [12]:
eda.sample(n=5, random_state=4, column_subset=["temporal"])

Unnamed: 0,id,app_id,app_name,category,author,quant_review_age,quant_review_month,quant_review_day_of_week,quant_review_hour
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,682,10,5,3
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,547,3,7,23
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,1126,8,2,3
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,1218,5,1,9
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,818,6,2,13


### AppVoCAI Quality Data
The final score, as well as the syntactic complexity measures describe the overall quality of each review.

In [13]:
eda.sample(n=5, random_state=4, column_subset=["quality"])

Unnamed: 0,id,app_id,app_name,category,author,enrichment_tqa_score_final,tqm_pos_count_score,tqm_pos_diversity_score,tqm_structural_complexity_score,tqm_pos_intensity_score
67204,7936651674,929775122,Rave - Watch Party,Social Networking,ff48a05384c673f0b3be,0.478494,0.016129,0.206942,0.126639,0.057895
23633,8424577813,407517450,Papa Johns Pizza & Delivery,Food & Drink,5ad95cb5fbbb216e57a9,0.074665,0.001466,0.073241,0.249706,0.033333
39830,6273433270,1455822746,GCN,Lifestyle,c1297b2d3459905a7b8b,0.550486,0.038123,0.203453,0.211511,0.059091
13583,5896826757,1436192460,PS Remote Play,Entertainment,81b6e304b7df6bd0ea65,0.492941,0.02346,0.194518,0.197031,0.051613
4680,7437322961,294934058,HotSchedules,Business,1901697c840a01a3ae2a,0.621322,0.196481,0.201117,0.201041,0.053175


### Review Analysis
Before we close the overview section of our exploratory effort, we will analyze and constrast two reviews in the dataset: the first review from July 10, 2008, the date the App Store launched, and the last review in the dataset written on September 3, 2023. 

#### First Review

In [17]:
eda.get(sort_by="date", ascending=True, index=0, column_subset=["core", "sentiment"])



                         Record it! :: Screen Recorder                          
                                      id | 5344587979
                                  app_id | 1245356545
                                app_name | Record it! :: Screen Recorder
                                category | Utilities
                                  author | 13fe97f7d89a531f74dc
                                  rating | 4
                              vote_count | 0
                                vote_sum | 0
                                    date | 2020-01-01 00:25:16
                    sentiment | 0.07291666666666669
     sentiment_classification | neutral
I love this app! But I wish you could add in some music and edit how loud the
music will be. At least I can record me arresting high bounty bois  But
sometimes you press the record button and it wont freaking record!!! And why
cant I just type the name of the YouTube video instead of adding stupid numbers
and stuff? Developers 

#### Last Review

In [16]:
eda.get(sort_by="date", ascending=False, index=0, column_subset=["core", "sentiment"])



                         YouTube: Watch, Listen, Stream                         
                                      id | 10328197406
                                  app_id | 544007664
                                app_name | YouTube: Watch, Listen, Stream
                                category | Photo & Video
                                  author | 0964234d89c093f03012
                                  rating | 5
                              vote_count | 0
                                vote_sum | 0
                                    date | 2023-09-03 01:46:54
                    sentiment | 0.0
     sentiment_classification | neutral
I like the app
