In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# AppVoC Overview
In this section, we provide essential information about the AppVoC dataset, including review, app, user and category counts, the number of features in the dataset, and its temporal range.

In [2]:
import pandas as pd
from appvocai-discover.utils.repo import ReviewRepo
from appvocai-discover.app.overview import DatasetOverview
from appvocai-discover.utils.print import Printer

pd.set_option("max_colwidth", 200)


In [3]:
DIRECTORY = "04_features"
FILENAME = "reviews.pkl"

In [4]:
repo = ReviewRepo()
df = repo.read(directory=DIRECTORY, filename=FILENAME)

In [5]:
ov = DatasetOverview(data=df)
ov.reset_cache()
ov.overview



                                AppVoC Overview                                 
                       Number of Reviews | 17109
                         Number of Users | 17097
                          Number of Apps | 3997
                    Number of Categories | 10
                                Features | 18
                               Size (Mb) | 16.93
                    Date of First Review | 2008-09-04
                     Date of Last Review | 2023-08-17
                          Date Generated | 2023-12-02




Compiled in December of 2023, the preprocessed AppVoC dataset contains some 16.7 million reviews and ratings submitted by approximately 33,000 users for 12 million apps across 10 categories. The 17 features in the dataset represent user opinion and sentiment dating back to July 10, 2008, the day that Steve Job's launched the App Store. Quite naturally, we will reveal some of these early opinions in about a minute. 

In [6]:
ov.by_category

Unnamed: 0,Category,Reviews,Apps,Reviews/App,Authors
0,Book,748,124,6.03,748
1,Business,1320,451,2.93,1320
2,Education,1040,300,3.47,1040
3,Entertainment,1885,335,5.63,1885
4,Health & Fitness,3891,851,4.57,3891
5,Lifestyle,1612,389,4.14,1612
6,Medical,642,245,2.62,642
7,Productivity,754,194,3.89,754
8,Social Networking,2426,278,8.73,2425
9,Utilities,2791,830,3.36,2790


The table above summarizes the distribution of reviews and apps across different categories in the dataset. Each category is listed along with the total number of reviews, the total number of apps, the average number of reviews per app (Reviews/App), and the total number of authors contributing to the reviews in each category. 

## AppVoC Features

The table below lists the 17 features of the dataset, representing both the customer's experience and the influence of their opinions on customer behavior and engagement.


| #  | Name          | Data Type      | Measurement | Description                                                                                         |
|----|---------------|----------------|-------------|-----------------------------------------------------------------------------------------------------|
| 1  | id            | string         | nominal     | Unique identifier for each review.                                                                  |
| 2  | app_id        | string         | nominal     | Unique identifier for each app in the App Store.                                                    |
| 3  | app_name      | string         | nominal     | Name for each app in the App Store.                                                                 |
| 4  | category      | category       | nominal     | Name of the category or genre to which each app belongs.                                            |
| 5  | author        | string         | nominal     | 10-character hash sequence representing a unique review author.                                     |
| 6  | rating        | numeric        | interval    | Represents the author's level of satisfaction with the app on a   five-point scale from 1 to 5.     |
| 7  | title         | string         | nominal     | String that characterizes or summarizes a review.                                                   |
| 8  | content       | text           | nominal     | Strings that comprise an app review.                                                                |
| 9  | review_length | numeric        | discrete    | Number of words in a review.                                                                        |
| 10 | vote_count    | numeric        | discrete    | Number of users that cast a vote on the usefullness or value of a   review.                         |
| 11 | vote_sum      | numeric        | discrete    | Total of the value of the votes given by users indicating the value or   usefullness of a review.   |
| 12 | date          | datetime64[ms] | interval    | The date the review was submitted.                                                                  |
| 13 | year          | numeric        | interval    | The year the review was submitted.                                                                  |
| 14 | month         | numeric        | interval    | The month the review was submitted.                                                                 |
| 15 | day           | numeric        | interval    | The day the review was submitted.                                                                   |
| 16 | year_month    | numeric        | interval    | The year and month the review was submitted.                                                        |
| 17 | ymd           | numeric        | interval    | The year month a day the review was submitted.                                                      |
 

## Variable Types and Measurement

The dataset features a variety of data types and measurement types, each serving a distinct purpose in capturing and analyzing user interactions and app performance:

1. **String (Nominal)**: Used for identifiers such as `id`, `app_id`, `app_name`, `author`, `title`, and `content`. These represent categorical data without any inherent order or numerical value.
2. **Category (Nominal)**: The `category` field conveys a specific type of string data, indicating the genre or category to which each app belongs.

3. **Numeric (Interval/Discrete)**: The `rating`, `review_length`, `vote_count`, `vote_sum`, `year`, `month`, `day`, `year_month`, and `ymd` fields utilize numeric data types. 
    - **Interval**: `rating`, `date`, `year`, `month`, `day`, `year_month`, and `ymd` represent continuous numerical data with meaningful intervals between values.
    - **Discrete**: `review_length`, `vote_count`, and `vote_sum` represent countable quantities, typically whole numbers.

4. **Datetime (Interval)**: The `date` field employs datetime data type, providing precise temporal information about when each review was submitted.

These diverse data types and measurement types enable principled analysis of user engagement, sentiment, and behavior.

### Rating as Interval 

Our decision to treat rating as an interval type may raise eyebrows among traditionalists in data analysis pedagogy and measurement theory, as it challenges the long-standing orthodoxy surrounding data types. Since Harvard Psychology Professor, Dr.Stanley Smith Stevens proposed his taxonomy in the 1946 {cite}`stevensTheoryScalesMeasurement1946`, the orthodoxy surrounding data types has been well-established. Rating, he argued, takes on a distinct nature of ordinal measurement. In this framework, ratings are characterized by their inherent order but *lack* consistent intervals between categories. This view aligns with the theoretical justification that ordinal scales provide a *ranking* of values without implying specific quantitative differences between them. Therefore, certain constructs such as 'average rating', for instance, have no mathematically interpretation. 

Our departure from this view leverages the inherent properties of interval scales, including equal intervals and meaningful arithmetic operations. Interval scales exhibit the property of equal intervals, where the difference between any two consecutive points on the scale remains constant. Mathematically, this implies that for any ratings $i$ and $j$, that $|i-j| = |k-l| \space \forall \space i,j,k,l. Because of this, arithmetic operations such as addition and averaging are meaningful, and more sophisticated statistical techniques, such as regression and correlation analysis, can be applied to facilitate a deeper understanding of user feedback. Moreover, the widespread adoption of average ratings by industry-leading platforms (such as Amazon, IMDb, and Yelp) as key metric for summarizing user feedback and comparing different products, services, or categories, underscores the practical acceptance of treating ratings as interval data. By treating ratings as interval data, we align with common industry practices and leverage the full range of statistical tools available for numerical data. 

Our departure from the conventional treatment of ratings as ordinal measurements is not without precedent. Inspired by the critiques of Velleman, Wilkinson, Rozeboom, and others, we have adopted a perspective that "the scale type of data may be determined in part
by the questions we ask of the data or the purposes for which we intend it" {cite}`vellemanNominalOrdinalInterval1993`. 

## AppVoC Profile
The profile summarizes the data and aspects of data quality in terms of:

- **Column**: Represents the column names in the dataset.
- **DataType**: Indicates the data type of each column.
- **Complete**: Displays the count of complete cases (non-null values) in each column.
- **Null**: Shows the count of null values in each column.
- **Completeness**: Represents the completeness of each column, calculated as the ratio of complete cases to the total number of cases.
- **Unique**: Indicates the count of unique values in each column.
- **Duplicate**: Displays the count of duplicate values in each column.
- **Uniqueness**: Represents the uniqueness of values within each column, calculated as the ratio of unique values to the total number of cases.
- **Size**: Represents the size of the column in bytes.

In [7]:
ov.info

Unnamed: 0,Column,DataType,Complete,Null,Completeness,Unique,Duplicate,Uniqueness,Size (Bytes)
0,id,string,17109,0,1.0,17109,0,1.0,1142703
1,app_id,string,17109,0,1.0,3997,13112,0.23,1134012
2,app_name,string,17109,0,1.0,3997,13112,0.23,1458227
3,category_id,category,17109,0,1.0,10,17099,0.0,17780
4,category,category,17109,0,1.0,10,17099,0.0,18083
5,author,object,17109,0,1.0,17097,12,1.0,1317393
6,rating,int64,17109,0,1.0,5,17104,0.0,136872
7,title,string,17109,0,1.0,11601,5508,0.68,1311468
8,content,string,17109,0,1.0,16164,945,0.94,6243364
9,vote_sum,int64,17109,0,1.0,26,17083,0.0,136872


All columns in the dataset are complete with no null values. Key observations include zero duplication in review `id` and 74% uniqueness in the `author` variable indicates 26% of the authors have submitted more than one review.

## AppVoC Data Sample
Next, we inspect the data by examining the first and last five observations in the dataset.

In [8]:
ov.data.head()

Unnamed: 0,id,app_id,app_name,category_id,category,author,rating,title,content,vote_sum,vote_count,date,review_length,year,month,day,year_month,ymd
15906283,1160519484,454638411,Messenger,6005,Social Networking,a3485c0ad91b83c2a966,5,Good,100% good,0,0,2015-02-28 17:51:50,2,2015,February,Saturday,2015-02,2015-02-28
8328248,9036129510,1597150642,Reverse Health,6013,Health & Fitness,5b28931a5bc65792a7b8,1,It's a Scam,"This app/program is a joke. You don't receive coaching with a live person, the exercise program consists of three 20 minute videos w/two bonus exercise videos. Some of the recipes are decent, but ...",0,0,2022-08-31 02:31:08,191,2022,August,Wednesday,2022-08,2022-08-31
10123462,1685166136,946346179,Animation Desk® Draw & Animate,6016,Entertainment,66edc1765bb07328b9d4,3,Hard to use,Would be good if there was copy frame sorry I'm only going to give it 3 stars because it was not the worst sorry fix this and I'll give it 5🤓,0,0,2017-07-16 23:20:22,31,2017,July,Sunday,2017-07,2017-07-16
5884575,6836055253,1158555867,Replika - Virtual AI Companion,6013,Health & Fitness,9e88318918189b886c83,1,Peoples privacy,This app it's been telling me that people are watching me through the phone following me around everywhere I go I don't think this is a safe app that anyone should be on.,1,1,2021-01-06 03:10:31,33,2021,January,Wednesday,2021-01,2021-01-06
16095366,445648636,447119634,Currents,6005,Social Networking,09b586574a4c4d3e6259,5,I Love Google,Google > Apple,0,0,2011-07-19 16:27:53,3,2011,July,Tuesday,2011-07,2011-07-19


In [9]:
ov.data.tail()

Unnamed: 0,id,app_id,app_name,category_id,category,author,rating,title,content,vote_sum,vote_count,date,review_length,year,month,day,year_month,ymd
10544753,5008213353,376510438,Hulu: Watch TV shows & movies,6016,Entertainment,598519b19b40f326fe25,2,glitchy,"i don't normally submit reviews but this app has been bothering me more and more. it often makes you rewatch commercials if you accidentally rewind or fastforward too much, it freezes, it doesn't ...",0,0,2019-10-24 18:38:51,96,2019,October,Thursday,2019-10,2019-10-24
5338032,1605328721,330595774,Cyclemeter Bike Computer,6013,Health & Fitness,83bd65033507f43ac262,5,The perfect biking app,"The only thing I don't like about this app is that I didn't find it sooner! This app is immensely customizable, so you can set it up to display EXACTLY the information that you want to see. In ad...",1,1,2017-05-05 02:37:58,309,2017,May,Friday,2017-05,2017-05-05
2113413,9794096424,1533132556,Angi Services for Pros,6000,Business,436344ad7013a66c4626,5,New update killed the app,"It started out amazing, I was making great money on tons of different handyman jobs. Then they started assigning 2 people to some jobs. You have no idea who the other “pro” is, and it makes a job ...",0,0,2023-04-06 19:54:00,183,2023,April,Thursday,2023-04,2023-04-06
8395610,3288614151,314498713,BetterSleep: Relax and Sleep,6013,Health & Fitness,cf370b0f5e43a3d6030f,5,The baby was,"I got it for the baby and it's like magic. Asleep in little time, the best sound app",0,0,2018-10-11 08:51:55,18,2018,October,Thursday,2018-10,2018-10-11
4827725,690659843,381471023,Flashlight Ⓞ,6002,Utilities,2d54a30692141cb5e7a0,5,No Name,Supper app,0,0,2012-11-14 03:25:04,2,2012,November,Wednesday,2012-11,2012-11-14


When examining the first and last five reviews in the dataset, several aspects are worth commenting on:

1. **Consistency**: The dataset maintains a consistent format with clearly defined columns.
2. **Timestamp Granularity**: Each review has a detailed timestamp (`date`), providing the exact date and time down to seconds.
3. **Derived Columns**: Additional columns (`year`, `month`, `day`, `year_month`, `ymd`) are derived from the `date` column for easier analysis.
4. **Vote Metrics**: The earliest reviews reflect varying levels of user engagement with each review via `vote_sum` and `vote_count`.

### Example: Analysis
Here's an example analysis commentary based on the printed reviews:

- **Qudo - Find Snapchat Friends**: The review is short but positive, indicating a good user experience. However, the low rating contrasts with the review content, which could be an outlier or an error.
- **Flirtini - Match, Chat, Meet**: This review is critical of the app, highlighting an issue with fake accounts. This type of feedback is crucial for app developers to address user concerns.
- **Text Me - Phone Call + Texting**: A highly positive review praising the app's functionality, aligning well with the 5-star rating.
- **Boo — Dating. Friends. Chat.**: Another positive review with a humorous touch, reflecting user satisfaction.
- **Emoji Me Sticker Maker**: Positive feedback on the app's features, showing user appreciation for the provided content.

Next step? We analyze the distributions of the variables, to understand centrality, frequency and spread of variables in the dataset. But first...

## App Store's First Review
To be more precise, 'App Store's First Review' should say the App Store's First Review *in our dataset* from App Store launch on July 10, 2008; hereinafter referred to as the 'first app review'.

The first app review in the App Store holds historical and symbolic significance, marking the beginning of a new era in mobile applications and user-generated feedback. This review provides insights into early user experiences and expectations from apps when the App Store was launched. Here are several reasons why the first app review is significant:

### Historical Significance
1. **Milestone Event**: The first app review represents a milestone in the history of mobile technology and digital marketplaces. It signifies the launch of a platform that revolutionized how software is distributed and consumed.
2. **Early User Expectations**: It provides a glimpse into what early adopters of mobile technology valued and expected from mobile applications in 2008.
3. **App Store Evolution**: Comparing this review to contemporary reviews highlights how both user expectations and app functionalities have evolved.

### Analytical Significance
1. **Baseline for Analysis**: It serves as a baseline for understanding how user feedback and app quality have evolved. Analyzing this review can set the stage for a longitudinal study of app reviews.
2. **Sentiment Analysis**: Conducting sentiment analysis on the first review can reveal initial user sentiment towards the App Store and its apps.
3. **Trends and Patterns**: Observing the progression from this first review to subsequent reviews can reveal trends in user demands and app features.

### Cultural and Social Significance
1. **User Empowerment**: The ability for users to review and rate apps empowered consumers and provided developers with direct feedback. This review is a testament to that empowerment.
2. **Community Building**: It marks the beginning of an interactive community where users share their experiences and help guide others in the app selection process.

For a complete analysis, it's essential to recognize this review's historical, analytical, cultural, and social significance.

Voici, the 'first app review'.

In [10]:
Printer().print_dataframe_as_dict(ov.data.head(n=1), title='First App Review (in our dataset)')



                       First App Review (in our dataset)                        
                                      id | 1160519484
                                  app_id | 454638411
                                app_name | Messenger
                             category_id | 6005
                                category | Social Networking
                                  author | a3485c0ad91b83c2a966
                                  rating | 5
                                   title | Good
                                 content | 100% good
                                vote_sum | 0
                              vote_count | 0
                                    date | 2015-02-28 17:51:50
                           review_length | 2
                                    year | 2015
                                   month | February
                                     day | Saturday
                              year_month | 2015-02
                                     

### Analysis of the First App Review in Our Dataset

This review, dated July 10, 2008, is for the app **epocrates**, categorized under **Medical**. The review highlights several key aspects of early user interactions and expectations with mobile applications.

#### Review Details
- **App Information**
  - **App Name:** epocrates
  - **Category:** Medical
  - **App ID:** 281935788
- **Review Metadata**
  - **Review ID:** 5287030
  - **Author ID:** 50c38e8b41319e42951d
  - **Rating:** 5 stars
  - **Review Date:** July 10, 2008 (Thursday)
  - **Vote Sum:** 27
  - **Vote Count:** 37
  - **Review Length:** 34 words
- **Review Content:** 
  - **Title:** ePocrates
  - **Body:** "It's FREE!  Damn, you can't beat that for an outstanding app!  I was ready to throw down some cash for this.  Did I mention that this is a FREE app?  Go and get yours!!!"

#### Content Analysis
The reviewer's excitement is palpable, marked by the repeated emphasis on the app being free. This enthusiasm reflects early user sentiments where the cost of apps was a significant factor in user satisfaction. The use of informal language ("Damn," "throw down some cash") and multiple exclamation marks highlights the casual and enthusiastic tone common in early app reviews.

#### Sentiment Analysis
- **Tone:** Highly positive
- **Key Sentiment Words:** "FREE," "outstanding," "can't beat that"
- The review conveys strong satisfaction, primarily driven by the app's cost-free nature and perceived high value.

#### Contextual Insights
1. **Historical Context:** This review dates back to the very early stages of the App Store, which launched on July 10, 2008. The review's date indicates it was among the first user feedbacks, offering valuable historical context.
2. **User Expectations:** The emphasis on the app being free underscores early user expectations where free apps were a significant draw.
3. **App Popularity:** The relatively high vote count (37) suggests that even in the nascent stages of the App Store, user engagement and community validation (through votes) were important.

#### Reflection
Behold, the first app review. This piece of digital history offers a window into the dawn of the app ecosystem. The excitement and high rating reflect a time of discovery and novelty, where users were thrilled to find useful, free applications. This review not only serves as a benchmark for user satisfaction but also highlights the evolving nature of app development and user interaction over the years.

The first review in our dataset is a testament to the excitement and high expectations surrounding early mobile applications. It provides a rich, qualitative snapshot of user sentiment, engagement, and the nascent app culture. 

With that, we conclude the dataset overview section of the exploratory analysis. Next, we scrutinize the distributions of ratings, vote counts, and other key variables to unveil insights into their centrality, spread, and potential patterns.