In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

## Preliminaries
Imports, let's get this out of the way.

In [2]:
import warnings
from appvocai-discover.app.overview import DatasetOverview
warnings.simplefilter("ignore")

In [3]:
DIRECTORY = "04_features"
FILENAME = "reviews.pkl"

## AppVoC Overview and Summary


In [4]:
ov = DatasetOverview(directory=DIRECTORY, filename=FILENAME)
ov.overview





                                AppVoC Overview                                 
                       Number of Reviews | 17109
                         Number of Users | 17097
                          Number of Apps | 3997
                    Number of Categories | 10
                    Date of First Review | 2008-09-04
                     Date of Last Review | 2023-08-17




## AppVoC Overview and Summary
The App Store review dataset was compiled in December of 2023 and encapsulates user sentiment or opinion of mobile apps in the Apple Appstore from July of 2008 through August of 2023. The preprocessed dataset comprising 17 variables and is structured as follows:

| #  | Name          | Data Type      | Measurement | Description                                                                                         |
|----|---------------|----------------|-------------|-----------------------------------------------------------------------------------------------------|
| 1  | id            | string         | nominal     | Unique identifier for each review.                                                                  |
| 2  | app_id        | string         | nominal     | Unique identifier for each app in the App Store.                                                    |
| 3  | app_name      | string         | nominal     | Name for each app in the App Store.                                                                 |
| 4  | category      | category       | nominal     | Name of the category or genre to which each app belongs.                                            |
| 5  | author        | string         | nominal     | 10-character hash sequence representing a unique review author.                                     |
| 6  | rating        | numeric        | interval    | Represents the author's level of satisfaction with the app on a   five-point scale from 1 to 5.     |
| 7  | title         | string         | nominal     | String that characterizes or summarizes a review.                                                   |
| 8  | content       | text           | nominal     | Strings that comprise an app review.                                                                |
| 9  | review_length | numeric        | discrete    | Number of words in a review.                                                                        |
| 10 | vote_count    | numeric        | discrete    | Number of users that cast a vote on the usefullness or value of a   review.                         |
| 11 | vote_sum      | numeric        | discrete    | Total of the value of the votes given by users indicating the value or   usefullness of a review.   |
| 12 | date          | datetime64[ms] | interval    | The date the review was submitted.                                                                  |
| 13 | year          | numeric        | interval    | The year the review was submitted.                                                                  |
| 14 | month         | numeric        | interval    | The month the review was submitted.                                                                 |
| 15 | day           | numeric        | interval    | The day the review was submitted.                                                                   |
| 16 | year_month    | numeric        | interval    | The year and month the review was submitted.                                                        |
| 17 | ymd           | numeric        | interval    | The year month a day the review was submitted.                                                      |
 

We begin by examining basic information about the dataset such as the number of observations, variables/features, and storage data types, but first...

### Assumption and Justification for Treating Ratings as Interval Data
For the purposes of this analysis, we assume that the ratings are measured on an interval scale. This implies that the differences between consecutive rating values are consistent and meaningful. Specifically, the difference in satisfaction between a rating of 1 and 2 is the same as between a rating of 4 and 5.

### Theoretical Justification:
1. **Interval Scale Properties**:
   - **Equal Intervals**: Interval scales have equal intervals between points, which means that the difference between any two consecutive points on the scale is the same throughout the scale.
   - **Arithmetic Operations**: Because interval scales have equal intervals, arithmetic operations such as addition and averaging are meaningful. This allows us to compute metrics like the average rating to summarize central tendency.

2. **Ordinal vs. Interval**:
   - **Ordinal Data**: Ratings can be considered ordinal, indicating a rank order without assuming equal intervals (e.g., 1-star is worse than 2-star, but the difference might not be the same as between 3-star and 4-star).
   - **Interval Assumption**: By assuming equal intervals, we can leverage more sophisticated statistical techniques and derive summary statistics that can be easily interpreted and compared.

### Precedent and Practical Use:
1. **Industry Standard**:
   - Many popular platforms and services (such as Amazon, IMDb, and Yelp) commonly calculate and display average ratings. This suggests a practical acceptance and precedent for treating ratings as interval data.
   - These platforms provide average ratings as a key metric for summarizing user feedback and comparing different products, services, or categories.

2. **Analytical Utility**:
   - **Average Rating**: The average rating is a widely used and easily understood metric that provides a quick summary of overall user satisfaction.
   - **Statistical Analysis**: Treating ratings as interval data allows us to perform more comprehensive statistical analyses, such as regression and correlation analysis, which require numerical data.

### Practical Considerations:
1. **Potential Bias**: It is important to recognize that the assumption of equal intervals might not perfectly reflect all users' perceptions. Some users may perceive the difference between certain ratings as larger or smaller than others.
2. **Communication**: When presenting the results, we will clarify that the average rating is calculated based on the assumption of equal intervals and discuss any potential implications.

By treating ratings as interval data, we align with common industry practices and leverage the full range of statistical tools available for numerical data. This approach allows us to compute average ratings and other summary statistics that provide valuable insights into user feedback and overall satisfaction.

Let's examine some basic characteristics of the dataset.

In [5]:
ov.structure

<function appvocai-discover.app.overview.cachenow.<locals>.decorator.<locals>.wrapper>

This confirms the number of observations and variables in the dataset. The memory footprint is non-trivial, suggesting that efficient data processing techniques will be necessary to handle the dataset effectively.

 This table provides a summary of key metrics, including the number of reviews, users, apps, and categories. Let's examine the profile of the data.

## Data Profile
The profile summarizes the data and aspects of data quality in terms of:

- **Column**: Represents the column names in the dataset.
- **DataType**: Indicates the data type of each column.
- **Complete**: Displays the count of complete cases (non-null values) in each column.
- **Null**: Shows the count of null values in each column.
- **Completeness**: Represents the completeness of each column, calculated as the ratio of complete cases to the total number of cases.
- **Unique**: Indicates the count of unique values in each column.
- **Duplicate**: Displays the count of duplicate values in each column.
- **Uniqueness**: Represents the uniqueness of values within each column, calculated as the ratio of unique values to the total number of cases.
- **Size**: Represents the size of the column in bytes.



In [6]:
ov.info

<function appvocai-discover.app.overview.cachenow.<locals>.decorator.<locals>.wrapper>

All columns in the dataset are complete with no null values. Key observations include high uniqueness in identifiers like `id` and `author`, while columns such as `category`, `rating`, and `vote_sum` show very low uniqueness. Some columns, such as `app_id` and `app_name`, have significant duplication. Each columns's overall size in bytes varies, with the `content` column being notably large due to its detailed text data.

Here is a breakdown of reviews by category, revealing a breadth of diversity of user feedback.

In [7]:
ov.summary

<function appvocai-discover.app.overview.cachenow.<locals>.decorator.<locals>.wrapper>

The dataset showcases a diverse range of categories, spanning from books and education to social networking and utilities. This diversity reflects the broad spectrum of apps and services available in the dataset.

Having introduced the dataset, let's outline how our exploratory data analysis will unfold over the next few sections.

## Data Distribution Exploration
In the data distribution exploration section, we examine the statistical characteristics and patterns inherent within the dataset.

### Numeric Variable Distributions
Starting with the numeric variables, we expose insights into their spread, central tendency, and shape.

#### Rating
Let's begin with descriptive statistics and visualization of ratings data. Apps are rated on a five-point scale.

In [8]:
reviews.describe(x=["rating"]).numeric.T

NameError: name 'reviews' is not defined

In [None]:
reviews.plot.countplot(x="rating", title="Distribution of Ratings")

The ratings data exhibit a notable concentration toward the higher end, with a significant proportion of ratings clustered around the upper quartile (75th percentile). The median rating of 5 suggests that a substantial portion of reviews have received the highest possible rating, indicating overall positive sentiment towards the mobile applications. Additionally, the relatively low standard deviation implies that ratings tend to be consistent around the mean, albeit with some variability.

#### Vote Sum
Users are able to rate a review as being helpful on a scale of 1 to 5. Vote sum represents the sum of votes rendered on reviews. 

In [None]:
reviews.describe(x=["vote_sum"]).numeric.T

In [None]:

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12,6))
histogram = reviews.plot.histogram(x="vote_sum", title="Vote Sum Histogram", ax=ax[0])
boxplot = reviews.plot.violinplot(x="vote_sum", title="Vote Sum Violin Plot", ax=ax[1])
_ = fig.suptitle("Distribution of Vote Sum")
fig.tight_layout()
plt.show()


The distribution of the vote_sum variable reveals that the majority of reviews have received a minimal number of votes, as indicated by the median (50th percentile) and 75th percentile values of 0.00. However, the presence of a non-zero mean value suggests the existence of a relatively small number of reviews with higher vote counts. Additionally, the wide standard deviation indicates significant variability in the number of votes received, with some reviews receiving a substantial number of votes, as evidenced by the maximum value of 4,433.

#### Vote Count
Vote count represents the number of users who rendered a vote for a specific review.

In [None]:
reviews.describe(x=["vote_count"]).numeric.T

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12,6))
histogram = reviews.plot.histogram(x="vote_count", title="Vote Count Histogram", ax=ax[0])
boxplot = reviews.plot.violinplot(x="vote_count", title="Vote Count Violin Plot", ax=ax[1])
_ = fig.suptitle("Distribution of Vote Count")
fig.tight_layout()
plt.show()

The statistics for the vote_count variable indicate that the majority of reviews have not received any votes, as demonstrated by the median (50th percentile) and 75th percentile values of 0.00. The mean value being greater than zero suggests the presence of a subset of reviews with non-zero vote counts, albeit with a wide standard deviation, indicating significant variability in the number of votes received. The maximum value of 8,494 highlights the existence of a small number of reviews with exceptionally high vote counts.

#### Review Length
Review length is measured in terms of the number of words in the review.

In [None]:
reviews.describe(x=["review_length"]).numeric.T

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12,6))
histogram = reviews.plot.histogram(x="review_length", title="Review Length Histogram", ax=ax[0], kde=True)
boxplot = reviews.plot.violinplot(x="review_length", title="Review Length Violin Plot", ax=ax[1])
_ = fig.suptitle("Distribution of Review Length")
fig.tight_layout()
plt.show()

The distribution of review_length demonstrates considerable variability, with a mean of 28.31 and a standard deviation of 37.13. The majority of reviews have relatively short lengths, as evidenced by the median (50th percentile) and 75th percentile values of 17.00 and 36.00, respectively. However, there is notable variability, with some reviews being significantly longer, as indicated by the maximum length of 2,624 words. Additionally, the presence of reviews with a minimum length of 0 suggests the existence of empty or very short reviews in the dataset.

### Categorical Frequency Distributions
Next, we explore the frequency distribution and composition of categorical variables.

#### App ID
The `app_id` uniquely identifies an app in the Apple App ecosystem.

In [None]:
reviews.describe(x=["app_id"]).categorical.T

The frequency distribution of app IDs reveals that there are 34,064 unique mobile applications represented in the dataset. The most common app ID, 341232718 occurs with a frequency of 338,097. This app ID is associated with a significant number of reviews within the dataset, suggesting a high level of user engagement or popularity for this particular application.

In [None]:
counts, stats = reviews.countstats(x='app_id')
stats

The distribution of `app_id` counts reveals considerable variability in the number of reviews associated with each unique app ID. The mean count of reviews per app ID is 528.14, with a standard deviation of 4,864.06, indicating significant dispersion around the mean. 

The range of counts per app ID is substantial, with the minimum count being 1 and the maximum count reaching as high as 338,097. Indeed, the difference between the mean and the 50th percentile evinces significant right skew in the shape of the distribution. The quartile values further illustrate this variability, with 25% of app IDs having 9 or fewer reviews, 50% having 27 or fewer reviews, and 75% having 109 or fewer reviews.  

The wide range and dispersion in app ID counts suggest that the dataset encompasses a diverse range of mobile applications, with some apps receiving relatively few reviews while others are associated with a much larger volume of reviews.

#### Author
The authors names have been hashed; however, we can evaluate the distribution of author reviews
within the dataset.

In [None]:
reviews.describe(x=['author']).categorical.T

The dataset contains a total of 17,990,555 authors, out of which 13,242,121 are unique. The most frequently occurring author ID in the dataset is "32322782b8377dd81989", which appears 126 times. This indicates that while there are a vast number of unique authors, some authors are significantly more prolific or more frequently represented than others.

In [None]:
counts, stats = reviews.countstats(x='author')
stats

The dataset includes 13,242,121 unique authors. On average, each author has 1.36 entries in the dataset, with a standard deviation of 1.03. The number of entries per author ranges from a minimum of 1 to a maximum of 126. The 25th, 50th (median), and 75th percentiles are all 1, indicating that most authors have only a single entry. However, there are a few authors with significantly higher counts, as reflected by the maximum value.

#### App Name

In [None]:
reviews.describe(x=["app_name"]).categorical.T

The summary of the `app_name` variable reveals that there are 17,990,555 observations in the dataset, consisting of 34,051 unique mobile application names. The most frequently occurring app name is "MyFitnessPal: Calorie Counter," which appears 338,097 times in the dataset. This suggests that "MyFitnessPal: Calorie Counter" is the most commonly reviewed or referenced mobile application among the dataset's observations.

In [None]:
counts, stats = reviews.countstats(x='app_name')
stats

The summary statistics for the `app_name` variable indicate that there are 34,051 unique mobile application names in the dataset. The mean count of reviews per app name is 528.34, with a standard deviation of 4,864.99, suggesting significant variability in the number of reviews associated with each app name. The distribution ranges from a minimum of 1 review to a maximum of 338,097 reviews, with quartile values indicating that 25% of app names have 9 or fewer reviews, 50% have 27 or fewer reviews, and 75% have 109 or fewer reviews.

#### Category Id
The `category_id` numerically references a specific category or genre in the App Store.

In [None]:
reviews.describe(x=["category_id"]).categorical.T

The frequency distribution for the `category_id` variable reveals that there are 17,990,555 observations in the dataset, categorized into 11 unique category IDs. The most frequently occurring category ID is "6013," which appears 3,946,182 times in the dataset. This indicates that category ID "6013" is the most common category among the observations, suggesting that a significant proportion of reviews belong to this particular category.

In [None]:
counts, stats = reviews.countstats(x='category_id')
stats

In [None]:
reviews.plot.countplot(x="category_id", title="Distribution of Category IDs")

The summary statistics for the `category_id` variable indicate that there are 11 unique category IDs in the dataset. The mean category ID count is 1,635,505.00, with a standard deviation of 1,173,223.76, suggesting variability in the number of reviews across categories. The distribution ranges from a minimum category ID count of 9 to a maximum count of 3,946,182, with quartile values indicating that 25% of category IDs have a count of 807,466.50 or fewer, 50% have 1,405,019.00 or fewer, and 75% have 2,378,821.50 or fewer.

#### Category

In [None]:
reviews.describe(x=["category"]).categorical.T

The summary of the `category` variable indicates that there are 17,990,555 observations in the dataset, categorized into 11 unique categories. The most frequently occurring category is "Health & Fitness," which appears 3,946,182 times in the dataset. This suggests that "Health & Fitness" is the most common category among the observations, indicating a significant proportion of reviews are related to this category.

In [None]:
counts, stats = reviews.countstats(x='category')
stats

In [None]:
_ = reviews.plot.countplot(x="category", title="Distribution of Categories")

The `category` and `category_id` variables share precisely the same distribution. The summary statistics for the `category` variable show that there are 11 unique categories in the dataset. The mean count of observations per category is approximately 1,635,505, with a standard deviation of approximately 1,173,223.76. The distribution ranges from a minimum of 9 observations to a maximum of 3,946,182 observations. Quartile values suggest that 25% of categories have 807,466.50 or fewer observations, 50% have 1,405,019.00 or fewer observations, and 75% have 2,378,821.50 or fewer observations.

### Temporal Analysis
Finally, we have the datetime variable, `date`, which indicates the date the review was submitted.

In [None]:
reviews.describe(x=["date"], include="datetime").numeric

In [None]:
histogram = reviews.plot.histogram(x="date", title="Review Date Histogram", fill=True, kde=True)

The frequency distribution for the `date` variable indicates that there are 17,990,555 observations in the dataset. The mean date is approximately November 29, 2017, suggesting that the average review date falls around this time. 

The earliest review date in the dataset is July 10, 2008, while the latest review date is August 19, 2023, indicating a broad range of review dates over a span of several years.

The quartile values provide insight into the distribution of review dates:
- 25% of the reviews occurred on or before October 10, 2014.
- 50% of the reviews occurred on or before June 10, 2018.
- 75% of the reviews occurred on or before January 6, 2021.

Overall, the distribution of review dates spans a wide range, with reviews occurring over a period of more than a decade, from 2008 to 2023.

#### Reviews by Year
Our dataset contains reviews from 2008 through 2023. For which years were users most opinionated?

In [None]:
df_dates = IOService.read(FP_DATES)


In [None]:
counts, stats = reviews.countstats(x="year", df=df_dates)
counts.sort_values(by="year", inplace=True)

In [None]:
_ = reviews.plot.barplot(data=counts, x='year', y="count", title="Distribution of Reviews by Year")

Several observations are made:
- Review counts have generally increased over the years, with a noticeable spike in 2020. 
- There's a gradual upward trend from 2008 to 2019, followed by a significant jump in 2020, indicating a potential surge in activity or interest. 
- While there are fluctuations in review counts between years, the overall trend suggests a growing engagement or usage over time, with particularly high activity in recent years.

#### Reviews by Month
What is the distribution of reviews by month over the entire dataset period, and are there any notable trends or patterns in the monthly review counts?

In [None]:
counts, stats = reviews.countstats(x="month", df=df_dates)

In [None]:
_ = reviews.plot.barplot(data=counts, x='month', y="count", title="Distribution of Reviews by Month")

Here are key points from the review counts by month:
- January has the highest number of reviews, followed closely by March, May, and July.
- November has the lowest review count among the months.
- There's a relatively consistent level of activity from April to August, with slight variations.
- The months with the highest review counts (January, March, May, July) might correspond to periods of increased activity, possibly influenced by seasonal or external factors.

#### Reviews by Day of Week
How does the volume of opinion vary throughout the week?

In [None]:
counts, stats = reviews.countstats(x="day", df=df_dates)

In [None]:
_ = reviews.plot.barplot(data=counts, x='day', y="count", title="Distribution of Reviews by Day of Week")

Several points are illuminated here:
- Wednesday has the highest number of reviews, followed closely by Tuesday and Thursday.
- Saturday and Sunday have the lowest review counts among the days of the week.
- There's a relatively consistent level of activity from Monday to Friday, with slight variations.
- The higher review counts on weekdays compared to weekends suggest that more reviews are submitted during the workweek, possibly reflecting patterns of user engagement or behavior.

#### Reviews Top Day and Month 
Which month and day had the most reviews over the 15 year span in the dataset.

In [None]:
counts, stats = reviews.countstats(x="year_month", df=df_dates)
counts[['year_month', 'count']][0:5]


Based on the provided table showing the top 5 months by review count, it can be inferred that:
- The months of April, March, and May 2020 experienced exceptionally high review counts, indicating a period of heightened user engagement or activity, possibly influenced by global events or seasonal trends.
- Additionally, the presence of January 2021 and July 2023 in the top 5 suggests sustained interest or increased activity across different time periods, highlighting the importance of monitoring trends over time to understand user behavior and engagement patterns.

In [None]:
counts, stats = reviews.countstats(x="ymd", df=df_dates)
counts[['ymd', "count"]][0:5]

Based on the provided table showing the top days by review count, it may be inferred that:
- March and April 2020 were particularly active periods, with multiple days featuring prominently in the top rankings. This suggests that these months may have been marked by significant events or developments that prompted increased user engagement.
- The top days, such as March 13th and May 15th, likely correspond to specific events or moments of heightened activity within those months, indicating the importance of monitoring and understanding temporal patterns to glean insights into user behavior and engagement trends.

## Variable Relationships
In this bivariate analysis section of our exploratory data analysis (EDA), we aim to investigate the
relationships between pairs of variables within our dataset, and understand how different variables
interact with each other and identify potential correlations or associations. By examining
combinations of categorical, ordinal, and numerical variables, we will uncover insights that are not apparent from univariate analysis alone.

### Categorical Analysis
In this section, we'll evaluate the relationships among categorical variables such as category and author in terms of review, app, and author counts. We will motivate this part of the analysis with a few guiding questions: 

1. What is the distribution of review counts by category?
2. What is the distribution of app counts by category?
3. What is the distribution of authors by category?

Such an examination will illuminate the level of engagement among the categories from app, review and author perspectives.

#### Distribution Review Counts by Category

In [None]:
counts, stats = reviews.countstats(by=["category"], var="id", unique=False)
counts.set_index(keys=["category"], inplace=True)
counts.T

In [None]:
_ = reviews.plot.barplot(data=counts, x="category", y="count", title="Review Counts by Category")

The distribution of reviews across different categories highlights varying levels of user engagement and interest:

- **Health & Fitness**: Dominates with the highest number of reviews at 3,946,182.
- **Utilities**: Follows closely with 2,928,883 reviews.
- **Social Networking**: Also highly popular, gathering 2,735,869 reviews.
- **Entertainment**: Attracts substantial user interaction with 2,021,774 reviews.
- **Business, Education, and Lifestyle**: Each has over a million reviews, reflecting significant but slightly lower engagement compared to top categories.
- **Productivity and Medical**: Moderate interest with 822,674 and 621,340 reviews, respectively.
- **Book**: Moderate engagement with 792,259 reviews.
- **Shopping**: Exceptionally low engagement, with only 9 reviews, which is likely an outlier or underreporting.

This summary provides a snapshot of user review distribution across app categories.

##### Distribution of Author Counts by Category
Here, we examine the number of users who've written reviews by category to assess user engagement from this dimension.

In [None]:

df = reviews.subset
counts


### Summary of Distribution of Author Reviews by Category

The table summarizes the statistical distribution of the number of reviews written by authors across different app categories. Each category includes data on count, mean, standard deviation (std), and various percentiles.

- **Total Author Reviews**: Each category has the same total count of 13,242,121 reviews.

- **Mean (Average Reviews per Author)**:
  - Highest: **Health & Fitness (0.30)**
  - High: Utilities (0.22), Social Networking (0.21)
  - Moderate: Entertainment (0.15), Lifestyle (0.12), Business (0.11)
  - Lower: Education (0.08), Book (0.06), Productivity (0.06), Medical (0.05)
  - Lowest: Shopping (0.00)

- **Standard Deviation (Variability in Reviews per Author)**:
  - Highest variability: **Health & Fitness (0.58)**
  - High variability: Utilities (0.49), Social Networking (0.45), Entertainment (0.40)
  - Moderate variability: Business (0.34), Lifestyle (0.35), Education (0.30), Productivity (0.27)
  - Lower variability: Medical (0.23), Book (0.25)
  - Lowest variability: Shopping (0.00)

- **Percentiles**:
  - **25th, 50th, 75th Percentiles**: Most categories have the 25th, 50th, and 75th percentiles at 0.00, indicating that a significant portion of authors have not written multiple reviews. 
  - **75th Percentile**: Health & Fitness has a higher 75th percentile (1.00), indicating that 25% of authors have written more than one review.

- **Maximum Reviews by a Single Author**:
  - Highest: **Business (75 reviews)**
  - Other high values: Medical (39), Utilities (41), Health & Fitness (57), Social Networking (24), Productivity (32), Education (28)
  - Moderate values: Entertainment (23), Book (28), Lifestyle (12)
  - Lowest: Shopping (1 review)

### Interpretation:
- **Health & Fitness** stands out with the highest average number of reviews per author and the highest variability, suggesting intense engagement by a smaller group of authors.
- **Utilities and Social Networking** also show high average reviews and variability, reflecting significant user interaction.
- **Shopping** has the lowest engagement, with minimal reviews per author.
- The majority of authors in most categories have written very few reviews, as indicated by the lower percentiles.

This summary is visually supported by the following barplot to illustrate the differences in the average number of reviews per author across categories.

In [None]:
stats

In [None]:
df = stats.loc[("count","mean")].reset_index()
df.columns = ["category", "count"]
_ = reviews.plot.barplot(data=df,x="category", y="count", title="Average Author Reviews by Category")


##### Reviews by Category