In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Metrics

In this stage, we delve deeper, aggregating and computing vital statistics across multiple dimensions of our dataset. The Metrics Aggregation Stage provides a multi-dimensional overview by aggregating data at various levels: category, author, app, and the combination of category and author. 

For each level of aggregation, we extract key metrics including counts of reviews, unique authors, and unique apps. Additionally, we compute descriptive statistics on important features such as rating, review length, vote count, vote sum, and date.

This approach allows us to gain insights into the distribution and characteristics of reviews across different categories, authors, and apps, providing valuable information for our subsequent analysis and decision-making processes.

In [2]:
from appvocai-discover.data.prep.metrics import Metrics, AppMetricsConfig, CategoryAuthorMetricsConfig, AuthorMetricsConfig, CategoryMetricsConfig, CategoryMetricsTask, AuthorMetricsTask, AppMetricsTask, CategoryAuthorMetricsTask
from appvocai-discover.infrastructure.spark.factory import SparkSessionPool
from appvocai-discover.utils.print import Printer

ModuleNotFoundError: No module named 'appvocai-discover.data'

You know the drill. We're setting up our Spark session just like before. 

In [2]:
factory = SparkSessionPool()
spark = factory.build(nlp=False)

your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Category Metrics
In this section, we review category-level metrics, providing insights into various aspects such as review counts, unique authors, unique apps, and descriptive statistics concerning ratings, review length, vote count, vote sum, and date. 

In [3]:
config = CategoryMetricsConfig()
metrics = Metrics(config=config, spark=spark, metrics_task_cls=CategoryMetricsTask)
category_metrics = metrics.execute()

Let's sample a category.

In [4]:
category = category_metrics.sample(n=1)
Printer().print_dataframe_as_dict(df=category, title="Category Metrics")



                                Category Metrics                                
                                category | Education
                               app_count | 300
                            author_count | 1040
                            review_count | 1040
                         reviews_per_app | 3.466666666666667
                      reviews_per_author | 1.0
                              rating_min | 1
                              rating_max | 5
                              rating_avg | 3.5913461538461537
                             rating_mode | 5
                              rating_std | 1.7432178810280476
                       review_length_min | 1
                       review_length_max | 363
                       review_length_avg | 28.556730769230768
                      review_length_mode | 2
                       review_length_std | 33.72065298611556
                          vote_count_min | 0
                          vote_count_max | 9
    

This sample provides various statistics related to app reviews within a particular category. Key insights include the number of apps and authors, the range and distribution of ratings, review lengths, and vote counts. 

The metrics outline the total number of apps and authors involved in the reviews for the category. It also highlights the minimum, maximum, average, and most common values for ratings, showing how users generally perceive the apps in this category. Additionally, the standard deviation offers insight into the variability of these ratings.

Review length statistics are included, describing the shortest and longest reviews, the average review length, the most frequent review length, and the variability in review length. 

Vote counts and sums are detailed, presenting the spread and commonality of votes received by reviews. The dataset also captures the range of dates during which the reviews were submitted, providing a timeframe for the data collection.

Overall, these statistics help in understanding user engagement and satisfaction within the selected app category, as well as the distribution and variability of reviews over time.

## Author Metrics
Building on the category-level metrics, author-level metrics provide a more granular view of individual reviewers' behaviors and contributions. These metrics include the total number of reviews each author has submitted, the distribution of their ratings, and the lengths of their reviews. Additionally, author-level metrics capture the number of votes their reviews receive, both individually and in aggregate. This helps in understanding patterns in reviewer activity, the consistency of their feedback, and the reception of their reviews by the community, complementing the broader insights gained from the category-level analysis.

In [5]:
config = AuthorMetricsConfig()
metrics = Metrics(config=config, spark=spark, metrics_task_cls=AuthorMetricsTask)
author_metrics = metrics.execute()




#                            AuthorMetrics Pipeline                            #

Task Reader completed successfully.


                                                                                

Task ConvertTask completed successfully.
Task AuthorMetricsTask completed successfully.


                                                                                

Task ConvertTask completed successfully.
Task Writer completed successfully.


                                 AuthorMetrics                                  
                          Pipeline Start | 2024-06-03 13:40:11.227942
                           Pipeline Stop | 2024-06-03 13:40:27.873968
                        Pipeline Runtime | 00 Minutes 16.646026 Seconds







Let's pull a sample from authors that have 2 or more reviews.

In [6]:
author = author_metrics.loc[author_metrics["review_count"]>1]
author = author.sample(n=1)
Printer().print_dataframe_as_dict(df=author, title="Author Metrics")



                                 Author Metrics                                 
                                  author | 010551c7268306b0e116
                                category | 2
                               app_count | 2
                            review_count | 2
                         reviews_per_app | 1.0
                    reviews_per_category | 1.0
                              rating_min | 5
                              rating_max | 5
                              rating_avg | 5.0
                             rating_mode | 5
                              rating_std | 0.0
                       review_length_min | 4
                       review_length_max | 34
                       review_length_avg | 19.0
                      review_length_mode | 34
                       review_length_std | 21.213203435596427
                          vote_count_min | 0
                          vote_count_max | 0
                          vote_count_avg | 0.0
           

This sample reveals that author-level metrics provide a more granular view of individual reviewer behaviors. It details the number of reviews the author has submitted, the distribution and variability of their ratings, and the lengths of their reviews. Additionally, these metrics capture the number of votes their reviews receive, both individually and in total. This deeper analysis helps reveal patterns in reviewer activity, consistency in their feedback, and how their reviews are received by the community, complementing the broader trends observed at the category level.

## App Metrics
Following the insights from the author metrics, let's now transition to exploring the characteristics of individual apps. Through app-level metrics, we can gain deeper insights into the performance, user engagement, and features of specific applications within the category.

In [7]:
config = AppMetricsConfig()
metrics = Metrics(config=config, spark=spark, metrics_task_cls=AppMetricsTask)
app_metrics = metrics.execute()



#                             AppMetrics Pipeline                              #

Task Reader completed successfully.
Task ConvertTask completed successfully.
Task AppMetricsTask completed successfully.


                                                                                

Task ConvertTask completed successfully.
Task Writer completed successfully.


                                   AppMetrics                                   
                          Pipeline Start | 2024-06-03 13:40:27.935439
                           Pipeline Stop | 2024-06-03 13:40:32.010320
                        Pipeline Runtime | 00 Minutes 04.074881 Seconds







Let's take a sample of an app with two or more reviews.

In [8]:
app = app_metrics.loc[app_metrics["review_count"]>1]
app = app.sample(n=1)
Printer().print_dataframe_as_dict(df=app, title="App Metrics")



                                  App Metrics                                   
                                app_name | ShopShop - Shopping List
                            author_count | 4
                            review_count | 4
                      reviews_per_author | 1.0
                              rating_min | 3
                              rating_max | 5
                              rating_avg | 4.0
                             rating_mode | 4
                              rating_std | 0.816496580927726
                       review_length_min | 11
                       review_length_max | 67
                       review_length_avg | 32.0
                      review_length_mode | 11
                       review_length_std | 25.85858980429263
                          vote_count_min | 0
                          vote_count_max | 0
                          vote_count_avg | 0.0
                         vote_count_mode | 0
                          vote_count_std

This sample illustrates the app-level metrics, providing detailed insights into individual applications. These metrics encompass various aspects such as the number of authors associated with each app, the total number of reviews received, the distribution of reviews per author, and detailed rating information including minimum, maximum, average, and mode ratings. Additionally, it includes statistics on the lengths of reviews, the number of votes received, and the timeframe during which the reviews were submitted for each app.

## Category and Author Metrics
Now, let's wrap up our analysis by combining the insights from both category-level and author-level metrics. By synthesizing these comprehensive sets of statistics, we can gain a holistic understanding of the category's landscape, and individual author behaviors. 

In [9]:
config = CategoryAuthorMetricsConfig()
metrics = Metrics(config=config, spark=spark, metrics_task_cls=CategoryAuthorMetricsTask)
category_author_metrics = metrics.execute()

As before, let's inspect a sample of an author that has two or more reviews within a category.

In [10]:
category_author = category_author_metrics.loc[category_author_metrics["review_count"]>1]
category_author = category_author.sample(n=1)
Printer().print_dataframe_as_dict(df=category_author, title="Category/Author Metrics")



                            Category/Author Metrics                             
                                category | Social Networking
                                  author | 7402e6f4591082ddf333
                               app_count | 2
                            review_count | 2
                         reviews_per_app | 1.0
                              rating_min | 1
                              rating_max | 5
                              rating_avg | 3.0
                             rating_mode | 1
                              rating_std | 2.8284271247461903
                       review_length_min | 4
                       review_length_max | 23
                       review_length_avg | 13.5
                      review_length_mode | 4
                       review_length_std | 13.435028842544403
                          vote_count_min | 0
                          vote_count_max | 0
                          vote_count_avg | 0.0
                         vot

This sample provides a detailed overview of specific authors within the Social Networking category. Each row represents a distinct author and their interactions within this category. The metrics encompass various aspects such as the number of apps and reviews associated with each author within a category, the range and distribution of ratings, the lengths of their reviews, and voting activity. Analyzing these metrics at the category and author level offers insights into individual user behavior and contributions within the each category.