# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

**Import Python Libraries and Provision Dependencies** # noqa

In [1]:
import dependency
import sys
print(sys.path)
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns

import d8analysis as eda

from appstore.container import AppstoreContainer
from appstore.data.dataset.appdata import AppDataDataset

Added /home/john/projects/appstore to sys.path


/home/john/projects/appstore/config/logging.yml
['/home/john/projects/appstore/jbook/3_data', '/home/john/anaconda3/envs/appstore/lib/python310.zip', '/home/john/anaconda3/envs/appstore/lib/python3.10', '/home/john/anaconda3/envs/appstore/lib/python3.10/lib-dynload', '', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages/PyQt5_sip-12.11.0-py3.10-linux-x86_64.egg', '/home/john/projects/appstore']


**Obtain the Dataset** # noqa

In [2]:
container = AppstoreContainer()
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [3]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [4]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1598380690,EZ Tolls DE,"Manage your Delaware E-ZPass account through an easy-to-use app! * Add money to your account * Check your balance * Edit registered vehicles * View your toll history Disclaimer: The developers of this app and Pragmistic are not associated with DelDOT, E-ZPass Delaware or any other electronic toll-collection system. Every attempt has been made to ensure that this application is accurate and reliable, however human and/or mechanical errors are possible. Accordingly, Pragmistic makes no representation as to the accuracy or completeness of the information provided in this app and denies any express or implied warranty of the same.",6002,Utilities,0.0,993081828,Pragmistic LLC,1.0,1,2021-12-06 08:00:00
963249200,Farmers State Bank & Trust Co.,"Start banking wherever you are with Farmers State Bank & Trust Co. app. Available to all Farmers State Bank & Trust Company online banking customers. Available features include: Accounts: •Check your latest account balance and search recent transactions by date, amount, or check number. Transfers: •Easily transfer cash between your accounts. Bill Pay: •Make payments and view recent and scheduled payments. Deposits: •Submit check deposits using your device’s camera. Locations: •Find nearby branches and ATMs using your device’s built-in GPS. Touch ID/Face ID: •Touch ID/Face ID allows you to use a secure and more efficient sign-on experience using your fingerprint or facial recognition.",6015,Finance,0.0,963249199,Farmers State Bank & Trust Co.,4.86341,205,2015-05-25 07:11:00
1565018123,Ledes Serviços Contábeis,"Software destinado ao gerenciamento de entregas e comunicação entre os Clientes dos Escritórios de Contabilidade com o Escritório de Contabilidade, o mesmo possui as seguintes funcionalidades: 1 - Gerenciamento de Solicitações tanto pelo cliente como pelo escritório contábil; 2 - Gerenciamento de Documentos Eletrônicos; 3 - Gerenciamento e Visualização de Comunicados;",6015,Finance,0.0,1491321740,Abner Roberto Santiago Da Silva,0.0,0,2021-04-27 07:00:00
739285899,Big Cartel,"Over a million creators use Big Cartel to run their business and sell all sorts of weird and wonderful stuff. Use the app to take your Big Cartel shop with you anywhere you go! Easily manage products, orders, discounts, and account settings to sell your work whenever, wherever. - View sales stats in a glance with the Dashboard - Create, edit, and rearrange products - Upload product images from your photo library - Track and update order status - Accept in-person payments using cash, or credit cards via Stripe - Add, edit, or remove discounts - Search orders, view order details, and print packing slips - Send receipts instantly - Adjust your account settings, including your email and shop description You can do all that and more while you’re away from the computer, making it perfect for selling at craft fairs, concerts, and other live events. We hope you love this app, and appreciate you leaving a rating or review. If you have questions, we're here to help! Just email support@bigcartel.com.",6000,Business,0.0,739285902,Big Cartel,4.76097,6794,2013-11-13 00:32:00
1478967788,Schoolrunner 2.0,"** Please note: you must be a staff member, student, or guardian at a Schoolrunner partner school to access the Schoolrunner mobile app. ** The Schoolrunner mobile app is a convenient companion to our full-featured web app. Our mobile app has these features to make teachers' days easier: Behaviors – Quickly give 1 or more behaviors to 1 or more students. See your most relevant groups right away or easily search for anyone. Save favorites and see your most recently given behaviors for even quicker tracking. View and edit behaviors you've given from the app today. Choose whether to play sounds when giving behaviors. Class Attendance – See your class sections or search for any other section. Log attendance with a few taps and see your class attendance percentage update in real time. Communication – Automatically log calls and emails placed from the Schoolrunner mobile app with details such as topic, mood, and comments. Manually log communications without having to make a call. View all communications for a student whether logged by you or someone else. In-App Messaging: send messages to guardians of students in a group/homeroom/section, or to all guardians of a particular student. Choose to accept replies from guardians or restrict to no-reply announcements. Change your language to automatically translate incoming messages into your home language. Student Profile - A work in progress. View and update student photos with your device's camera or photo library to add faces to names. View pertinent information like homeroom, birthday, and alerts. Search - Quickly find any student group, section, or student at your school. Star the most important groups for quick access and manage your notification preferences for them. Go directly to the Attendance, Behaviors, Communications, and Student pages from search results.",6017,Education,0.0,815111331,Schoolrunner,2.11628,43,2019-09-10 07:00:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [5]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [6]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [7]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.

### Univariate Analysis

#### Quantitative Data

We'll begin the univariate analysis with an examination of the quantitative variables, namely:

- Average User Rating
- Rating Count
- Price
- Release Date

Using quantitative and qualititative methods, we'll discover the central tendency of the data (arithmetic mean, median, mode), its spread (variance, standard deviation, interquartile range, maximum and minimum value) and some features of its distribution (skewness, kurtosis).

##### Average User Rating

In [8]:
dataset.plot.pdfcdfplot(x='rating', title='Average User Rating Distribution')


AttributeError: 'Provide' object has no attribute 'style'

 Since the rating scale is in [1,5], its clear that the probability density and histogram above contain apps that have not been rated. To get a sense of the actual ratings, we'll create a new dataset without the non-reviewed apps.

In [None]:
df = dataset.as_df()
df = df.loc[df['rating'] != 0]
rated = AppDataDataset(df=df)

Ok, let's examine the frequency distribution of the ratings.

In [None]:
rated.frequency(x='rating', bins=4)
stats = rated.describe(x='rating')
stats.numeric
rated.plot.pdfcdfplot(x='rating', bins=4, title='Distribution of User Ratings')
rated.plot.histpdfplot(x='rating', title='Distribution of User Ratings')

**Key Observations:**

- The long left tail reveals a tendency towards ratings in the 4-5 star range.
- Five star ratings make up 67% of all ratings.
- Multiple peaks are also observed at one star and three star ratings and to a lesser degree with two stars.
- Ratings up to one, two, and three stars, correspond to approximately 8%, 20% and 33% of the cumulative ratings respectively.
- There is no assumption of normality in the distribution of ratings.
- In short, five star ratings dominate customer opinion at this level by a significant margin.
- Note: Taking the average of ordinal values, such as user ratings, is not among the *permissible* statistical transformations whose meanings are preserved when applied to the data, according to measurement theorists, most notably, Harvard psychologist S.S Stevens, who coined the terms *nominal*, *ordinal*, *interval*, and *ratio*. Fortunately, permission is not required in data analysis

##### Rating Count

Rating count can be a harbinger of the intensity of opinion. We'll use the same rated dataset as above.

In [None]:
stats = rated.describe(x='ratings')
stats.numeric
rated.plot.histogram(x='ratings',bins=5, title='Distribution of User Rating Count')

In [None]:
rated.top_n(x='ratings', n=10)

In [None]:
topn = np.array([10,20,35,50,75,100,200,500,1000])
rated.plot.topn_plot(x='ratings', n=topn)

**Key Observations:**

- The distribution of rating counts has a long right tail, with a range from 1 to nearly 31 m ratings.
- The central tendency is placed at a median of 10 ratings per app. The average is pulled in the direction of the outliers.
- Giants of big-tech, social-media, an e-commerce, such as YouTube, Tik-Tok, Spotify, WhatsApp and DoorDash are among the most rated apps in the App Store.
- The top-10 most-rated apps account for nearly 14% of all ratings and less than 1/10th of a percent of all apps. Moreover, the most-rated 1000 apps, who represent 1/3rd of a percent of all apps, consume nearly 75% of all ratings.
- Takeaway: Rating counts are vastly disproportionate.
- Note: Apps with earlier release dates may have higher rating counts. Ratings per day since release will remove the temporal dimension from the rating counts.

In [None]:
with pd.option_context('format.precision',2):
    df1_style = freq.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Average User Rating Frequency Distribution")
    df2_style = desc.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Average User Rating Descriptive Statistics")
    display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

**Key Observations**

- The five-star ratings represent 67% of nearly 291,000 ratings in the dataset.
- Ratings up to four-star comprise just 33% of all ratings. Three stars and below make up approximately 20% of the data; whereas, one and two star ratings represent less than 10% of all ratings.
- The mean of average user ratings, 4.09 stars, is influenced by the significant left skew. The median of 4.53 is a more robust measure of centrality, given the long left-tail in the distribution.
- The assumption of a normal distribution is violated as shown in the histogram, and the probability density functions.

#### Ratings

Next, we explore the distribution of rating counts in the dataset.

In [None]:
p = eda.KDEPlot(data=df_ratings, x='ratings', title="Rating Count\nProbability Density Function")
h = eda.Histogram(data=df_ratings, x='ratings', title="Rating Count\nHistogram")
c = eda.ECDFPlot(data=df_ratings, x='ratings', title="Rating Count\nCumulative Distribution Function")
v = eda.ViolinPlot(data=df_ratings,x='ratings', title="Rating Count Distribution")
plots = [p]

visual = Visual()
for plot in plots:
    visual.add_plot(plot=plot)
visual.visualize()