# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

**Import Python Libraries and Provision Dependencies** # noqa

In [1]:
import sys
print(sys.path)
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns

import studioai as eda

from appstore.container import AppstoreContainer
from appstore.data.dataset.appdata import AppDataDataset

['/home/john/projects/appstore/jbook/content/4_eda', '/home/john/projects/appstore/appstore', '/home/john/projects/recsys', '/home/john/anaconda3/envs/appstore/lib/python310.zip', '/home/john/anaconda3/envs/appstore/lib/python3.10', '/home/john/anaconda3/envs/appstore/lib/python3.10/lib-dynload', '', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages/PyQt5_sip-12.11.0-py3.10-linux-x86_64.egg', '/home/john/projects/appstore']


**Obtain the Dataset** # noqa

In [2]:
container = AppstoreContainer()
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [3]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,15
2,Number of Cells,7126980
3,Size (Bytes),974458222

Unnamed: 0_level_0,Count
Data Type,Unnamed: 1_level_1
bool,1
category,2
datetime64[ns],2
float64,4
int64,1
string,5


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [4]:
dataset.sample().style.hide(axis="index")

id,name,description,category_id,category,price,developer_id,developer,rating,ratings,released,extracted,free,months_avail,ratings_per_month
1532588789,Photo Widget — The Best One,"Put photos on your home screen using widgets. Create as many photo widgets as you like with one or more photos per widget. If you choose more than one photo, it will change photos on a time interval (which can be customized in the settings). Each widget can have separate photos. You can also show photos from an album in your photo library. This app is forever free with no ads or in-app purchases. Consider leaving a nice review. ■ Features - Widgets with a single photo. - Widgets with multiple photos that change on an interval (slideshow). - Widgets showing photos from an album in your photo library (slideshow). - Three widget sizes. - Import photos from the photo library or the file system. - Tapping a widget can either show a larger preview of the photo, open a URL, or run a shortcut. - The “Album” widget has a setting to show a random photo only from the most recent `n` photos in the album, where `n` is the number you choose. ■ How it works First, decide whether you want to select the individual photos to show (1) or just choose an existing album from your photo library (2). 1. Add photos to the app which you can later choose from in a widget. Then add the “Photo Widget” widget with the “Photos” type and edit the widget to pick the photos to show. 2. Add the “Photo Widget” widget with the “Album” type and edit it to choose the album. ■ Limitation If you make two or more widgets of the same size where all of them are set to show all the photos or the same album, they will show photos in the same order. This is a limitation in the iOS widget system. You can work around this by giving the widgets unique names in the widget configuration. ■ Tip If you have a widget that cycles between photos and you want to force it to skip to the next photo, long-press the widget, select “Edit Widget”, and then close the edit view. The widget should now have a new photo. ■ FAQ 〉 How do I add a widget to the home screen? When on the home screen, long-press on the background (not on any icons), press the top-left “+” button, and select “Photo Widget”. 〉 How do I edit a widget? When on the home screen, long-press on the widget, and select “Edit Widget”. 〉 Why can I only add 100 photos to the app? This is because of a technical limitation in the iOS widget system. Hopefully, it can be increased in the future. It should be enough for most users though. If you use the “Album” widget, there’s no limit to the number of photos. 〉 How can I hide the name of the app shown below the widget? This is not possible. App developers have no way to hide it. 〉 How is this different from the built-in “Photos” widget? The built-in widget only shows photos from “Memories” and “Featured Photos” in your photo library. There is no way to customize it or pick the photos to be shown. 〉 Can I show an animated GIF in a widget? This is not possible.",6012,Lifestyle,0.0,328077650,Sindre Sorhus,4.65937,640,2020-09-21 07:00:00,2023-07-31 05:00:00,True,34.0,19.0236
1571075476,Lucy Doo Boutique,Welcome to the Lucy Doo Boutique App! The best way to shop with Lucy Doo Boutique on iOS! About Us: Features: - Browse all of our most recent arrivals and promotions - Easy ordering and checkout - Waitlist items and purchase them when they are back in stock - Email notification for order fulfillment and shipping,6024,Shopping,0.0,1571075478,Lucy Doo Boutique,5.0,1,2021-06-09 07:00:00,2023-07-31 05:00:00,True,25.0,0.039646
1491731188,onship,"onship - the Maritime Supeapp is a secure Crew Welfare and Collaboration app for maritime crew, seafarers, their shore teams, support providers, and for crew's friends and family. onship connects people, support services, and welfare services for a happier, healthier, and safer experience while connecting between shore and ship. -- Packed with FREE features with NO ADS whatsoever -- MUST-HAVE FREE WELFARE FEATURES Emma Virtual Assistant Your personalised, virtual assistant for updates, general knowledge, and assistance SeafarerHelp Easy & secure access to ISWAN's helpline via chat or call for a free confidential welfare service to discuss any problem onboard, for general information, health issues, or guidance Stella Maris Find a global network of chaplains and volunteers that support seafarers in times of need Seafarer's Centre Locator Search, discover and contact seafarer centres around the world for local chaplains and general support OceanVoice Your onboard church for creating a link for Christian fellowship between ship and shore -- ESSENTIAL FREE COLLABORATION FEATURES Chat, call, video Instant messaging, voice over IP/internet calling, video calling Loft video conferencing 15 minutes group video calls between shore and ship that uses ship's crew WiFi at under 256kbps speeds -- AFFORDABLE COMMUNICATION FEATURES Loft premium video conferencing Enterprise subscription for an unlimited duration, secured, video meetings and conferencing that uses any available corporate WiFi or special Inmarsat WiFi at under 256kbps speeds International calls Make international calls to landlines or mobile phones at lower rates. No expensive fees or subscription for each call Drop Local Global maritime directory to easily discover, locate service providers for free and discounted calling tariffs to pay-as-you-use call their service providers' landline numbers E-Wallet A built-in prepaid money account to top up, receive allowances, and manage balance for payments to -- New features are being added every month. Data charges may apply in this app. Contact your provider for details. Official website: https://onship.ai. Developed by https://frontm.com, a UK-based technology company obsessed with making maritime and remote communication easier.",6005,Social Networking,0.0,1287568717,Frontm Ltd,0.0,0,2020-01-23 08:00:00,2023-07-31 05:00:00,True,41.0,0.0
1453429375,iÖrgeli,Mit dieser App können Sie auf der Melodieseite eines Schwyzerörgeli üben.,6011,Music,0.0,1332018825,gwerder.digital KLG,0.0,0,2022-06-02 07:00:00,2023-07-31 05:00:00,True,14.0,0.0
1189834489,知识靑年,知识靑年是集名师/学友/企业/培训机构为一体的多维社交移动学习生态圈，提供外训+内训的整合式培训体系，助力企业大学建立，实现人才发展战略目标执行落地！,6017,Education,0.0,1189834488,Di Wang,0.0,0,2017-01-04 01:36:00,2023-07-31 05:00:00,True,77.0,0.0


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [5]:
dataset.info.style.hide(axis="index")

Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [6]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

KeyError: "None of [Index(['min', 'max'], dtype='object')] are in the [columns]"

All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [None]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category and category_id values are as expected.

### Univariate Analysis

#### Quantitative Data

We'll begin the univariate analysis with an examination of the quantitative variables, namely:

- Average User Rating
- Rating Count
- Price
- Release Date

Using quantitative and qualititative methods, we'll discover the central tendency of the data (arithmetic mean, median, mode), its spread (variance, standard deviation, interquartile range, maximum and minimum value) and some features of its distribution (skewness, kurtosis).

##### Average User Rating

In [None]:
dataset.plot.pdfcdfplot(x='rating', title='Average User Rating Distribution')


 Since the rating scale is in [1,5], its clear that the probability density and histogram above contain apps that have not been rated. To get a sense of the actual ratings, we'll create a new dataset without the non-reviewed apps.

In [None]:
df = dataset.as_df()
df = df.loc[df['rating'] != 0]
rated = AppDataDataset(df=df)

Ok, let's examine the frequency distribution of the ratings.

In [None]:
rated.frequency(x='rating', bins=4)
stats = rated.describe(x='rating')
stats.numeric
rated.plot.pdfcdfplot(x='rating', bins=4, title='Distribution of User Ratings')
rated.plot.histpdfplot(x='rating', title='Distribution of User Ratings')

**Key Observations:**

- The long left tail reveals a tendency towards ratings in the 4-5 star range.
- Five star ratings make up 67% of all ratings.
- Multiple peaks are also observed at one star and three star ratings and to a lesser degree with two stars.
- Ratings up to one, two, and three stars, correspond to approximately 8%, 20% and 33% of the cumulative ratings respectively.
- There is no assumption of normality in the distribution of ratings.
- In short, five star ratings dominate customer opinion at this level by a significant margin.
- Note: Taking the average of ordinal values, such as user ratings, is not among the *permissible* statistical transformations whose meanings are preserved when applied to the data, according to measurement theorists, most notably, Harvard psychologist S.S Stevens, who coined the terms *nominal*, *ordinal*, *interval*, and *ratio*. Fortunately, permission is not required in data analysis

##### Rating Count

Rating count can be a harbinger of the intensity of opinion. We'll use the same rated dataset as above.

In [None]:
stats = rated.describe(x='ratings')
stats.numeric
rated.plot.histogram(x='ratings',bins=5, title='Distribution of User Rating Count')

In [None]:
rated.top_n(x='ratings', n=10)

In [None]:
topn = np.array([10,20,35,50,75,100,200,500,1000])
rated.plot.topn_plot(x='ratings', n=topn)

**Key Observations:**

- The distribution of rating counts has a long right tail, with a range from 1 to nearly 31 m ratings.
- The central tendency is placed at a median of 10 ratings per app. The average is pulled in the direction of the outliers.
- Giants of big-tech, social-media, an e-commerce, such as YouTube, Tik-Tok, Spotify, WhatsApp and DoorDash are among the most rated apps in the App Store.
- The top-10 most-rated apps account for nearly 14% of all ratings and less than 1/10th of a percent of all apps. Moreover, the most-rated 1000 apps, who represent 1/3rd of a percent of all apps, consume nearly 75% of all ratings.
- Takeaway: Rating counts are vastly disproportionate.
- Note: Apps with earlier release dates may have higher rating counts. Ratings per day since release will remove the temporal dimension from the rating counts.