# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

**Import Python Libraries and Provision Dependencies** # noqa

In [1]:
import dependency
import sys
print(sys.path)
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns

import d8analysis as eda

from appstore.container import AppstoreContainer
from appstore.data.dataset.appdata import AppDataDataset

Added /home/john/projects/appstore to sys.path


/home/john/projects/appstore/config/logging.yml
['/home/john/projects/appstore/jbook/3_data', '/home/john/anaconda3/envs/appstore/lib/python310.zip', '/home/john/anaconda3/envs/appstore/lib/python3.10', '/home/john/anaconda3/envs/appstore/lib/python3.10/lib-dynload', '', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages', '/home/john/anaconda3/envs/appstore/lib/python3.10/site-packages/PyQt5_sip-12.11.0-py3.10-linux-x86_64.egg', '/home/john/projects/appstore']


**Obtain the Dataset** # noqa

In [2]:
container = AppstoreContainer()
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [3]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [4]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1601555296,Zèra Jewels,"Introducing the Zèra Jewels app! Everything you love about the site is now at your fingertips! Whether you're looking for jewelry for you or for your loved one - use the Zera Jewels app to shop our entire site. ZÈRA Jewels is a fashion brand that offers affordable jewelry that allows you to look like a million bucks without spending a million bucks. We have collections for all budgets - our jewelry pieces are designed to help women honor their personal style. At times, it can be difficult to make time for ourselves and feel good in our own skin. Each one of our pieces are elegantly designed with a sophisticated flare, to make women feel confident, beautiful and authentically themselves. Whether you’re on a journey to find your style, elevating your outfit or buying a gift for a loved one, ZÈRA has something for you. Check prices and build up your list of favorites right here on the app. Create an amazing favorites list, and share with friends for immediate feedback.",6024,Shopping,0.0,1601555298,Zera Jewels,5.0,2,2021-12-23 08:00:00
1629032021,GlobalTips Guest,"Having a great time? Send your waiter some cashless appreciation with GlobalTips! There are only 3 simple steps to follow: 1. Whenever you are at a participating restaurant, hold your iPhone near the GlobalTips card or sticker or point your camera at the GlobalTips QR code. 2. Choose tip amount and payment method. In addition to Apple Pay, GlobalTips accepts a wide range of credit and debit cards. 3. You can also rate your dining experience and add a quick review if you wish. Tipping with GlobalTips means your waiter gets 100% of your tip. It has never been easier to tip cashless!",6015,Finance,0.0,1616293661,GLOBALTIPS EUROPE UAB,0.0,0,2022-07-29 07:00:00
983655856,Keyplan 3D Lite - Home design,"Keyplan 3D, our new home and interior designer is built on top of a unique technology unleashing features never seen before on the Appstore. It is a simple to use, useful and fun App to help you design, build, think and decorate your home or future home from the ground up. Whether you are looking to make alterations to your current home or plan on building your dream house, Keyplan 3D is there to turn this otherwise complicated process into child's play. Through our beautiful interface, designed with simplicity in mind, you can create amazing content without ever having to worry about complex menus and cryptic options. Create a wall or room by simply using our build button and our smart engine will take care of the rest. Forget about loading screens and unresponsive Apps : with Keyplan 3D you can visualise your project through our gorgeously rendered plans, which are both fully interactive and updated in real-time. Every aspect of Keyplan 3D has been designed to enable you to express your creativity like never before. Behind Keyplan 3D there is a powerful technology, allowing you to build any shape imagineable, paint, decorate and place more than 350 free unique objects. Feeling proud of your creations? Go ahead and share them with your friends and family on your favourite social media platform. Main features: - House builder : creating walls is as easy as drawing a line with a pencil. Any shape is possible with our unique 2D/3D editing features. - Interior design : Place furniture, windows, doors - edit, change, remove, in either the 2D or 3D view. - Decoration : A large selection of paints, brick, wood, ceramic, textile to be used on any object or surface. Expect new free objects on a regular basis. - Many customisation options such as wall width and height settings, inch/meter conversion. - iCloud synchronisation to enjoy your project on all your devices, iPhone and iPad. - Sharing : share 2D snapshots and 3D renders of your plan with your friends/family. Try it out and let us know what you think - we made Keyplan 3D for you and our only goal is to offer you the best experience out there. A nice video presentation: http://bit.ly/1DACRHo For more information, visit us on www.keyplan3d.com Contact us to: contact@keyplan3d.com",6007,Productivity,0.0,850534528,Quasarts LLC,4.51084,44172,2018-03-16 14:13:00
545213790,Happy Find Differences,"- Happy Spot, find the difference between two similar images in limited time. - It is a casual game with many beautiful pictures. - Just find the difference and then tap it!",6014,Games,0.0,487494069,Fino Soft Inc.,3.78571,98,2012-11-07 20:34:00
1438347785,Play3D Camera,"With super high-resolution photorealistic imagery that’s lightning fast to place, Play3D is more than just 3D photo stickers, it’s an Augmented Reality (AR) experience that is so lifelike it’s amazing. Place holograms anywhere on any surface and create your very own stories. Optimized for iOS 12 and powered by ARKit, Play3D brings you the next generation of Augmented Reality capture technology and offers a line-up of exciting, breathtaking holograms that you’re going to love. More will be added every week! Play3D is fun and easy to use. Look for the surface, choose a character and place. Then record and share your fun moment. It’s that simple! You can even share directly to Messages in your iPhone. Whether you’re looking for an Extended Reality, Mixed Reality or Augmented Reality experience, there’s nothing more real than Play3D. Bring the power of Play3D into your videos. Download today! Officially Licensed Product of National Football League Players Inc. © 2018",6008,Photo & Video,0.0,1195068564,Altered Reality Corporation,4.59999,10,2018-10-22 00:41:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [5]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [6]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [7]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.

### Univariate Analysis

#### Quantitative Data

We'll begin the univariate analysis with an examination of the quantitative variables, namely:

- Average User Rating
- Rating Count
- Price
- Release Date

Using quantitative and qualititative methods, we'll discover the central tendency of the data (arithmetic mean, median, mode), its spread (variance, standard deviation, interquartile range, maximum and minimum value) and some features of its distribution (skewness, kurtosis).

##### Average User Rating

In [8]:
dataset.plot.pdfcdfplot(x='rating', title='Average User Rating Distribution')


AttributeError: 'Provide' object has no attribute 'style'

 Since the rating scale is in [1,5], its clear that the probability density and histogram above contain apps that have not been rated. To get a sense of the actual ratings, we'll create a new dataset without the non-reviewed apps.

In [None]:
df = dataset.as_df()
df = df.loc[df['rating'] != 0]
rated = AppDataDataset(df=df)

Ok, let's examine the frequency distribution of the ratings.

In [None]:
rated.frequency(x='rating', bins=4)
stats = rated.describe(x='rating')
stats.numeric
rated.plot.pdfcdfplot(x='rating', bins=4, title='Distribution of User Ratings')
rated.plot.histpdfplot(x='rating', title='Distribution of User Ratings')

**Key Observations:**

- The long left tail reveals a tendency towards ratings in the 4-5 star range.
- Five star ratings make up 67% of all ratings.
- Multiple peaks are also observed at one star and three star ratings and to a lesser degree with two stars.
- Ratings up to one, two, and three stars, correspond to approximately 8%, 20% and 33% of the cumulative ratings respectively.
- There is no assumption of normality in the distribution of ratings.
- In short, five star ratings dominate customer opinion at this level by a significant margin.
- Note: Taking the average of ordinal values, such as user ratings, is not among the *permissible* statistical transformations whose meanings are preserved when applied to the data, according to measurement theorists, most notably, Harvard psychologist S.S Stevens, who coined the terms *nominal*, *ordinal*, *interval*, and *ratio*. Fortunately, permission is not required in data analysis

##### Rating Count

Rating count can be a harbinger of the intensity of opinion. We'll use the same rated dataset as above.

In [None]:
stats = rated.describe(x='ratings')
stats.numeric
rated.plot.histogram(x='ratings',bins=5, title='Distribution of User Rating Count')

In [None]:
rated.top_n(x='ratings', n=10)

In [None]:
topn = np.array([10,20,35,50,75,100,200,500,1000])
rated.plot.topn_plot(x='ratings', n=topn)

**Key Observations:**

- The distribution of rating counts has a long right tail, with a range from 1 to nearly 31 m ratings.
- The central tendency is placed at a median of 10 ratings per app. The average is pulled in the direction of the outliers.
- Giants of big-tech, social-media, an e-commerce, such as YouTube, Tik-Tok, Spotify, WhatsApp and DoorDash are among the most rated apps in the App Store.
- The top-10 most-rated apps account for nearly 14% of all ratings and less than 1/10th of a percent of all apps. Moreover, the most-rated 1000 apps, who represent 1/3rd of a percent of all apps, consume nearly 75% of all ratings.
- Takeaway: Rating counts are vastly disproportionate.
- Note: Apps with earlier release dates may have higher rating counts. Ratings per day since release will remove the temporal dimension from the rating counts.

In [None]:
with pd.option_context('format.precision',2):
    df1_style = freq.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Average User Rating Frequency Distribution")
    df2_style = desc.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Average User Rating Descriptive Statistics")
    display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

**Key Observations**

- The five-star ratings represent 67% of nearly 291,000 ratings in the dataset.
- Ratings up to four-star comprise just 33% of all ratings. Three stars and below make up approximately 20% of the data; whereas, one and two star ratings represent less than 10% of all ratings.
- The mean of average user ratings, 4.09 stars, is influenced by the significant left skew. The median of 4.53 is a more robust measure of centrality, given the long left-tail in the distribution.
- The assumption of a normal distribution is violated as shown in the histogram, and the probability density functions.

#### Ratings

Next, we explore the distribution of rating counts in the dataset.

In [None]:
p = eda.KDEPlot(data=df_ratings, x='ratings', title="Rating Count\nProbability Density Function")
h = eda.Histogram(data=df_ratings, x='ratings', title="Rating Count\nHistogram")
c = eda.ECDFPlot(data=df_ratings, x='ratings', title="Rating Count\nCumulative Distribution Function")
v = eda.ViolinPlot(data=df_ratings,x='ratings', title="Rating Count Distribution")
plots = [p]

visual = Visual()
for plot in plots:
    visual.add_plot(plot=plot)
visual.visualize()