# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import d8analysis as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1630241692,Rabbit Comics,"Rabbit Comics is home to Comic Books, Exclusive Store Variants, CGC Slabs, and Original Art",6018,Book,0.0,1630241694,Rabbit Comics LLC,5.0,12,2022-06-17 07:00:00
1558951959,yuru Dog Photo,"～Explanation～ What is ""yuru"" Meaning slowly in Japanese It's difficult to communicate in words with people all over the world, No need for dog lovers or words Let's talk with a photo of a dog! I can't connect with people Beyond words New Type SNS！ Upload a photo of your dog whenever you like Click if you have a photo of your favorite dog",6005,Social Networking,0.0,1538370047,ikeda tomoki,0.0,0,NaT
1428694195,Food & Fire Rewards,"Frequent Fire Rewards Join our rewards club, its fun, easy, Free and available now. We have a great program that gives you more points the more you spend with us. Loyalty App Benefits • Earn points with every purchase • Redeem rewards for discounts • Receive up to date rewards and points • Specials & Offers • Events • Directions ABOUT THE PROGRAM Membership has its Rewards How your membership program works: Simply identify yourself at any of our participating locations during a visit and your server will associate your account with your guest check and you will start earning points toward future savings at any of our participating locations. To redeem a reward you must have enough points available. Only one reward redemption can be made per visit. Our program may from time to time have certain other benefits and or restrictions that apply. See our website for more program details. Offers and Discounts From time to time we may provide special discounts and or offers to certain members who qualify for these benefits. Offers are not transferrable and cannot be combined with rewards or gift card redemptions. Offers have a limited time in which they can be redeemed. Please check the offer for details and restrictions. If not specified otherwise all offers expire within 30 days of issuance. PROGRAM RULES •	You must be 18 years or older to join and no purchase is necessary. •	Your membership can be used to earn points at any of our participating locations. •	Points are not awarded on redeemed gift certificates, tax, gratuities or alcoholic beverages and will be issued on qualified purchases only on day of purchase. •	We reserve the right to change or discontinue this program at any time without notice. •	Employees are not eligible for our program. •	Loyalty points cannot be used to purchase gift cards. •	See our website for complete program rules and offering.",6023,Food & Drink,0.0,1428694194,Jose Moreira,1.0,5,2018-08-18 01:14:00
1247142099,The1Flower,Online store for flowers gifts,6024,Shopping,0.0,862721441,Mohammed Maraey Fathy Moustafa Agwa,0.0,0,2017-06-12 18:38:00
785661940,Acuity Insurance,"Have your insurance information close at hand with the Acuity app. Pay your bill, view your policy documents, report and track claims, and more all in one place! View your information and profile: Access your agency information effortlessly Conveniently add your vehicle ID cards to your Apple Wallet* Have digital copies of your certificates of insurance for any situation Lean on Acuity during tough times: Instantly connect with Emergency Roadside Assistance 24/7 Have personal guidance with the claim process step-by-step on your phone Smoothly search through our extensive Acuity pre-approved auto repair shops near you Make the payment and billing process quick and safe: Pay scheduled bills with debit/credit or checking account Be in the loop on new information by signing up for electronic mail or text message notifications *Vehicle ID card does not meet requirements for proof of insurance as required in some states.",6015,Finance,0.0,785661943,"Acuity, A Mutual Insurance Company",4.7522,3749,2013-12-30 23:46:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [8]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.