# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import d8analysis as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1589674370,Level Shop,"Level Shop mobile application that delivers high end cosmetic brands. In simple terms it is a delivery service customized for savvy beauty buying customers. Level Shop delivers a big collection of Makeup, Perfumes, Footwear, Sportswear, Accessories and Sunglasses. As the name suggests levelstore.co is a pool of different brands in high-end category. All Welcome to the new age of digital shopping with level Shop. Indulge yourself with famous beauty brands, and have it delivered directly to your door. Download our new app today and choose from more than 10,000 products, hand-picked by our experts. Enjoy our new enhanced customer experience and a smoother shopping journey.",6000,Business,0.0,1530012698,Ahmad Hamza,4.0,4,2021-10-13 07:00:00
957162677,Neurology Exam Tools,"With this app on your phone, you won't need to carry around a flashlight/pen light, bell, or tuning fork for vibration - perfect for on-the-go neurological exams! Save space and weight in your white coat! Also included is a built-in checklist for the neurology physical exam!",6020,Medical,0.0,955093777,Healthcare Technologies LLC,1.0,2,2015-01-21 02:51:00
914948581,Pryvate Messenger,"New Pryvate Messenger. As a recipient of the Best Mobile App Awards' prestigious Best Business App award in 2015, Pryvate is recognised as the number one (#1) choice world-wide for all privacy-savvy business organisations and independent professionals who require the strongest encryption technology with unrivalled call quality and containerised IM. Pryvate features secured encrypted chat, email, browsing, voice and video calls. Want to take your privacy and professional practices to a new level of security? Pryvate enhances protection of your conversations via triple-encrypted chat, video, email, and browsing, so you can keep your movement private when conducting confidential business or having personal conversations. Pryvate offers world-class RSA 4096-bit, SHA-256 encryption technology that is reliant upon zero servers or middlemen and provides secured direct connections to co-workers, clients, and friends. Easily erase sensitive messages immediately or via call history with the auto-delete timer, and if you've sent something you shouldn't have, you can delete your conversations from the recipients phone with the delete for all function or delete all messages via Remote wipe function. With Pryvate's growing catalogue of utility-based features, you can rest assured knowing your communications are kept secure as you indulge in crystal-clear call quality and reap the benefits of using the most-secure communications protocol on the planet. What’s new in Pryvate? • Improved User Interface • Auto Delete function immediately (or set a timer to automatically delete messages) • Remote Wipe function (Delete all app content remotely) • Screenshot Spy function (get notified when a user screenshots the conversation) • Easy to set-up, secured, private agnostic email (integrated with Microsoft Exchange) • Anti-blocking tech, ensuring unrestricted global access • Now you can Forward, Reply and React to Messages • And so much more, we are frequently updating! Coming soon.. • TOR Mobile Web Browser (browse the internet via a proxy on your mobile) • Secured Cryptocurrency Wallet (used with the PryvateCoin 'PVC' token) • PryvateX – Cryptocurrency Exchange integration • Integrated Distributed Applications Portal (dapps) integration Download the secure Pryvate app, independently certified as the ""real deal"" with NO back doors! Ensure your communications are kept confidential today with Pryvate. Supported devices This version of app runs exclusively on iOS devices with 64-bit processors. For optimum app performance, it is recommended you update to iOS 14 plus iPhone • iPhone 5S • iPhone 6 • iPhone 6 Plus • iPhone 6S • iPhone 6S Plus • iPhone SE • iPhone 7 • iPhone 7 Plus • iPhone 8 • iPhone 8 Plus • iPhone X • iPhone XS • iPhone XS Max • iPhone XR • iPhone 11 • iPhone 11 Pro • iPhone 11 Pro Max • iPhone 12 • iPhone 12 Mini • iPhone 12 Pro • iPhone 12 Pro Max • iPhone 13 • iPhone 13 Mini • iPhone 13 Pro • iPhone 13 Pro Max iPod Touch • iPod Touch (6th generation) • iPod Touch (7th generation) iPad • iPad Air • iPad Air 2 • iPad Air 3 • iPad 5 • iPad 6 • iPad 7 • iPad Mini 2 • iPad Mini 3 • iPad Mini 4 • iPad Mini 5 • iPad Pro (12.9-inch) • iPad Pro (9.7-inch) • iPad Pro (10.5-inch) • iPad Pro (11-inch)",6000,Business,0.0,914948585,Criptyque,4.45833,72,2014-09-26 23:18:00
1450058114,DTH Channel Price & Selection,"Get DTH Channel & Packages Prices as per new TRAI Rule. Application gives you - - Get the detail of channel & its prices - Get the detail of packages of broadcaster & its price - Create Packages to easy to understand - set you favorite channel - get the network fee , channel price, & total channels - Share your selected channels & packages to your cable operator - Select free channels to get the minimum price of your packages - Select SD & HD channel - Provides Channels languages - Provides Channel Category Application is useful to GET , SELECT & SHARE Channel. ** NOT GOVERMENT APP **",6016,Entertainment,0.0,1407139621,Rajesh Sutariya,4.75,4,2019-01-22 02:48:00
1069841163,TUTEUR APP,"TUTEUR APP es una herramienta diseñada para ayudar a los profesionales de la salud. Contiene la descripción de los productos comercializados por el laboratorio Tuteur S.A.C.I.F.I.A., organizados por sus respectivas líneas: Hematología, Oncología, Bioterapéuticos y Drogas Huérfanas.  La búsqueda puede hacerse por droga o producto, con un rápido acceso a la información de los mismos. También encontrará la posología de cada droga y una función de gran utilidad para calcular la superficie corporal del paciente.  Absolutamente todas las drogas que encontrará en esta aplicación cuentan con la autorización del Ministerio de Salud para ser comercializadas en la Argentina. Su objetivo es facilitar el día a día de la práctica médica y debe ser utilizada como un complemento de su indelegable responsabilidad profesional, teniendo en cuenta que el lector posee los conocimientos necesarios para interpretar la información que encuentre en ella. Esta aplicación está disponible para smartphone o tablet y contiene: - Buscador de productos. - Información actualizada de cada uno de ellos. - Vademécum de drogas comercializadas por el Laboratorio TUTEUR S.A.C.I.F.I.A. - Dosificación de cada droga. - Calculadora de superficie corporal. - Dirección de las oficinas comerciales, teléfonos y mail de contacto del laboratorio.  No necesita acceso a Internet.",6020,Medical,0.0,1069841162,Tuteur,0.0,0,2016-01-06 00:49:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [8]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.