# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import d8analysis as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1557210553,"MKMW - Mein Krebs, mein Weg","Mein Krebs, mein Weg Die Diagnose „Krebs“ trifft Betroffene und deren Angehörige meist aus heiterem Himmel und verändert das gesamte Leben. Die Krankheit steht fortan im Mittelpunkt und der gesamte Alltag muss neu organisiert werden. Dadurch wachsen nicht selten Unsicherheit und Ängste. Information und Orientierung für Krebspatienten und Angehörige „Mein Krebs, mein Weg“: Denn jede Krebserkrankung und jede Krebstherapie ist anders. Diese App möchte dazu beitragen, Patienten und Angehörigen die Angst vor der Krankheit und Therapie zu nehmen und als Begleiter und Navigator in dieser schweren Zeit zur Seite stehen. Die medizinische und fachliche Expertise ist dafür unverzichtbar: Ratgeber und Expertenbeiträge helfen Betroffenen, ihren Alltag mit der Krebserkrankung besser zu bewältigen und vermitteln das Wissen, wie sich während der Therapie eine möglichst hohe Lebensqualität bewahren lässt. In der App finden Sie einen leicht nachvollziehbaren Überblick zu den möglichen Stationen der Krebstherapie und den zahlreichen Beratungs- und Unterstützungsangeboten. Unter den einzelnen Wegweisern Diagnose, Behandlung, Soziales und Aktuelles finden sie konkrete Hilfsmittel: Professionelle Service-Materialien helfen kostenlos bei der Vorbereitung auf die Termine, während regelmäßige Artikel über aktuelle Themen und Entwicklungen aus der Krebsforschung informieren. „Mein Krebs, mein Weg“ richtet sich nicht an einzelne Krebsarten, sondern spricht jeden Krebspatienten und dessen Angehörige an. Die App dient ausschließlich als Informationsgrundlage zum Umgang mit einer Krebserkrankung. Jede Entscheidung für die Behandlung und das Therapiemanagement erfolgt durch den behandelnden Arzt.",6013,Health & Fitness,0.0,563909628,Esteve,0.0,0,2021-05-06 07:00:00
1169589510,Fusionetics Pro,"Fusionetics is developing a new class of products to support its expanding user base as it extends relationships across performance healthcare and the various settings and environments where services are delivered. The new products are intended to support providers and extend their reach and capabilities in delivering evidence-based solutions to help people move better, perform stronger, and recover faster. The Fusionetics Pro app is available for providers and Practitioners of elite sport, fitness and healthcare organizations who have an existing Fusionetics license. Now Practitioners can manage clients, collect data, monitor activity and stay connected in and out of their facility with all of their clients. The key tools, features and functions used in Fusionetics Sports Science platform are now available for use on a mobile device. This allows Practitioners to manage their accounts and provide testing, analytics and programming wherever they go. Data and information syncs between the mobile app and the Fusionetics Sports Science platform in real time regardless of which device is being used so you can provide the best in performance healthcare. Note: you must have an active Fusionetics Practitioner account to Sign In to the Fusionetics Pro App.  Fusionetics Pro App Features include:  • Access accounts including Facilities, Teams and Clients • Add new Clients on the fly  • View Client Profiles & Dashboards  • Access Client Training Calendars for daily activities and monthly game plans  • Perform Tests for Movement Efficiency, Range of Motion, Performance, and Recovery  • Generate and view reports & analytics based on test data  • Generate and edit recommended Movement Efficiency Programs • Select, edit and assign Programs from the Program Catalog • Management of training calendars and client schedules • View program exercises and techniques with HD video instruction  • All data syncs in real-time with the Fusionetics Sports Science platform",6013,Health & Fitness,0.0,983035073,Fusionetics,4.0,4,2016-11-06 01:58:00
1441297895,VPN Ai + Private Browser,"VPN Ai is the latest and world’s most advanced VPN. Enjoy best-in-class VPN technology to encrypt your online information and stay completely secure and anonymous on the Internet. GET VPN Ai TO: - Access ANY website, video or game privately and securely from home, work, school and anywhere in the world. - Block tracking and logging of your IP address, location, identity, and stay completely private and anonymous. - Integrated Private Browser that block ads and scripts, to help you and browse the web seamlessly and securely. With just a tap, VPN Ai takes your web activity private and anonymous, with top performance in speed and security. Access the Internet while completely protecting your identity and without compromise. With VPN Ai, you are getting the safest and most secure browsing experience. VPN Ai features: ► Access websites, games and apps safely and anonymously VPN Ai lets you access websites and apps in school, at work, or at home. You can browse websites like Netflix, Facebook, Youtube, or use apps like Snapchat, and play games over restricted Wi-Fi networks ► Free and Unlimited VPN Ai is free to use. Enjoy our VPN service at no costs. You may even want to try out our Premium service under a Free Trial which you can cancel at any time. ► Anonymous and Untraceable Hide your Internet activity. Remain completely anonymous without being tracked or traced by ISPs, or websites or prying eyes. ► No Logs Strict no-log policy, forever. ► Worldwide VPN Coverage Choose from ultra-high speed servers from over 15 countries including US, UK, CN, HK, CA, IN, SG, JP, and many more. Ready to get started? Try VPN Ai for free now. About VPN Ai PREMIUM: - VPN Ai Premium is offered for 1 Week, 1 Month or 1 Year, with a free trial available. - Price of Subscription ranges from $1.99-$79.99 depending on the subscription period you choose. Additional Subscription Information: - Payment will be charged to iTunes Account at confirmation of purchase. - Offer limited to one 7-day trial per user. After the first 7 days, subscription renews automatically unless cancelled before the end of the 7-day trial. Subscription may be cancelled at any time within the iTunes and App Store Apple ID Settings. All prices include applicable local sales taxes. - Subscription automatically renews unless auto-renewal is turned off at least 24-hours before the end of the period. - Account will be charged for renewal within 24-hours prior to the end of the current period. - Subscriptions may be managed by the user and auto-renewal may be turned off by going to the user's Account Settings after purchase. - Subscription can be cancelled anytime. However, no cancellation of the current subscription is allowed during active subscription period. - Any unused portion of a free trial period, if offered, will be forfeited when the user purchases a subscription to the service. About Ai APPS: ""Ai"" stands for ""Alternate Identity"" - Our mission here at Ai Apps is to help each individual consumer take control of their personal privacy and safeguard their personal information from the prying eyes of the Internet. We work hard to build great products that help promote a safer and more secure internet for everyone. Follow us on.. Website: https://www.myaiapps.com/ Privacy: https://www.myaiapps.com/vpn/privacy.html Terms: https://www.myaiapps.com/vpn/terms.html Copyright © 2019 Ai Apps. All Rights Reserved",6007,Productivity,0.0,1441297894,AI APPS PTE LTD,4.67148,3117,2018-12-18 07:08:00
1489449344,كومنتاتك - Commentatk,تطبيق يخص موقع وصفحه كومنتاتك - Commentatk للمساعده علي ارسال الاسكرينات يقوم العضو بارسال الاسكرينات على صفحه كومنتاتك - Commentatk او مواقف كومنتاتك - Mwakef Commentatk او اسكرين كوميدي - Screen Comedy او كوميكس كومنتاتك - Commentatk Comics او اسلاميات كومنتاتك - Commentatk Islamic ويتم مراجعتها بواسطه ادمنز الصفحها وفى حاله كانت الصوره لا تخالف شروط التطبيق يتم ارسال SMS للعميل بان الصوره تمت الموافقه عليها وسيتم نشرها فى الوقت المحدد بعرض جميع الاسكرينات التي يتم ارسالها بواسطه التطبيق على صفحه صفحه كومنتاتك - Commentatk او مواقف كومنتاتك - Mwakef Commentatk او اسكرين كوميدي - Screen Comedy او كوميكس كومنتاتك - Commentatk Comics او اسلاميات كومنتاتك - Commentatk Islamic وعمل مسابقات يوميه بكروت شحن يقوم اداره التطبيق باضافه صوره كوميديه يومياً واحسن تعليق على المسابقه يفوز بكارت شحن مجاناً ويتم ارسال رقم الكارت وصورته على البريد الالكتروني ورقم الجوال واشعار على الموبايل .. صفحاتنا : - كومنتاتك - Commentatk مواقف كومنتاتك - Mwakef Commentatk كوميكس كومنتاتك - Commentatk Comics اسلاميات كومنتاتك - Commentatk Islamic مشاكل كومنتاتك - Commentatk Problems مواهب كومنتاتك - Commentatk Talent كمين كومنتاتك - Kamin Commentatk اسكرين كوميدي - Screen Comedy فى حاله وجود اى مشاكل برجاء التواصل معنا على الصفحه الخاصه بنا https://www.facebook.com/commentatkcom/ سعداء جداً بخدمتكم,6016,Entertainment,0.0,1489449330,MAHMOUD AHMED EL KOMY AND PARTNER,5.0,1,2019-12-27 08:00:00
1552595461,PF360,"Programa dirigido a los Titulares y su Familia, mediante el cual adquieres: - Asistencia Médica - Check Up - Hogar. - Asesoría Legal. - Beneficios de conexión a emergencia. - Asistencia Vial. - Club de Descuentos. - Llamadas ilimitadas. - Seguros de vida y accidentes.",6012,Lifestyle,0.0,1552595463,"Ein Asistencias, S. de R.L.",0.0,0,2021-02-23 08:00:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [8]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.