# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import d8analysis as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1464388848,Anime Wallpaper 4K Premium,"Anime Wallpaper is the best app for fans (Otaku Wallpapers) of japan animated series, manga and movies you can discover amazing wallpapers of your favorite anime or manga, Anime Wallpapers it has a lot of wallpapers and background of all anime wallpapers, upload daily wallpapers, wallpapers of new anime and add new wallpapers anime series every day. Wall of the Day Come back every day for a new treat. This is where we showcase newly created backdrops, or just some of our favorites. Contact us if you want your own original work in the spotlight. Updated Daily We'll be constantly designing new backdrops for you. This means new high quality content within the app every day. We welcome your feedback and suggestions and if you like this app, please rate it Thank you and Happy surfing!",6008,Photo & Video,3.99,1536158029,Abdelkrim Mabkhout,4.45744,188,2019-06-18 16:30:00
1257781289,firstmoco,"firstmoco allows you to view your bank accounts, schedule transfers between them, and even deposit checks from anywhere!",6015,Finance,0.0,1257781288,First Community Bank of Moultrie County,0.0,0,2017-07-14 14:01:00
966582871,"כאן | דיגיטל, רדיו וטלוויזיה","כאן מתחדשים! שמחים להשיק מתיחת פנים חדשה ליישומון שמנגישה את כל התכנים של כאן בצורה נוחה וידידותית יותר - במקום אחד, 24 שעות ביממה, שבעה ימים בשבוע, 365 ימים בשנה. כשחשבנו על היישומון שלנו היה לנו חשוב לשים אתכם ואתכן במקום הראשון ולאפשר לכם לצרוך את התכנים שלנו ללא רעש מיותר, להתמצא בקלות, לשוטט ולגלות בכל פעם מחדש מגוון אדיר של תוכן - מחדשות ועד הסכתים, ממבוגרים ועד ילדים. אז מה התחדש? • ניראות חדשה לעמוד הבית – עיצוב חדש וניווט מהיר לכל התכנים הפופולאריים של כאן, שידורים חיים, חדשות, רדיו ו-VOD. • ניראות חדשה לעמוד החדשות – העדכונים השוטפים והפרשנויות מדסק כאן חדשות, מכאן רשת ב', ותוכניות האקטואליה של כאן 11. • איזור מועדפים – מהיום תוכלו לשמור את כל התכנים האהובים עליכם בעמוד אחד, ולחזור אליהם בלחיצת כפתור אחת. • עיצוב משודרג ואפשרויות פילטור וניווט חדשות - לבחירת ההעדפות האישיות שלכם בצורה נוחה יותר. תיהנו!",6009,News,0.0,1254969211,Kan - כאן,4.56935,2264,2015-03-09 13:42:00
1155026929,RTIconnect,"RTIconnect Back-Office simplifies the task of running a restaurant, helping managers consistently achieve operational and financial goals. And the powerful features of RTIconnect are now available to go. From a single restaurant to a complete view of your company’s sales activity, you can see it all with the comprehensive restaurant management app. THE NUMBERS YOU NEED, WHENEVER YOU NEED THEM With RTIconnect, you have complete access to the information your team needs—anytime and anywhere. With the daily demands of running restaurants, you and your team can’t be tied to one place. Now you can see your restaurants’ performance from anywhere, giving you the flexibility needed for your busy day. YOU CAN SEE IT ALL With RTIconnect, you get access to the information you need to ensure your operations are running smoothly. •	Sales +/- LW, LY •	Projections •	Summary of Store Activity •	Average Check •	Cash +/- •	Labor Variance •	Speed of Service You also gain more than just today’s numbers. You can easily see historical data by selecting a previous date. THE BIG PICTURE AND THE SMALL DETAILS District managers can see all their stores and drill-through to the same details their store managers see. This provides the visibility needed to coach teams to success. On the store level, managers can drill-through to details to quickly locate and correct issues. That way, small issues can be corrected before they become big problems. DO MORE THAN MAINTAIN—THRIVE With the RTIconnect restaurant management app, you get the most from your quick-service restaurant software. You gain more than access to numbers and the opportunity to keep costs down. You and your team gain the control needed to grow your company and run more profitable restaurants.",6000,Business,0.0,1455368071,Xenial,1.4,15,NaT
1607127104,Picked Cherries,"Picked Cherries is THE destination for the best podcast listening experience for all listeners and THE social podcasting app. We all love 'shared experiences' and, now with Picked Cherries, you can share podcasts like never before! On Picked Cherries, you can listen to full episodes of your favorite podcasts and then share 60 second clips of your favorite audio content (called 'picked cherries') with your friends and families via text, WhatsApp, messaging and post on your social media of choice. The best thing about the Picked Cherries 'sharing' technology is with just one click, you can share compelling, interesting, funny, newsworthy or thought-provoking picked cherries with your family, friends or work buddies even if they are NOT Picked Cherries users and then they can continue to share with their friends & family. With 2 million+ podcasts in the podcast ecosystem, the Picked Cherries app makes it easy and fun for you to discover new podcasts using our 'cherry stream' technology. As a discovery engine, listen to an endless stream of 60-second 'picked cherries' created by other Picked Cherries listeners in the categories that interest you most. This personalized audio stream will hopefully provide you with your new favorite podcasts! Welcome to Picked Cherries and welcome to the new world of Social Podcasting!  Picked Cherries is 100% free for all to use and is available in the iOS App Store.",6005,Social Networking,0.0,1607127106,PickedCherries,4.7037,27,2022-01-27 08:00:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [8]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.