# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import studioai as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),962579922

Unnamed: 0,Data Type,Count
0,string,5
1,float64,2
2,category,1
3,category,1
4,int64,1
5,datetime64[ns],1


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

Id,Name,Description,Category_id,Category,Price,Developer_id,Developer,Rating,Ratings,Released
1501310833,Decibel Sound Meter | dB Level,"Sound level meter(or SPL) app is shows a decibel values by measure the environmental noise, displays measured dB values in various forms. you'll experience tidy graphic design with high frame by this smart sound meter app. Decibel meters also are referred to as sound level meters (decibel meters), noise detection, sound meters, noise detectors, and noise detectors. Decibel meter may be a noise detection software that uses mobile microphone to live the ambient noise decibel (dB). The current decibel size and curve are displayed during the method . With the decibel meter, you'll measure the present ring The background level of the environment is straightforward and straightforward to use. Functions: - The dashboard displays the present noise decibel value - Display min/aver/max/current decibel value - Display decibel size change chart - Auto decibel value calibrate - Display test time 11 Levels of Noise: 120dB: thunder 110dB: rock 'n' roll 100dB: train 90dB: factory 80dB: busy street 70dB: busy traffic 60dB: conversation 50dB: quiet library 40dB: quiet park 30dB: whisper 20dB: leaves fall Decibels: dB, dB(A), dBA, dB(C), dBV, dBm and dBi? What are they all? The decibel (dB) may be a logarithmic unit wont to measure sound level. it's also widely utilized in electronics, signals and communication. The dB may be a logarithmic way of describing a ratio. The ratio could also be power, instantaneous sound pressure , voltage or intensity or several other things. afterward we relate dB to the phon and to the sone, which measures loudness. What is a Sound level meter? A sound level meter is employed for acoustic (sound that travels through air) measurements. it's commonly a hand-held instrument with a microphone. the simplest sort of microphone for sound level meters is that the condenser microphone[1], which mixes precision with stability and reliability. The diaphragm of the microphone responds to changes in atmospheric pressure caused by sound waves. that's why the instrument is usually mentioned as a instantaneous sound pressure Level (SPL) Meter. This movement of the diaphragm, i.e. the instantaneous sound pressure deviation (pascal Pa), is converted into an electrical signal (volts V). While describing sound in terms of instantaneous sound pressure (Pascal) is feasible , a logarithmic conversion is typically applied and therefore the instantaneous sound pressure level is stated instead, with 0 dB SPL adequate to 20 micropascals. A microphone is distinguishable by the voltage value produced when a known, constant instantaneous sound pressure is applied. this is often referred to as the microphone sensitivity. The instrument must know the sensitivity of the actual microphone getting used . Using this information, the instrument is in a position to accurately convert the electrical signal back to a instantaneous sound pressure , and display the resulting instantaneous sound pressure level (decibels dB SPL). Sound level meters are commonly utilized in sound pollution studies for the quantification of various sorts of noise, especially for industrial, environmental, mining[2] and aircraft noise. the present international standard that specifies sound level meter functionality and performances is that the IEC 61672-1:2013. However, the reading from a sound level meter doesn't correlate well to human-perceived loudness, which is best measured by a loudness meter. Specific loudness may be a compressive nonlinearity and varies at certain levels and at certain frequencies.These metrics also can be calculated during a number of various ways Note: Microphones in most IOS devices are aligned to human voice. the utmost values are limited by the device. Very loud sounds(over ~90 dB) might not be recognized in most device. So please use it as just an auxiliary tools. If you would like more accurate dB values, we recommend a actual sound level meter for that.",6007,Productivity,0.0,1489624695,Mattia La Spina,5.0,2,2020-03-08 08:00:00
1597540641,Noble Nutrition,"Kick start your fitness journey with Noble Nutrition. Together with Noble Nutrition, you'll have the best combination of nutrition & exercises available on a single app. With Noble Nutrition, you can begin your fitness journey in no time. Get a fully personalized workout and meal plan tailored to your fitness goals. Progress tracking gets easier when you log your daily workout, record meals, update your check-ins and connect your fitness band & health kit, and get real-time updates via advanced analytical tools. Everything that contributes to your fitness goals gets captured in one place. To top it all, use the inbuilt 1-1 chat feature to have all your queries addressed on the go. You deserve to be the best. That’s why Noble Nutrition have packed so many features in a single app to help you achieve your fitness goals. Start your journey today! Features that will help you achieve your fitness goals include: * Personalized Workout Plan: Get a fully personalized fitness plan tailored to your goals, whether it is to gain weight, lose weight, gain muscles, or simply wish to work on your general fitness.  * Nutrition, Hydration & Habits: Access meal plans assigned by your coach and log your food intake to keep a close track of your calorie intake and macros. You can also track your hydration, steps and calories burnt on the app.  * Instant messaging & video calls - Message your coach in real-time and schedule video sessions directly from the app. Stay connected with your coach to improve compliance and get better results. * Check-ins: Gain complete insight into your overall performance with easy check-ins and real-time updates.  * Progress: Stay on top of your progress with powerful analytics.  * Wearable integration: Get the bigger picture of your progress by connecting your fitness band & health kit thereby enabling real-time updates. Note regarding Apple Health:  The app integrates with Apple Health to show your daily activity - distance, steps, active energy, and flights to help you better achieve your goals.  App also uses Apple Health to track energy burned and heart rate during a workout session, if an Apple Watch is used.  Workout metrics are shared with the coach to better design your workout schedule.  DISCLAIMER: Users should seek a doctor’s advice before using this app and making any medical decisions.",6013,Health & Fitness,0.0,1597540643,"Noble Bodybuilding, LLC",4.76471,17,NaT
492190063,日本 旅遊會話一指搞定(Point and Learn Japan Travel Conversation),"到日本旅行時 如果能跟當地人們來幾句會話的話 一定會成為更開心美好的回憶。 「日本 旅遊會話一指搞定」是不論任何人都能輕鬆上手的旅遊會話應用程式。 特長 ■ 在旅途中能簡單使用! 啟動非常快 在旅途中可隨時迅速對應。 ■ 能輕巧迅速動作! 以簡單易懂為最優先設計考量 實現了輕巧順暢的使用感。 ■ 在旅途中好用的會話表現 能簡單地表達自己的意思 收錄了各種旅途中實用的會話表現。 ■ 搖一下就可播放語音! 搖一下就能播放語音。 ■ 圖解單字表(附語音) 將旅途中常用單字以圖解方式整理。 輕輕觸碰單字就可播放語音。 Description: Tongue-tied tourists? Get ""Point and Learn Japan Travel Conversation"" for your trip. It's an travel phrase application for Taiwanese Mandarin. Features: Easy-to-use on your trip: You can find expressions and words whenever you need. Just touch the icon. It's ready for use in a minute. Perform comfortably: You can flick and shake. See light and smooth move. It's user-friendly interface. Useful travel-related phrase: You can find exact expressions. Zooms each phrase and point the screen to show it. It supports landscape mode. Shake to play voice files: You can ""speak"" in Japanese by shaking or tapping the screen. *You can shake to play only phrase files. Picture wordbook (with voice): You can listen to the pronunciation of the word. Tap the word to play it. Enjoy scrolling and playing at the same time.",6003,Travel,0.0,320222259,Brain,1.0,1,2012-01-11 19:33:00
6443700387,JPG Media,"A simple & fast way to book an appointment, view and access client portals, learn about services and stay up to day with current promotions and product updates at the touch of a button. New Push Notifications will make sure you never miss important information about promotions, new products, etc. No more emails and spam.",6008,Photo & Video,0.0,1648473958,"Jane Photography Group, LLC (jpg, Llc)",0.0,0,2022-10-05 07:00:00
999197469,Pinch Pad: a simple sketchpad,"Draw simple sketches and post them instantly to Twitter, Tumblr, and more. Features: • Simple but powerful sketching tools • Works with Apple Pencil • Animation support: draw multiple frames, then post them all as one animated GIF • Seamless integration with Twitter and Tumblr, so you can post your sketches with a single tap • Offline support: try to post without a connection, and Pinch Pad will save your sketches until your internet comes back Pinch Pad was built specifically to help you post quick drawings at any time. It's a great app for doing hourly comics, or for sharing doodles with your followers or friends. Give it a try! :)",6016,Entertainment,0.0,690015717,Ryan Laughlin,4.0,4,2015-06-11 12:09:00


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,Datatype,Valid,Null,Validity,Cardinality,Percent unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

Unnamed: 0,min,max
price,0.0,999.99
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [8]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category_id,Category
6013,Health & Fitness
6017,Education
6000,Business
6012,Lifestyle
6004,Sports
6014,Games
6007,Productivity
6002,Utilities
6027,Graphics & Design
6010,Navigation


Category and category_id values are as expected.