# Exploratory Data Analysis

## Stage I: App Performance Overview

Stage one of our exploratory data analysis aims to expose patterns and yield insight into the nature and intensity of the customer experience within the IOS app user community.

### The Dataset

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store.

| #  | Variable     | Date Type  | Description                              |
| -- | ------------ | ---------- | ---------------------------------------- |
| 1  | id           | Nominal    | App Id from the App Store                |
| 2  | name         | Nominal    | App Name                                 |
| 3  | description  | Nominal    | App Description                          |
| 4  | category_id  | Nominal    | Numeric category identifier              |
| 5  | category     | Nominal    | Category name                            |
| 6  | price        | Continuous | App Price                                |
| 7  | developer_id | Nominal    | Identifier for the developer             |
| 8  | developer    | Nominal    | Name of the developer                    |
| 9  | rating       | Ordinal    | Average user rating since first released |
| 10 | ratings      | Discrete   | Number of ratings since first release    |
| 11 | released     | Continuous | Datetime of first release                |

### EDA Approach

Our exploration will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.
6. Conclusions, insights and questions for stage two.

### Preliminaries

**Import Dependencies** # noqa

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import seaborn as sns
import studioai as eda

from appstore.data.dataset.appdata import AppDataDataset
from appstore.container import AppstoreContainer

**Wire and Initialize Dependencies** # noqa

In [2]:
container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore"])

**Obtain the Dataset** # noqa

In [3]:
repo = container.data.appdata_repo()
dataset = repo.get_dataset()

### Structural Analysis

The structure and characteristics of the AppData dataset are as follows:

In [4]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,15
2,Number of Cells,7126980
3,Size (Bytes),974458222

Unnamed: 0_level_0,Count
Data Type,Unnamed: 1_level_1
bool,1
category,2
datetime64[ns],2
float64,4
int64,1
string,5


As indicated above, we have approximately 475,000 apps in our dataset, described by 11 features. Let's take a quick look.

In [5]:
dataset.sample().style.hide(axis="index")

id,name,description,category_id,category,price,developer_id,developer,rating,ratings,released,extracted,free,months_avail,ratings_per_month
673020961,岡三アクティブFX for iPhone,「岡三アクティブFX for iPhone」は岡三証券が提供するiPhone専用のFXトレーディングアプリです。 どなたでも無料でダウンロードでき、ログインしなくてもリアルタイム為替レートやテクニカルチャートをご利用いただけます。 ※お取引には、岡三アクティブFX取引口座の開設、取引口座へのログインが必要です。 ■主な機能 【リアルタイム為替レート】 岡三アクティブFX取扱い２０通貨ペアのリアルタイムレートを配信。 レート一覧は、お好みに合わせて「リスト」「パネルＳ」「パネルＬ」の表示切替が可能です。 また、レート一覧をワンタップするだけで「チャート」や「クイック注文」「全決済注文」などに素早く移動できます。 【多彩なチャート機能】 TICKから月足まで１３種類の豊富な足種と多彩なテクニカル指標でチャート分析を強力にサポート。テクニカルのパラメータを変更してオリジナルのカスタマイズも可能です。 また、チャートを見ながらワンタップで発注する「クイック注文機能」を搭載。狙ったタイミングで素早く注文を出すことが可能です。 [テクニカル指標] 単純移動平均、指数平滑移動平均、ボリンジャーバンド、一目均衡表、パラボリック、ストキャスティクス、RSI、MACD、DMI、平均足、RCIを搭載 【豊富な注文機能】 豊富な注文方法をご用意。 成行や指値等のベーシックな注文から、トレール注文や時間指定注文などPCと同等の注文が可能です。 【クイック注文】 「クイック注文」は、新規注文、決済注文、同一通貨ペアの全決済注文などを全てワンタップで発注することができます。 保有ポジション、平均約定レートや評価損益などを１画面に集約しているので、スピーディーなお取引が可能です。 【充実の設定機能】 注文条件の初期設定や通貨ペア別の注文設定、起動時画面の設定など、様々な項目の設定が可能です。 お客様が一番使い易いオリジナルの設定に保存しておくことができます。 【投資情報】 アジア・オセアニアからニューヨーク市場まで、全世界の取引時間に対応したマーケット情報を、1日約200本のフルボリュームで配信。 岡三オンライン証券の公式ブログや動画コンテンツ等も閲覧でき、いつでもどこでも最新のニュースへアクセスできます。 ※ご利用には取引口座へのログインが必要です。 ■提供会社 岡三証券株式会社　岡三オンライン証券カンパニー https://www.okasan-online.co.jp/ 金融商品取引業者 関東財務局長(金商)第53号 加入協会: 日本証券業協会、 一般社団法人 日本投資顧問業協会、 一般社団法人 金融先物取引業協会、 一般社団法人 第二種金融商品取引業協会、 一般社団法人 日本暗号資産取引業協会,6015,Finance,0.0,395918157,岡三オンライン証券株式会社,0.0,0,2013-08-08 07:00:00,2023-07-31 05:00:00,True,118.0,0.0
1447839267,Workout Schedule,Workout Schedule app that gives you a 10 weeks training schedule for your whole body. You are free to determine your workout as you like or to follow the preset schedule. Important features: -10 weeks schedule with different exercises. -Click the picture to see how to perform the exercise. -Change sets and reps on each exercise. -Double-click each exercise to mark it as finished.,6013,Health & Fitness,0.0,1230753430,Nerje Salah,4.25,4,NaT,2023-07-31 05:00:00,True,,
1578880067,Pay'am,"Pay'am is an app that facilitates the payment of goods and services 7 times faster in Africa, without an internet connection. The goal of the app is to promote a digital economy for economic growth in Africa. NB: Pay'am is not a mobile wallet, it is a facilitator of financial transactions across network providers and banks. It uses your current mobile wallet or bank to facilitate the process of paying for goods and services and sending money * Send and request money: With Payam, you can send money to your peers across any network provider. You can do that either by selecting the person's contact in your contact list, scanning his or her QR Code, or inputting the receiver's number manually. The transaction happens within seconds. * Pay bills: Pay utility bills like Water, Electricity, and TV cable bills from the comfort of your home or office with the payamapp. * Deposit or withdraw money: You can easily make a deposit or withdrawal from your local agent faster and more securely. * Buy Airtime and internet data * Check balance: Check the balance of your mobile wallet, internet data, and airtime This app uses Accessibility Services to simplify and visualize complex, USSD-based processes for illiterate and innumerate users. ************************************************ Supported countries: Cameroon Rwanda ************************************************ Supported operators and service providers: Cameroon: -- Operators 	MTN Cameroon 	Orange Cameroon -- Service providers 	MTN Mobile Money Cameroon 	Orange Money Cameroon Rwanda: -- Operators 	MTN Rwanda -- Service providers 	MTN Mobile Money Rwanda ********************** Pay for goods and services with the payam app up to 7 times faster! Scan QR code to send money, withdraw, or deposit Use your preferred mobile wallet or bank Send money or pay merchants without internet Secured biometrically and with PIN code Multiple services available per country Send money or request payment from others Receive payments through QR code",6015,Finance,0.0,1157308873,GLOBEXCAM GROUP LIMITED,5.0,3,2022-01-04 08:00:00,2023-07-31 05:00:00,True,18.0,0.162339
1016278440,DoF Table,"DoF Table shows the ""depth of field"" (or focus range) in a table format calculated by the camera and the lens model you choose. With this app, you don't need to input the parameters (e.g. f-value) by hand because this app shows the result in a table format — you just select the camera and the lens. In addition to that, this app never bothers you by showing DoF values which cannot be achieved by the lens you chose. ■Features - Shows Depth of fields in a table format - The table hides settings which cannot be achieved with the equipments you chose. - Shipped with actual product database. - You can add cameras and lenses by yourself. Other useful information are also available: - Hyperfocal distance - Field of view [Pro Version only] - Angle of view [Pro Version only] - 35mm equivalent focal length of chosen combination of the camera/lens [Pro Version only] - Crop factor (focal length multiplier) of cameras - Native aspect ratio of cameras ■Pro Version If you liked this app, please consider purchasing the ""Pro version"" in the Settings menu. Highlights of Pro version are: - Ad-free - Additional features to calculate:  - Field of View  - Angle of View  - 35mm Equivalent Focal Length ■Notes - You can backup/restore user-defined camera/lens database using ""iTunes file sharing.""",6008,Photo & Video,0.0,1016278439,Suguru Yamamoto,4.38462,13,2015-08-26 01:36:00,2023-07-31 05:00:00,True,93.0,0.139151
1120910144,انشاء جي آي إف مع ملصقات,حان الوقت لانشاء وصنع صور متحركه وافلام فيديو قصيره ووضع اجمل الملصقات عليها مع برنامج انشاء جي اي اف مع ملصقات من خلال البرنامج تستطيع عمل التالي: التقاط صوره باستخدام الكاميرا او من خلال البوم الصور تركيب خلفيات ملونه اضافة نص على الصوره اضافة العديد من الملصقات المثيره والمميزه انتاج صور متحرمه او فيلم قصير من الصور التي قمت بتركيبها واخيرا يمكنك تصدير الفيديو اما على شكل صور متحرمه -جي اي اف- او على شكل فيديو وفلم قصير ومشاركته مع اصدقائك على شبكات التواصل الاجتماعي,6008,Photo & Video,0.0,1114392483,MOHAMMED ABURAS,0.0,0,2016-06-09 03:28:00,2023-07-31 05:00:00,True,84.0,0.0


Identity variables, specifically (app) id and developer_id will be retained for data processing purposes, but have no other value and will be largely ignored during this analysis.

### Data Quality Analysis

Data type, cardinality, validity, duplication, and size data are summarized at the variable level.

In [6]:
dataset.info.style.hide(axis="index")

Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
id,string,475132,0,1.0,475132,1.0,31748470
name,string,475132,0,1.0,474250,1.0,40449624
description,string,475132,0,1.0,463635,0.98,1186227495
category_id,category,475132,0,1.0,26,0.0,477790
category,category,475132,0,1.0,26,0.0,477944
price,float64,475132,0,1.0,125,0.0,3801056
developer_id,string,475132,0,1.0,265367,0.56,31666555
developer,string,475132,0,1.0,264402,0.56,37093493
rating,float64,475132,0,1.0,44083,0.09,3801056
ratings,int64,475132,0,1.0,14531,0.03,3801056


**Observations**

- With the exception of released (date), we have no missing values.
- Id's are unique and name, description, developer information, are all high-cardinality
- Category id and label are low-cardinality with 26 unique values, each.

#### Numeric Variable Data Quality

Each feature has been cast to an appropriate data type and missing data are not extant for the dataset. Valid values for the numeric variables are:

| Variable | Date Type  | Valid Values                                     |
| -------- | ---------- | ------------------------------------------------ |
| price    | Continuous | Non negative values                              |
| rating   | Interval   | Real valued in [0,5]                             |
| ratings  | Discrete   | Discrete and non-negative                        |
| released | Continuous | Datetimes between June 10, 2008 and present day. |

Let's check the ranges for these variables.

In [7]:
stats = dataset.describe(include=[np.number, np.datetime64])
stats.numeric[['min','max']]

KeyError: "None of [Index(['min', 'max'], dtype='object')] are in the [columns]"

All numeric and datetime values are within range.

#### Categorical Variable Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings. Category and category_id; in contrast, must contain one of 26 category id / category values selected for this analysis.

In [None]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

Category and category_id values are as expected.