# AppStore Rating Exploratory Data Analysis


| #  | Variable    | Description                      | Data Type   |
|----|-------------|----------------------------------|-------------|
| 1  | id          | App Identifier                   | Nominal     |
| 2  | name        | App Name                         | Nominal     |
| 3  | category_id | Four Digit Category Id           | Categorical |
| 4  | category    | Category Name                    | Categorical |
| 5  | rating      | Average Customer Rating          | Interval    |
| 6  | reviews     | Total Number Of Customer Reviews | Discrete    |
| 7  | ratings     | Rating Count                     | Discrete    |
| 8  | onestar     | One Star Rating Count            | Discrete    |
| 9  | twostar     | Two Star Rating Count            | Discrete    |
| 10 | threestar   | Three Star Rating Count          | Discrete    |
| 11 | fourstar    | Four Star Rating Count           | Discrete    |
| 12 | fivestar    | Five Star Rating Count           | Discrete    |


In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import HTML, display_html

from aimobile.data.dataset.rating import RatingDataset
from aimobile.container import AIMobileContainer

container = AIMobileContainer()
container.init_resources()
container.wire(packages=["aimobile.data.dataset"])
dataset = RatingDataset()
formatting = {"thousands":",", "precision":2}

### Structural Analysis
The structure and characteristics of the Rating dataset are as follows:

In [2]:
df1 = dataset.structure
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure").format(thousands=",")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types").format(thousands=",")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True).style.format(**formatting)

Unnamed: 0,Characteristic,Total
0,Number of Observations,153518
1,Number of Variables,12
2,Number of Cells,1842216
3,Size (Bytes),35081406

Unnamed: 0,Data Type,Number of Features
0,Number of Nominal Data Types,2
1,Number of Categorical Data Types,2
2,Number of Discrete Data Types,7
3,Number of Interval Data Types,1


### Data Quality Analysis
Data type, cardinality, validity, duplication, and size data are summarized at the variable level. 

In [6]:
dataset.quality.style.format(**formatting)


Unnamed: 0,Column,Format,Data Type,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,id,int64,Nominal,153518,0,1.0,129579,0.84,1228144
1,name,object,Nominal,153518,0,1.0,129552,0.84,12950810
2,category_id,int64,Categorical,153518,0,1.0,16,0.0,1228144
3,category,object,Categorical,153518,0,1.0,16,0.0,10321037
4,rating,float64,Interval,153518,0,1.0,9,0.0,1228144
5,reviews,int64,Discrete,153518,0,1.0,4231,0.03,1228144
6,ratings,int64,Discrete,153518,0,1.0,11191,0.07,1228144
7,onestar,int64,Discrete,153518,0,1.0,3075,0.02,1228144
8,twostar,int64,Discrete,153518,0,1.0,1915,0.01,1228144
9,threestar,int64,Discrete,153518,0,1.0,2898,0.02,1228144


## Content Analysis

In [5]:
dataset.summary.style.format(**formatting)

Unnamed: 0,Category,Examples,Apps,Average Rating,Rating Count,Review Count
0,Health & Fitness,14681,11721,4.29,69368208,6530265
1,Education,13212,10931,4.24,75172396,3848424
2,Utilities,12236,9701,4.07,131277902,6270973
3,Shopping,12022,10329,4.49,331039135,9515002
4,Lifestyle,10913,8531,4.25,143917913,8359071
5,Photo & Video,10482,9462,4.02,259327405,17261525
6,Music,9784,9494,4.24,123172220,5502001
7,Business,9436,7728,3.99,78521689,2417416
8,Food & Drink,8973,8688,4.37,122413043,3059905
9,Productivity,8708,6400,4.19,126252116,6034024


In [None]:
dataset.