# Ratings Exploratory Data Analysis

| #  | Variable    | Description                      | Data Type   |
|----|-------------|----------------------------------|-------------|
| 1  | id          | App Identifier                   | Nominal     |
| 2  | name        | App Name                         | Nominal     |
| 3  | category_id | Four Digit Category Id           | Categorical |
| 4  | category    | Category Name                    | Categorical |
| 5  | rating      | Average Customer Rating          | Interval    |
| 6  | reviews     | Total Number Of Customer Reviews | Discrete    |
| 7  | ratings     | Rating Count                     | Discrete    |
| 8  | onestar     | One Star Rating Count            | Discrete    |
| 9  | twostar     | Two Star Rating Count            | Discrete    |
| 10 | threestar   | Three Star Rating Count          | Discrete    |
| 11 | fourstar    | Four Star Rating Count           | Discrete    |
| 12 | fivestar    | Five Star Rating Count           | Discrete    |


In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import HTML, display_html

from aimobile.data.dataset.rating import RatingDataset
from aimobile.container import AIMobileContainer

container = AIMobileContainer()
container.init_resources()
container.wire(packages=["aimobile.data.dataset"])
dataset = RatingDataset()

### Structural Analysis
The structure and characteristics of the Rating dataset are as follows:

In [2]:
df1 = dataset.structure
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,141278
1,Number of Variables,12
2,Number of Cells,1695336
3,Size (Bytes),32287351

Unnamed: 0,Data Type,Number of Features
0,Number of Nominal Data Types,2
1,Number of Categorical Data Types,2
2,Number of Discrete Data Types,7
3,Number of Interval Data Types,1


### Data Quality Analysis
Data type, cardinality, validity, duplication, and size data are summarized at the variable level. 

In [3]:
dataset.quality


Unnamed: 0,Column,Format,Data Type,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,id,int64,Nominal,141278,0,1.0,119875,0.85,1130224
1,name,object,Nominal,141278,0,1.0,119862,0.85,11907298
2,category_id,int64,Categorical,141278,0,1.0,15,0.0,1130224
3,category,object,Categorical,141278,0,1.0,15,0.0,9513209
4,rating,float64,Interval,141278,0,1.0,9,0.0,1130224
5,reviews,int64,Discrete,141278,0,1.0,4032,0.03,1130224
6,ratings,int64,Discrete,141278,0,1.0,10495,0.07,1130224
7,onestar,int64,Discrete,141278,0,1.0,2887,0.02,1130224
8,twostar,int64,Discrete,141278,0,1.0,1821,0.01,1130224
9,threestar,int64,Discrete,141278,0,1.0,2740,0.02,1130224


## Content Analysis

In [4]:
dataset.summary

Unnamed: 0,Category,Examples,Apps,Average Rating,Rating Count,Review Count
0,Health & Fitness,14681,11721,4.29,69368208,6530265
1,Education,13212,10931,4.24,75172396,3848424
2,Shopping,12022,10329,4.49,331039135,9515002
3,Lifestyle,10913,8531,4.25,143917913,8359071
4,Photo & Video,10482,9462,4.02,259327405,17261525
5,Music,9784,9494,4.24,123172220,5502001
6,Business,9436,7728,3.99,78521689,2417416
7,Food & Drink,8973,8688,4.37,122413043,3059905
8,Productivity,8708,6400,4.19,126252116,6034024
9,Entertainment,8507,7297,4.11,162296982,11392628
