# AppVoC Rating Exploratory Data Analysis


| #  | Variable    | Description                      | Data Type   |
|----|-------------|----------------------------------|-------------|
| 1  | id          | App Identifier                   | Nominal     |
| 2  | name        | App Name                         | Nominal     |
| 3  | category_id | Four Digit Category Id           | Categorical |
| 4  | category    | Category Name                    | Categorical |
| 5  | rating      | Average Customer Rating          | Interval    |
| 6  | reviews     | Total Number Of Customer Reviews | Discrete    |
| 7  | ratings     | Rating Count                     | Discrete    |
| 8  | onestar     | One Star Rating Count            | Discrete    |
| 9  | twostar     | Two Star Rating Count            | Discrete    |
| 10 | threestar   | Three Star Rating Count          | Discrete    |
| 11 | fourstar    | Four Star Rating Count           | Discrete    |
| 12 | fivestar    | Five Star Rating Count           | Discrete    |


In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import HTML, display_html

from appvoc.data.dataset.rating import RatingDataset
from appvoc.container import AppVoCContainer

container = AppVoCContainer()
container.init_resources()
container.wire(packages=["appvoc.data.dataset"])
dataset = RatingDataset()
formatting = {"thousands":",", "precision":2}

TypeError: RatingDataset.__init__() missing 1 required positional argument: 'df'

### Structural Analysis
The structure and characteristics of the Rating dataset are as follows:

In [None]:
df1 = dataset.structure
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure").format(thousands=",")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types").format(thousands=",")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,177100
1,Number of Variables,13
2,Number of Cells,2302300
3,Size (Bytes),39125826

Unnamed: 0,Data Type,Number of Features
0,Number of Nominal Data Types,2
1,Number of Categorical Data Types,2
2,Number of Discrete Data Types,7
3,Number of Interval Data Types,1


### Data Quality Analysis
Data type, cardinality, validity, duplication, and size data are summarized at the variable level. 

In [None]:
dataset.quality.style.format(**formatting)


Unnamed: 0,Column,Format,Data Type,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,id,string,Nominal,177100,0,1.0,177100,1.0,11839060
1,name,string,Nominal,177100,0,1.0,176890,1.0,14932818
2,category_id,category,Categorical,177100,0,1.0,7,0.0,177827
3,category,category,Categorical,177100,0,1.0,7,0.0,177875
4,rating,float64,Continuous,177100,0,1.0,10,0.0,1416800
5,reviews,int64,Discrete,177100,0,1.0,2494,0.01,1416800
6,ratings,int64,Discrete,177100,0,1.0,6430,0.04,1416800
7,onestar,int64,Discrete,177100,0,1.0,1763,0.01,1416800
8,twostar,int64,Discrete,177100,0,1.0,1086,0.01,1416800
9,threestar,int64,Discrete,177100,0,1.0,1656,0.01,1416800


## Content Analysis

In [None]:
dataset.summary.style.format(**formatting)

Unnamed: 0,Category,Examples,Apps,Average Rating,Rating Count,Review Count
0,Finance,58300,58300,1.21,132980495,3846716
1,Medical,31800,31800,0.68,9060144,589224
2,Health & Fitness,28800,28800,2.14,37529874,3392969
3,Lifestyle,20500,20500,1.95,59400346,3714440
4,Food & Drink,17900,17900,2.26,87449927,2324313
5,Photo & Video,17900,17900,2.14,100711313,7472538
6,Productivity,1900,1900,1.16,31737,6660
