## AppData Dataset

In [1]:
import numpy as np
import pandas as pd
from IPython.display import HTML, display_html
import warnings
import seaborn as sns
warnings.filterwarnings(action='ignore', category=UserWarning)

from appvoc.container import AppVoCContainer
from appvoc.data.dataset.app import AppDataDataset
pd.set_option('display.max_colwidth', 200)

In [2]:
container = AppVoCContainer()
container.init_resources()
container.wire(packages=['appvoc'])

In [3]:
repo = container.data.app_repo()
dataset = repo.get_dataset()

ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'appvoc_raw.app' doesn't exist")
[SQL: SELECT * FROM app;]
(Background on this error at: https://sqlalche.me/e/20/f405)

The dataset contains the following product descriptive, rating, price, and developer data for some 475,132 apps from the App Store. 

| #  | Variable                | Date Type  | Description                                |
|----|-------------------------|------------|--------------------------------------------|
| 1  | id                      | Nominal    | App Id from the App Store                  |
| 2  | name                    | Nominal    | App Name                                   |
| 3  | description             | Nominal    | App Description                            |
| 4  | category_id             | Nominal | Numeric category identifier                |
| 5  | category                | Nominal    | Category name                              |
| 6  | price                   | Continuous | App Price                                  |
| 7  | developer_id            | Nominal    | Identifier for the developer               |
| 8  | developer               | Nominal    | Name of the developer                      |
| 9  | rating                  | Ordinal   | Average user rating since first released   |
| 10 | ratings                 | Discrete   | Number of ratings since first release      |
| 11 | released                | Continuous   | Datetime of first release                  |

The structure is summarized as follows.

In [None]:
df1 = dataset.overview
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.T.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Summarizing the structure of the app dataset, we have:
- 475,132 observations described by 11 variables in approximately 7 million cells. 
- Id, name, description, developer id, and developer are nominal string variables. 
- Category id, and category are both category data types. 
- Price and rating are floats, and ratings is an integer.  

Let's examine completeness.

### Completeness

In [None]:
dataset.info.style.hide(axis="index")

With the exception of released (date), all variables are non-null. As such, approximately 93% of the observations are complete, 7% lack a release date. Rating count is a facet that will be examined during exploratory data analysis. All apps are presumed to be active; therefore, apps with earlier release dates have larger windows of time to attract ratings. To remove this temporal dimension from the rating count analysis, we'll create a new variable that normalizes the rate count (ratings) by the months that the app was available on the market. Apps without release dates must therefore, be excluded from this analysis. However, this would not justify removal from the dataset as these observations carry additional information, such as average rating, rating count, developer and so on.  Completeness is estimated at 93%, and no observations will be removed at this stage.

Next, we'll check validity.
### Validity
Our ability to assess validity of id, name, description, developer_id, and developer variables is limited. Though we have no way of testing this assumption, we take the position that any non-null value in those variables is valid.  For the rest, we define validity as follows:

| # | name        | validity                                                                                        |
|---|-------------|-------------------------------------------------------------------------------------------------|
| 1 | category_id | One of the 26, four digit category ids published by Apple.                                      |
| 2 | category    | One of the 26 categories published by Apple                                                     |
| 3 | price       | Any non-negative real value.                                                                    |
| 4 | rating      | Any real value in [0,5].                                                                        |
| 5 | ratings     | Any non-negative integer value                                                                  |
| 6 | released    | A datetime on or after July, 10, 2008. Some apps are recorded as having   future release dates. |

Let's check category and category id.

In [None]:
columns = ['category_id', 'category']
dataset.unique(columns=columns).style.hide(axis="index")

All category values are valid. Next, price, rating and ratings.

In [None]:
stats = dataset.describe(include=[np.number, "datetime64[ns]"])
stats.numeric.T[['min', 'max']]

Ratings and price are non-negative, and rating values are beteen zero and five. Finally, let's check release date. 

In [None]:
condition = lambda df: df["released"] == df["released"].min()
first_released = dataset.subset(condition=condition)
first_released['released'].values[0].astype(str)
condition = lambda df: df["released"] == df["released"].max()
last_released = dataset.subset(condition=condition)
last_released['released'].values[0].astype(str)
first_released, last_released