## AppData Exploratory Data Analysis
Section one of our exploratory data analysis is a *quantitative* exploration of the following 15 variables for over 300,000 apps, across 26 categories. 

| #  | Variable                | Date Type  | Description                                |
|----|-------------------------|------------|--------------------------------------------|
| 1  | id                      | Nominal    | App Id from the App Store                  |
| 2  | name                    | Nominal    | App Name                                   |
| 3  | description             | Nominal    | App Description                            |
| 4  | category_id             | Categorical | Numeric category identifier                |
| 5  | category                | Categorical    | Category name                              |
| 6  | price                   | Continuous | App Price                                  |
| 7  | developer_id            | Nominal    | Identifier for the developer               |
| 8  | developer               | Nominal    | Name of the developer                      |
| 9  | rating                  | Interval   | Average user rating since first released   |
| 10 | ratings                 | Discrete   | Number of ratings since first release      |
| 11 | rating_current_version  | Interval   | Average customer rating of current release |
| 12 | ratings_current_version | Discrete   | Numer of user ratings for current release  |
| 13 | released                | Continuous   | Datetime of first release                  |
| 14 | released_current        | Continuous   | Datetime of current release                |
| 15 | version                 | Nominal    | Current version of app                     |


Our investigation will center around the five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions and frequencies of each variable individually
4. Bivariate Analysis: Association and correlation analysis between two variables.    
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.


**Dependencies** 

In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import HTML, display_html

from aimobile.data.dataset.appdata import AppDataDataset
from aimobile.container import AIMobileContainer

container = AIMobileContainer()
container.init_resources()
container.wire(packages=["aimobile.data.dataset"])
dataset = AppDataDataset()

### Structural Analysis
The structure and characteristics of the AppData dataset are as follows:

In [2]:
df1 = dataset.structure
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,334821
1,Number of Variables,15
2,Number of Cells,5022315
3,Size (Bytes),816572328

Unnamed: 0,Data Type,Number of Features
0,Number of Nominal Data Types,6
1,Number of Categorical Data Types,2
2,Number of Continuous Data Types,3
3,Number of Discrete Data Types,2
4,Number of Interval Data Types,2


The structure, in terms of shape, size, and type comports with expectations.

### Data Quality Analysis
Data type, cardinality, validity, duplication, and size data are summarized at the variable level. 

In [3]:
dataset.quality


Unnamed: 0,Column,Format,Data Type,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,id,string[python],Nominal,334821,0,1.0,302760,0.9,22354642
1,name,string[python],Nominal,334821,0,1.0,302434,0.9,28546848
2,description,string[python],Nominal,334821,0,1.0,297963,0.89,987808693
3,category_id,category,Categorical,334821,0,1.0,26,0.0,336101
4,category,category,Categorical,334821,0,1.0,26,0.0,337633
5,price,float64,Continuous,334821,0,1.0,116,0.0,2678568
6,developer_id,string[python],Nominal,334821,0,1.0,168841,0.5,22297313
7,developer,string[python],Nominal,334821,0,1.0,168335,0.5,25693867
8,rating,float64,Interval,334821,0,1.0,52988,0.16,2678568
9,ratings,int64,Discrete,334821,0,1.0,23167,0.07,2678568


These variable level metadata indicate no missing data. The data formats align with, and are appropriate for the data types. However, the cardinality of the id, name, and description variables reveal about 10% partial duplication of these variables. Whereas duplicate name and description should be addressed, some duplication is tolerable. Not so for the identity variable.  Duplication of name and description names and descriptions are not  in name, and description isn't ideal, but tolerable.  but these nominal values are not ; however, Whereas some duplication in the name and description variables  of ids. Whereas some duplication in name and description  Similarly.  duplication  suggests some , name, and description variables indicates about 10% partial duplication among these variables. Let's take a closer look.

#### Partial Duplication Analysis


#### Numeric Data Quality
Valid values for the numeric variables are:

| Variable                | Date Type  | Valid Values                                    |
|-------------------------|------------|-------------------------------------------------|
| price                   | Continuous | Non negative values                             |
| rating                  | Interval   | Real valued in [0,5]                            |
| ratings                 | Discrete   | Discrete and non-negative                       |
| rating_current_version  | Interval   | Real valued in [0,5]                            |
| ratings_current_version | Discrete   | Discrete and non-negative                       |
| released                | Continuous | Datetimes between June 10, 2008 and present day. |
| released_current        | Continuous | Datetimes between June 10, 2008 and present day. |

Let's examine the descriptive statistics for these variables.

In [6]:
dataset.describe(include=[np.number, np.datetime64])[["min","max"]]

Unnamed: 0,min,max
price,0.00,999.99
rating,0.00,5.00
ratings,0.00,30835421.00
rating_current_version,0.00,5.00
ratings_current_version,0.00,30835421.00
released,2008-09-27 04:53:00,2023-11-15 08:00:00
released_current,2008-11-05 05:15:00,2023-11-15 08:00:00


All numeric and datetime values are within range.

#### Non-Numeric Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings with no restrictions on the values they may contain. Category and category_id; in contrast, must contain one of 26 valid values.  

In [2]:
columns = ['category_id', 'category']
dataset.unique(columns=columns)

Unnamed: 0,index,category_id,category
0,0,6013,Health & Fitness
1,3,6024,Shopping
2,9,6020,Medical
3,15,6009,News
4,22,6014,Games
5,36,6012,Lifestyle
6,39,6017,Education
7,42,6004,Sports
8,49,6007,Productivity
9,53,6003,Travel


Category and category_id values are within spec.

The dataset evinces no missing data, and variable values are appropriate for the data type, and specifications of the data. Some partial duplication within the id, name, and description variables.  align with data type, and expectations, 