## Exploratory Data Analysis I: App Ratings 
Section one of our exploratory data analysis is a brief *quantitative* exploration of the App rating and related data obtained from Apple AppStore. This initial analysis aims to 
1. expose patterns and stimulate insight into the nature and intensity of customer satisfaction within the IOS app user community to. 

### The Dataset
The dataset contains IOS app descriptive data, ratings, rating counts, developer and pricing information as indicated below.

| #  | Variable                | Date Type  | Description                                |
|----|-------------------------|------------|--------------------------------------------|
| 1  | id                      | Nominal    | App Id from the App Store                  |
| 2  | name                    | Nominal    | App Name                                   |
| 3  | description             | Nominal    | App Description                            |
| 4  | category_id             | Nominal | Numeric category identifier                |
| 5  | category                | Nominal    | Category name                              |
| 6  | price                   | Continuous | App Price                                  |
| 7  | developer_id            | Nominal    | Identifier for the developer               |
| 8  | developer               | Nominal    | Name of the developer                      |
| 9  | rating                  | Ordinal   | Average user rating since first released   |
| 10 | ratings                 | Discrete   | Number of ratings since first release      |
| 11 | released                | Continuous   | Datetime of first release                  |

### EDA Approach
Our investigation will comprise the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.    
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.


**Dependencies** 

In [1]:
import os

import numpy as np
import pandas as pd
from IPython.display import HTML, display_html

from appstore.data.analysis.appdata import AppDataDataset
from appstore.container import AppstoreContainer

container = AppstoreContainer()
container.init_resources()
container.wire(packages=["appstore.data.analysis"])
dataset = AppDataDataset()

### Structural Analysis
The structure and characteristics of the AppData dataset are as follows:

In [2]:
df1 = dataset.structure
df2 = dataset.dtypes

df1_style = df1.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Dataset Structure")
df2_style = df2.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Dataset Data Types")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,Characteristic,Total
0,Number of Observations,475132
1,Number of Variables,11
2,Number of Cells,5226452
3,Size (Bytes),974952215

Unnamed: 0,Data Type,Number of Features
0,Number of Nominal Data Types,7
1,Number of Continuous Data Types,2
2,Number of Discrete Data Types,1
3,Number of Ordinal Data Types,1


The structure, in terms of shape, size, and type comports with expectations.

### Data Quality Analysis
Data type, cardinality, validity, duplication, and size data are summarized at the variable level. 

In [3]:
dataset.quality


Unnamed: 0,Column,Format,Data Type,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,id,int64,Nominal,475132,0,1.0,475132,1.0,3801056
1,name,object,Nominal,475132,0,1.0,474250,1.0,40449624
2,description,object,Nominal,475132,0,1.0,463635,0.98,1186227495
3,category_id,int64,Nominal,475132,0,1.0,26,0.0,3801056
4,category,object,Nominal,475132,0,1.0,26,0.0,31516841
5,price,float64,Continuous,475132,0,1.0,125,0.0,3801056
6,developer_id,int64,Nominal,475132,0,1.0,265367,0.56,3801056
7,developer,object,Nominal,475132,0,1.0,264402,0.56,37093493
8,rating,float64,Ordinal,475132,0,1.0,44083,0.09,3801056
9,ratings,int64,Discrete,475132,0,1.0,14531,0.03,3801056


These variable level metadata indicate no missing data. The data formats align with, and are appropriate for the data types.

#### Partial Duplication Analysis


#### Numeric Data Quality
Valid values for the numeric variables are:

| Variable                | Date Type  | Valid Values                                    |
|-------------------------|------------|-------------------------------------------------|
| price                   | Continuous | Non negative values                             |
| rating                  | Interval   | Real valued in [0,5]                            |
| ratings                 | Discrete   | Discrete and non-negative                       |
| rating_current_version  | Interval   | Real valued in [0,5]                            |
| ratings_current_version | Discrete   | Discrete and non-negative                       |
| released                | Continuous | Datetimes between June 10, 2008 and present day. |
| released_current        | Continuous | Datetimes between June 10, 2008 and present day. |

Let's examine the descriptive statistics for these variables.

In [4]:
dataset.describe(include=[np.number, np.datetime64])[["min","max"]]

Unnamed: 0,min,max
id,281736535.0,6449369729.0
category_id,6000.0,6027.0
price,0.0,999.99
developer_id,281656478.0,1688884845.0
rating,0.0,5.0
ratings,0.0,30835421.0


All numeric and datetime values are within range.

#### Non-Numeric Data Quality

The id, name, description, developer_id, and developer variables are nominal, high cardinality strings with no restrictions on the values they may contain. Category and category_id; in contrast, must contain one of 26 valid values.  

In [5]:
columns = ['category_id', 'category']
dataset.unique(columns=columns)

Unnamed: 0,index,category_id,category
0,0,6013,Health & Fitness
1,13,6017,Education
2,17,6000,Business
3,24,6012,Lifestyle
4,25,6004,Sports
5,88,6014,Games
6,109,6007,Productivity
7,122,6002,Utilities
8,141,6027,Graphics & Design
9,164,6010,Navigation


Category and category_id values are within spec.

The dataset evinces no missing data, and variable values are appropriate for the data type, and specifications of the data. Some partial duplication within the id, name, and description variables.  align with data type, and expectations, 