# AppData Cleaning
The 'AppData' dataset encapsulates the core descriptive and rating data obtained from the App Store and contains the following variables.

| #  | attribute     | type  | description                                  | API Field         |
|----|---------------|-------|----------------------------------------------|-------------------|
| 1  | id:           | int   | Unique Apple App Identifier                  | trackId           |
| 2  | name:         | str   | Name of the app.                             | trackName         |
| 3  | description:  | str   | Description                                  | description       |
| 4  | category_id:  | int   | Four digit category identifier               | primaryGenreId    |
| 5  | category:     | str   | Category name                                | primaryGenreName  |
| 6  | price:        | float | Cost of the app                              | price             |
| 7  | rating:       | float | The user average rating                      | averageUserRating |
| 8  | ratings:      | int   | The rating count                             | userRatingCount   |
| 9  | developer_id: | int   | The app developer identifier                 | artistId          |
| 10 | developer:    | str   | The app developer name                       | artistName        |
| 11 | released:     | str   | The date of initial release                  | releaseDate       |
| 12 | source:       | str   | The host from which the data were obtained.  | itunes.apple.com  |

Our aim here is to prepare this raw data for exploratory data analysis. Following some dependency housekeeping, we begin with an overview and profile of the data. 

In [1]:
import os

import pandas as pd
from IPython.display import HTML

from aimobile.container import AIMobileContainer
from aimobile.data.analysis.profile import Profiler

container = AIMobileContainer()
container.init_resources()
container.wire(packages=["aimobile.data.acquisition.appstore"])

## AppData Overview

In [2]:
uow = container.data.uow()
appdata = uow.appdata_repo.getall()
appdata_profiler = Profiler(data=appdata)
appdata_profiler.overview

[05/18/2023 06:27:39 AM] [ERROR] [MySQLDatabase] [connect] : Database is not started. Starting database...
[sudo] password for john: 


Starting MySQL...
 * Starting MySQL database server mysqld




   ...done.


Unnamed: 0,Unnamed: 1
Number of Variables,12.0
Number of Observations,513183.0
Number of Cells,6158196.0
Missing Cells,0.0
Missing Cells (%),0.0
Duplicate Rows,0.0
Duplicate Rows (%),0.0
Size (Bytes),1576810257.0


From this overview, we have:
1. Over 513,000 observations,   
2. Validity of 100%, evincing no missing values 
3. Row cardinality of 100%, evincing no duplicate rows

## AppData Summary
Let's summarize the dataset at the variable level.

In [3]:
appdata_profiler.summary

Unnamed: 0,Column,Dtype,Valid,Missing,Validity,Unique,Cardinality,Size
0,id,int64,513183,0,1.0,461878,0.9,4105464
1,name,object,513183,0,1.0,461358,0.9,43714521
2,description,object,513183,0,1.0,451349,0.88,1356704636
3,category_id,int64,513183,0,1.0,26,0.0,4105464
4,category,object,513183,0,1.0,26,0.0,34058664
5,price,float64,513183,0,1.0,116,0.0,4105464
6,developer_id,int64,513183,0,1.0,258212,0.5,4105464
7,developer,object,513183,0,1.0,257297,0.5,39955896
8,rating,float64,513183,0,1.0,52917,0.1,4105464
9,ratings,int64,513183,0,1.0,20026,0.04,4105464


At the variable level, two primary issues stand out.  First, the cardinality of the id, name, and developer variables suggest some partial duplication in the dataset. Second, the data types should reflect the meaning of the data. Nominal variables should be coded as such, prior to the exploratory data analysis effort.

## AppData Partial Duplication
Our strategy for addressing the partial duplication rests on the supposition that the id variable does not uniquely identify each observation or app in our data. Indeed, we have multiple releases of apps and the id/release duality is not captured in the data in its current form. To test our hypothesis, let's examine the cardinality of the id and released variables. The following will return a summary of value counts of value counts for the combined id/released variables.

In [10]:
appdata_profiler.value_counts(x=["id","name",  "category",  "developer_id", "ratings", "price", "rating"], threshold=2)['count'].value_counts()

2    21257
3        3
Name: count, dtype: int64