## AppData Exploratory Data Analysis
Section one of our exploratory data analysis is a *quantitative* exploration of the following 15 variables for over 300,000 apps, across 26 categories. 

| #  | Variable                | Date Type  | Description                                |
|----|-------------------------|------------|--------------------------------------------|
| 1  | id                      | Nominal    | App Id from the App Store                  |
| 2  | name                    | Nominal    | App Name                                   |
| 3  | description             | Nominal    | App Description                            |
| 4  | category_id             | Categorical | Numeric category identifier                |
| 5  | category                | Categorical    | Category name                              |
| 6  | price                   | Continuous | App Price                                  |
| 7  | developer_id            | Nominal    | Identifier for the developer               |
| 8  | developer               | Nominal    | Name of the developer                      |
| 9  | rating                  | Interval   | Average user rating since first released   |
| 10 | ratings                 | Discrete   | Number of ratings since first release      |
| 11 | rating_current_version  | Interval   | Average customer rating of current release |
| 12 | ratings_current_version | Discrete   | Number of user ratings for current release |
| 13 | released                | Continuous   | Datetime of first release                  |
| 14 | released_current        | Continuous   | Datetime of current release                |
| 15 | version                 | Nominal    | Current version of app                     |


Our investigation will center around the following five analyses.

1. Structural Analysis: Examine the overall shape, structure, and type of the data.
2. Data Quality Analysis: Assess quality and suitability of the data in terms of missing values, outliers, duplication, cardinality, and feature values.
3. Univariate Analysis: Explore the distributions of rating count, average rating, categories, and price.
4. Bivariate Analysis: Evaluate ratings, rating count, reviews and correlation analysis between two variables.    
5. Multivariate Analysis: Cluster, factor, and correspondence analysis of three or more variables simultaneously.


**Dependencies** 

In [1]:
from appstore.infrastructure.io.local import IOService

In [2]:
fp1 = "data/appstore/archive/appdata_05-20-2023_22-24-04.pkl"
fp2 = "data/appdata/appdata.pkl"
df1 = IOService.read(fp1)
df2 = IOService.read(fp2)
df1.info()
df2.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 678137 entries, 0 to 678136
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       678137 non-null  int64  
 1   name                     678137 non-null  object 
 2   description              678137 non-null  object 
 3   category_id              678137 non-null  int64  
 4   category                 678137 non-null  object 
 5   price                    678137 non-null  float64
 6   developer_id             678137 non-null  int64  
 7   developer                678137 non-null  object 
 8   rating                   678137 non-null  float64
 9   ratings                  678137 non-null  int64  
 10  rating_current_version   678137 non-null  float64
 11  ratings_current_version  678137 non-null  int64  
 12  released                 678137 non-null  object 
 13  released_current         678137 non-null  object 
 14  vers

In [3]:
def summarize(df):
    summary = df["category"].value_counts().reset_index()
    summary.columns = ["category", "Examples"]
    df2 = df.groupby(by="category")["id"].nunique().to_frame()
    df3 = df.groupby(by="category")["rating"].mean().to_frame()
    df4 = df.groupby(by="category")["ratings"].sum().to_frame()

    summary = summary.join(df2, on="category")
    summary = summary.join(df3, on="category")
    summary = summary.join(df4, on="category")
    summary.columns = ["Category", "Examples", "Apps", "Average Rating", "Rating Count"]
    return summary


In [6]:
sum(summarize(df1)["Apps"])
sum(summarize(df2)["Apps"])

302765

475132

In [5]:
fp = "data/appstore/archive/appdata_05-20-2023_22-24-04.pkl"
df = IOService.read(fp)