# Android App Market Analysis - What Makes an App Popular?
## Project Proposal

*95-885 Data Science and Big Data - Project 1: Exploratory Data Analysis*

**Team 15<br>
Daniel You (sangwony)<br>
Jiaxuan Zhang (jiaxuanz)<br>
Yuran Zhu (yuranz)**

## I. Motivation

### 1. Key Motivation

As science and technology have been advanced swiftly, app markets are full of opportunities as well as challenges. While many public datasets provide Apple App Store data, there are not many counterpart datasets and visualization available for Google Play Store apps anywhere on the web. For the app-making business, we aim to provide the developers with some actionable insights in the Android market through the visualization of Google Play Store apps data.

### 2. Significance & Potential Value

With over 2.7 billion smartphone users across the world, it’s no surprise that the mobile app industry is thriving. App usage and smartphone penetration are still growing at a steady rate, without any signs of slowing down in the foreseeable future. 

According to some [statistics](https://buildfire.com/app-statistics/), the Apple App Store has 2.2 million apps available for download while there are 2.8 million apps on the Google Play Store. Besides, the developers need to pay Apple \\$99/year to keep your app online, but only [$25 one time for Google Play Store](https://yalantis.com/blog/apple-app-store-and-google-play-store/). Given the more strict and slower apps released process in the Apple Store, many developers are more satisfied with the policies of Android apps. Therefore, a thorough analysis of the Google Play Store data is very important for the developers who want to put more effort into the Android market. In the long run, it may create enormous values for both the public and private sectors to keep up with the era of mobile internet.  


## II. Related Work

This problem originated from some [articles](https://www.theverge.com/2020/8/7/21358355/facebook-apple-app-store-policies-comments-facebook-gaming-ios) about the Apple Store policy conflicts that we read. Then a question occurs to us: what if the developers want to switch from IOS to Android and they need to know about the app developing situation in the Android market? While the Android market is as important as the IOS market in many countries, there is not so much public dataset and analysis for it. Before we walk further into this problem, we want to question, wrangle, and explore the datasets we found on [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps) using the visualizing and storytelling skills we discussed in class. 

## III. Data

### 1. Nature of Data

The data is scraped from the Google Play Store and is available at [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps). The portfolio contains **two datasets** and we plan to analyze both of them for developing a comprehensive data exploration. Therefore, the data we use is spread across two tables and we'll join them together in further analysis.

One dataset includes app information, having values for category, rating, size, download counts. etc. The other dataset contains data for app reviews, each row refers to a single review and corresponding analysis for an App.

#### Apps Information Data

The dataset consists of **13 columns** and **10,842 rows**. Each row refers to a single Android app, and each column refers to one app characteristics, such as App category, overall ratings, size, price. All the columns are in object types except for the `Rating` columns, which is float type. With initial data exploration, we find:
- The `Installs` column can be one of the dependable variables that can be categorized into 22 bins by numbers of installs. 
- The `Rating` column is missing 1,474 variables, the mean value is 4.19 and the standard deviation is 0.54.
- The `Type` column is missing 1 variable and has 3 unique values.
- The `Content Rating` is missing 1 variable and has 6 unique values.
- The `Current Ver` column is missing 8 variables and has 10,833 unique values.
- The `Android Ver` column is missing 3 variables and has 33 unique values.


#### Users Review Data

This dataset consists of **5 columns** and **64,295 rows**. Each row refers to a single user review for an App, including values of sentiment type (positive/negative) and scores of sentiment polarity and sentiment subjectivity. **26,862 rows** contain missing values for `Translated_Review`, `Sentiment`, `Sentiment_Polarity`, `Sentiment_Subjectivity` columns. Therefore, we need to clean data first.

To conclude, the data is not clean so we need to perform data prepossessing before starting to analyze and visualize.

### 2. Other Data Use

To our knowledge this data has been used in other projects especially related to Kaggle. We noticed that on Kaggle there are a lot of participants who used data to get interesting findings for their purpose, for example: 

- Tinotenda Mhalanga used the dataset to find the most popular category that has the largest number of installs. 
- Shoumik performed the feature engineering to come up with novel features based on the functional understanding of the dataset.
- Anita Soni used google play store apps dataset to find apps that require GPS permission.

## IV. Question

### 1. Interesting Problems

With the dataset, we’re able to analyze the Google Play store market and seek answers for the questions related to the app variety, pricing strategy, app size, and content design. More specifically, we expect to answer the following questions that are interesting to both app developers and users. 

1. What categories of apps are most available/popular?
2. Do paid apps perform better than free apps? For paid apps, do those with higher ratings commonly have a price in a reasonable range? What categories tend to set higher prices?
3. Do users prefer apps with lighter size? That is to say, for each category, are top-rated apps tend to have less size? Can we find a reasonable range of an app size?
4. Among current apps, which categories receive the most positive reviews? What factors make an app positively/negatively evaluated? 

### 2. How to Address Questions

Using data analysis skills and visualization techniques, we can address each of the above questions from different perspectives.

- For Question 1, we will explore the total number of apps for comparing the availability, and analyze the popularity from the perspectives of installations, ratings, and reviews.
- For Question 2, we will mainly evaluate the performance through ratings. Based on the category grouping, we can explore the distribution of app ratings and prices, to reveal the relationship between them and detect a reasonable price range.
- For Question 3, we plan to visualize how ratings and download counts vary by app sizes for different categories. We expect to find a reasonable range of app size from the visualization.
- For Question 4, we’ll focus on the reviews dataset. Using sentiment polarity data, we can analyze review sentiment differences between categories.Through generating word clouds for positive reviews and negative reviews, we could find the most common words, and especially, we could notice important factors that lead users to like or dislike an app. 

### 3. Workflow
1. Data prepossessing
<br>
2. Category by category analysis
    - App availability and popularity
    - Share of free and paid apps
    - For paid apps: how ratings/download counts vary by prices
    - How ratings/download counts vary by size
    - Find a reasonable range for pricing strategy and app size design
<br>
3. Reviews analysis
    - Sentiment analysis by categories
    - Word clouds for positive and negative reviews
    - Explore factors that users are concerned about an app
<br>
4. Summarize findings and develop conclusions

## V. Possible Findings and Implications

At the current stage, we aim to answer the questions we brought up above, which can further provide insights for app developers and reveal some facts about user preferences. We expect to have the following findings:

- From analyzing the availability and popularity of different categories, to detect what kinds of apps may face larger market needs and developing opportunities.
- From analyzing how ratings and download counts vary by prices, we aim to find a reasonable range of price setting for paid apps. With this insight, developers can optimize their price strategy to attract users and make profits.
- From analyzing how ratings and download counts vary by size, we can provide suggestions for developers to consider how to optimize their apps.
- From analyzing apps reviews, we can find common factors that matter for users in evaluating apps, which is helpful for developers in the designing phase.

Overall, we hope to reveal some facts about user preferences in choosing and evaluating apps, and provide suggestions for developers on finding Android market opportunities, improving  pricing strategies and app design.
