# Profitable App Profiles for the App Store and Google Play Markets

This project is undertaken from the perspective of a data analyst at a company specializing in the development of free-to-download mobile applications for Google Play and the App Store. The company's primary revenue model relies on in-app advertisements; consequently, user engagement is a critical determinant of financial performance.

Through the analysis of mobile application data, this project aims to identify application categories that attract a substantial user base and foster high engagement. The objective is to furnish the development team with data-driven recommendations regarding application types that exhibit strong potential for user engagement. Identifying the key attributes of successful applications will enable the company to align its development strategy more effectively with market demand, thereby enhancing revenue generation via optimized ad exposure.


## Opening and Exploring the Data


To optimize time and resources, this analysis utilizes publicly available datasets from Kaggle, negating the need for extensive primary data collection.

The two datasets used are:

-   **[Google Play Store Apps](https://www.kaggle.com/datasets/lava18/google-play-store-apps):** Contains data on approximately ten thousand Android apps from Google Play.

-   **[Mobile App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps):** Contains data on approximately seven thousand iOS apps from the App Store.

The initial phase involves loading and performing an initial exploration of the datasets. Each dataset will be loaded into a pandas DataFrame.


### Import the pandas library


The `pandas` library is imported to facilitate data manipulation and analysis:


In [1]:
import pandas as pd

### Open the Google Play dataset


Utilizing the imported `pandas` library, the `read_csv()` function is employed to load the Google Play dataset into a `DataFrame` object named `android`.


In [2]:
# Open the Google Play Store dataset as a DataFrame object, setting the header
android = pd.read_csv("../data/googleplaystore.csv", header=0)

### Open the App Store dataset


A similar procedure is followed for the App Store dataset, loading it into a `DataFrame` named `ios`.


In [3]:
# Open the iOS App Store dataset as a DataFrame object, setting the header and index column
ios = pd.read_csv("../data/AppleStore.csv", header=0, index_col=0)

### Create a function to help explore the datasets


To streamline the initial data exploration process, a helper function, `explore_data()`, is defined. This function displays a specified number of rows from a dataset in a clear format and can optionally report the dataset's dimensions.


In [4]:
def explore_data(dataset, row_count=5, rows_and_columns=False):
    """Displays a subset of rows from a given dataset and optionally prints its dimensions (number of rows and columns)."""
    # Slice and display the dataset
    display(dataset.head(row_count if row_count else ""))

    # Prints the number of rows and columns if rows_and_columns is True
    if rows_and_columns:
        print("Number of rows: ", dataset.shape[0])
        print("Number of columns: ", dataset.shape[1])

### Use `explore_data()` to explore the datasets


#### Google Play Store


The `explore_data()` function is utilized to display the first three rows and the dimensions of the Google Play Store dataset:


In [5]:
explore_data(android, 3, True)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


Number of rows:  10841
Number of columns:  13


The output presents the column headers and the initial three records of the Google Play Store dataset. The dataset comprises 10,841 entries (apps) and 13 features (columns).

A preliminary assessment indicates that the columns `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'` are potentially relevant for this analysis.


#### iOS App Store


The same exploratory steps are performed for the iOS App Store dataset:


In [6]:
explore_data(ios, 3, True)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1


Number of rows:  7197
Number of columns:  16


This dataset contains 7,197 application records and 16 features.

For the App Store dataset, the columns `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'` appear most pertinent to the analytical objectives.

More details about each column can be found in the data set [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).


## Data Cleaning


With the datasets loaded and initially explored, the subsequent critical phase is data cleaning.

The data cleaning protocol encompasses these key operations:

-   Identifying and rectifying or removing inaccurate data entries.
-   Detecting and eliminating duplicate records.
-   Filtering out non-English applications, as the company's development focuses on English-speaking audiences.
-   Excluding paid applications, consistent with the company's strategy of developing free-to-download apps.


### Detecting and Deleting Inaccurate Data


The [discussion forum](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) for the Google Play Store dataset [indicates an erroneous entry at row 10472](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015). This error is attributed to a missing `Rating` value, which resulted in a misalignment of subsequent column data for that specific row.

To examine this anomaly, the first row of the dataset is printed alongside the problematic row 10472.


In [7]:
# Print the first row of the dataset and row 10472
android.iloc[[0, 10472]]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


The output confirms that row 10472 pertains to the application `Life Made Wi-Fi Touchscreen Photo Frame`. The data for this entry incorrectly lists its `Category` as `1.9` and its `Rating` as `19`. This rating is invalid, as the maximum permissible rating on the Play Store is 5, a fact supported by other entries such as the first row.

A comparison between row 10472 and a correctly formatted row (e.g., row 1) suggests that the Category value is indeed missing for row 10472, leading to the data misalignment, an observation consistent with the [dataset's discussion forum](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015).

The identified erroneous row is subsequently removed from the dataset:


In [8]:
# Print the number of rows in the dataset before row deletion
print("Number of Rows Before Row Deletion: ", android.shape[0])
# Drop the row with the missing column
android.drop(10472, inplace=True)
# Print the number of rows in the dataset after row deletion
print("Number of Rows After Row Deletion: ", android.shape[0])

Number of Rows Before Row Deletion:  10841
Number of Rows After Row Deletion:  10840


### Detect and Remove Duplicate Data


#### Google Play Store


Further exploration of the Google Play Store dataset reveals the presence of duplicate application entries. For example, the application "Instagram" is listed multiple times:


In [9]:
# Find all instances of "Instagram" in the dataset
instagram_android = android.loc[android["App"] == "Instagram"]
instagram_android

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


The output displays four entries for "Instagram." These entries possess nearly identical values across most columns, with a notable exception in the `Reviews` column. This variance in review counts likely indicates that the data for these duplicate entries were collected at different times.

The `find_duplicates` function is defined below to quantify the extent of such duplicate entries based on a specified subset of columns:


In [10]:
def find_duplicates(dataset, subset):
    """Identifies and counts duplicate rows based on specified columns."""
    # Create a boolean array of apps that have duplicate names
    duplicates_bool = dataset.duplicated(subset=subset)

    # Find all duplicates within the dataset
    duplicates = dataset[duplicates_bool]

    # Find number of duplicate apps
    num_duplicate_apps = duplicates.shape[0]
    print("Number of duplicate apps:", num_duplicate_apps, "\n")

Applying the `find_duplicates()` function to the `App` column of the Google Play Store dataset quantifies the number of duplicate application names.


In [11]:
# Find all duplicate App names in the dataset
find_duplicates(android, "App")

Number of duplicate apps: 1181 



The result indicates 1,181 instances where application names are repeated. A sample of these duplicates is displayed below, showing multiple entries for applications such as `Box` and `Google My Business`.


In [12]:
# Display the first 15 rows in the dataframe of duplicate applications
android_duplicates = android[android.duplicated(subset="App")]
android_duplicates.head(15)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
229,Quick PDF Scanner + OCR FREE,BUSINESS,4.2,80805,Varies with device,"5,000,000+",Free,0,Everyone,Business,"February 26, 2018",Varies with device,4.0.3 and up
236,Box,BUSINESS,4.2,159872,Varies with device,"10,000,000+",Free,0,Everyone,Business,"July 31, 2018",Varies with device,Varies with device
239,Google My Business,BUSINESS,4.4,70991,Varies with device,"5,000,000+",Free,0,Everyone,Business,"July 24, 2018",2.19.0.204537701,4.4 and up
256,ZOOM Cloud Meetings,BUSINESS,4.4,31614,37M,"10,000,000+",Free,0,Everyone,Business,"July 20, 2018",4.1.28165.0716,4.0 and up
261,join.me - Simple Meetings,BUSINESS,4.0,6989,Varies with device,"1,000,000+",Free,0,Everyone,Business,"July 16, 2018",4.3.0.508,4.4 and up
265,Box,BUSINESS,4.2,159872,Varies with device,"10,000,000+",Free,0,Everyone,Business,"July 31, 2018",Varies with device,Varies with device
266,Zenefits,BUSINESS,4.2,296,14M,"50,000+",Free,0,Everyone,Business,"June 15, 2018",3.2.1,4.1 and up
267,Google Ads,BUSINESS,4.3,29313,20M,"5,000,000+",Free,0,Everyone,Business,"July 30, 2018",1.12.0,4.0.3 and up
268,Google My Business,BUSINESS,4.4,70991,Varies with device,"5,000,000+",Free,0,Everyone,Business,"July 24, 2018",2.19.0.204537701,4.4 and up
269,Slack,BUSINESS,4.4,51507,Varies with device,"5,000,000+",Free,0,Everyone,Business,"August 2, 2018",Varies with device,Varies with device


The removal of duplicate data is essential to prevent bias and ensure the accuracy of subsequent analyses. For this analysis, the strategy for deduplication involves retaining the entry with the highest number of reviews. This approach assumes that a higher review count correlates with the most recent data capture for that application.

The "Instagram" entries are used as a reference to illustrate this deduplication logic:


In [13]:
# Show the rows with "Instagram" as the App name
instagram_android

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


Entry with index `2604` exhibits the highest review count among the "Instagram" duplicates. The objective is to retain this specific entry while removing the others.

To implement this, the dataset is first sorted by `App` name and then by `Reviews` in descending order. Subsequently, `duplicated(subset='App')` identifies all but the first occurrence (which, due to sorting, is the one with the highest reviews) of each app as a duplicate.


In [14]:
# Sort the Google Play Store dataset by App name and Number of Reviews in descending order, mark all duplicates as 'True' and save in a Boolean Series
android_sorted_by_reviews = android.sort_values(
    by=["App", "Reviews"], ascending=False
).duplicated(subset="App")

The identified duplicates (those not having the highest review count) are then removed. The DataFrame is subsequently re-sorted by its original index.


In [15]:
# Remove all duplicate apps except the one with the highest number of reviews, and sort the DataFrame by the row index
android_clean = android.loc[~android_sorted_by_reviews].sort_index()

To verify the deduplication for "Instagram," the cleaned dataset is filtered for this application:


In [16]:
# Find all rows in android_clean with "Instagram" as the value in the "App" column
android_clean[android_clean["App"] == "Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


As expected, only the entry with index `2604` for "Instagram" remains. To confirm that all 1,181 identified duplicates (not just for "Instagram") were correctly handled, the total number of rows before and after this operation is compared.

Previously, 1,181 duplicate entries (excluding the first occurrence of each app name) were identified. This count can be used to verify the correct number of rows were removed.

Below is the expected number of apps in the dataset after 1,181 duplicates were removed, as well as the number of applications in the `android_clean` dataset:


In [17]:
# Expected number of rows in the dataset after duplicates have been removed
print("Expected number of rows:", android.shape[0] - android_duplicates.shape[0])

# Number of rows in the android_clean dataset
print("Actual number of rows: ", android_clean.shape[0])

Expected number of rows: 9659
Actual number of rows:  9659


Based on the output, the DataFrame now contains the anticipated number of applications, confirming the removal of the 1,181 duplicate entries from the original count (after the initial erroneous row deletion).

Calling the `explore_data` function on the `android_clean` DataFrame should now reflect this updated row count of 9,659.


In [18]:
explore_data(android_clean, 3, True)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up


Number of rows:  9659
Number of columns:  13


The duplicates have successfully been removed from the Google Play Store dataset.


#### iOS App Store


For the iOS App Store dataset, duplicate entries are checked using the `id` column, which is expected to be a unique identifier for each application:


In [19]:
# Find all duplicates IDs in the dataset
find_duplicates(ios, "id")

Number of duplicate apps: 0 



The output indicates zero duplicate `id` values in the App Store dataset; therefore, no deduplication steps are required for this dataset based on application ID.


### Removing non-English Apps


As the company's application development is targeted towards an English-speaking audience, this analysis will be restricted to English-language applications.

Both datasets contain applications with names that suggest a primary language other than English:


In [20]:
display(ios.iloc[[814, 1094], 1])
display(android_clean.iloc[[4412, 7940], 0])

926                                     搜狐新闻—新闻热点资讯掌上阅读软件
1279    Dictionary ( قاموس عربي / انجليزي + ودجيت التر...
Name: track_name, dtype: object

5509    Wowkwis aq Ka'qaquj
9116       PHARMAGUIDE (DZ)
Name: App, dtype: object

This section details the methodology for identifying and removing such non-English applications.

The [American Standard Code for Information Interchange (ASCII)](https://en.wikipedia.org/wiki/ASCII) defines standard English characters within the ordinal range of 0 to 127.

A function, `is_English()`, is defined below. This function iterates through each character of an input string, evaluating its Unicode code point to determine if it falls outside the standard ASCII range (0-127).


In [21]:
def is_English(str):
    """Assesses if a string is predominantly English by counting non-ASCII characters. Returns `False` if the count of non-ASCII characters exceeds a threshold of three, `True` otherwise."""
    non_eng_chars = 0
    # Iterate through each character in string
    for char in str:
        # Store the character's Unicode code point in a variable
        char_unicode = ord(char)
        # If there are more than 3 non-English characters in the string, return False
        if char_unicode > 127:
            non_eng_chars += 1
            if non_eng_chars > 3:
                return False
    # Return True if there are three or less non-English characters in the string
    return True

To accommodate application names containing a limited number of special symbols (e.g., emojis, ™), which may still be targeted at English audiences, the function classifies an application as non-English only if its name contains more than three characters outside the ASCII range.

The `is_English()` function is tested with sample inputs:


In [22]:
# Function test
print(is_English("Instagram"))
print(is_English("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_English("Docs To Go™ Free Office Suite"))
print(is_English("Instachat 😜"))

True
False
True
True


The `is_English()` function is now applied to filter the `App` names in the Android dataset and `track_name` in the iOS dataset. Rows where the function returns `False` (indicating likely non-English app names) are removed.


In [23]:
# Remove non-ASCII characters from the Google Play Store dataset
android_eng = android_clean[[is_English(app) for app in android_clean["App"]]]

# Remove non-ASCII characters from the iOS App Store dataset
ios_eng = ios[[is_English(app) for app in ios["track_name"]]]

print("Google Play Store - English Apps:")
explore_data(android_eng, 3, True)
print("\nApp Store - English Apps:")
explore_data(ios_eng, 3, True)

Google Play Store - English Apps:


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up


Number of rows:  9614
Number of columns:  13

App Store - English Apps:


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1


Number of rows:  6183
Number of columns:  16


The Google Play Store dataset now has 9,614 apps, and the iOS App Store dataset has 6,183.


### Removing non-Free Apps


The concluding data cleaning procedure is the removal of paid applications, aligning with the company's exclusive focus on developing free-to-download mobile apps.


#### Google Play Store


The Google Play Store dataset has two column labels that will be useful here, `Price` and `Type`. `Price` is the price of the app when the data was scraped, and an app's `Type` could be either `Paid` or `Free` depending on if it were a paid or free app.

Although the `Type` column categorizes applications as `Paid` or `Free`, an inspection reveals instances where an application has a `Price` of `0` but is not designated as `Free` in the `Type` column:


In [24]:
# Show apps from the Google Play Store dataset that have '0' as a value for 'Price' and do not have 'Free' as a value for 'Type'
android_eng[~(android_eng["Type"] == "Free") & (android_eng["Price"] == "0")]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


To ensure only genuinely free applications are retained and to handle such discrepancies, the Android dataset is filtered to include only those applications where the `Type` is `Free` and the `Price` is `'0'`.


In [25]:
# Show apps from the Google Play Store dataset that have 'Free' and '0' values for 'Type' and 'Price'
android_final = android_eng[
    (android_eng["Type"] == "Free") & (android_eng["Price"] == "0")
]

#### iOS App Store


For the iOS App Store dataset, the `price` column, indicating the application's price at the time of data collection, is used to filter out non-free applications. Applications with a `price` of `0.00` are retained.


In [26]:
# Show apps from the iOS App Store dataset that have '0.00' values for 'price'
ios_final = ios_eng[ios_eng["price"] == 0.00]

### Final Datasets


Following the comprehensive data cleaning procedures, the final, refined datasets for the Google Play Store and iOS App Store are presented below, along with their respective row counts.


In [27]:
# Print the number of rows in the Google Play Store dataset and the first three rows
print("Google Play Store Free Apps:", android_final.shape[0])
explore_data(android_final, 3)

# Print the number of rows in the iOS App Store dataset and the first three rows
print("\niOS App Store Free Apps:", ios_final.shape[0])
explore_data(ios_final, 3)

Google Play Store Free Apps: 8861


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up



iOS App Store Free Apps: 3222


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1


The preceding data cleaning operations have addressed identified inaccuracies, duplicate entries, and filtered the datasets to align with the project's focus on English-language, free-to-download applications. The resulting datasets, `android_final` with 8,861 apps and `ios_final` with 3,222 apps, now provide a refined foundation. The next step involves selecting the specific columns from these datasets that will be instrumental for the upcoming analysis.


### Selecting Columns for Analysis


To streamline the datasets for the upcoming analysis and focus only on relevant features, I will select a subset of columns from both the `android_final` and `ios_final` DataFrames. This also helps in optimizing memory usage and processing speed.


Based on the project objectives, the following columns are deemed essential for the Google Play dataset: `['App', 'Category', 'Reviews', 'Installs', 'Genres']`.


In [28]:
# Select columns for android_final
android_analytical_df = android_final[
    ["App", "Category", "Reviews", "Installs", "Genres"]
].copy()
explore_data(android_analytical_df, 3, True)

Unnamed: 0,App,Category,Reviews,Installs,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,"10,000+",Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,"5,000,000+",Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,215644,"50,000,000+",Art & Design


Number of rows:  8861
Number of columns:  5


Similarly, for the iOS App Store dataset, the essential columns are: `['track_name', 'rating_count_tot', 'rating_count_ver', 'prime_genre']`.


In [29]:
# Select columns for ios_final
ios_analytical_df = ios_final[
    ["track_name", "rating_count_tot", "rating_count_ver", "prime_genre"]
].copy()
explore_data(ios_analytical_df, 3, True)

Unnamed: 0,track_name,rating_count_tot,rating_count_ver,prime_genre
2,Evernote - stay organized,161065,26,Productivity
3,"WeatherBug - Local Weather, Radar, Maps, Alerts",188583,2822,Weather
4,"eBay: Best App to Buy, Sell, Save! Online Shop...",262241,649,Shopping


Number of rows:  3222
Number of columns:  4


With the selection of key features now complete, the `android_analytical_df` dataset consists of 8,861 applications and 5 columns, while the `ios_analytical_df` dataset contains 3,222 applications and 4 columns. These streamlined and analysis-focused DataFrames are now fully prepared for the subsequent phase of identifying profitable app profiles.


## Data Analysis


With the datasets now cleaned and prepared, the data analysis phase commences to identify potentially profitable application profiles.

The company's primary revenue stream is derived from in-app advertisements within its portfolio of free-to-download mobile applications. Consequently, user engagement and the active user base size are direct drivers of revenue.

This analysis aims to identify promising application categories by examining the prevalence of different app genres within the Google Play Store and iOS App Store datasets.


### Most Common Apps by Genre


Given the company's objective to launch a new application on both the Google Play Store and the iOS App Store, this initial analysis focuses on identifying app genres that demonstrate high prevalence across both platforms.

To achieve this, frequency tables will be generated to ascertain the most common genres within each market. To confirm the relevant column names for this task, the headers of the analytical datasets are displayed:


In [30]:
display("Google Play Store", android_analytical_df.head(0))
display("iOS App Store", ios_analytical_df.head(0))

'Google Play Store'

Unnamed: 0,App,Category,Reviews,Installs,Genres


'iOS App Store'

Unnamed: 0,track_name,rating_count_tot,rating_count_ver,prime_genre


The analysis will utilize the `Genres` and `Category` columns from the Google Play Store dataset and the `prime_genre` column from the iOS App Store dataset.

To facilitate the generation and presentation of these frequency distributions, a helper function, `freq_table()`, is defined. This function calculates normalized frequencies (percentages) for a specified column, sorts them by frequency in descending order, and formats them for clear presentation.

The `freq_table()` function accepts a dataset and a column name (subset) as input and returns a formatted series representing the frequency table.


In [31]:
def freq_table(dataset, subset):
    """Calculates and formats a sorted frequency table (percentages) for a specified column in a dataset. Frequencies are normalized, converted to percentages, displayed in descending order, and rounded to two decimal places."""
    return (dataset.value_counts(subset=subset, normalize=True) * 100).map(
        "{:,.2f}%".format
    )

The `freq_table()` function is first applied to the `Category` column of the Google Play Store dataset:


In [32]:
# Display the frequency table for the Google Play Store's Category column
freq_table(android_analytical_df, "Category")

Category
FAMILY                 18.93%
GAME                    9.69%
TOOLS                   8.45%
BUSINESS                4.59%
LIFESTYLE               3.90%
PRODUCTIVITY            3.89%
FINANCE                 3.70%
MEDICAL                 3.52%
SPORTS                  3.40%
PERSONALIZATION         3.32%
COMMUNICATION           3.24%
HEALTH_AND_FITNESS      3.08%
PHOTOGRAPHY             2.95%
NEWS_AND_MAGAZINES      2.80%
SOCIAL                  2.66%
TRAVEL_AND_LOCAL        2.34%
SHOPPING                2.25%
BOOKS_AND_REFERENCE     2.14%
DATING                  1.86%
VIDEO_PLAYERS           1.79%
MAPS_AND_NAVIGATION     1.40%
FOOD_AND_DRINK          1.24%
EDUCATION               1.17%
ENTERTAINMENT           0.96%
LIBRARIES_AND_DEMO      0.94%
AUTO_AND_VEHICLES       0.93%
HOUSE_AND_HOME          0.82%
WEATHER                 0.80%
EVENTS                  0.71%
PARENTING               0.65%
ART_AND_DESIGN          0.64%
COMICS                  0.62%
BEAUTY                  0.60%
N

Next, the frequency distribution for the `Genres` column in the Google Play Store dataset is generated:


In [33]:
# Display the frequency table for the Genre column from the Google Play Store dataset
freq_table(android_analytical_df, "Genres")

Genres
Tools                                 8.44%
Entertainment                         6.07%
Education                             5.35%
Business                              4.59%
Lifestyle                             3.89%
                                      ...  
Strategy;Creativity                   0.01%
Tools;Education                       0.01%
Travel & Local;Action & Adventure     0.01%
Trivia;Education                      0.01%
Video Players & Editors;Creativity    0.01%
Name: proportion, Length: 115, dtype: object

Finally, the frequency distribution for the `prime_genre` column in the iOS App Store dataset is displayed:


In [34]:
freq_table(ios_analytical_df, "prime_genre")

prime_genre
Games                58.16%
Entertainment         7.88%
Photo & Video         4.97%
Education             3.66%
Social Networking     3.29%
Shopping              2.61%
Utilities             2.51%
Sports                2.14%
Music                 2.05%
Health & Fitness      2.02%
Productivity          1.74%
Lifestyle             1.58%
News                  1.33%
Travel                1.24%
Finance               1.12%
Weather               0.87%
Food & Drink          0.81%
Reference             0.56%
Business              0.53%
Book                  0.43%
Navigation            0.19%
Medical               0.19%
Catalogs              0.12%
Name: proportion, dtype: object

#### Google Play Store


##### Category


The `FAMILY` category exhibits the highest prevalence in the Google Play Store dataset, accounting for 18.93% of the listed applications. It is followed by the `GAME` category at 9.69% and `TOOLS` at 8.45%.

These findings suggest that developing an application within the `FAMILY` category could be a viable strategy. Furthermore, an application that spans multiple categories (e.g., a family-oriented game or a utility tool for families) might possess greater market appeal or differentiation compared to an application targeting a single category, potentially enhancing its visibility among numerous competitors.


##### Genre


`Tools` is the most frequent `Genre` in the Google Play Store at 8.44%, followed by `Entertainment` (6.07%) and `Education` (5.35%).

Consistent with observations from the `Category` analysis, applications within the `Tools` genre are highly prevalent. The `Entertainment` and `Education` genres also represent significant portions of the dataset. These genres could potentially complement applications in the prominent `GAME` and `FAMILY` categories, indicating a broad appeal for such integrated functionalities in free applications.


#### iOS App Store


##### prime_genre


According to the frequency analysis, `Games` is the most dominant `prime_genre` in the iOS App Store, constituting 58.16% of the free applications. This high concentration suggests `Games` as a significant genre to consider, particularly for the iOS market.

The next most common genres are `Entertainment` (7.88%) and `Photo & Video` (4.97%).


#### Recommended App Profile


The high prevalence of gaming applications on both platforms, particularly on the iOS App Store, suggests that developing a game could be a promising avenue. Game applications also offer the potential for integration with other categories or genres, potentially broadening their target audience.

It is important to note, however, that a high genre frequency (such as the 58.16% for games on iOS) does not inherently guarantee a larger accessible audience or indicate lower risk. This prevalence could reflect lower barriers to entry for game development or approval, or it might signify intense market competition.

Considering the company's objectives, a gaming application might represent a strategically sound initial venture. I recommend a phased approach: first, develop and launch a game for the Android platform via the Google Play Store, where games constitute 9.69% of the analyzed free applications. User response and engagement metrics should be closely monitored. If the application demonstrates positive traction and achieves profitability within an approximate six-month timeframe, developing an iOS version for the App Store would then be advisable.


### Most Popular Apps by Genre


As mentioned above, the frequency tables do not infer a higher number of users. For the next part of this analysis, I will be looking at which apps have the most users.


#### iOS App Store


Unlike the Google Play Store dataset, the iOS App Store dataset does not have an `Installs` column that can be used for this analysis, so the total number of user ratings will be used instead.

To calculate the average number of user ratings per app genre, the `rating_count_tot` column will be used.


The code block below groups the `ios_analytical_df` by the `prime_genre` column, calculates the average user ratings, then sorts and formats the data.


In [35]:
# Display the mean of user ratings for each genre
ios_analytical_df.groupby("prime_genre")["rating_count_tot"].mean().sort_values(
    ascending=False
).map("{:,.2f}".format)

prime_genre
Navigation           86,090.33
Reference            74,942.11
Social Networking    71,548.35
Music                57,326.53
Weather              52,279.89
Book                 39,758.50
Food & Drink         33,333.92
Finance              31,467.94
Photo & Video        28,441.54
Travel               28,243.80
Shopping             26,919.69
Health & Fitness     23,298.02
Sports               23,008.90
Games                22,788.67
News                 21,248.02
Productivity         21,028.41
Utilities            18,684.46
Lifestyle            16,485.76
Entertainment        14,029.83
Business              7,491.12
Education             7,003.98
Catalogs              4,004.00
Medical                 612.00
Name: rating_count_tot, dtype: object

The above table shows that `Navigation`, `Reference`, `Social Networking`, and `Music` have the highest average user ratings among all genres in the iOS App Store.

Further analysis shows within each genre the data is skewed to a few applications.


The code below creates a function that will print each application under a specified genre and their user ratings, and the function will sort the output by the app's user ratings:


In [36]:
def display_ios_ratings(genre_name):
    display(
        ios_analytical_df.loc[
            (ios_analytical_df["prime_genre"] == genre_name),
            ["track_name", "rating_count_tot"],
        ].sort_values("rating_count_tot", ascending=False)
    )

The `Navigation` genres shows a skew toward two very popular applications:


In [37]:
display_ios_ratings("Navigation")

Unnamed: 0,track_name,rating_count_tot
197,"Waze - GPS Navigation, Maps & Real-time Traffic",345046
1968,Google Maps - Navigation & Transit,154911
228,Geocaching®,12811
1402,CoPilot GPS – Car Navigation & Offline Maps,3582
316,ImmobilienScout24: Real Estate Search in Germany,187
1102,Railway Route Search,5


The above output shows that Waze and Google Maps have 499,957 user ratings combine - that's almost half a million user ratings for only two mobile apps!


A similar pattern applies to `Social Networking` and `Music` apps:


In [38]:
display_ios_ratings("Social Networking")

Unnamed: 0,track_name,rating_count_tot
17,Facebook,2974676
863,Pinterest,1061624
104,Skype for iPhone,373519
1041,Messenger,351466
105,Tumblr,334293
...,...,...
4893,BestieBox,0
6558,bit-tube - Live Stream Video Chat,0
5988,MATCH ON LINE chat,0
10101,niconico ch,0


In [39]:
display_ios_ratings("Music")

Unnamed: 0,track_name,rating_count_tot
8,Pandora - Music & Radio,1126879
202,Spotify Music,878563
20,"Shazam - Discover music, artists, videos & lyrics",402925
40,iHeartRadio – Free Music & Radio Stations,293228
270,SoundCloud - Music & Audio,135744
...,...,...
6193,"Free Music - Player & Streamer for Dropbox, ...",46
75,NRJ Radio,38
3169,Smart Music: Streaming Videos and Radio,17
8633,BOSS Tuner,13


The `Social Networking` genre's output is heavily skewed by popular apps such as Facebook, Pinterest, Skype, etc.

The `Music` genre is heavily influenced by the ratings of apps such as Pandora, Spotify, Shazam, and iHeartRadio.


What can be inferred by the above output is that apps in the `Navigation`, `Social Networking` and `Music` genres may not be as popular as they seem. All of these apps make up a small percentage of the total number of apps in each genre, yet they have significantly more user ratings than most other apps within the same genre.


In regards to `Reference` apps, which has the second highest average number of user ratings, there are two apps that skew the average results: the Bible and Dictionary.com.


In [40]:
display_ios_ratings("Reference")

Unnamed: 0,track_name,rating_count_tot
5,Bible,985920
130,Dictionary.com Dictionary & Thesaurus,200047
424,Dictionary.com Dictionary & Thesaurus for iPad,54175
779,Google Translate,26786
574,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",18418
9399,New Furniture Mods - Pocket Wiki & Game Tools ...,17588
669,Merriam-Webster Dictionary,16849
1193,Night Sky,12122
9625,City Maps for Minecraft PE - The Best Maps for...,8535
9485,LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...,4693


This information could be useful for the app development company. Another possible app profile to consider is an app that takes a popular book and turn it into an app that provides unique features beside the book itself, such as daily quotes from the book, quizzes about the book, etc. In addition, building in a dictionary into the app will be valuable to users to look up definition of words without having to download or open an external dictionary app.

Earlier, it was determined that over half of the iOS App Store's mobile applications are classified as games, suggesting there is some over saturation, and that a more practical app is more likely to differentiate itself from the plethora of entertainment apps.


#### Google Play Store


To determine which apps have the most users, I will be calculating the average number of installs for each app genre using the values in the `Installs` column of the Google Play Store dataset.

It's important to note that the `Installs` column is not precise. The output belows shows the values in the `Installs` column are open-ended (e.g 5,000+, 1,000+, etc.):


In [41]:
freq_table(android_analytical_df, "Installs")

Installs
1,000,000+        15.74%
100,000+          11.56%
10,000,000+       10.52%
10,000+           10.20%
1,000+             8.40%
100+               6.92%
5,000,000+         6.84%
500,000+           5.57%
50,000+            4.77%
5,000+             4.51%
10+                3.54%
500+               3.25%
50,000,000+        2.29%
100,000,000+       2.12%
50+                1.92%
5+                 0.79%
1+                 0.51%
500,000,000+       0.27%
1,000,000,000+     0.23%
0+                 0.05%
Name: proportion, dtype: object

For this analysis, the values will be left as they are, so if an app has 10,000+ installs, it will be assumed that the number of installs in 10,000.


The code block below takes the dictionary output from the `freq_table()` function, calculates the average number of installs for each category in the dictionary and outputs the results sorted by the average number of installs:


In [42]:
# Remove non-digit characters and convert values to int64 dtype
android_analytical_df["Installs"] = (
    android_analytical_df["Installs"].str.replace(r"\D", "", regex=True).astype("int64")
)

The `Installs` column is now free of non-digit characters and has been converted to an integer values, which now allows for the mean to be calculated for each category.

In [44]:
# Display the mean of number of installs for each category
android_analytical_df.groupby("Category")["Installs"].mean().sort_values(
    ascending=False
).map("{:,.2f}".format)

Category
COMMUNICATION          38,456,119.17
VIDEO_PLAYERS          24,727,872.45
SOCIAL                 23,253,652.13
PHOTOGRAPHY            17,805,627.64
PRODUCTIVITY           16,787,331.34
GAME                   15,560,965.60
TRAVEL_AND_LOCAL       13,984,077.71
ENTERTAINMENT          11,640,705.88
TOOLS                  10,682,301.03
NEWS_AND_MAGAZINES      9,549,178.47
BOOKS_AND_REFERENCE     8,767,811.89
SHOPPING                7,036,877.31
PERSONALIZATION         5,201,482.61
WEATHER                 5,074,486.20
HEALTH_AND_FITNESS      4,188,821.99
MAPS_AND_NAVIGATION     4,056,941.77
FAMILY                  3,696,479.24
SPORTS                  3,638,640.14
ART_AND_DESIGN          1,986,335.09
FOOD_AND_DRINK          1,924,897.74
EDUCATION               1,820,673.08
BUSINESS                1,712,290.15
LIFESTYLE               1,437,816.27
FINANCE                 1,387,692.48
HOUSE_AND_HOME          1,331,540.56
DATING                    854,028.83
COMICS                    817

On average, `COMMUNICATION` apps have the most installs with an average number installs of 38,456,119.17.

Before making any decisions based on this information, it's important to take a closer look at the apps within the `COMMUNICATION` genre:


In [None]:
android_analytical_df.loc[android_analytical_df["Category"] == "COMMUNICATION", ["App", "Installs"]].sort_values("Installs", ascending=False)

Unnamed: 0,App,Installs
391,Skype - free IM & video calls,1000000000
382,Messenger – Text and Video Chat for Free,1000000000
464,Hangouts,1000000000
336,WhatsApp Messenger,1000000000
411,Google Chrome: Fast & Secure,1000000000
...,...,...
9648,EO Mumbai,10
10748,FP Live,10
10169,Test Server SMS FA,5
6399,Of the wall Arapaho bk,5


The output above shows that the average number of installs of `COMMUNICATION` apps is heavily skewed by a few apps that have over one billion installs (e.g - Skype, Facebook Messenger, Google Hangouts, WhatsApp Messenger, etc.), as well as a few apps with over 100 and 500 million installs.

The function `print_apps_with_100m_plus_installs()` below singles out these apps:


In [55]:
def print_apps_with_100m_plus_installs(cat_name):
    apps_with_100m_plus_installs = android_analytical_df.loc[(android_analytical_df["Category"] == cat_name) & (android_analytical_df["Installs"] >= 100000000), ["App", "Installs"]].sort_values("Installs", ascending=False)
    display(apps_with_100m_plus_installs)
    print("Number of apps with 100,000,000 or more installs:", apps_with_100m_plus_installs.shape[0])

Running the above function for all the `COMMUNICATION` apps produces the following:


In [56]:
print_apps_with_100m_plus_installs("COMMUNICATION")

Unnamed: 0,App,Installs
336,WhatsApp Messenger,1000000000
391,Skype - free IM & video calls,1000000000
382,Messenger – Text and Video Chat for Free,1000000000
411,Google Chrome: Fast & Secure,1000000000
464,Hangouts,1000000000
451,Gmail,1000000000
403,LINE: Free Calls & Messages,500000000
4676,Viber Messenger,500000000
420,UC Browser - Fast Download Private & Secure,500000000
383,imo free video calls and chat,500000000


Number of apps with 100,000,000 or more installs: 27


A similar pattern can be seen for `VIDEO_PLAYERS` (which has the 2nd highest number of average installs at 24,727,872.45), `SOCIAL` (with 23,253,652.13 average installs), `PHOTOGRAPHY` (with 17,840,110.40), `PRODUCTIVITY` (with 16,787,331.34):


In [None]:
print_apps_with_100m_plus_installs("VIDEO_PLAYERS")

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+

Number of apps with 100,000,000 or more installs: 9


In [None]:
print_apps_with_100m_plus_installs("SOCIAL")

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+

Number of apps with 100,000,000 or more installs: 13


In [None]:
print_apps_with_100m_plus_installs("PHOTOGRAPHY")

B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
Google Photos : 1,000,000,000+
Retrica : 100,000,000+
Photo Editor Pro : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
AR effect : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+

Number of apps with 100,000,000

In [None]:
print_apps_with_100m_plus_installs("PRODUCTIVITY")

Microsoft Word : 500,000,000+
Microsoft Outlook : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Dropbox : 500,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Drive : 1,000,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Sheets : 100,000,000+
Microsoft Excel : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Google Calendar : 500,000,000+
Cloud Print : 500,000,000+
CamScanner - Phone PDF Creator : 100,000,000+

Number of apps with 100,000,000 or more installs: 22


Nine apps (including major apps like YouTube, Google Play Movies & TV, and MX Player) dominate the `VIDEO_PLAYER` market and each have more than 100 million installs.

Instagram, Facebook, Google+ and ten other apps have more than 100 million installs in the `SOCIAL` genre.

The `PHOTOGRAPHY` genre has 19 apps with more than 100 million installs, with apps likes Google Photos.

Lastly, the `PRODUCTIVITY` genre has 22 apps with more than 100 million installs and has major applications like Dropbox, Microsoft Word, Google Calendar, Evernote.


What this shows is that these genres are dominated by a few giant apps that are difficult to compete with.


Although the `GAME` genre appears to be quite popular, my earlier analysis of the iOS App Store indicated that this segment of the market is relatively saturated. For this reason, I sought to identify an alternative app category that may offer greater opportunity.

The `BOOKS_AND_REFERENCE` also demonstrates notable popularity, with an average install count of 87,67,811.89. This category warrants further exploration, as my earlier findings suggest it holds potential for success on the iOS App Store. Since my objective is to recommend an app genre with profitability prospects across both the iOS App Store and Google Play Store, this genre emerged as a promising candidate.


Taking a look at the apps in the `BOOKS_AND_REFERENCE` genre and their number of installs:


In [None]:
# This empty dictionary will store the name and number of installs for each app in the BOOKS_AND_REFERENCE genre
book_ref_dict = {}

# Loop through the Google Play Store dataset
for app in android_final:
    # For each app in the books and reference genre, remove any plus-signs or commas, convert the number to a float data type, and add it to the book_ref_dict
    if app[1] == "BOOKS_AND_REFERENCE":
        num_of_installs = app[5]
        num_of_installs = num_of_installs.replace("+", "")
        num_of_installs = num_of_installs.replace(",", "")
        book_ref_dict[app[0]] = float(num_of_installs)

# Sort and print the book_ref_dict
sort_output(book_ref_dict)

Google Play Books: 1000000000.0
Wattpad 📖 Free Books: 100000000.0
Bible: 100000000.0
Audiobooks from Audible: 100000000.0
Amazon Kindle: 100000000.0
Wikipedia: 10000000.0
Spanish English Translator: 10000000.0
Quran for Android: 10000000.0
Oxford Dictionary of English : Free: 10000000.0
NOOK: Read eBooks & Magazines: 10000000.0
Moon+ Reader: 10000000.0
JW Library: 10000000.0
HTC Help: 10000000.0
FBReader: Favorite Book Reader: 10000000.0
English Hindi Dictionary: 10000000.0
English Dictionary - Offline: 10000000.0
Dictionary.com: Find Definitions for English Words: 10000000.0
Dictionary - Merriam-Webster: 10000000.0
Dictionary: 10000000.0
Cool Reader: 10000000.0
Aldiko Book Reader: 10000000.0
Al-Quran (Free): 10000000.0
Al'Quran Bahasa Indonesia: 10000000.0
Al Quran Indonesia: 10000000.0
Read books online: 5000000.0
English to Hindi Dictionary: 5000000.0
Ebook Reader: 5000000.0
Dictionary - WordWeb: 5000000.0
Bible KJV: 5000000.0
Ancestry: 5000000.0
AlReader -any text book reader: 5000

The books and reference genre encompasses a wide range of applications, including ebook readers, digital library collections, dictionaries, and educational resources such as programming or language tutorials. Upon closer examination, I observed that a small number of highly popular apps appear to disproportionately influence the average install figures:


In [None]:
print_apps_with_100m_plus_installs("BOOKS_AND_REFERENCE")

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+

Number of apps with 100,000,000 or more installs: 5


Although the genre contains only a limited number of exceptionally popular apps, the market still appears to hold potential. To generate viable app ideas, I focused on examining applications with moderate levels of popularity, specifically those with download counts between one million and one hundred million.


In [None]:
# Loop through the Google Play Store dataset
for app in android_final:
    # Find and print the names of each app in the book and reference genre with their number of installs if the number of installs is between one million and one hundred million
    if app[1] == "BOOKS_AND_REFERENCE" and app[5] in [
        "1,000,000+",
        "5,000,000+",
        "10,000,000+",
        "50,000,000+",
    ]:
        print(app[0], ":", app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche appears to be largely dominated by ebook readers, library collections, and dictionary applications. Given the level of existing competition, developing a similar app may not be the most strategic approach.

I also observed a significant number of apps centered around the Quran, which indicates that creating an app based on a well-known book can be a viable and profitable strategy. This suggests that adapting a popular or recently published book into an app may hold potential for success in both the Google Play and iOS App Store markets.

However, since the market is already saturated with basic library apps, it would be important to offer additional features that enhance the user experience. These could include daily excerpts or quotes, audio narration, interactive quizzes, or a discussion forum to foster engagement around the content.


## Conclusion


In this project, I conducted an analysis of mobile app data from both the iOS App Store and Google Play Store with the objective of identifying an app profile that has the potential to generate profit across both platforms.

Based on the findings, I concluded that developing an app based on a popular book, particularly a recent publication, could be a viable strategy for profitability in both markets. Given the existing saturation of library-style apps, it would be necessary to differentiate the product by incorporating additional features. These may include daily quotes, an audio narration of the book, interactive quizzes, or a community forum for discussion.
