# Analyzing Profitable App Profiles for the App Store and Google Play Markets
#### Project Description:
A data analysis is the process of exploring, cleaning and transforming the data into a useable format. In this analysis, we are going to explore the apps on both Apple Store and Google Play Store and see what type of apps are currently attracting users. Most of us uses a smartphone everyday, we used different apps for different purposes.

#### Project Scope
The scope of this analysis is to analyze data that are focused on english users and that are free to download and install.

#### Project Goal
The goal of this analysis is to recommend the type of apps that the developers can work on. In this analysis, we are going to work on exploring the data, cleaning the data, transforming and analyzing the data to come up with a recommendation.

---

## Table of Contents

[Dataset](#Dataset)<br>
[Data Cleaning](#Data-Cleaning)<br>
[Data Analysis](#Data-Analysis)<br>
[Conclusion and Recommendations](#Conclusion-and-Recommendations)

---

## Dataset

The first thing that we have to do is to import the library that we are going to use. And the datasets to store it in a variable. We are going to use the [App Store](https://www.kaggle.com/datasets/lava18/google-play-store-apps) and [Google Play Store](https://www.kaggle.com/datasets/lava18/google-play-store-apps) csv files that are available to download in [Kaggle](kaggle.com).

In [1]:
#We are going to need the reader method in the csv library to read the imported csvs.
import csv

Since we are working on two datasets, we are going to create a function that will open, store in a variable and explore the dataset to avoid typing the code twice. This also allows our code to be reusable on other datasets.

In [2]:
#creating a function to open our dataset
def open_dataset(filename, header = True):
    '''
    Open, read, store in a variable and closes the dataset
    
    Args
    ---
    filename (str): Name/path of the file
    header (bool): To include or separate the header
    '''
    with open(filename, encoding = 'utf-8') as dataset:
        read_dataset = csv.reader(dataset)
        data_list = list(read_dataset)
        if header == True:
            return data_list
        elif header == False:
            return data_list[0], data_list[1:]

#creating a function to explore our dataset
def explore_data(dataset, start, end, rows_and_columns = False, header = True):
    '''
    Print the n number of rows and the length of rows and columns
    
    Args
    ---
    dataset (list): the dataset to be explored
    start (int): starting index number
    end (int): ending index number plus one.
    rows_and_columns (bool): Print the number of rows and columns
    header (bool): The dataset have header.
    '''
    dataslice = dataset[start:end]
    for row in dataslice:
        print(row, sep = '\n')
    if rows_and_columns == True and header == True:
        print('Number of rows:',len(dataset[1:]))
        print('Number of columns:', len(dataset[0]))
    elif rows_and_columns == True:
        print('Number of rows:',len(dataset))
        print('Number of columns:', len(dataset[0]))

Once we stored the dataset to a variable we are going to explore the first five rows (including the header) to give us a bit of a context on what is inside our dataset.

In [3]:
#Storing the dataset using the function defined above
apls_dataset = open_dataset('AppleStore.csv')
gpls_dataset = open_dataset('googleplaystore.csv')
#Exploring the dataset using the function defined above
print('For the apple store dataset:')
explore_data(apls_dataset,0,4,True)
print('\nFor the google play store dataset')
explore_data(gpls_dataset,0,4,True)

For the apple store dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
Number of rows: 7197
Number of columns: 16

For the google play store dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', 

---

## Data Cleaning

Before we perform data analysis, we have to make sure that the data is clean to be approriate for an analysis. Since our application will be focused on english speakers we are going to exclude applications with non-english characters. We are going to exclude non free-apps as well. So in order to properly clean our data, we have to do the following:
- [Detect inaccurate data, and correct or remove it.](#Detecting-and-removing-inaccurate-data)
- [Detect duplicate data, and remove the duplicates.](#Detecting-and-removing-duplicate-data)
- [Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.](#Removing-non-english-apps)
- [Remove apps that aren't free.](#Removing-apps-that-are-not-free)

### Detecting and removing inaccurate data

To start things off, one of the [discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) regarding our dataset states that the `Rating` column for the entry number **10473** has a missing value. Note that the `Rating` column has an index value of [2]. We are going to verify if this is true. And again, since we are going to work on two datasets we are going to define a function that will detect if the observation has a missing data.

In [4]:
def row_verifier(dataset):
    '''
    Verifies if the number of attributes in the observation is the same with the header
    
    Args
    ---
    dataset (list): The dataset to be verified
    '''
    correct_columns = 0
    for row in dataset:
        if len(row) != len(dataset[0]):
            print('Index:',dataset.index(row))
            print(row)
        else:
            correct_columns += 1
    if correct_columns == len(dataset):
        print('All observations in the dataset have a correct number of attributes.')

In [5]:
row_verifier(gpls_dataset)

Index: 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As we can observe, this confirms the information provided from the discussion. We can fill the entry by further investigating what data should we put on the missing field, but for the simplicity of our project we are just going to remove it. To clean our data, we need to delete this entry so that our dataset will have a uniform number of data per observation.

In [6]:
#using the del statement we are going to delete the entry 10473
del gpls_dataset[10473]
#checking if it fixes the dataset
row_verifier(gpls_dataset)

All observations in the dataset have a correct number of attributes.


<br>We are going to apply the same function to our Apple Store dataset to verify if there's any missing data.

In [7]:
#We are going to apply the same approach on apple store dataset
row_verifier(apls_dataset)

All observations in the dataset have a correct number of attributes.


### Detecting and removing duplicate data
Now that we removed innacurate data, we are going to focus on the duplicated data. 
According to the article from Hevo titled [10 Reasons How Duplicate Data Harms Your Business](https://hevodata.com/learn/duplicate-data/#:~:text=for%20better%20services.-,Inaccurate%20Reporting,should%20do%20for%20future%20growth.)

>Good reporting requires accurate data that is free of duplicates. Duplicate data inhibits this. Reports generated from duplicate records are less reliable and cannot be used to make informed decisions. The business will also find it difficult to forecast what it should do for future growth.

We want to avoid any misinformations on our result so we are going to remove the duplicated data. To start with removal of duplicated data, we are going to investigate which data has a duplicate.

In [8]:
unique_data = list()
duplicated_data = list()

for row in gpls_dataset[1:]:
    name = row[0]
    if name not in unique_data:
        unique_data.append(name)
    else:
        duplicated_data.append(name)
print('Number of unique apps:',len(unique_data))
print('Number of duplicated apps:',len(duplicated_data))
#this will print some of the name of duplicated app
print('\nSome names of duplicated app:')
print(duplicated_data[:10])
#this will print the app that has the same name as Quick PDF Scanner + OCR FREE
print('\nSample of duplicated apps:')
for row in gpls_dataset[1:]:
    if row[0] == 'Quick PDF Scanner + OCR FREE':
        print(row)

Number of unique apps: 9659
Number of duplicated apps: 1181

Some names of duplicated app:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']

Sample of duplicated apps:
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


As we can observe, the value with an index number [3] has a difference on one of the entries. If we look at our column names, we identify index [3] as the rating count. In order to remove the duplicated data, we have to set our basis. The data might be collected multiple times, so we can think of having a higher value of rating count means that it is the latest entry. This will be our argument on removing the duplicates.
<br><br>On the following steps, we are going to create a dictionary where the **key:value** pair is the name of the application and the maximum review count of the application.

In [9]:
print('Number of applications excluding duplicates:',len(gpls_dataset[1:]) - len(duplicated_data))

Number of applications excluding duplicates: 9659


In [10]:
reviews_max = dict() #this is where we are going to store our name:max review (key:value pair)

#the code below will loop through the gpls dataset and check whether the value exist in the reviews_max or not. And change the value if the review count is higher.
for row in gpls_dataset[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews_max.get(name, n_reviews)
        
#to verify that we have the correct length of dict, we are going to get the length and it should be 9659 according to the apps without duplicates
print('reviews_max dictionary length:',len(reviews_max))

reviews_max dictionary length: 9659


<br>Now that we have the **key:value** pair that we want, we are going to make two empty lists. The list `android_clean` will store all the rows where the maximum rating count of the application will be included and the `already_added` will store all the names that we added on the first list. We can verify that the number of observations without the duplicates which is *9,659 rows* by getting the length of both list.

In [11]:
android_clean = list() #new list which the duplicate was removed
already_added = list() #names of the apps that we already added in the new list

#this loop takes the row with the maximum number of rating count for that app and removes the duplicates
for row in gpls_dataset[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('android_clean length:', len(android_clean))
print('already_added length:', len(already_added))

android_clean length: 9659
already_added length: 9659


<br>Now that the `android_clean` has the same length with the number of applications exlcuding dupplicates, we are going to explore the first 5 rows of our dataset using the function explore_data that we defined above.

In [12]:
explore_data(android_clean,0,5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


<br>Now that we managed to remove the duplicates for the google play store dataset, we are going to identify if there is any duplicates on the apple store dataset. By using the id (index 0) of the dataset, we are going to look for any duplicates.

In [13]:
unique_id = list()
duplicated_id = list()

for row in apls_dataset[1:]:
    id_data = row[0]
    
    if id_data not in unique_id:
        unique_id.append(id_data)
    elif id_data in unique_id:
        duplicated_id.append(id_data)
        
print('Number of unique data:', len(unique_id))
print('Number of duplicates:', len(duplicated_id))

Number of unique data: 7197
Number of duplicates: 0


Since the number of the unique data is the same as the number of rows we have for the apple store dataset. We can confirm that there are no duplicates on the apple store dataset.

### Removing non-english apps

The target audience of our application will be english users, so we are going to remove any non-english app on our dataset. English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Each character has a value according to [ASCII](https://www.britannica.com/topic/ASCII). English characters are all in the range of 0 to 127. In order to get this value, we are going to use python's built in ord() function. With this information, we are going to define our own function that will take the name of the applications and verify if there is any non-english characters (in our case, any value that is greater than 127). Since characters such as "™" or emojis are out of the 127 range, we are going to add a criteria that we will only exclude an application if there are more than 3 characters with non-english character.

In [14]:
def english_characters(string):
    '''
    loop through the string and checks if there is any non english character in a string
    
    Args
    ---
    string (str): string that needs to be checked
    '''

    non_english = 0
    for character in string:
        if ord(character) > 127:
            non_english += 1
    if non_english > 3:
        return False
    return True

Now that we defined our function, we are going to loop in the apple store dataset and google playstore dataset and identify if there are non-english characters in the application's name. We can create a new list with a clean data by excluding any non-english applications.

In [15]:
apls_dataset_english = list()
android_clean_english = list()

for row in apls_dataset[1:]:
    name = row[1]
    if english_characters(name):
        apls_dataset_english.append(row)
        
for row in android_clean:
    name = row[0]
    if english_characters(name):
        android_clean_english.append(row)
        
print('Number of english apps in apple store dataset:', len(apls_dataset_english))
print('Number of english apps in google playstore dataset:', len(android_clean_english))

Number of english apps in apple store dataset: 6183
Number of english apps in google playstore dataset: 9614


We can clearly observe that both dataset has multiple non-english applications that were removed.

### Removing apps that are not free

As we mentioned in our scope, we are only going to analyze the applications that are free to download and install. So in this section of our data cleaning process, we are going to exclude applications that are not free.

We have an available information in apple store dataset that the fifth column which is indexed number 4 is the price. Using this information, we are going to filter all the applications that were priced with *0* and append it on our free app list.

In [16]:
apls_free_english_app = list()

for row in apls_dataset_english:
    price = float(row[4])
    if price == 0:
        apls_free_english_app.append(row)
        
print('Number of free english applications:', len(apls_free_english_app))

Number of free english applications: 3222


Now we are going to filter all the free apps on the google playstore dataset. Upon further inspecting the values of the price in the dataset which is indexed 7 has a "$" sign. Which indicates that we cannot convert it to a float or an int. Knowing this, we are going to use the string '0' on our conditional statement instead to filter out all the free applications.

In [17]:
gpls_free_english_app = list()

for row in android_clean_english:
    price = row[7]
    ty = row[6]
    if price == '0':
        gpls_free_english_app.append(row)
        
print('Number of free english applications:', len(gpls_free_english_app))

Number of free english applications: 8864


Now that we've cleaned the data by doing the following:
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.

<br>We can say that our data is now appropriate for an analysis. We will do that on the next section.

---

## Data Analysis

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps since in-app ads will drive our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. 
<br><br>First, we want to look for the genre of both dataset and the category of google playstore dataset to determine what are the most common applications on the platforms. So we start our analysis by determining the most common genre for each market by defining functions to help us with the frequency table.

In [18]:
def freq_table(dataset, index, percentage = False):
    '''
    Takes in a dataset and index and return a frequency table
    
    Args
    ---
    Dataset = the dataset
    Index (int) = index for basis of frequency table
    Percentage (bool) = if the return value is in unit or percentage
    '''
    
    freq_dict = dict()
    total = 0
    for row in dataset:
        column = row[index]
        total += 1
        if column not in freq_dict:
            freq_dict[column] = freq_dict.get(column,1)
        elif column in freq_dict:
            freq_dict[column] += 1
    if percentage:
        for element in freq_dict:
            freq_dict[element] = round(((freq_dict[element] / total) * 100),2)
        return freq_dict
    else:
        return freq_dict

def display_table(dataset, index, top = 5, percentage = False):
    '''
    Displays the table using the frequency table sorted in descending order
    
    Args
    ---
    Dataset = the dataset
    Index (int) = index for basis of frequency table
    Top (int) = how many rows to display (default 3)
    Percentage (bool) = if the return value is in unit or percentage
    '''
    
    table = freq_table(dataset, index, percentage)
    empt_list = list()
    
    for key in table:
        val_key = [table[key], key]
        empt_list.append(val_key)
        
    table_sorted = sorted(empt_list, reverse = True)
    basis = 0
    
    while basis != top:
        print(table_sorted[basis][1],':',table_sorted[basis][0])
        basis += 1

In [19]:
display_table(apls_free_english_app, -5, 10, True)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02


As we can observe from the apple store dataset, most of the applications are focused on entertainment being **Games, Entertainment and Photo & Videos** are the most common applications. For this trend, this might imply that large users are more likely to download applications that are on this genre, we will further investigate on this conclusion later.

In [20]:
display_table(gpls_free_english_app, -4, 10, True)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32


We can indetify that with the genre in the google playstore dataset, there isn't much of a trend. As the data are clustered around different genres. But we see a common trend with the apple store dataset that **Entertainment** is one of the most common genre.

In [21]:
display_table(gpls_free_english_app, 1, 10, True)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32


While determing the most common applications are easier on the category of the google playstore dataset. As we can see, **Family, Game and Tools** are in the top. Having family in the top 1 with a value of almost as double as the top 2 which is game. We can also see that **Game** is common between two markets.

<br>One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`.

In [22]:
apls_genre = list(freq_table(apls_free_english_app, -5))
apls_most_download = list()

for key in apls_genre:
    genre_key = key
    len_genre = 0
    rating_total = 0
    
    for row in apls_free_english_app:
        genre = row[-5]
        count_rating_tot = float(row[5])
        
        if genre_key == genre:
            rating_total += count_rating_tot
            len_genre += 1
        
    avg_ratings = round((rating_total / len_genre), 2)
    rating_genre = avg_ratings, genre_key
    apls_most_download.append(rating_genre)

sorted_list = sorted(apls_most_download, reverse = True)

for genre in sorted_list:
    print(genre[1], ':', genre[0])

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22788.67
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


We can see that **Navigation** applications are the most rated in this market. As we tried hypothesizing earlier that **Games** genre might have more users, we can observe that **Games** are in the middle having almost about 4 times less than the leading application in terms of the ratings. We can say that there are more active users on the **Navigation** genre.
<br>We will move along to our google playstore dataset. For this one, since we have the install attribute it is easier to see how many users per genre. We will try to inspect the first three rows of our dataset to see how we are going to analyze it.

In [23]:
explore_data(gpls_free_english_app, 0, 5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


<br>Upon further inspection we can see that `Install` values don't seem precise enough — we can see that most values are open-ended (10,000+, 100,000+, 5,000,000+ etc.). With this information, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users. 
<br><br>We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

In [26]:
gpls_category = freq_table(gpls_free_english_app, 1)
gpls_category_most_download = list()

for category in gpls_category:
    len_category = 0
    total_category = 0
    
    for row in gpls_free_english_app:
        installs = float((row[5].replace('+', '')).replace(',',''))
        
        if category == row[1]:
            total_category += installs
            len_category += 1
    avg_category = round((total_category / len_category),2)
    tot_category = avg_category, category
    gpls_category_most_download.append(tot_category)

sorted_list = sorted(gpls_category_most_download, reverse = True)

for category in sorted_list:
    print(category[1], ':', category[0])     

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3695641.82
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [27]:
for app in gpls_free_english_app:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

---

## Conclusion and Recommendations

As we observed from our analysis, applications such as **Navigation** and **Social Media** are one of the most popular applications. But it will be very hard market to tap due to popular applications on this genre. We can instead focus on **Entertainment** applications as there is a sudden trend and there's a lot of new applications on this genre that became popular such as *Tik-tok*. Since we are going to focus on in-app ads for our revenue, we can combine some feature's from the other genres such as communication by having a chat option on our application. We can also take a feature from the navigation that will enable us to locate nearby users of the application. One thing comes to mind is an application that we can upload photos and other users will be able to rate the photo. Enabling location based, users can see where that photo was taken. By combining multiple genre, we might be able to attract users to use our application. And implementing in-app ads to this application, the users will have to watch an advertisement before they can upload the photos. We can also create a pro version that will cost the user but it will remove in-app ads.