# An analysis on Apps Stores: most profitable apps available on march 2019

In this project we will present you some insights on the most downloaded (and the most profitable ones, as a consequence) apps on mobile app stores.

Our main objective is to help our team to understand what kinds of apps are likely to attract more users on the Google Play Store and Apple Apps Store, so we can make data driven decisions about what types of application are likely to create more revenue once developed and distributed.

For both systems there's a huge collection of apps available, about 2 million apps on the App Store and 2,1 million on Google Play Store. Such large universe would be impractical to analyse, since it would require too much time spent on data cleasing, and extra costs aquiring such dataset.

Thus, bearing in mind the statistics principle of sampling, we've opted for using two free datasets, one for each platform, still containing a good deal of apps, large enough to 

We got this number using a free dataset which is available [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

That dataset is organized in columns, and described as follows:

| Column | Description |
|---|---|
|"id" | App ID |
| "track_name" | App Name |
| "size_bytes" | Size (in Bytes) |
| "currency" | Currency Type |
| "price" | Price amount |
| "rating_count_tot" | User Rating counts (for all version) |
| "rating_count_ver" | User Rating counts (for current version) |
| "user_rating" | Average User Rating value (for all version) |
| "user_rating_ver" | Average User Rating value (for current version) |
| "ver" | Latest version code |
| "cont_rating" | Content Rating |
| "prime_genre" | Primary Genre |
| "sup_devices.num" | Number of supporting devices |
| "ipadSc_urls.num" | Number of screenshots showed for display |
| "lang.num" | Number of supported languages |
| "vpp_lic" | Vpp Device Based Licensing Enabled |


For Android the columns are quite self-explanatory:

| Column |
|---|
|App|
|Category|
|Rating|
|Reviews|
|Size|
|Installs|
|Type|
|Price|
|Content Rating|
|Genres|
|Last Updated|
|Current Ver|
|Android Ver|


**We will start exploring the datasets by importing a function called *reader*, from the **CSV library**, which means *comma separated values*, a general use format for importing datasheets and lists organized in columns and rows. Further, we open and load each file in a format that can be manipulated with Python commands.**

In [None]:
from csv import reader

opened_file_apple = open('../input/google-and-apple-store/AppleStore.csv')
read_apple = reader(opened_file_apple)
apple_apps_data = list(read_apple)

opened_file_android = open('../input/google-and-apple-store/googleplaystore.csv')
read_android = reader(opened_file_android)
android_apps_data = list(read_android)

**Now, we're going to import a function that organizes dictionaries (a special kind of data list) in decreasing order. Its use we'll see much  further on this work.**

In [None]:
from collections import OrderedDict

**Here, after we loaded the datasets, we print the first row in each one. Our objective is to show what information is available on those tables. For greater convenience, we also created two variables with the rows, which we will use many times for reference.**

In [None]:
android_header = (android_apps_data[0])
print (android_header)

In [None]:
apple_header = (apple_apps_data[0])
print (apple_header)

**We've created a simple function for exploring the datasets, which can be seen below:**

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

**And here we use that function with the two datasets.**

In [None]:
explore_data(apple_apps_data, 1, 4, rows_and_columns=True)

In [None]:
explore_data(android_apps_data, 1, 4, rows_and_columns=True)

**With further analysis on these datasets we've found apps that lacked information, and redundant apps. If not rectified, those occurences would create statistical distortions and make this work less reliable.**

**So, we had to cleanse the sets, removing duplicates and items which lacked information.**

**Below, we start the data cleansing, removing from the Android list an app that lacked an important column for our analysis.**

In [None]:
index = 0
for row in android_apps_data[1:]:
    index += 1
    app_name = row[0]
    if app_name == 'Life Made WI-Fi Touchscreen Photo Frame':
        print (row)
        print (index)

In [None]:
del android_apps_data[10473]

**Now we check for duplicates on the Android dataset, creating two lists, one for unique registers, and one for duplicated ones.**

In [None]:
unique_android = []
duplicated_android = []

for row in android_apps_data[1:]:
    name = row[0]
    if name in unique_android:
        duplicated_android.append(name)
    else:
        unique_android.append(name)

**We'll do the same for the iOS dataset.**

In [None]:
unique_ios = []
duplicated_ios = []

for row in apple_apps_data[1:]:
    name = row[1]
    if name in unique_ios:
        duplicated_ios.append(name)
    else:
        unique_ios.append(name)

**As we do this checking, we found 1.181 duplicated items in Android dataset, from a total of 10.840. This redundances represent more than 10% of the registers and given its weight, it could lead to wrong conclusions further.**

**For Apple dataset, we've found 2 redundances, which could be ignored without trouble for our final results.**

In [None]:
and_dupl_num = len (duplicated_android)
and_uniq_num = len (unique_android)
ios_dupl_num = len (duplicated_ios)
ios_uniq_num = len (unique_ios)

print (and_dupl_num)
print (and_uniq_num)
print (ios_dupl_num)
print (ios_uniq_num)

**Here's some of these redundances we've found in Android set:**

In [None]:
print (duplicated_android[:10])

**Many of these redundances have columns with different values, so we can not just randomly them. Instagram duplicated registries, for example, have different values on the 4th column, which holds the total number of reviews for a given app:**

In [None]:
for row in android_apps_data[1:]:
    name = row[0]
    if name == 'Instagram':
        print (row)

**We came to conclusion that there's a relation amongst the numbers on that columns: they have a growing pattern what indicates the growing number of reviews for that app, so we just kept the largest number (as it would indicate the latest value) and removed the smaller ones**

In [None]:
reviews_max = {}

for row in android_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    if name in reviews_max:
        if n_reviews > reviews_max[name]:
            reviews_max[name] = n_reviews

**Now we can check the "reviews_max" dictionary length, which contains no duplicated items. We expect it to match the unique values number of 9656, as seen on the first checking for duplicates**

In [None]:
len (reviews_max)

**It matches the number we've found on the first checking. Now we're going to clear our main list.**

In [None]:
android_clean = []
apple_clean = apple_apps_data[1:]
already_added = []

for row in android_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name]:
        if name not in already_added:
            android_clean.append(row)
            already_added.append(name)
            

**Let's check if the new list matches the numbers of unique apps and explore it a bit**

In [None]:
len (android_clean)

**The list length is ok. Now with the exploring:**

In [None]:
explore_data(android_clean, 1, 7)

**Now, we're going to check the list of apps added to the clean list, just to make sure it corresponds the number of unique apps, and explore it a litte, to confirm its content.**

In [None]:
len (already_added)
print (already_added[1:6])

**Everything seems ok. Now it's time to remove apps with non-latin characters, since our reasearch is about apps that could make a profit on Western countries.**

**This can be made by using the ASCII system codification to characters, where every character has a corresponding number, and where the latin characters range from 0 to 127**

**In Python, there's a built-in function to check a character's ASCII corresponding number, which is "ord". We're going to use this to check if a app name has non-latin characters, and if it is the case, we're remove that app from our list, as follows.**

In [None]:
def check_latin (app_name):
    latin_char = []
    non_latin_char = []
    for char in app_name:
        if ord(char) < 128:
            latin_char.append(char)
        if ord(char) > 127:
            non_latin_char.append(char)
    
    len_latin = len (latin_char)
    len_non_latin = len (non_latin_char)
    if len_latin > len_non_latin:
        return True
    else:
        return False

real_clean_googleapps = []
for row in android_clean[1:]:
    ck = check_latin(row[0])
    if ck == True:
        real_clean_googleapps.append(row)
        
real_clean_appleapps = []
for row in apple_clean[1:]:
    chk = check_latin(row[1])
    if chk == True:
        real_clean_appleapps.append(row)

**Now, we're gonna check the ultimate clean lists lengths, just to make sure everything looks right**

In [None]:
goog_len = len (real_clean_googleapps)
appl_len = len (real_clean_appleapps)
print (goog_len)
print (appl_len)

**A brief look at these dataset revealed that free to use apps, which rely on ads to profit, are the most numerous and downloaded ones. We created a simple program to show this.**

In [None]:
apple_apps_prices = {'Free': 0, 'Non_free': 0}
apple_apps_downloads = {'Free': 0, 'Non_free': 0}
for row in real_clean_appleapps:
    price = float (row[4])
    downloaded_times = float (row[5])
    if price == 0.0:
        apple_apps_prices['Free'] += 1
        apple_apps_downloads['Free'] += downloaded_times
    else:
        apple_apps_prices['Non_free'] += 1
        apple_apps_downloads['Non_free'] += downloaded_times

total_apps = apple_apps_prices['Free'] + apple_apps_prices['Non_free']
total_downloads = apple_apps_downloads['Free'] + apple_apps_downloads['Non_free']

**And below, you can see the numbers. In this sample, there are 4.056 free apps, downloaded 80.105.208 times, and 3.141 paid apps, downloaded 12.685.045 times.**

In [None]:
print (apple_apps_prices)

In [None]:
print (apple_apps_downloads)

**We can have a better notion in percentual terms, as we see below:**

In [None]:
apple_free_apps_propor = (apple_apps_prices['Free']/total_apps)*100
apple_free_downloads = (apple_apps_downloads['Free']/total_downloads)*100

In [None]:
proporcao_apps = round(apple_free_apps_propor)
proporcao_downloads = round (apple_free_downloads)
print (proporcao_apps)
print (proporcao_downloads)

**As we can see, free apps represent 56% of total apps and 86% of all downloads in iOS App Store.**

**Now it's time to isolate free apps on the two apps stores. Let's check again the header rows to get the indexes of prices**

In [None]:
print (android_apps_data[0])
print (apple_apps_data[0])

**As we see, the indexes are [7] for Android Apps and [4] for iOS Apps. Let's check the ultimate clean lists to confirm.**

In [None]:
print (real_clean_googleapps[0])
print (real_clean_appleapps[0])

**Everything looks fine. Now we're going to isolate the free apps on the two datasets.**

In [None]:
android_free = []
android_non_free = []
ios_free = []
ios_non_free = []

for row in real_clean_googleapps:
    price = row[6]
    if price == 'Free':
        android_free.append(row)
    else:
        android_non_free.append(row)
    
    
for row in real_clean_appleapps:
    price = row[4]
    if price == '0.0':
        ios_free.append(row)
    else:
        ios_non_free.append(row)

**Let's explore the free datasets, starting by Android**

In [None]:
explore_data(android_free, 0, 5)

**Now, with the iOS dataset**

In [None]:
explore_data(ios_free, 0, 5)

**Now we're going to check the lenghts of the two free and non free datasets**

In [None]:
len_free_ios = len (ios_free)
print (len_free_ios)

len_non_free_ios = len (ios_non_free)
print (len_non_free_ios)

In [None]:
len_free_and = len (android_free)
print (len_free_and)

len_non_free_and = len (android_non_free)
print (len_non_free_and)

**Everything seems fine so far.**

**To minimize risks on the development of apps, we have to have a validation strategy. We suggest the following steps:**

1. Build a minimal version of the app, only for Android, and add it to Google Play Store, since that platform costs for uploading an app is minimal, around US$ 25 one time payment.

2. We'll monitor the version performance amongst users. If the performance is good, we can develop the app further.

3. After six months, if the app shows profitable, we build and iOS version. This is necessary since there are bigger costs involved on uploading an app there.

**We'll now create a function to make a frequency table, where we can look up the sum of every occurence of a genre**

In [None]:
def freq_table (dataset, index, frequency=True):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        
        value = row[index]
        if value not in table:
            table[value] = 1
        if value in table:
            table[value] += 1
    
    percent_tab = {}
    for key in table:
        percentage = (float (table[key])/total)*100
        percent_tab[key] = percentage
        
    if frequency == True:
        return (table)
    if frequency == False:
        return (percent_tab)

**Below, we create another function to order the genres by their number of occurences in a descending order, from the largest to the smallest.**

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted (table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

**Now we use the functions to check the frequency of genres on the two datasets**

In [None]:
display_table(android_free, 1)

In [None]:
android_genre_qt = display_table(android_free, 9)

In [None]:
ios_genre_qt = display_table (ios_free, 11)

**As we can see, "Games", "Entertainment", "Education", "Family" and "Tools" categories are the ones with more apps on the two datasets. But what are the most downloaded ones, proportionally? The average installation number by genre could show us this.**

**Android apps list has an "installs" column, we could use it for our analysis. But Apple apps list has no similar field, which leaves us with no choice besides using the "rating_count_tot" for measuring the number of times an app has been used and therefore rated by the users.**

**Now we're going to explore the Android installs column. But first, it need some cleansing on its values, since there are many of them presented with commas, plus simbols and others, as we can see below. The "installs" number can be seen on index 5**

In [None]:
explore_data(android_free, 0, 5)

In [None]:
android_installs_genre = {}

for row in android_free:
    genre = row[9]
    
    installs = row[5]
    installs = installs.replace(',','')
    installs = installs.replace('+','')
    installs = float(installs)
    
    if genre not in android_installs_genre:
        android_installs_genre[genre] = installs
    
    if genre in android_installs_genre:
        android_installs_genre[genre] += installs

**Now we have the number of times each genre was downloaded. With the function "sorted', let's check the occurences on new dictionary we created and the greatest values there.**

In [None]:
sorted(android_installs_genre.items(), key=lambda x: (-x[1], x[0]))

**Looks like Communication apps are the most downloaded and installed ones, if we consider the "genres" column. But since Android apps dataset has seemly 2 columns for genre (genre itself and category), we have to check the other column.**

In [None]:
android_category_installs = {}

for row in android_free:
    category = row[1]
    
    installs = row[5]
    installs = installs.replace('+','')
    installs = installs.replace(',','')
    installs = float(installs)
    
    if category not in android_category_installs:
        android_category_installs[category] = installs
        
    if category in android_category_installs:
        android_category_installs[category] += installs

In [None]:
sorted(android_category_installs.items(), key=lambda x: (-x[1], x[0]))

**Comparing the two dictionaries we have just created, the values on the main categories/genres are exactly the same. The only exception found was on "Games" category, which was not on "genres" column, but appeared as the largest volume on the two columns, a fact we can not ignore.** 

**So we decided to keep the values on category column, which were just equal genres column, but more complete.**

**Now Let's divide this by the number of apps by genre, and check the average download per genre**

In [None]:
android_freq_gen = freq_table(android_free, 1)
sorted(android_freq_gen.items(), key=lambda x: (-x[1], x[0]))

In [None]:
avg_inst_gen = {k:v1/android_freq_gen.get(k,0) for k,v1 in android_category_installs.items()}

In [None]:
sorted(avg_inst_gen.items(), key=lambda x: (-x[1], x[0]))

**As we can see, Android apps set is dominated by gaming and communication apps. Let's know further the top apps on the top genres, starting with the "Communication" ones.**

In [None]:
communication_apps = []

for row in android_free:
    genre = row[1]
    if genre == 'COMMUNICATION':
        communication_apps.append(row)

In [None]:
sorted (communication_apps)

**As we can see, apps like Skype, Google Chrome, WhatsApp and AT&T Messenger fall into the "Communication". Let's check other categories, starting by the second on average installs: "Video Players".**

In [None]:
video_players_apps = []

for row in android_free:
    genre = row[1]
    if genre == 'VIDEO_PLAYERS':
        video_players_apps.append(row)

In [None]:
sorted (video_players_apps)

**Video players category has an interesting panorama, including video capturing solutions, players and editors. Now with the third place, Social.**

In [None]:
social_apps = []

for row in android_free:
    genre = row[1]
    if genre == 'SOCIAL':
        social_apps.append(row)

In [None]:
sorted (social_apps)

**Social apps category includes all the big social media apps, like Facebook, Twitter, Instagram and tools for a better use of those.**

**Let' check the apps competition on these three categories.**

In [None]:
comlen = len (communication_apps)
vilen = len (video_players_apps)
soclen = len (social_apps)

print (comlen)
print (vilen)
print (soclen)

**With 287 entries, the category with more apps competing for user visibility is "Communication". Besides the competition, this genre also has some big and well-stablished players, and by elimination, we concluded we should look somewhere else for better opportunities.**

**The "Social" apps category is the second more competitive. We found some big players there too.**

**By elimination, we indicate the "video players" category as the best opportunity for developing an app.**