# An Important Note
I worked on this project during my studies for Dataquest online Data Science Bootcamp. This was for "Python for Data Science: Fundamentals" part of the bootcamp.

# Profitable Apps for the App Store and Google Play Store


My aim in this project is to find mobile apps which are profitable for the App Store and Google Play Store. I am assuming that I am working as data analysts for a company that builds Android and iOS mobile apps, and my job is to enable the team of developers to make data-driven decisions with respect to the kind of apps they build.

Our company only builds apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. My goal for this project is to analyze data to help the developers understand what kinds of apps are likely to attract more users.

# Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play Store.

Collecting data for over four million apps requires a significant amount of time and money, instead of that I'll try to analyze a sample of data. To avoid spending resources with collecting new data myself, I should first try to see whether I can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for my purpose:

A data set containing data about approximately ten thousand Android apps from Google Play Store. 
A data set containing data about approximately seven thousand iOS apps from the App Store. 
Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
#The App Store data set
open_file_1 = open('AppleStore.csv')
from csv import reader
read_file_1 = reader(open_file_1)
app_store = list(read_file_1)
number_of_rows_app = len(app_store)
number_of_columns_app = len(app_store[0])

In [2]:
#The Google Play Store data set
open_file_2 = open('googleplaystore.csv')
from csv import reader
read_file_2 = reader(open_file_2)
play_store = list(read_file_2)
number_of_rows_play = len(play_store)
number_of_columns_play = len(play_store[0])

To make exploring the two data sets easier, I'll first write a function named explore_data() which I can use repeatedly to explore rows in a more readable way. I'll also add an option for my function to show the number of rows and columns for any data set.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[
            0]))

In [4]:
explore_data(app_store, 1, 4, rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


In the App Store data set, there are 7197 iOS apps and there are 16 columns.

In [5]:
explore_data(play_store, 1, 4, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In the Google Play Store data set there are 10841 apps and 13 columns.

In [6]:
print(app_store[0]) # To be able to see the column names in App Store data set
print('\n')
print(play_store[0])  # To be able to see the column names in Play Store data set

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The Google Play Store data set has a dedicated discussion section, and I saw that one of the discussions outlines an error for row 10472. Let me print this row and see what is the error.

In [7]:
print(play_store[0])  # The header
print('\n')
print(play_store[10473]) # The row which has an error

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


When I was comparing the "header" and the row, I saw that the rating for the app in this row is "19". That is not possible because the maximum rating for the apps in Google Play Store is "5". That means this information is wrong and I decided to delete this row beacuse it will affect my analyzes.

In [8]:
del play_store[10473]

# Removing The Duplicate Entries

In [9]:
# To be able to find the number of duplicate entries in App Store data set
duplicate_apps_app = []
unique_apps_app = []

for app in app_store[1:]:
    name = app[0]
    if name in unique_apps_app:
        duplicate_apps_app.append(name)
    else:
        unique_apps_app.append(name)
        
print('Number of Duplicate Apps :', len(duplicate_apps_app))
print('\n')
print('Number of Unique Apps :', len(unique_apps_app))

Number of Duplicate Apps : 0


Number of Unique Apps : 7197


In [10]:
# To be able to find the number of duplicate entries in Google Play Store data set
duplicate_apps_play = []
unique_apps_play = []

for app in play_store[1:]:
    name = app[0]
    if name in unique_apps_play:
        duplicate_apps_play.append(name)
    else:
        unique_apps_play.append(name)
        
print('Number of Duplicate Apps :', len(duplicate_apps_play))
print('\n')
print('Number of Unique Apps :', len(unique_apps_play))

Number of Duplicate Apps : 1181


Number of Unique Apps : 9659


In Apple Store Dataset there is no duplicate entry. But, in Google Play Store Dataset there are 1181 duplicate entries. First of all, I need to remove these duplicate entries and keep only one entry for each app. I will not remove the duplicate entries randomly. I will keep only the entry with a highest number of reviews to make sure the ratings are acurate.

In [11]:
# To be able to find the entry with a highest number of reviews
reviews_max = {}
for app in play_store[1:]:
    name = app[0]
    num_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < num_reviews:
        reviews_max[name] = num_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = num_reviews

I have removed the duplicate entries for the Google Play Store Dataset. Now, let me check if I am correct or wrong by comparing the expected number of the rows which is 9659 with the actual number of the row for our new dataset.

In [12]:
expected_length = len(unique_apps_play)
actual_length = len(reviews_max)
print('Expected Length is :' + ' ', expected_length)
print('Actual Length is :' + ' ', actual_length)

Expected Length is :  9659
Actual Length is :  9659


Now, I will create two new data sets. One of them will store only the names of the apps while the second one will store entire data of the apps after removing the duplicate entries.

In [13]:
play_clean = []
play_added = []
for app in play_store[1:]:
    name = app[0]
    num_reviews = float(app[3])
    if (reviews_max[name] == num_reviews) and (name not in play_added):
        play_clean.append(app)
        play_added.append(name)

In [14]:
explore_data(play_clean, 1, 5, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps

Since the company works on the apps with English names, I should remove the apps with non-English names. To be able to do this, first I need to check if the name of any app is in English or not by using the function below.

In [15]:
def english_names(string):
    num_non_eng = 0
    for character in string:
        if ord(character) > 127 :
            num_non_eng += 1
        if num_non_eng > 3:
            return False

    return True

In [16]:
english_names('Instagram')

True

In [17]:
english_names('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [18]:
english_names('Docs To Go™ Free Office Suite')

True

In [19]:
english_names('Instachat 😜')

True

Now, it is time to get the apps in both Google Play and Apple Store data sets which have English names.

In [20]:
play_store_eng = []
app_store_eng = []

for app in play_clean:
    name = app[0]
    if english_names(name):
        play_store_eng.append(app)
    
for app in app_store[1:]:
    name = app[1]
    if english_names(name):
        app_store_eng.append(app)
        
print(explore_data(play_store_eng, 0, 5, True))
print('\n')
print(explore_data(app_store_eng, 1, 5, True))
    
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13
None


['389801252', 'Instagram', '113954816', '

So far, I found that there are 9614 apps with English names in Google Play Store Dataset and 6183 apps with English names in Apple Store Dataset.

# Having The Free Apps Only
As I mentioned in the introduction, the company builds apps that are free to download and install, and our main source of revenue consists of in-app ads. The data sets contain both free and non-free apps, and I'll need to isolate only the free apps for our analysis. Below, I isolate the free apps for both of the data sets.

In [21]:
play_store_free = []
app_store_free = []

for app in play_store_eng:
    price = app[7]
    if price == '0':
        play_store_free.append(app)

for app in app_store_eng:
    price = app[4]
    if price == '0.0':
        app_store_free.append(app)
        
print(explore_data(play_store_free, 0, 5, True))
print('\n')
print(explore_data(app_store_free, 0, 5, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13
None


['284882215', 'Facebook', '389879808', 'U

Google Play Store Dataset has 8864 apps whose names are in English and they are free.

Apple Store Dataset has 3222 apps whose names are in English and they are free.

# Most Common Apps by Genre

As you see, these numbers are still very high.I need to eliminate some of these apps. Since I am looking for a way to get income throug advertisement in apps, it is highly recomended to deal with the apps which are most popular. 

To be able to identify these apps I will deal with the prime_genre column in Apple Store Dataset and Genres and Category columns in Google Play Store Dataset. By using of these columns I will create frequecy tables for the apps.

In [22]:
def freq_table(dataset, index):
    f_table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in f_table:
            f_table[value] += 1
        else:
            f_table[value] = 1
    
    p_table = {}
    for key in f_table:
        percentage = (f_table[key] / total) * 100
        p_table[key] = percentage
    return p_table
    

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now, I will start by examining the frequency table for the prime_genre column of the App Store data set.

In [24]:
display_table(app_store_free, 11) # Index 11 is for prime_genre column

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that among the free English apps in App Store data set the most popular apps are in Games. As general, we can say that the most popular apps are those for fun and entertainment. The apps for practical porposes are not popular.However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Now, I will continue by examining the Genres and Category columns of the Google Play Store data set (two columns which seem to be related).

In [27]:
display_table(play_store_free, 1) # Index 1 is for Category column

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The results seem significantly different on Google Play Store: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes.The most popular Category is Family in the Google Play Store data set.

In [29]:
display_table(play_store_free, 9) # Index 9 is for Genres column

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

There is not a very big difference between Gategory and Genres columns but only that Genres column is much more granular (it has more categories).I am only looking for the bigger picture at the moment, so I will only work with the Category column moving forward.

# Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [30]:
genres_in_app_store = freq_table(app_store_free, 11)
for genre in genres_in_app_store:
    total = 0
    len_genre = 0
    for app in app_store_free:
        genre_app = app[11]
        if genre_app == genre:
            num_ratings = float(app[5])
            total = total + num_ratings
            len_genre += 1
    avg_num_rat = total / len_genre
    print(genre, ':' , avg_num_rat)

Navigation : 86090.33333333333
Education : 7003.983050847458
Utilities : 18684.456790123455
Entertainment : 14029.830708661417
Travel : 28243.8
Social Networking : 71548.34905660378
Health & Fitness : 23298.015384615384
Finance : 31467.944444444445
Weather : 52279.892857142855
Photo & Video : 28441.54375
Reference : 74942.11111111111
News : 21248.023255813954
Catalogs : 4004.0
Business : 7491.117647058823
Sports : 23008.898550724636
Music : 57326.530303030304
Book : 39758.5
Medical : 612.0
Shopping : 26919.690476190477
Food & Drink : 33333.92307692308
Lifestyle : 16485.764705882353
Productivity : 21028.410714285714
Games : 22788.6696905016


On average, navigation apps have the highest number of user reviews.

In [35]:
category_in_play_store = freq_table(play_store_free, 1)
for category in category_in_play_store:
    total = 0
    len_category = 0
    for app in play_store_free:
        category_app = app[1]
        if category_app == category:
            num_inst = app[5]
            num_inst = num_inst.replace('+', '')
            num_inst = num_inst.replace(',', '')
            
            total += float(num_inst)
            len_category += 1
    avg_num_inst = total / len_category
    print(category, ':', avg_num_inst)

EDUCATION : 1833495.145631068
BEAUTY : 513151.88679245283
TRAVEL_AND_LOCAL : 13984077.710144928
MEDICAL : 120550.61980830671
PARENTING : 542603.6206896552
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
FOOD_AND_DRINK : 1924897.7363636363
WEATHER : 5074486.197183099
PRODUCTIVITY : 16787331.344927534
NEWS_AND_MAGAZINES : 9549178.467741935
BUSINESS : 1712290.1474201474
AUTO_AND_VEHICLES : 647317.8170731707
VIDEO_PLAYERS : 24727872.452830188
ART_AND_DESIGN : 1986335.0877192982
GAME : 15588015.603248259
MAPS_AND_NAVIGATION : 4056941.7741935486
PHOTOGRAPHY : 17840110.40229885
EVENTS : 253542.22222222222
HOUSE_AND_HOME : 1331540.5616438356
BOOKS_AND_REFERENCE : 8767811.894736841
SPORTS : 3638640.1428571427
HEALTH_AND_FITNESS : 4188821.9853479853
SOCIAL : 23253652.127118643
ENTERTAINMENT : 11640705.88235294
PERSONALIZATION : 5201482.6122448975
FINANCE : 1387692.475609756
FAMILY : 3695

On average, communication apps have the most installs.

# Conclusions
In this project, I analyzed data about the App Store and Google Play Store mobile apps with the goal of recommending an app profile that can be profitable for both markets.

I concluded that creating an app in navigation category for App Store and an app in communication category for Google Play Store can help the company to get more income throug ads.