# Profitable App Profiles for the App Store and Google Play Markets

Goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users in both Android and ios. We will analyze which genres are more common and are popular as well. For scope of this project we will only focus on free apps meant for english speaking audience

In [105]:
# we will create a function for easy exploration of any data set
def explore(dataset,start, end,display_rows_columns=False):
    data_slice=dataset[start:end]
    for row in data_slice:
        print (row)
        print('\n')
        if display_rows_columns:
            print('No. of rows: ', len(dataset))
            print('No. of columns:',len(dataset[0]))            

In [106]:
from csv import reader
open_file=open('AppleStore.csv')
read_file=reader(open_file)
ios=list(read_file)
ios_header=ios[0]
ios=ios[1:]

open_file=open('googleplaystore.csv')
read_file=reader(open_file)
google=list(read_file)
google_header=google[0]
google=google[1:]


In [110]:
print(google_header)# explore app store data
explore(google,1,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


No. of rows:  10841
No. of columns: 13
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


No. of rows:  10841
No. of columns: 13


for our analysis app, category, rating, price, reviews are relevant for our analysis

In [111]:
print(ios_header)  #explore ios data
explore(ios,1,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


No. of rows:  7197
No. of columns: 16
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


No. of rows:  7197
No. of columns: 16


for our analysis of ios market trackname, price, user rating, prime genre are relevant colums for analysis.

Data Cleaning Process:
1. Remove incorrect data
2. Remove duplicate data
3. Remove non english apps
4. Remove apps that are not free

Step 1: Remove incorrect data

In [113]:
# Identify if all rows are of same length
index=0
for row in google:
    if len(row) != len(google_header):
        print(row)
        print(len(row))
        print(index)
    else:
        index+=1

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
10472


In [114]:
# We see that row 10472 has less columns. We will remove this row
del google[10472]

In [115]:
for row in ios:
    if len(row) != len(ios_header):
        print(row)
        print(len(row))
        print(index)
    else:
        index+=1

all rows in ios data are of equal length

Step 2: Remove duplicate data

If we look at the data for any app we can see that there are multiple rows because of multiple reasons like different version or data being collected at different times. However if we look at the reviews column we can infer that the row with the most number of reviews will be the most recent and hence we will retain one row per app

In [123]:
# first we will check how many duplicate rows are there in google market
duplicate=[]
unique=[]

for row in google:
    name=row[0]
    if name in unique:
        duplicate.append(name)
    else:
        unique.append(name)

print('No. of duplicate rows:', len(duplicate))
print('\n')
print('Sample duplicate rows:',duplicate[:5])
print(google_header)
for row in google:
    name=row[0]
    if name=='ZOOM Cloud Meetings':
        print('\n')
        print(row)      


No. of duplicate rows: 1181


Sample duplicate rows: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']


['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']


In [125]:
# now for each app we will keep only the rows with the maximum number of reviews

reviews_max={}
for row in google:
    name=row[0]
    reviews=float(row[3])
    if name in reviews_max and reviews_max[name]<reviews:
        reviews_max[name]=reviews
    elif name not in reviews_max:
        reviews_max[name]=reviews

print(len(reviews_max)) # number of unique rows
print(len(google)-1181) # expected unique rows

9659
9659


In [126]:
# now we have the name of the app and the value for reviews. We will use this to remove rows which do not match this criteria

google_clean=[]
already_added=[]
for row in google:
    name=row[0]
    reviews=float(row[3])
    if name not in already_added and reviews_max[name]==reviews:
        google_clean.append(row)
        already_added.append(name) # once an entry is there in clean data we do not want to append any other row


In [127]:
print(len(google_clean)) # expected 9659 rows

9659


Step 1 and 2 are complete

Step 3: Removing non english app

We will do this by checking the ASCII codes. Every alphabet has a code. All english characters have ASCII code less than 127.There are some symbols or emojis which have a higher ASCII code and hence we will check every character in each app name and if total number of characters with ASCII code is more than 3 we will assume its a non english app name. This is a reasonable estimate for now

In [131]:
# we will create a function for identification of english and non english app names

def is_english(string):
    n=0
    for char in string:
        if ord(char)>127:
            n+=1
    if n>3:
        return False
    else:
        return True
#Checking function
is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [136]:
google_eng=[]
ios_eng=[]
for row in google_clean:
    name=row[0]
    if is_english(name):
        google_eng.append(row)

for row in ios_clean:
    name=row[1] #track name index 1
    if is_english(name):
        ios_eng.append(row)

print(len(google_eng))
print(len(ios_eng))

print(google_header)
print(ios_header)

9614
6183
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Step 3 is complete> now in Step 4 we will remove the paid apps.
This information in google data is avaialble in col 7 in 0 decimal format and in ios in col 4 in 1 decimal format

In [150]:
google_final=[]
ios_final=[]

for row in google_eng:
    price=row[7]
    if price=='0':
        google_final.append(row)
#
for row in ios_eng:
    price=row[4]
    if price=='0.0':
        ios_final.append(row)
#
print(len(google_final))
print(len(ios_final))

8864
3222


All 4 steps of data cleaning are now complete

Part C: Data Analysis

In this section we need to identify the kind of apps that are likely to attract more users and hence more revenue

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Since the end goal is to add the app on ios market we need find app profiles which are successful in both the markets

In [139]:
# understand which are the most common genres in each market
# google column index -4 and ios column index -5 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [157]:
# create a function to create frequency percentage of each app
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages



In [158]:
# function to use the frequency table (dictionary) and print entries of frequenxy table in decending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [163]:
#check function
#display_table(google_final,-4) # genre
display_table(google_final,1) # category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [161]:
#check function for ios
display_table(ios_final,-5) # prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the frequency table of ios market we can see that among the free english apps the most popular genre is Games which comprise 58% of the apps followed by Entertatinment 7.8% of the apps. We can see that majority of the apps are more for entertainment like Games, Videos, photos and social networking. Practical apps are very less. Hence its clear that Entertainment related apps dominate the ios market.
The android market however is quite different.For android market practical apps like family,tools,books are equally popular as the fun apps like games, photography, social.
But more number of apps does not necessarily mean more number of users or higher average rating
Now we will investigate which kind of apps/genres have more number of users and have the highest avg rating

For google data we will check the number of installs (index 5) and for ios data we will check total number of user ratings(index 5) since number of installs information is not available


In [170]:
#to calculate users per category
google_category=freq_table(google_final,1)
for category in google_category: #for running the loop only once for each category
    total_users=0
    total_apps=0 #  total_users/total_apps will give us user/category
    for app in google_final: #for running through all the rows
        category_app=app[1]
        if category_app==category:
            installs=app[5]
            installs=installs.replace('+','')
            installs=installs.replace(',','')
            
            total_users+=float(installs) # will add installs only for rows which are qual to the category generated from frequency table
            total_apps+=1
    avg_installs=total_users/total_apps
    print(category,':',avg_installs)

PARENTING : 542603.6206896552
SHOPPING : 7036877.311557789
HOUSE_AND_HOME : 1331540.5616438356
ENTERTAINMENT : 11640705.88235294
BEAUTY : 513151.88679245283
GAME : 15588015.603248259
BUSINESS : 1712290.1474201474
FINANCE : 1387692.475609756
SPORTS : 3638640.1428571427
LIFESTYLE : 1437816.2687861272
NEWS_AND_MAGAZINES : 9549178.467741935
WEATHER : 5074486.197183099
SOCIAL : 23253652.127118643
BOOKS_AND_REFERENCE : 8767811.894736841
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
HEALTH_AND_FITNESS : 4188821.9853479853
PHOTOGRAPHY : 17840110.40229885
EDUCATION : 1833495.145631068
COMMUNICATION : 38456119.167247385
EVENTS : 253542.22222222222
DATING : 854028.8303030303
MAPS_AND_NAVIGATION : 4056941.7741935486
PERSONALIZATION : 5201482.6122448975
TRAVEL_AND_LOCAL : 13984077.710144928
FOOD_AND_DRINK : 1924897.7363636363
ART_AND_DESIGN : 1986335.0877192982
COMICS : 817657.2727272727
MEDICAL : 120550.61980830671
FAMILY : 3695641.8198090694
PRODUCTIVITY : 16787331.3

based on the above output communication category has the highest number of average installs. Let's take a closer look

In [172]:

for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

we can see that the category is heavily dominated by few apps which skeys the overall average installs. 

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs

In [177]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average

In [179]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [181]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.