# Profitable App Profiles for the App Store and Google Play Markets

The aim of this project is to analyze data to help developers understand what kinds of apps are likely to attract more users. We will pretend to be working as data analysts for a company that builds Android and IOS mobile apps. We make our apps available on Google Play and the App Store.
We will assume the following:
* The apps we build are free to download and install
* Our main source of revenue consists of in-app ads
* Our revenue for any given app is mostly influenced by the number of users who use our app, so the more users the better
* The data is directed to an english-speaking audience



In [2]:
from csv import reader

opened_file1=open('AppleStore.csv',encoding='utf8')
read1=reader(opened_file1)
ios=list(read1)
ios_header=ios[0]
ios=ios[1:]


opened_file2=open('googleplaystore.csv',encoding='utf8')
read2=reader(opened_file2)
android=list(read2)
android_header=android[0]
android=android[1:]

We will now write a function that prints the data in a more readable way:

In [3]:
def explore_data(dataset,start,end,row_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        

    if row_and_columns:
        print ('Number of rows:' ,len(dataset))
        print('Number of columns:', len(dataset[0]))

## Apple Store dataset
Click [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) to access its documentation

In [4]:
print(ios_header)
print('\n')
explore_data(ios,0,5,True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

## Google Play Store dataset 
Click [here](https://www.kaggle.com/lava18/google-play-store-apps/home) to access its documentation

In [5]:
print(android_header)
print('\n')
explore_data(android,0,5,True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

# Data cleaning

## Step 1: 
If we read the google play store dataset [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), it is mentioned that an error occurs for the row 10472 (or 10473 with the header).

In [6]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We observe that there no catergory (2nd column), but a rating instead. So we delete the row:

In [7]:
del android[10472]

## Step 2: 
We will check the datasets for duplicate entries (an app appearing more than once):

In [8]:
def check_for_duplicates(dataset):
    duplicates=[]
    unique=[]

    for row in dataset:
        name=row[0]
        if name in unique:
            duplicates.append(name)
        else:
            unique.append(name)
    
    print('Number of duplicate apps:',len(duplicates))
    print('Number of unique apps:',len(duplicates))
    print('\n')
    if len(duplicates)!=0:
        print('Some of the duplicates are:')
        print(duplicates[:25])
        
   
            

 Duplicates in Apple Store Dataset?

In [9]:
check_for_duplicates(ios)

Number of duplicate apps: 0
Number of unique apps: 0




 Duplicates in Google PlayStore Dataset?

In [10]:
check_for_duplicates(android)

Number of duplicate apps: 1181
Number of unique apps: 1181


Some of the duplicates are:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents']


Going back to the datasets discussions we find that instagram is in 'duplicates' so we will display them to find out more about this issue:

In [11]:
for row in android:
    if row[0]=='Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We notice that the only difference happens on the fourth position of each row, which indicates the number of reviews.
This means that when the information about an app changes, a duplicate row is created. 
So instead of deleting the duplicates randomly, we will keep the newest row (indicated by the highest number of reviews).

To proceed with what we mentioned above, we will use a dictionary:

In [12]:
reviews_max={}

for row in android:
    name=row[0]
    n_reviews=float(row[3])
    if (name in reviews_max):
        if (n_reviews>reviews_max[name]):
            reviews_max[name]=n_reviews
    else:
        reviews_max[name]=n_reviews
    

We will inspect the dictionary to make sure everything went as expected.
We expect to have:

In [13]:
print(len(android)-1181, "entries")

9659 entries


And in the dictionary there are:

In [14]:
print(len(reviews_max), "entries")

9659 entries


We will now clean the android dataset.
We create a new list (of lists) for the clean data and we use a list to help us keep track of the app that we have already added to it:

In [15]:
android_clean=[]
already_added=[]

for row in android:
    name=row[0]
    n_reviews=float(row[3])
    if (((name in already_added) ==0) & (n_reviews==reviews_max[name])):
        already_added.append(name)
        android_clean.append(row)
        

In [16]:
print("The android_clean dataset contains" ,len(android_clean), "entries")

The android_clean dataset contains 9659 entries


## Step 3:
If we take a look at the last rows of the Apple Store datasets, we find non-english apps:


In [17]:
print("Apple Store dataset:")
print("\n")
explore_data(ios,len(ios)-10,len(ios)-5)

Apple Store dataset:


['11051', '1186126548', 'Escape Game: illumination', '52342784', 'USD', '0', '23', '23', '4.5', '4.5', '1.0', '4+', 'Games', '37', '5', '2', '1']


['11060', '1186384912', 'Demolition Derby Virtual Reality (VR) Racing', '168774656', 'USD', '0', '18', '18', '4', '4', '1.0.0', '12+', 'Games', '38', '4', '1', '1']


['11074', '1187128255', '飞刀传奇-动作武侠热血江湖即时PK传奇（登录爆金装）', '537462784', 'USD', '0.99', '0', '0', '0', '0', '2.1.0', '9+', 'Games', '38', '5', '1', '1']


['11077', '1187279979', 'Add-Ons Studio for Minecraft', '22999040', 'USD', '2.99', '97', '97', '3', '3', '1.0', '4+', 'Games', '37', '5', '3', '1']


['11079', '1187282363', 'Plead the Fifth - The Game', '27853824', 'USD', '2.99', '11', '0', '4', '0', '1.1.1', '17+', 'Games', '37', '0', '1', '1']




Now since the apps we develop at our company use the english language, we only want to analyze apps that are directed towards english speaking audience. 
So let us remove Non-English entires:

In [18]:
## Function that checks if a string is not in english
## It will classify a string as "non-english" if it has more than 3 
## characters outside the ASCII range of characters commonly used in english
def english_string_checker(string):
    count=0
    for character in string:
        if (ord(character)>127):
            count+=1
        

    if count>3:
         return False
    else:
        return True

In [19]:
print(english_string_checker('Instagram'))
print(english_string_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_string_checker('Docs To Go™ Free Office Suite'))
print(english_string_checker('Instachat 😜'))

True
False
True
True


In [20]:
##This function will return a clean dataset
## dataset_num: ios->0    android->1 because the name of the app is not 
## stored in the same index for both datasets

def only_english(dataset,dataset_num):
    new_dataset=[]
    if dataset:
        i=0
    else:
        i=2
        
    for row in dataset:
        name=row[i]
        if english_string_checker(name)==True:
            new_dataset.append(row)
            
    return new_dataset
            

In [21]:
ios_english=only_english(ios,0)
android_english=only_english(android_clean,1)

In [22]:
print("Remaining rows for IOS:", len(ios_english))
print("Remaining rows for Android:" ,len(android_english))

Remaining rows for IOS: 7197
Remaining rows for Android: 9614


## Step 4:
As we mentioned earlier our main source of income is form in-app adds, so our apps are free. Therefore, we must focus our analysis on free apps.
This will be the final step of the data cleaning process.

In [23]:
ios_data=[]
android_data=[]

for app in android_english:
    price=app[7]
    if price=='0':
        android_data.append(app)
        
for app in ios_english:
    price=app[5]
    if price=='0':
        ios_data.append(app)
    
    

In [24]:
print("Remaining rows for IOS:", len(ios_data))
print("Remaining rows for Android:" ,len(android_data))

Remaining rows for IOS: 4056
Remaining rows for Android: 8864


# Analysis

## Most common apps by genre

As we mentioned earlier, our end goal is to create a profitable app for the App Store and google play. 
To minimize the risks we will:
* Build a simplistic Android version of an app and add it to Google play
* Track the users response to the app and develop it further if it received a positive feedback
* If it is profitable after six months, we buildl an iOS version of the app and add it to the App Store

It is therefore essential that we find an app profile that fits both stores.

As a reminder, these are the columns of each data set:

In [25]:
print(android_header)
print('\n','\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

 

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


To determine to most common genres in each market we must use:
* Android: `Category` (column 1) and `Genres` (column 9)
* IOS: `prime_genre` (column 12)

We will now sort the apps from most to least common.

In [26]:
def freq_table(dataset,index):
    dictionnary={}
    dict_percentage={}
    total=0
    for row in dataset:
        field=row[index]
        if field in dictionnary:
            dictionnary[field]+=1
            total+=1
            
        else:
            dictionnary[field]=1
            total+=1
        
        for key in dictionnary:
            dict_percentage[key]=(dictionnary[key]/total)*100
            
    return dict_percentage


In [27]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0], '%')

In [28]:
print('Google Play by category:')
print('\n')
display_table(android_data,1)

Google Play by category:


FAMILY : 18.907942238267147 %
GAME : 9.724729241877256 %
TOOLS : 8.461191335740072 %
BUSINESS : 4.591606498194946 %
LIFESTYLE : 3.9034296028880866 %
PRODUCTIVITY : 3.892148014440433 %
FINANCE : 3.7003610108303246 %
MEDICAL : 3.531137184115524 %
SPORTS : 3.395758122743682 %
PERSONALIZATION : 3.3167870036101084 %
COMMUNICATION : 3.2378158844765346 %
HEALTH_AND_FITNESS : 3.0798736462093865 %
PHOTOGRAPHY : 2.944494584837545 %
NEWS_AND_MAGAZINES : 2.7978339350180503 %
SOCIAL : 2.6624548736462095 %
TRAVEL_AND_LOCAL : 2.33528880866426 %
SHOPPING : 2.2450361010830324 %
BOOKS_AND_REFERENCE : 2.1435018050541514 %
DATING : 1.861462093862816 %
VIDEO_PLAYERS : 1.7937725631768955 %
MAPS_AND_NAVIGATION : 1.3989169675090252 %
FOOD_AND_DRINK : 1.2409747292418771 %
EDUCATION : 1.1620036101083033 %
ENTERTAINMENT : 0.9589350180505415 %
LIBRARIES_AND_DEMO : 0.9363718411552346 %
AUTO_AND_VEHICLES : 0.9250902527075812 %
HOUSE_AND_HOME : 0.8235559566787004 %
WEATHER : 0.800992779783

In [29]:
print('Google Play by genre:')
print('\n')
display_table(android_data,9)

Google Play by genre:


Tools : 8.449909747292418 %
Entertainment : 6.069494584837545 %
Education : 5.347472924187725 %
Business : 4.591606498194946 %
Productivity : 3.892148014440433 %
Lifestyle : 3.892148014440433 %
Finance : 3.7003610108303246 %
Medical : 3.531137184115524 %
Sports : 3.463447653429603 %
Personalization : 3.3167870036101084 %
Communication : 3.2378158844765346 %
Action : 3.1024368231046933 %
Health & Fitness : 3.0798736462093865 %
Photography : 2.944494584837545 %
News & Magazines : 2.7978339350180503 %
Social : 2.6624548736462095 %
Travel & Local : 2.3240072202166067 %
Shopping : 2.2450361010830324 %
Books & Reference : 2.1435018050541514 %
Simulation : 2.0419675090252705 %
Dating : 1.861462093862816 %
Arcade : 1.8501805054151623 %
Video Players & Editors : 1.7712093862815883 %
Casual : 1.7599277978339352 %
Maps & Navigation : 1.3989169675090252 %
Food & Drink : 1.2409747292418771 %
Puzzle : 1.128158844765343 %
Racing : 0.9927797833935018 %
Role Playing : 0.93637184

* The most common category is `FAMILY`, followed by `GAMES` which half as frequent as the first. 
* The most common genre is `Tools` followed by `Entertainment`

In [30]:
print('App Store by Prime genre:')
print('\n')
display_table(ios_data,12)

App Store by Prime genre:


Games : 55.64595660749507 %
Entertainment : 8.234714003944774 %
Photo & Video : 4.117357001972387 %
Social Networking : 3.5256410256410255 %
Education : 3.2544378698224854 %
Shopping : 2.983234714003945 %
Utilities : 2.687376725838264 %
Lifestyle : 2.3175542406311638 %
Finance : 2.0710059171597637 %
Sports : 1.947731755424063 %
Health & Fitness : 1.8737672583826428 %
Music : 1.6518737672583828 %
Book : 1.6272189349112427 %
Productivity : 1.5285996055226825 %
News : 1.4299802761341223 %
Travel : 1.3806706114398422 %
Food & Drink : 1.0601577909270217 %
Weather : 0.7642998027613412 %
Reference : 0.4930966469428008 %
Navigation : 0.4930966469428008 %
Business : 0.4930966469428008 %
Catalogs : 0.22189349112426035 %
Medical : 0.19723865877712032 %


* We can see that the most common genre of apps in the App store is `Games`, as more than half of the apps are games. The second place goes to `Entertainment` whith a considerable difference in percentage. 
* Other notable common genres are `Photo & Video`, `Social Networking` and `Education` which constitute 10% of the apps.
* The general impression is that most of the apps are designed for fun, intertainment and socializing.
<br>
<br>
<br>

Comparing the results of both stores, we notice that in the App Store a striking portion of the apps are games. In the Google Play Store the apps are more equally distributed among categories.

One similarity between the two stores is the presence of "fun" categories at the top of the ranking. This could mean two things:
1. The demand for these apps is high, which encourages developpers to release new apps
2. The competition is much higher when it comes to entertainment and games. 
<br>
<br>
<br>

**However** these numbers only represent the number of apps within a certain category and not the populariy of a category. So we cannot make any conclusion regarding the genre of the app we want to build.
 


## Most Popular Apps by Genre
<br>

###  App Store
Unlike the Google Play dataset, the Appstore dataset does not contain any columns for `installs`. So as a workaround we will take the total number of user rating as a proxy, which we can find in the `rating_count_tot`.
<br>

We will start by calculating the average rating per app genre on the App Store:



In [69]:
unique_genre=freq_table(ios_data,12)
ios_avg_rating={}

for genre in unique_genre:
    ios_avg_rating[genre]=0
    total=0
    len_genre=0
    for row in ios_data:
        if (row[12]==genre):
            len_genre+=1
            total+=float(row[6])
    ios_avg_rating[genre]=total/len_genre
ios_avg_rating_list=[]    
for genre in ios_avg_rating:
    ios_avg_rating_list.append((genre, ios_avg_rating[genre]))

In [70]:
print("Average number of rating by category: ", '\n')

#print sorted
for genre in sorted(ios_avg_rating_list, key= lambda x :x[1],reverse=True): 
    print(genre[0],":", genre[1])
    


Average number of rating by category:  

Reference : 67447.9
Music : 56482.02985074627
Social Networking : 53078.195804195806
Weather : 47220.93548387097
Photo & Video : 27249.892215568863
Navigation : 25972.05
Travel : 20216.01785714286
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
Productivity : 19053.887096774193
Games : 18924.68896765618
Shopping : 18746.677685950413
News : 15892.724137931034
Utilities : 14010.100917431193
Finance : 13522.261904761905
Entertainment : 10822.961077844311
Lifestyle : 8978.308510638299
Book : 8498.333333333334
Business : 6367.8
Education : 6266.333333333333
Catalogs : 1779.5555555555557
Medical : 459.75


The most popular categories based on the rating count are `Reference`, `Music`, `Social Networking` and `Weather` with a fairly close score ranging from 47k ratings to 68k ratings.
<br>

It is undeniable that the Music and Social Networking fields are dominated by very popular apps such as **Spotify**, **Instagram**, **Twitter** or **Facebook**. So it would be extremely difficult to compete in such categories, especially that our company is not focused on a single app.
<br>

Based on these observations, we would recommend building an app that falls into `Reference` category for the App Store. These app are usually used on a daily basis and could potentantially attract more users than would the games. 


### Google Play 

On google play the number of Installs for each app is denoted within a certain range : 100 000+, 300 000+ ...
Although these number are not precise, they should be enough for us to get a general idea on the most popular categories of apps.

In [74]:
unique_category=freq_table(android_data,1)
install_list=[]

for category in unique_category:
    total=0
    len_category=0
    for row in android_data:
        category_app=row[1]
        if category_app==category:
            cleaning_install=row[5].replace(',','')
            cleaning_install=cleaning_install.replace('+','')
            total+=float(cleaning_install) #To remove the + character
            len_category+=1
            
    install_list.append((category,total/len_category))
        
        
    

In [77]:
print('Average number of installs per category:','\n')
for category in sorted(install_list, key= lambda x :x[1],reverse=True):
    print(category[0],':', category[1])

Average number of installs per category: 

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
P

The most installed apps on Google Play belong to the categories `Communication`, `Video_Players`, `Social`, `Photography`, `Productivity`, `Game` and `Travel_And_Local` with a fairly equal distribution.
<br>

As we mentioned earlier `Communication`, `Social` are very popular but difficult categories to compete in. In this case `Video_player` category is definitely dominated by **Youtube**. The remaining apps are in the categories `Photography`, `Productivity`, `Game` and `Travel_And_Local`, which in exception of games are apps that are used daily for practical purposes. 


# Conclusion
<br>

* The stores are clearly flooded with games, so it would be risky to get into game development. However, this trend of "fun" apps indicate that there are alot of users seeking entertainment in both the Apple Store. 
* Social Networking and communication apps are very popular but the competition is too high. These categories are led by giants such as **Whatsapp** and **Facebook**. Getting into this field would be very dangerous and difficult. 
* Apps that are used daily for practical purposes are also very popular.

The ideal app to develop would have a fun and entertaining side and would be daily used as a reference or tool.
A good app that matches this profile is a Book of self help, comedy or technical knowledge. But since both app store are already flooded by libraries we could add features to the book: Interactive reading, adding notes, sharing paragraphs with friends, listing underlined sentences in a document... 
