# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

[Click Here](https://www.kaggle.com/lava18/google-play-store-apps) for Google Play Store Data Set named googleplaystore.csv

[Click Here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) for iOS App Store Data Set named AppleStore.csv

In [2]:
def explore_data(dataset):

    #displaying 1st 5 rows of dataset
    for rows in dataset[:5]:
        print(rows)
        print()
    print("Number of rows = ",len(dataset))
    print("Number of columns = ",len(dataset[0]))
    print()

In [3]:
#Opening and displaying a few rows of Applestore.csv

fhand_ios = open("AppleStore.csv")
from csv import reader
file_read = reader(fhand_ios)
ios = list(file_read)
print("Applestore.csv ---->")
explore_data(ios)

#Opening and displaying a few rows of googleplaystore.csv

fhand_google = open("googleplaystore.csv")
file = reader(fhand_google)
ps = list(file)
print("googleplaystore.csv ---->")
explore_data(ps)


Applestore.csv ---->
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']

Number of rows =  7198
Number of columns =  16

googleplaystore.csv ---->
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updat

In [4]:
# checking for duplicates in google play store dataset
unique_google = []
duplicate_google = []
for rows in ps:
    row = rows[0]
    if row in unique_google:
        duplicate_google.append(row)
    else:
        unique_google.append(row)
print("Number of duplicate records : ",len(duplicate_google))
print("Some of the duplicate records are : ",duplicate_google[:20])

Number of duplicate records :  1181
Some of the duplicate records are :  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


In [5]:
# checking for duplicates in apple store dataset
unique_ios = []
duplicate_ios = []
for rows in ios:
    row = rows[1]
    if row in unique_ios:
        duplicate_ios.append(row)
    else:
        unique_ios.append(row)
print("Number of duplicate records : ",len(duplicate_ios))
print("Some of the duplicate records are : ",duplicate_ios[:20])

Number of duplicate records :  2
Some of the duplicate records are :  ['Mannequin Challenge', 'VR Roller Coaster']


We saw that there are 1181 duplicate records in the googleplaystore.csv dataset. So we will now proceed to keep one record per app with highest rating and remove the rest. App with highest rating means that the app is most recent.

So to figure out the entry with highest reviews we will create a dictionary and perform some operations to obtain the same.

In [6]:
del ps[10473]   # 10473 has a column shift so we remove it
review_max={}
for row in ps[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in review_max:
        if review_max[name]<n_reviews:
            review_max[name]=n_reviews
    if name not in review_max:
        review_max[name]=n_reviews

After we have created the dictionary, we will create our own dataset for the processing purpose using the following code:

In [7]:
#creating our fresh dataset for project with no duplicate records
ps_clean = []
already_added = []
for row in ps[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if review_max[name] == n_reviews and name not in already_added:
        ps_clean.append(row)
        already_added.append(name)

Our motive was to analyse only English apps. So we will make a fresh data set that has only english apps and for that we created a filter function english_check().

In [8]:
# function to check whether the string is english or not

def english_check(string):
    c = 0
    for i in string:
        if not(ord(i)>=0 and ord(i)<=127):
            c+=1
    if c<=3:
        return True
    else:
        return False

So now we will filter our data based on names and form new datasets for google and ios apps.

In [30]:
english_app_google=[]
english_app_ios=[]
for row in ps_clean:
    name = row[0]
    res = english_check(name)
    if res == True:
        english_app_google.append(row)

In [31]:
len(english_app_google)

9614

In [32]:
for row in ios[1:]:
    name = row[1]
    res = english_check(name)
    if res == True:
        english_app_ios.append(row)

In [33]:
len(english_app_ios)

6183

Till now we have filtered out the english apps out of all apps. Our other criteria was the apps must be free. So now we will filter out English apps that are free.

In [34]:
free_english_apps_google=[]
free_english_apps_ios=[]
for row in english_app_google:
    price = row[7]
    if price == '0':
        free_english_apps_google.append(row)

In [35]:
len(free_english_apps_google)

8864

In [36]:
for row in english_app_ios:
    price = row[4]
    if price == '0.0':
        free_english_apps_ios.append(row)

In [37]:
len(free_english_apps_ios)

3222

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1.Build a minimal Android version of the app, and add it to Google Play.

2.If the app has a good response from users, we develop it further.

3.If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [38]:
#function to find frequency table
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        table[value] = table.get(value,0)+1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [39]:
# function to find fequency table and display it in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [40]:
#category

display_table(ps_clean,1)

FAMILY : 19.40159436794699
GAME : 9.793974531525002
TOOLS : 8.582669013355419
BUSINESS : 4.348276219070297
MEDICAL : 4.089450253649446
PERSONALIZATION : 3.892742519929599
PRODUCTIVITY : 3.8720364426959315
LIFESTYLE : 3.820271249611761
FINANCE : 3.571798322807744
SPORTS : 3.3647375504710633
COMMUNICATION : 3.2612071643027227
HEALTH_AND_FITNESS : 2.9816751216482036
PHOTOGRAPHY : 2.9092038513303655
NEWS_AND_MAGAZINES : 2.6296718086758464
SOCIAL : 2.474376229423336
BOOKS_AND_REFERENCE : 2.298374572937157
TRAVEL_AND_LOCAL : 2.2673154570866547
SHOPPING : 2.0913138006004766
DATING : 1.760016564861787
VIDEO_PLAYERS : 1.6978983331607829
MAPS_AND_NAVIGATION : 1.3562480588052592
FOOD_AND_DRINK : 1.1595403250854126
EDUCATION : 1.1077751320012423
ENTERTAINMENT : 0.9007143596645615
AUTO_AND_VEHICLES : 0.8800082824308935
LIBRARIES_AND_DEMO : 0.8696552438140595
WEATHER : 0.8178900507298892
HOUSE_AND_HOME : 0.7557718190288849
EVENTS : 0.6625944714773786
ART_AND_DESIGN : 0.6315353556268765
PARENTING : 0

In [41]:
#Genres

display_table(ps_clean,-4)

Tools : 8.572315974738585
Entertainment : 5.808054664043897
Education : 5.280049694585361
Business : 4.348276219070297
Medical : 4.089450253649446
Personalization : 3.892742519929599
Productivity : 3.8720364426959315
Lifestyle : 3.8099182109949266
Finance : 3.571798322807744
Sports : 3.4268557821720678
Communication : 3.2612071643027227
Action : 3.095558546433378
Health & Fitness : 2.9816751216482036
Photography : 2.9092038513303655
News & Magazines : 2.6296718086758464
Social : 2.474376229423336
Books & Reference : 2.298374572937157
Travel & Local : 2.2569624184698207
Shopping : 2.0913138006004766
Simulation : 1.9981364530489698
Arcade : 1.9049591054974633
Dating : 1.760016564861787
Casual : 1.7082513717776169
Video Players & Editors : 1.6771922559271144
Maps & Navigation : 1.3562480588052592
Puzzle : 1.232011595403251
Food & Drink : 1.1595403250854126
Role Playing : 1.0870690547675743
Strategy : 0.9835386685992339
Racing : 0.9421265141318977
Auto & Vehicles : 0.8800082824308935
Libra

In [42]:
#prime_genre

display_table(ios,-5)

Games : 53.65379272020006
Entertainment : 7.4326201722700755
Education : 6.293414837454849
Photo & Video : 4.848569046957488
Utilities : 3.445401500416782
Health & Fitness : 2.5006946373992776
Productivity : 2.4729091414281745
Social Networking : 2.3200889135871074
Lifestyle : 2.0005557099194218
Music : 1.9171992220061127
Shopping : 1.694915254237288
Sports : 1.5837732703528755
Book : 1.5559877743817727
Finance : 1.4448457904973604
Travel : 1.125312586829675
News : 1.0419560989163656
Weather : 1.0002778549597109
Reference : 0.8891358710752986
Food & Drink : 0.875243123089747
Business : 0.7918866351764378
Navigation : 0.6390664073353709
Medical : 0.31953320366768545
Catalogs : 0.13892747985551543
prime_genre : 0.01389274798555154


# Most Common Genres of AppStore.csv Dataset

From the above output of  

 > display_table(ios,-5)

we can see that Gaming is the most common genre of ios store with 53.65379272020006%

Other runner-ups include:-

Entertainment : 7.4326201722700755 %

Education : 6.293414837454849 %

Photo & Video : 4.848569046957488 %

And many more

# What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?

From the above analysis we can say that most apps are gaming and entertainment apps but education also holds a percentage of 6.3 (rounded).

But overall we can say that most apps are entertainment based.

# Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?

According to frequency table Games hold the maximum number of apps. As we are not provided with the number of users by the frequency table it is not possible to predict at the moment.

# Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

# Most common genres are:

Tools : 8.572315974738585

Entertainment : 5.808054664043897

Education : 5.280049694585361

Business : 4.348276219070297

Medical : 4.089450253649446

Personalization : 3.892742519929599

In [46]:
genre_apple = freq_table(ios[1:],-5)
genre_apple

{'Book': 1.5562039738780047,
 'Business': 0.7919966652771988,
 'Catalogs': 0.1389467833819647,
 'Education': 6.294289287203002,
 'Entertainment': 7.433652910935113,
 'Finance': 1.445046547172433,
 'Food & Drink': 0.8753647353063776,
 'Games': 53.66124774211477,
 'Health & Fitness': 2.501042100875365,
 'Lifestyle': 2.0008336807002918,
 'Medical': 0.31957760177851885,
 'Music': 1.9174656106711132,
 'Navigation': 0.6391552035570377,
 'News': 1.0421008753647354,
 'Photo & Video': 4.849242740030569,
 'Productivity': 2.473252744198972,
 'Reference': 0.8892594136445742,
 'Shopping': 1.6951507572599693,
 'Social Networking': 2.3204112824788106,
 'Sports': 1.5839933305543976,
 'Travel': 1.1254689453939142,
 'Utilities': 3.4458802278727245,
 'Weather': 1.0004168403501459}

In [47]:
for genre in genre_apple:
    total=0
    len_genre = 0
    for row in ios[1:]:
        genre_app = row[-5]
        if genre_app == genre:
            total +=float(row[5])
            len_genre+=1
    print(genre, " : ",total/len_genre)

Health & Fitness  :  9913.172222222222
Shopping  :  18615.32786885246
Business  :  4788.087719298245
Lifestyle  :  6161.763888888889
Navigation  :  11853.95652173913
Utilities  :  6863.822580645161
News  :  13015.066666666668
Photo & Video  :  14352.280802292264
Games  :  13691.996633868463
Travel  :  14129.444444444445
Food & Drink  :  13938.619047619048
Social Networking  :  45498.89820359281
Finance  :  11047.653846153846
Catalogs  :  1732.5
Entertainment  :  7533.678504672897
Sports  :  14026.929824561403
Reference  :  22410.84375
Weather  :  22181.027777777777
Productivity  :  8051.3258426966295
Education  :  2239.2295805739514
Book  :  5125.4375
Music  :  28842.021739130436
Medical  :  592.7826086956521


## Recommendation for Apple Store

<h6> App Profile :- <b>Social Media and Networking</b></h6>

In [51]:
genre_google = freq_table(ps_clean,1)
genre_google

{'ART_AND_DESIGN': 0.6315353556268765,
 'AUTO_AND_VEHICLES': 0.8800082824308935,
 'BEAUTY': 0.5487110466922042,
 'BOOKS_AND_REFERENCE': 2.298374572937157,
 'BUSINESS': 4.348276219070297,
 'COMICS': 0.5797701625427063,
 'COMMUNICATION': 3.2612071643027227,
 'DATING': 1.760016564861787,
 'EDUCATION': 1.1077751320012423,
 'ENTERTAINMENT': 0.9007143596645615,
 'EVENTS': 0.6625944714773786,
 'FAMILY': 19.40159436794699,
 'FINANCE': 3.571798322807744,
 'FOOD_AND_DRINK': 1.1595403250854126,
 'GAME': 9.793974531525002,
 'HEALTH_AND_FITNESS': 2.9816751216482036,
 'HOUSE_AND_HOME': 0.7557718190288849,
 'LIBRARIES_AND_DEMO': 0.8696552438140595,
 'LIFESTYLE': 3.820271249611761,
 'MAPS_AND_NAVIGATION': 1.3562480588052592,
 'MEDICAL': 4.089450253649446,
 'NEWS_AND_MAGAZINES': 2.6296718086758464,
 'PARENTING': 0.6211823170100425,
 'PERSONALIZATION': 3.892742519929599,
 'PHOTOGRAPHY': 2.9092038513303655,
 'PRODUCTIVITY': 3.8720364426959315,
 'SHOPPING': 2.0913138006004766,
 'SOCIAL': 2.474376229423336

In [60]:
for genre in genre_google:
    total = 0
    len_category = 0
    for row in ps_clean:
        category_app = row[1]
        if category_app == genre:
            noi=row[5]
            noi = noi.replace('+','')
            noi = noi.replace(',','')
            installs=float(noi)
            total+=installs
            len_category+=1
    print(genre," : ",total/len_category)

TOOLS  :  9774151.887816647
BUSINESS  :  1659916.3452380951
EVENTS  :  249580.640625
COMICS  :  803234.8214285715
VIDEO_PLAYERS  :  23975016.585365854
BEAUTY  :  513151.88679245283
WEATHER  :  4570892.658227848
LIBRARIES_AND_DEMO  :  630903.6904761905
DATING  :  828971.2176470588
BOOKS_AND_REFERENCE  :  7504367.459459459
HEALTH_AND_FITNESS  :  3972300.388888889
FAMILY  :  3319926.0965848453
SHOPPING  :  6932419.727722772
TRAVEL_AND_LOCAL  :  13218662.767123288
FOOD_AND_DRINK  :  1891060.2767857143
COMMUNICATION  :  35042146.82857143
HOUSE_AND_HOME  :  1331540.5616438356
SPORTS  :  3373767.6861538463
FINANCE  :  1319851.4028985507
NEWS_AND_MAGAZINES  :  9327628.976377953
GAME  :  14226135.745243128
ART_AND_DESIGN  :  1856362.2950819673
SOCIAL  :  22961790.384937238
LIFESTYLE  :  1365375.4444444445
PERSONALIZATION  :  4075783.994680851
ENTERTAINMENT  :  11375402.298850575
PARENTING  :  525351.8333333334
MEDICAL  :  96944.49873417722
AUTO_AND_VEHICLES  :  625061.305882353
EDUCATION  :  17

## Recomendation for Google Play Store

<h6>App category with maximum number of installs :- <b> Communication</b></h6>