# Project: Profitable App Profiles for App Store and Google Play Store Markets

For this project, I'll pretend that I'm working as a data analyst for a company that builds Android and iOS mobile apps. The goal for this project is to analyze data to predict what type of apps are likely to attract more users.  

I will be analyzing a data set containing 10,000 Android apps from the Google Play store (collected in Aug 2018) and 7,000 iOS apps from the App Store (collected in July 2017).

### Defining function to easily explore Data

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Reading in the Data

I found two datasets on Kaggle to use for the analysis. One contains information on the google play store and one contains data from the Apple Store

In [20]:
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('Applestore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

### Exploring the Data

In [21]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [18]:
print(apple_header)
print('\n')
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


## Cleaning up the Data, deleting incorrecting Values

Looking at the discussion section on Kaggle for the Google Play dataset revealed that their was a row with a missing value for rating. Will delete it below

In [22]:
android[10472]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [23]:
del android[10472]

In [27]:
print(len(android))

10840


## Removing Duplicate Entries

Given the high number of entries in the Google Play dataset compared to the Apple Store dataset, I suspected that the Google Play dataset contained some duplicate entries. The code below proves my suspicions to be correct.

In [29]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [32]:
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


In [33]:
print('Expected length:', len(android)- 1181)

Expected length: 9659


In [44]:
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [45]:
len(reviews_max)

9658

In [46]:
android_clean = []
already_added = []
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [47]:
explore_data(android_clean, 0, 3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9658
Number of columns: 13


### Removing Non-English Apps

In [51]:
print(apple[813][1])
print(apple[6731][1])
print('\n')
print(android_clean[4411][0])
print(android_clean[7939][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


In [52]:
print(ord('a'))
print(ord('A'))
print(ord('5'))
print(ord('+'))

97
65
53
43


### Define function to test whether App is in english or not

In [57]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


On the previous code our function detected Non-English apps named effectively but also incorrectly identified English app names because of special characters, as seen below

In [59]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

False
False
8482
128540


Therefore, going to edit the function to allow at least 3 special characters, this buffer will prevent English app names to be flagged as non-english

In [75]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    
    return True

In [76]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [74]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in apple:
    name = app[0]
    if is_english(name):
        apple_english.append(app)

print(explore_data(android_english, 0, 3, True))
print('\n')
print(explore_data(apple_english, 0, 3, True))

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13
None


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+

In [92]:
android_free = []
apple_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_free.append(app)

In [93]:
print(len(android_free))
print(len(apple_free))

8863
4056


In [94]:
a_list = [50,20,100]
print(sorted(a_list))
print(sorted(a_list, reverse=True))

[20, 50, 100]
[100, 50, 20]


### Create Frequency Tables to Visualize Popular App Genres

In [98]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [99]:
display_table(apple_free, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


**The table shows that over 55% of Apple free apps are in the category of 'Games', and Entertainment comes in a distant second with over 8%.**

In [100]:
display_table(android_free, 1)

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

**Compared to the App store, the Google Play store has a far more varied range of popular apps. With Family coming in first with 18.9% and Games second with 9.7%**

### Finding Average Number of Ratings for Each App Category

In [118]:
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in apple_english:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total/len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 45498.89820359281
Photo & Video : 14352.280802292264
Games : 13691.996633868463
Music : 28842.021739130436
Reference : 22410.84375
Health & Fitness : 9913.172222222222
Weather : 22181.027777777777
Utilities : 6863.822580645161
Travel : 14129.444444444445
Shopping : 18615.32786885246
News : 13015.066666666668
Navigation : 11853.95652173913
Lifestyle : 6161.763888888889
Entertainment : 7533.678504672897
Food & Drink : 13938.619047619048
Sports : 14026.929824561403
Book : 5125.4375
Finance : 11047.653846153846
Education : 2239.2295805739514
Productivity : 8051.3258426966295
Business : 4788.087719298245
Catalogs : 1732.5
Medical : 592.7826086956521


**From the table above, we can see that the "Navigation" apps genre has the highest number of user ratings. Second is the "Reference" genre, and "Social Networking" comes in third place. Based on the number of ratings and the prevalence of the genre, I recommend a Social Networking genre for free iOS apps.**

In [120]:
display_table(android_english, 5)

1,000,000+ : 14.709247893477581
100,000+ : 11.5052533028191
10,000+ : 10.610631436596275
10,000,000+ : 9.747217309892854
1,000+ : 9.154270259024239
100+ : 7.323416207219391
5,000,000+ : 6.293560803079163
500,000+ : 5.2429002392593365
5,000+ : 4.837199625507125
50,000+ : 4.816394465827526
10+ : 3.994590658483304
500+ : 3.4120461874544885
50,000,000+ : 2.1221262873192552
50+ : 2.1221262873192552
100,000,000+ : 1.9660875897222514
5+ : 0.8530115468636221
1+ : 0.6865702694268178
500,000,000+ : 0.2496619161552065
1,000,000,000+ : 0.20805159679600538
0+ : 0.1352335379174035
0 : 0.010402579839800271


In [121]:
n_installs = '100,000+'
n_installs = n_installs.replace('+', '')
print(n_installs)
n_installs = n_installs.replace(',', '')
print(n_installs)

100,000
100000


In [122]:
category_android = freq_table(android_english, 1)

for category in category_android:
    total = 0
    len_category = 0
    for app in android_english:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = total/len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1919103.3898305085
AUTO_AND_VEHICLES : 632501.3214285715
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 7641777.871559633
BUSINESS : 1663758.627684964
COMICS : 817657.2727272727
COMMUNICATION : 35153714.17515924
DATING : 828971.2176470588
EDUCATION : 1782566.0377358492
ENTERTAINMENT : 11375402.298850575
EVENTS : 249580.640625
FINANCE : 1319851.4028985507
FOOD_AND_DRINK : 1891060.2767857143
HEALTH_AND_FITNESS : 3972300.388888889
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 630903.6904761905
LIFESTYLE : 1369954.7774725275
GAME : 14256217.600635594
FAMILY : 3345018.516684607
MEDICAL : 96944.49873417722
SOCIAL : 22961790.384937238
SHOPPING : 6966908.880597015
PHOTOGRAPHY : 16636241.267857144
SPORTS : 3373767.6861538463
TRAVEL_AND_LOCAL : 13218662.767123288
TOOLS : 9785955.211352658
PERSONALIZATION : 4086652.4853333333
PRODUCTIVITY : 15530942.008042896
PARENTING : 525351.8333333334
WEATHER : 4570892.658227848
VIDEO_PLAYERS : 24121489.079754602
NEWS_AND_MAGAZI

**The table shows that the "Communication" category has the largest number of installs, with "Social" coming in a close second. The app profile I recommend for free Android apps is "Communication".**