# App Profiles for the App Store and Google Play Markets

*What is this project about?
The purpose of this project is to analyze data to help our developers understand what type of apps are likely to attract more users. This project looks at two separate files on Android and iOS apps. *

*Import 'reader' from 'csv' module. Open both files and save them in 2 variables without their header rows.*

In [3]:
from csv import reader

opened_file_google = open('googleplaystore.csv', encoding="utf-8")
read_file_google = reader(opened_file_google)
google = list(read_file_google)
google_header = google[0]
google = google[1:]

opened_file_apple = open('AppleStore.csv', encoding="utf-8")
read_file_apple = reader(opened_file_apple)
apple = list(read_file_apple)
apple_header = apple[0]
apple = apple[1:]

*'explore_data' function to help us choose parts of the file we want to print. This also shows the number of rows and columns.*

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

*Let's test our function with both of our files and print the first 5 rows for each of the files. Android apps file has 10841 rows  and 13 columns, while iOS apps file has 7197 rows and 17 columns.*

In [5]:
explore_data(google, 0, 5, True)
explore_data(apple, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
['1', '281656475', 'PAC-MAN Premium

*Here we remove a row that has a missing value.*

In [7]:
del google[10472]


In [8]:
google[10471:10473]


[['Xposed Wi-Fi-Pwd',
  'PERSONALIZATION',
  '3.5',
  '1042',
  '404k',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Personalization',
  'August 5, 2014',
  '3.0.0',
  '4.0.3 and up'],
 ['osmino Wi-Fi: free WiFi',
  'TOOLS',
  '4.2',
  '134203',
  '4.1M',
  '10,000,000+',
  'Free',
  '0',
  'Everyone',
  'Tools',
  'August 7, 2018',
  '6.06.14',
  '4.4 and up']]

*Find duplicate values in the Android apps files. The code below demonstrates 3 instances of Twitter, so we will have to remove all but one. Based on the highest rating, which is at index 3 (or based on the most recent date at index 10), we will keep only one unique row for each app.*

In [9]:
for app in google:
    name = app[0]
    if name == "Twitter":
        print(app)

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


*Delete duplicate rows in the data. Android apps file has several duplicated. In the code below I will sort unique apps into one list, and duplicate apps into another list.*

In [10]:
duplicate_apps = []
unique_apps = []
for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Expected length of unique apps: ', len(google) - 1181)
print('Number of duplicate apps: ', len(duplicate_apps))
print(duplicate_apps[:20])


Expected length of unique apps:  9659
Number of duplicate apps:  1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


*Below we create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.*

In [11]:
reviews_max = {}
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


In [12]:
android_clean = []
already_added = []

*Let's loop through Android dataset. We will get the unique rows into a list here by checking whether rows from Android dataset correspond to the dictionary 'reviews_max'. We also take the app names and store them in a separate file to keep track of them.*

In [13]:
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


*Find non-English characters.*

In [14]:
def characters(string):
    for letter in string:
        if ord(letter) > 127:
            return False
            
         
    return True
    
characters('Instagram')

True

In [15]:
characters('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [16]:
characters('Docs To Go™ Free Office Suite')

False

In [17]:
characters('Instachat 😜')

False

*We re-write our function so as not to lose important rows where there could be emoji characters that our original function would count as non-English. So we will discard the rows if the app name has more than 3 non-ascii characters.*

In [18]:
def characters(string):
    non_american = 0
    for letter in string:
        if ord(letter) > 127:
            non_american += 1
    if non_american > 3:
            return False
    else:                     
        return True

In [19]:
characters('Docs To Go™ Free Office Suite')

True

In [20]:
characters('Instachat 😜')

True

*The following 2 lists will contain only apps directed at English-speaking audience.*

In [21]:
google_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if characters(name):
        google_english.append(app)
        
for app in apple:
    name = app[2]
    if characters(name):
        apple_english.append(app)
        
explore_data(google_english, 0, 5, True)
print('\n')
explore_data(apple_english, 0, 5, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '1007882

*Let's create the final two lists for both datasets. These datasets will contain only free applications.*

In [22]:
print(google_header, apple_header)
#7 --> 'Free', 5 --> '0'
print(google_header.index('Genres'))
print(apple_header.index('prime_genre'))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
9
12


In [26]:
google_free = []
apple_free = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_free.append(app)
        
for app in apple_english:
    price = app[5]
    if price == '0':
        apple_free.append(app)
        
print(len(google_free))
print(len(apple_free))
print(google_free[:3])

8864
3222
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


*To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:*

-Build a minimal Android version of the app, and add it to Google Play.   
-If the app has a good response from users, we develop it further.  
-If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [37]:

def freq_table(dataset, index):
    percent = {}
    total = 0
    for app in dataset:
        total += 1
        column = app[index]
        
        if column in percent:
            percent[column] += 1
        else:
            percent[column] = 1
    frequency = {}
    for key in percent:
        percentage = (percent[key] / total) * 100
        frequency[key] = percentage
    return frequency
        





In [55]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [56]:

print(display_table(google_free,1))
print('\n')
print(display_table(google_free,9))
print('\n')
print(display_table(apple_free,12))

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

*'Family' category comes at the top of the list for Android apps. In its 'Genres' column, education and entertainment is almost even.
Among ios apps, games take up more than half of the number of app. Entertainment comes second with almost 8% of frequency among other all other apps. Most of the apps are designed for entertainment purposes, while educational applications only make up 3.6% of the overall pool.*

In [71]:
apple_genres = freq_table(apple_free, 12)


for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[12]
        if genre_app == genre:
            ratings = float(app[6])
            total+=ratings
            len_genre+=1
    average = total/len_genre
    print(genre, average)

Productivity 21028.410714285714
Weather 52279.892857142855
Shopping 26919.690476190477
Reference 74942.11111111111
Finance 31467.944444444445
Music 57326.530303030304
Utilities 18684.456790123455
Travel 28243.8
Social Networking 71548.34905660378
Sports 23008.898550724636
Health & Fitness 23298.015384615384
Games 22788.6696905016
Food & Drink 33333.92307692308
News 21248.023255813954
Book 39758.5
Photo & Video 28441.54375
Entertainment 14029.830708661417
Business 7491.117647058823
Lifestyle 16485.764705882353
Education 7003.983050847458
Navigation 86090.33333333333
Medical 612.0
Catalogs 4004.0


*Navigation apps, by far, have the most reviews. However, this does not mean we need more navigation app. In every day life Google Maps, for instance, is essential to getting around and other apps do not get used as often because they are not seen as necessarily 'essential' to our every day life. What is worth looking into are apps that are rated fairly often and still have room for development. For example, take 'Reference' genre from iOS file. The number of ratings vary immensely between apps labeled as 'Reference'. **Dictionary** apps have a lot more reviews that **Real Bike Traffice Rider VR Glasses** app, it would help to add apps for testing vocabulary to help students get ready for standardized tests, or just help individuals improve their vocabulary.*

In [84]:
for app in apple_free:
    if app[12] == 'Reference':
        print(app[12], app[2], app[6])

Reference Bible 985920
Reference Dictionary.com Dictionary & Thesaurus 200047
Reference Dictionary.com Dictionary & Thesaurus for iPad 54175
Reference Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran 18418
Reference Merriam-Webster Dictionary 16849
Reference Google Translate 26786
Reference Night Sky 12122
Reference WWDC 762
Reference Jishokun-Japanese English Dictionary & Translator 0
Reference 教えて!goo 0
Reference VPN Express 14
Reference New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition 17588
Reference LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools 4693
Reference Guides for Pokémon GO - Pokemon GO News and Cheats 826
Reference Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free 718
Reference City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) 8535
Reference GUNS MODS for Minecraft PC Edition - Mods Tools 1497
Reference Real Bike Traffic Rider Virt

In [86]:
google_categories = display_table(google_free, 1)
print(google_categories)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [78]:
for category in google_categories:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
    average = total/len_category
    print(category, average)
    


ART_AND_DESIGN 1986335.0877192982
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 1833495.145631068
ENTERTAINMENT 11640705.88235294
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15588015.603248259
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17840110.40229885
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10801391.298666667
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24727872.452830188
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.774193

In [85]:
google_genres = display_table(google_free, 9)
print(google_genres)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [81]:
for genre in google_genres:
    total = 0
    len_category = 0
    for app in google_free:
        genre_app = app[9]
        if genre_app == genre:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_genre += 1
    average = total/len_genre
    print(genre, average)

Art & Design 1973878.9473684211
Art & Design;Creativity 27142.85714285714
Auto & Vehicles 366069.38620689657
Beauty 137358.8383838384
Books & Reference 4293516.134020618
Business 876606.4025157233
Comics 52910.65959952886
Comics;Creativity 58.8235294117647
Communication 9707041.513632366
Dating 108229.46006144393
Education 146840.0337837838
Education;Creativity 6460.6741573033705
Education;Education 78887.02209944751
Education;Pretend Play 4958.677685950413
Education;Brain Games 8800.8800880088
Entertainment 1279415.328098472
Entertainment;Brain Games 9818.0279305967
Entertainment;Creativity 5071.85122569738
Entertainment;Music & Video 40403.191936161274
Events 6535.662847790507
Finance 164200.26406926406
Food & Drink 73469.37925052048
Health & Fitness 362455.9118858954
House & Home 30112.286555142502
Libraries & Demo 16005.98308668076
Lifestyle 133338.19173960612
Lifestyle;Pretend Play 2734.4818156959254
Card 41281.71490397619
Arcade 972207.1846671847
Puzzle 209615.29689472355
Racing 

# Conclusions