# Analyzing App Data from Google and Apple Store 

* This project aims to analyze app information provided by
the Apple store and the Google store.

* The goal is to extract insights about the profitability of free apps on both Apple and Google ecosystems, which should help us make devisions relating to the apps we develop.


In [1]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader
opened_apple = open('AppleStore.csv')
opened_google = open('googleplaystore.csv')
read_apple = reader(opened_apple)
read_google = reader(opened_google)
apple_data = list(read_apple)
google_data = list(read_google)

explore_data(apple_data, 1, 2, True)
print('\nApple Columns:')
print(apple_data[0])
print('\n')
explore_data(google_data, 1, 2, True)
print('\nGoogle Columns:')
print(google_data[0])
print('\n')
print('------------------------------')

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16

Apple Columns:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13

Google Columns:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


------------------------------


In [3]:
print(google_data[10473])
print('\n')
del google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




# 4 - Examining Duplicate Entries in the Google Android Dataset

* It has been discovered that there are duplicate entries in the Google dataset. Several apps are listed more than once.

* All duplicates will be examined and the one with the largest number of reviews will be kept as this implies that it is the most recently submitted data.

In [4]:
unique_entries = []
duplicate_entries = []

for row in google_data[1:]:
    app_name = row[0]
    if app_name in unique_entries:
        duplicate_entries.append(app_name)
    else:
        unique_entries.append(app_name)
        
print('Number of duplicate entries:' + str(len(duplicate_entries)))

Number of duplicate entries:1181


# 5 - Creating a list of entries we wish to keep and then confirming its length and inspecting it

* As can be seen from the code above, we have 1181 duplicate entries in our Google Playstore data set.

* We must remove all duplicate entries and leave only the entries with the highest number of ratings.

* We will then check the length of the newly formed dictionary, and after that we shall inspect the dictionary entries to make sure we have the information we need.

In [5]:
reviews_max ={}

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

#checking dictionary contents
for entry in reviews_max:
    print(entry +" : " + str(reviews_max[entry]))    

Chick-fil-A : 28009.0
Visualmed : 0.0
Downvids Helper - One touch DW : 197.0
Mini Motor Racing WRT : 107497.0
Truck Chat & CB for Truckers : 89.0
Multi Surgery ER Emergency Hospital : Doctor Game : 227.0
Guns of Glory : 120592.0
EI Mobile : 4231.0
Udacity - Lifelong Learning : 22384.0
Picture Grid Builder : 33439.0
TN e Sevai TN EB Bill Patta Citta EC Birth All Hub : 23.0
Autool BT-BOX : 4.0
Microsoft Power BI–Business data analytics : 1819.0
Charlotte County, FL : 10.0
EO App. SelfCompassion to you : 1.0
dB: Sound Meter Pro : 31.0
CI Time : 4.0
HDFC Bank MobileBanking : 208463.0
St. Petersburg, FL - weather and more : 0.0
BW-Go : 265.0
FUN Keyboard – Emoji Keyboard, Sticker,Theme & GIF : 12089.0
BN DB1 App : 2.0
BH Cosmetics : 2.0
OnePlus Icon Pack - Square : 229.0
Heart Surgery Emergency Doctor : 2248.0
CM Launcher Default Theme : 3989.0
Power Widget : 928.0
EasyNote Notepad | To Do List : 15618.0
Билеты ПДД CD 2019 PRO : 21.0
Public Service CU Mobile : 314.0
autoricardo.ch – vehicle

# 5 - 2 Creating a clean data set with no duplicate entries

* We use the dictionary we created in the previous cell to create a new data set which has no duplicates. 

* After we create the new dataset, we check its length and then we print some rows of our new dataset to confirm.

In [6]:
google_clean = []
already_added = []

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)
        
print(len(google_clean))

for app in google_clean[:5]:
    print(app)
    print("\n")


9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']




# 6 & 7 - Checking for non English Apps

* We are interested in apps that are developed for an English speaking audience, so we will base our study on Apps that are in English.

* We will define a function that checks the string for Non-English characters.

* Each App will be tested for 3 or more non-English characters, to allow room for certain special characters like smileys and other characters such as 'Copy Right" and "Trade Mark"

In [7]:
def check_english(str_name):
    counter = 0
    for character in str_name:
        if ord(character) > 127:
            counter += 1
            if counter > 3:
                return False
    return True

print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
True
True


* We will now use the above defined function to filter apps from both Apple and Google app store.

* Apps in English will be added to a new list.

In [8]:
apple_english = []
google_clean_english = []

for row in apple_data[1:]:
    name = row[1]
    if check_english(name):
        apple_english.append(row)
        
for row in google_clean:
    name = row[0]
    if check_english(name):
        google_clean_english.append(row)
        
print("Number of English apps on Apple store: " + str(len(apple_english)) + "\n")
print("Number of English apps on google store: " + str(len(google_clean_english)) + "\n")


print('Apple Store sample:\n')
for row in apple_english[:2]:
    print(row)
    print('\n')

#for row in apple_english:

print('Google Store sample:\n')
for row in google_clean_english[:2]:
    print(row)
    print('\n')


    

Number of English apps on Apple store: 6183

Number of English apps on google store: 9614

Apple Store sample:

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Google Store sample:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




# 8 - Isolating the free Apps

* We loop through the data sets for both Apple and Google store and we Isolate the free apps in new lists

* We then check the length of the remaining data sets before proceeding to analyze them

* We rename our datasets to reflect the operations performed on them:
    we currently have:
    
    apple_english
    
    google_clean_english
    
    which we will use for the next step:  


In [9]:
#creating the empty lists that will house the final data sets
apple_english_free = []
google_clean_english_free = []

#Examinging the header rows to determine price indices for both Apple and Google
apple_header = apple_data[0]
google_header = google_data[0]

print('Apple store header for inspection:\n')
print(apple_header)
print('\n')
print('Google store header for inspection:\n')
print(google_header)
print('\n')

#Creating the final data sets for analysis and checking their size
for row in apple_english:
    if row[4] == '0.0':
        apple_english_free.append(row)
print('Apple Store final dataset size: ' + str(len(apple_english_free)) + "\n")
        
for row in google_clean_english:
    if row[7] == "0":
        google_clean_english_free.append(row)
print('Google Store final dataset size: ' + str(len(google_clean_english_free)) + '\n')
        



Apple store header for inspection:

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google store header for inspection:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Apple Store final dataset size: 3222

Google Store final dataset size: 8864



# 9 & 10 - Most common apps - Inspecting Final Data Sets

* We are looking to build an app for both Apple IOS and Google Android. As such, we will be looking at the genre of the apps, and will be analyzing different information provided in the data sets, such as number of downloads, number of reviews, avarage customer ratings, target audience age. We will be looking to find app profiles that are highly successful on both platforms.

* After inspecting the datasets in the cell above:

  from the Apple Store data set, we can use: prime_genre
  
  from the Google Store data set, we can use: Genre and Gategory

In [10]:
def freq_table(dataset, index):
    table = {}
    total_entries = 0
    for row in dataset:
        if row[index] in table:
            table[row[index]] += 1
        else:
            table[row[index]] = 1
        total_entries += 1
    
    for ky in table:
        table[ky] = (table[ky] * 100) / total_entries
        
    return table
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

"""
apple_english_free = []
google_clean_english_free = []
"""      

'\napple_english_free = []\ngoogle_clean_english_free = []\n'

# 10 - 2 - Using our new functions to desribe the datasets

In [11]:
#Examining the Apple Store dataset:
display_table(apple_english_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


# 11 - 1 - Analyzing The Apple dataset
* We can see that the majority of English apps on the Apple store are games with 58% of the total entries, closely followed by Entertainment at 7.88 percent and then "Photo & Video" apps at almost 5%. 

* We can conclude that apps for fun and entertainment are the most common apps to find on the apps store. Just because games represent the most abundant category doesn't imply that they have the highest number of users.

* We will need to further analyze the results before recommending an app profile.

Next we proceed to examine the Google store data:

We start with the Genres column:

In [12]:
#Examining Google store dataset - Category column
display_table(google_clean_english_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

# 11 - 2 - Analyzing the Google Dataset
* We can see that the categories of apps on the Google Store are much more balanced, with almost 19% being family applications and 9.7% being games. Other categories such as tools form 8.46% of the total apps available, and the the rest of the categories represent smaller and smaller portions of all apps.


Next we examine another column of the Google store dataset: Genres

In [13]:
#Examining Genres column of the Google dataset

display_table(google_clean_english_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.700361010830325
Medical : 3.5311371841155235
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.237815884476534
Action : 3.1024368231046933
Health & Fitness : 3.079873646209386
Photography : 2.9444945848375452
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.041967509025271
Dating : 1.861462093862816
Arcade : 1.8501805054151625
Video Players & Editors : 1.7712093862815885
Casual : 1.759927797833935
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418774
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075813

* We can see that the Genre's column is much more granular, havin a large number of categories, and confirms that the apps landscape on Google is much more balanced compared to the Apple store. 
* It can be deduced that the Tools Genre has the highest share of apps, while entertainment is the second highest at 6.0% followed closely by education at 5%.
* We cannot conclude the number of users represented by each category or genre at this point in the analysis.

# 12 Most popular Apps by Genre on the App store

* We start by generating a frequency table for the prime_genre column in order to get the main app genres.

In [14]:
apple_genre_frequency = freq_table(apple_english_free, 11)
apple_avg_installs_per_genre = []
for genre in apple_genre_frequency:
    total = 0
    len_genre = 0
    for row in apple_english_free:
        genre_app = row[11]
        if genre_app == genre:
            user_ratings = row[5]
            total += int(user_ratings)
            len_genre += 1
    avarage_nratings = total /  len_genre
    apple_avg_installs_per_genre.append((avarage_nratings, genre))

apple_avg_installs_per_genre = sorted(apple_avg_installs_per_genre, reverse = True)
for tpl in apple_avg_installs_per_genre:
    print(tpl[1] + " : " + str(tpl[0]))

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


* We can see from the information above that Navigation is the Genre with the most reviews at more than 86K, followed by Reference at almost 75k, followed by social networking at 71.5K and then Music at 57K.

* The information shows us that the most used and most reviewed apps are not games, but rather are Navigation, Reference, Social Networking, Music and weather apps.

* The recommended app profile can be reference and book apps, because these genres may not have very large players that skew their review numbers. It will also be more difficult to be profitable in a genre that is dominated by a few very large players.


# 13 - most popular apps by genre - Google play

In [24]:
google_cat_freq = freq_table(google_clean_english_free, 1)

google_genre_average_installs = []

for category in google_cat_freq:
    total = 0
    len_category = 0
    for category_app in google_clean_english_free:
        if category_app[1] == category:
            installs = category_app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
    average_installs = total / len_category
    google_genre_average_installs.append((average_installs, category))

google_genre_average_installs = sorted(google_genre_average_installs, reverse = True)

for tpl in google_genre_average_installs:
    print(tpl[1] + " : " + str(tpl[0]))
    

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

# 13++ Looking through individual genres on both Apple add Google datasets for additional insights

In [33]:
#looking through individual apps in certain genres on the apple store
for row in apple_english_free:
    if row[-5] == 'Reference':
        print(row[1] + ' : ' + row[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In [34]:
for row in google_clean_english_free:
    if row[1] == 'BOOKS_AND_REFERENCE':
        print(row[0] + " : " + row[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

# Conclusion:

* We looked through the most used genres of English apps on both the Apple Store and Google Play store. 

* Our objective was to find an app profile that could be profitable on both Apple and Android.

* Our logic and reasoning: The majority of the popular genres had their numbers skewed by major players in the field. Those genres are dominated by large players that are hard to compete against.

* The References genre on Apple store and the Book and References genre on Google play store is one of the popular genres on both platforms, and yet is not singly dominated by large players.

* The recommended App profile for both platforms is therefor References/Books and References genre. 