# Analysis of Data Obtained from Free iOS and Android Apps 


In this project, descriptive data on free apps on Google Play and App Store is analyzed to find out user preferance. These apps render revenue via in-app ads. So it is crucial to attract a large number of users.

The goal of this project is to analyze existing datasets to better understand user demands and help the developers take them into account. 

# Opening Datasets

First the datasets are opened, read and turned into lists for further analyzing.

In [1]:
from csv import reader

opened_file = open('AppleStore.csv') #Apple store data
ios = list(reader(opened_file)) #read and turn dataset into a list
ios_header = ios[0] #separate the header
ios = ios[1:]

opened_file = open('googleplaystore.csv') #Google play data
android = list(reader(opened_file)) #read and turn dataset into a list
android_header = android[0]
android = android[1:]

# Exploring the Datasets

First few rows of both datasets are printed to get a visual of the data. Then the numbers of rows and columns are printed.

First, the android apps dataset is explored.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print('Exploring Android Dataset\n', android_header)
print('\n')
explore_data(android,0,2,True)



Exploring Android Dataset
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Here, we can assume that the **App**, **Rating**, **Genre**, **Installs** and **Type** columns could be useful.

We explore iOS apps dataset:

In [29]:
print('Exploring iOS Dataset\n', ios_header)
print('\n')
explore_data(ios,0,2,True)

Exploring iOS Dataset
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


Here **track_name**, **price**, **rating_count_tot**, **user_rating**, **prime_genre** columns could be useful.

For detailed understanding of what each columns mean, refer to this [Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). 

### Deleting Faulty Data

From [this discussion entry](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) we can see a claim that the row 10472 in the Google Pay dataset has faulty data (data under the **category** column is missing, instead it holds **ratings** data).


In [30]:
print(android_header[1], '\n')
print(android[10472][1])


Category 

1.9


The claim is true. Now we will now proceed to delete this row as part of data cleaning.

In [31]:
del android[10472]

### Duplicates!

Let's look for duplicate entries. We will keep the entries with the highest reviews because they are more like to be more recent. The newer the data, the better the insight.

In [5]:
def duplicates(data):
    duplicate_apps = []
    unique_apps = []
    
    for app in data:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    print('Number of duplicate apps:', len(duplicate_apps), '\n')
    print('Some examples of duplicate apps:', duplicate_apps[:10])
    
print('Google Play Store apps')
duplicates(android)
print('\n')
print('iOS apps')
duplicates(ios)

Google Play Store apps
Number of duplicate apps: 1181 

Some examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


iOS apps
Number of duplicate apps: 0 

Some examples of duplicate apps: []


In [10]:
for app in android:
    name = app[0]
    if name == 'Box':
        print(app)


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can see that 1181 Play Store apps have duplicates while none for iOS apps.

Now to remove the extras. 

Life Made WI-Fi Touchscreen Photo Frame app seems to have reviews saved as 3.0M which cannot be converted to float type. So this value is replaced with 3,000,000. 

In [24]:
for idx, app in enumerate(android):
    row = app[3]
    if row=='3.0M':
        print(idx,app)
        app[3] = 3000000
        print(idx,app)

10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', 3000000, '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [26]:
print('Expected length:', len(android) - 1181)

Expected length: 9660


In [51]:
def remove_dups(data):
    reviews_max = {}
    for app in data:
        name = app[0]
        n_reviews = float(app[3])
        
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
        #didn't use else here because otherwise n_reviews will be incorrectly updated
    print('Number of reviews left after finding duplicates:', len(reviews_max))
    
    data_clean = []
    already_added = []
    
    for app in data:
        name = app[0]
        n_reviews = float(app[3])
        
        if n_reviews == reviews_max[name] and name not in already_added:
            data_clean.append(app)
            already_added.append(name) 
            #to account for highest number of reviews occusring more than once in entries
            
    print('Number of reviews left after cleaning:', len(data_clean))
    return data_clean

android1 = remove_dups(android)

Number of reviews left after finding duplicates: 9660
Number of reviews left after cleaning: 9660


In [53]:
len(android1)

9660

### Filtering non-English app names

If the app name has more than three emojis or characters beyond standard ASCII number (>127), it will be counted as a non-English app.

In [81]:
def isEng(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True
        
print(isEng('Instagram'))
print(isEng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEng('Docs To Go™ Free Office Suite'))
print(isEng('Instachat 😜'))

True
False
True
True


In [93]:
def remove_noneng(data, index):
    new_list = []
    for app in data:
        name = app[index]
        if isEng(name):
            new_list.append(app)
    return new_list

android2 = remove_noneng(android1,0)
ios1 = remove_noneng(ios,1)

print('Andorid rows left:', len(android2))
print('iOS rows left:', len(ios1))

Andorid rows left: 9615
iOS rows left: 6183


### Isolating Free Apps

We only work with free apps and generate revenue from in-app ads.

In [94]:
free_android = []
free_ios = []

for app in android2:
    prc = app[7]
    if prc == '0':
        free_android.append(app)
        
for app in ios1:
    prc = app[4]
    if prc == '0.0':
        free_ios.append(app)
        
print('Free Android apps:', len(free_android))
print('Free iOS apps:', len(free_ios))      
    

Free Android apps: 8864
Free iOS apps: 3222


And with that, we are done cleaning the data we have.

# Most Common Apps by Genre

We want to build an app that's pupular because revenue is highly inflenced by the number of users. So our task ahead is,
- Build Android app and add it to Google Play
- If user response is good we develop it further
- If it's profitable after six months, we build an iOS version  and add it to the App Store

Our end goal is to have an app on both Google Play and App Store. Let's find out common genres for each market.

In [95]:
print('Android app columns:\n', android_header)
print('\nApple app columns:\n', ios_header)

Android app columns:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Apple app columns:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Frequency Tables

For Android apps, **Category** and **Genres** columns and for iOS apps, **prime_genre** columns will be assessed. 

In [96]:
def freq_table(dataset, index):
    table = {}
    count = 0
    
    for row in dataset:
        count += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / count) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [97]:
def display_table(dataset,index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key],key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

In [98]:
freq_prime_genre = display_table(free_ios, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Almost 55.65% apps in the App Store are games. A close second is enterntainent apps. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set 

In [99]:
freq_genres = display_table(free_android, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The Genres column shows no significant advantage for a particular genre. It is a granular display. 

In [100]:
freq_categories = display_table(free_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Practical purpose app seem to be popular in Google Play, though there is no dominating type. We will continue with Category column. 

# Most Popular Genres

For the Google Play Store, the **installs** column can be used to find the most popular genre. But there is no such data on the iOS apps. So we will use the **ratings** column for this purpose. Let's start with calculating the average number of user ratings per app genre on the App Store

In [101]:
genres_ios = freq_table(free_ios, 11)

for genre in genres_ios:
    count = 0
    len_genre = 0
    
    for app in free_ios:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            count += n_ratings
            len_genre += 1
    avg_n_ratings = count / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [102]:
for app in free_ios:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


But categories like Navigation, Social media and reference are skewed by some very popular app. So the average number is biased.
NOw to android apps

In [104]:
display_table(free_android, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [106]:
categories_android = freq_table(free_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

# Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.