# My first project
This project analyses what kind of apps are mosty downloaded from the Apple Store and Google Play Store in order to improve the revenue stream of an app-making company. It aims to give developers suggestions on what kind of free apps are the most profitable.

Data Sources:
- [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/home)
- [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

Credit to DataQuest for project guidelines

In [1]:
open_file_gp = open('googleplaystore.csv')
open_file_as = open('AppleStore.csv')
from csv import reader

read_file_gp = reader(open_file_gp)
read_file_as = reader(open_file_as)

google_play = list(read_file_gp)
app_store = list(read_file_as)

In [2]:
#DataQuest
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(google_play, 1,3)
explore_data(app_store, 1,3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




In [4]:
#number of rows
rows_as = len(app_store)
rows_gp = len(google_play)

print("Apple Store rows: " + str(rows_as))
print("Google Play rows: " + str(rows_gp))

#number of columns including header
as_columns = len(app_store[0])
gp_columns = len(google_play[0])

print("Apple Store columns: " + str(as_columns))
print("Google Play columns: " + str(gp_columns))

#header info
as_header = app_store[0]
gp_header = google_play[0]
print("Apple Store header: " + str(as_header))
print("Google Play header: " + str(gp_header))

Apple Store rows: 7198
Google Play rows: 10842
Apple Store columns: 16
Google Play columns: 13
Apple Store header: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Google Play header: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [5]:
#no header data sets
app_store_nh = app_store[1:]
google_play_nh = google_play[1:]

Data cleaning process:
- Remove non-English apps
- Remove apps that aren't free

Google Play:
[error reference link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)

# Deleting duplicates

In [6]:
print(google_play_nh[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
del google_play_nh[10472]
print(google_play_nh[10472]) #verifying that the app number has now changed

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Removing duplicate data from Google Play (follows demonstration below)

In [8]:
for app in google_play_nh:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Duplicates will be removed based on the number of reviews to the app, I will only keep the app with the highest number of reviews

In [9]:
gp_duplicates = []
gp_unique = []

for app in google_play_nh:
    name = app[0]
    if name in gp_unique:
        gp_duplicates.append(name)
    else:
        gp_unique.append(name)
        
gp_duplicates_number = len(gp_duplicates)
gp_unique_number = len(gp_unique)
print("GP duplicates: " + str(gp_duplicates_number))
print("GP unique: " + str(gp_unique_number))

GP duplicates: 1181
GP unique: 9659


Verifying if there are duplicates in the App Store

In [10]:
as_duplicates = []
as_unique = []

for app in app_store_nh:
    name = app[0]
    if name in as_unique:
        as_duplicates.append(name)
    else:
        as_unique.append(name)
        
as_duplicates_number = len(as_duplicates)
as_unique_number = len(as_unique)
print("AS duplicates: " + str(as_duplicates_number))
print("AS unique: " + str(as_unique_number))

AS duplicates: 0
AS unique: 7197


Apple store has no duplicate apps

Filtering duplicate rows in Google Play store by identifying the ones with the most reviews

In [11]:
reviews_max = {}

for app in google_play_nh:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


Removal of duplicate rows in Google Play:

In [12]:
android_clean = []
already_added = []

for app in google_play_nh:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print("Number of android clean items: " + str(len(android_clean)))

Number of android clean items: 9659


# Filtering out apps in English only

App checks whether the string is in English or not by detecting whether there are is one character outside the ASCII range (0-127)

In [13]:
def is_in_english(a_string):
    nonASCII = 0
    for character in a_string:
        if ord(character) > 127:
            return False
    return True
    
#testing 
print("Instagram :" + str(is_in_english("Instagram")))
print("爱奇艺PPS -《欢乐颂2》电视剧热播 :" + str(is_in_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print("Docs To Go™ Free Office Suite :" + str(is_in_english("Docs To Go™ Free Office Suite")))
print("Instachat 😜 :" + str(is_in_english("Instachat 😜")))

Instagram :True
爱奇艺PPS -《欢乐颂2》电视剧热播 :False
Docs To Go™ Free Office Suite :False
Instachat 😜 :False


There is a problem with the detection of special characters and emoji. We modified the app to check whether the string is in English or not by detecting whether there are more than three characters that fall outside the ASCII range (0-127)

In [14]:
def is_in_english(a_string):
    nonASCII = 0
    for character in a_string:
        if ord(character) > 127:
            nonASCII += 1
    if nonASCII > 3:
        return False
    return True
    
#testing 
print("Instagram :" + str(is_in_english("Instagram")))
print("爱奇艺PPS -《欢乐颂2》电视剧热播 :" + str(is_in_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print("Docs To Go™ Free Office Suite :" + str(is_in_english("Docs To Go™ Free Office Suite")))
print("Instachat 😜 :" + str(is_in_english("Instachat 😜")))

Instagram :True
爱奇艺PPS -《欢乐颂2》电视剧热播 :False
Docs To Go™ Free Office Suite :True
Instachat 😜 :True


Filtering out non-English apps from both data sets

In [15]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_in_english(name):
        android_english.append(app)

for app in app_store_nh:
    name = app[1]
    if is_in_english(name):
        ios_english.append(app)

print("Number of unique apps in English (Google Play) " + str(len(android_english)))
print("Number of unique apps in English (Apple Store) " + str(len(ios_english)))


Number of unique apps in English (Google Play) 9614
Number of unique apps in English (Apple Store) 6183


# Isolating the free apps

In [16]:
android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == "0":
        android_free.append(app)

for app in ios_english:
    price = app[4]
    if price == "0.0":
        ios_free.append(app)

print("Number of free apps in English(Google Play) " + str(len(android_free)))
print("Number of free apps in English (Apple Store) " + str(len(ios_free)))
        

Number of free apps in English(Google Play) 8864
Number of free apps in English (Apple Store) 3222


Strategy for developers:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


Most useful data entries per store:
- Apple Store header: ['prime_genre']
- Google Play header: ['Category', 'Genres', 'Last Updated']

In [17]:
def freq_table(data_set, index):
    frequency_table = {}
    total = 0
    
    for row in data_set:
        value = row[index]
        total += 1
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    
    table_percentages = {}
    
    for key in frequency_table:
        percentage = (frequency_table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

#Dataquest
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print("'Prime genre' ios: ")
display_table(ios_free, 11)
print("\n'Genre' android")
display_table(android_free, 9)
print("\n'Category' android ")
display_table(android_free, 1)

'Prime genre' ios: 
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665

'Genre' android
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653

# Analysis of data:
Most frequent prime genres on ios:
- Games
- Entertainment
- Photo

Most frequent genres on android:
- Tools
- Entertainment
- Education

Most frequent categories on android:
- Family
- Game
- Tool

In [18]:
#Frequency table for prime genre (ios)
prime_genre = freq_table(ios_free, -5)

for genre in prime_genre:
    total = 0 #the sum of user ratings (the number of ratings, not the actual ratings) specific to each genre
    len_genre = 0 #number of apps specific to each genre
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
            
    avg_user_ratings = total / len_genre
    print(genre + ": " + str(avg_user_ratings))


Entertainment: 14029.830708661417
Finance: 31467.944444444445
Music: 57326.530303030304
Catalogs: 4004.0
Games: 22788.6696905016
Education: 7003.983050847458
Weather: 52279.892857142855
Reference: 74942.11111111111
Social Networking: 71548.34905660378
News: 21248.023255813954
Productivity: 21028.410714285714
Navigation: 86090.33333333333
Shopping: 26919.690476190477
Sports: 23008.898550724636
Photo & Video: 28441.54375
Food & Drink: 33333.92307692308
Health & Fitness: 23298.015384615384
Medical: 612.0
Utilities: 18684.456790123455
Book: 39758.5
Travel: 28243.8
Business: 7491.117647058823
Lifestyle: 16485.764705882353


In [21]:
categories_android = freq_table(android_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

VIDEO_PLAYERS : 24727872.452830188
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
MAPS_AND_NAVIGATION : 4056941.7741935486
ART_AND_DESIGN : 1986335.0877192982
SOCIAL : 23253652.127118643
COMMUNICATION : 38456119.167247385
TOOLS : 10801391.298666667
LIBRARIES_AND_DEMO : 638503.734939759
DATING : 854028.8303030303
EVENTS : 253542.22222222222
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
HOUSE_AND_HOME : 1331540.5616438356
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
PARENTING : 542603.6206896552
PERSONALIZATION : 5201482.6122448975
GAME : 15588015.603248259
NEWS_AND_MAGAZINES : 9549178.467741935
WEATHER : 5074486.197183099
PRODUCTIVITY : 16787331.344927534
BEAUTY : 513151.88679245283
EDUCATION : 1833495.145631068
FOOD_AND_DRINK : 1924897.7363636363
COMICS : 817657.2727272727
HEALTH_AND_FITNESS : 4188821.9853479853
TRAVEL_AND_LOCAL : 13984077.710144928
A