# Profitable App Profiles for the App Store and Google Play Markets
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

- A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
- A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.
Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
# READ DATASET
from csv import reader
def read_dataset(dataset, header=True):
    readfile = list(reader(open(dataset, encoding="utf8")))
    if header:
        return readfile[0], readfile[1:], readfile
appstore_header, appstore_data, appstore_all=read_dataset('AppleStore.csv')
ggplay_header, ggplay_data, ggplay_all=read_dataset('googleplaystore.csv')

In [2]:
# EXPLORE DATASET & DELETE WRONG DATA
def explore_data(dataset, start, end, rows_and_columns=False):
    for row in dataset[start:end]:
        print(row)
        print('\n')
    if rows_and_columns:
        print('number of rows:',len(dataset))
        print('number of columns:', len(dataset[0]))

del ggplay_data[10472]
explore_data(ggplay_data,10472,10473, True)

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


number of rows: 10840
number of columns: 13


In [3]:
# REMOVE DUPLICATE DATA
def find_duplicate(dataset):
    duplicate_data, unique_data = [],[]
    for row in dataset:
        name = row[0]
        if name in unique_data:
            duplicate_data.append(row)
        else:
            unique_data.append(name)
    return duplicate_data, unique_data
duplicate_gg, unique_gg=find_duplicate(ggplay_data)[0], find_duplicate(ggplay_data)[1]
# print(duplicate_gg)
# print('Instagram' in unique_gg)

In [4]:
# BUILDING A DICTIONARY STORING APPS AND THEIR MAXIMUM REVIEWS
gg_dic = {}
for app in ggplay_data:
    name = app[0]
    review = float(app[3])
    if name in gg_dic and gg_dic[name] < review:
        gg_dic[name] = review
    elif name not in gg_dic:
        gg_dic[name] = review
# print(gg_dic['Quick PDF Scanner + OCR FREE'])
print(len(gg_dic))

9659


In [5]:
# BUILDING A LIST STORING ALL DATA OF APPS WITH NO DUPLICATES 
gg_clean, gg_added = [], []
for apps in ggplay_data:
    if apps[0] not in gg_added and float(apps[3]) == float(gg_dic[apps[0]]):
        gg_clean.append(apps)
        gg_added.append(apps[0]) # sẽ có trường hợp apps[3] có giá trị = nhau
print(len(gg_clean))

9659


In [6]:
# Write a function that takes in a string and returns False if there's any character in the string that doesn't belong 
# to the set of common English characters; otherwise, the function returns True.
# def detect_eng(string):
#     for i in string:
#         if ord(i) > 127:
#             return False
#     return True # return chỉ được sử dụng 1 lần nên nếu ở trên đã return False (phát hiện ra phần tử > 127) thì sẽ 
#                 # ko xét đến cái ở dưới
# print(detect_eng('Docs To Go™ Free Office Suite'))
# print(ord('ê'))

In [7]:
# Change the function you created on the previous screen. If the input string has more than three characters that fall 
# outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise 
# it should return True.

# CÁCH 1:
# def detect_eng(string):
#     ascii_list = []
#     for i in string:
#         if ord(i) > 127:
#             ascii_list.append(False)
#         else: 
#             ascii_list.append(True)
#     if ascii_list.count(False) > 3:
#         return False
#     return True

# CÁCH 2:
def detect_eng(string):
    ascii_element = 0
    for i in string:
        if ord(i) > 127:
            ascii_element += 1
    if ascii_element > 3:
        return False
    return True
print(detect_eng('Instachat 😜'))

True


In [8]:
# CREATE ENGLISH-ONLY APPS LIST
def english_app(dataset, name_index):
    english_only = []
    for item in dataset:
        if detect_eng(item[name_index]):
            english_only.append(item)
    return english_only
english_gg = english_app(gg_clean, 0)
english_appstore = english_app(appstore_data, 1)
# print(len(english_appstore))
# print(explore_data(english_appstore,0,3))
print(explore_data(english_gg,0,3))
print(ggplay_header)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


None
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [9]:
def free_app(dataset, price_index, price_measure):
    free = []
    for app in dataset:
        if app[price_index] == price_measure:
            free.append(app)
    return free
free_gg = free_app(english_gg, 7, '0')
free_ios = free_app(english_appstore, 4, '0.0')
print(appstore_header)
# print(explore_data(free_gg, 0, 50))
# print(len(free_gg))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [10]:
def freq_table(dataset, column):
    app_dic = {}
    for app in dataset:
        if app[column] in app_dic:
            app_dic[app[column]] += 1
        else:
            app_dic[app[column]] = 1
    for key in app_dic:
        app_dic[key] /= len(dataset)
        app_dic[key] *= 100
    return app_dic

def display_table(dataset, column):
    table = freq_table(dataset, column)
    table_dis = []
    for key in table:
        table_dis.append((table[key], key))
    table_sorted = sorted(table_dis, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
# display_table(free_ios, -5)
# print('\n')
# display_table(free_gg, 1)
# print('\n')
# display_table(free_gg, -4)
print(freq_table(free_ios, -5))

{'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Games': 58.16263190564867, 'Music': 2.0484171322160147, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Weather': 0.8690254500310366, 'Utilities': 2.5139664804469275, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'News': 1.3345747982619491, 'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Entertainment': 7.883302296710118, 'Food & Drink': 0.8069522036002483, 'Sports': 2.1415270018621975, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Business': 0.5276225946617008, 'Catalogs': 0.12414649286157665, 'Medical': 0.186219739292365}


In [11]:
print(appstore_header)
explore_data(free_ios, 1,6)
# print('\n')
# print(ggplay_header)
# explore_data(free_gg, 1,2)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']




In [13]:
# rating = []
# for app in freq_table(free_ios, -5):
#     genre_rate = []
#     for i in free_ios:
#         if i[-5] == app:
#             genre_rate.append(float(i[7]))
#     avg = sum(genre_rate)/len(genre_rate)
#     rating.append(avg)

# for key in freq_table(free_ios, -5):
#     app = freq_table(free_ios, -5)
#     for rate in rating:
#         app[key]=rate
# print(freq_table(free_ios, -5))
# # genre_count

In [38]:
# APPSTORE ANALYS
rating = []
for app in freq_table(free_ios, -5):
    genre_rate = []
    for i in free_ios:
        if i[-5] == app:
            genre_rate.append(float(i[5]))
    avg = sum(genre_rate)/len(genre_rate)
    print(app, ":", avg)
#     rating.append(avg)

# MAKE A DICTIONARY SHOWING THE NUMBER OF USER RATINGS OF EACH GENRE
# app = freq_table(free_ios, -5)
# genres = [key for key in app]
# rating_dic_ios = {genres[i]:rating[i] for i in range(len(rating))}
# print(rating_dic_ios)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [58]:
# GOOGLE PLAY ANALYS
for app in free_gg:
    app[5] = app[5].replace('+', '')
    app[5] = app[5].replace(',', '')

gg_app = freq_table(free_gg, 1)
for key in gg_app:
    genre_n_installs = []
    for app in free_gg:
        if app[1] == key:
            genre_n_installs.append(float(app[5]))
    print(key, ":",sum(genre_n_installs)/len(genre_n_installs))

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5000000', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


None
ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
T