# Profitable App Profiles for the App Store and Google Play Markets
***
**Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.**

**At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.**

# Opening and Exploring the Data
***

**As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.**

**Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:**

**[A data](https://www.kaggle.com/lava18/google-play-store-apps) set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).**

**[A data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).**

**Let's start by opening the two data sets and then continue with exploring the data.**

In [83]:
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#AppleData
openApple = open('AppleStore.csv')
readApple = reader(openApple)
appleData = list(readApple)
headerApple = appleData[0]
print(headerApple)
print('\n')
explore_data(appleData,1,5, True)
print('\n')

#GoogleData
openGoogle = open('googleplaystore.csv')
readGoogle = reader(openGoogle)
googleData = list(readGoogle)
headerGoogle = googleData[0]        
print(headerGoogle)
print('\n')
explore_data(googleData,1,5,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo 

In [84]:
del googleData[10473]
print(googleData[10473])


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


*Some of the data is duplicate.The following code shows one such scenario.*

In [85]:
for data in googleData[1:] :
        if data[0] == 'Instagram':
            print(data)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [86]:
#Count the number of duplicates:
count_duplicate = 0 
unique_list = []
for data in googleData[1:] :
    if data[0] in unique_list:
        count_duplicate += 1
    else:
        unique_list.append(data[0])
print(count_duplicate)

1181


*I will remove the duplicate apps in the dataset. In order to do that, I will only keep the data with most number of rating by the users and delete the rest.*

In [87]:
review_max = {}
for data in googleData[1:] :
    name = data[0]
    n_reviews = float(data[3])
    if name in review_max and review_max[name] < n_reviews:
        review_max[name] = n_reviews
    elif name not in review_max :
        review_max[name] = n_reviews

android_clean = []
already_added = []
already_added.append(googleData[0])
for data in googleData[1:] :
    name = data[0]
    n_review = float(data[3])
    if n_review == review_max[name] and name not in already_added :
        android_clean.append(data)
        already_added.append(name)
print(len(android_clean))
    
        

9659


*In the following step i removed non english apps from the dataset.*

In [93]:
def checkString(input) :
    count = 0
    for character in input:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    return True

newGoogleData = []
newGoogleData.append(android_clean[0])
for data in android_clean[1:]:
    if checkString(data[0]):
        newGoogleData.append(data)
        
newAppleData = []
newAppleData.append(appleData[0])
for data in appleData[1:]:
    if checkString(data[0]):
        newAppleData.append(data) 

7198


*In this step I will remove apps which are not free.*

In [94]:
freeGoogleData = []
for data in newGoogleData[1:]:
    if data[7] == '0':
        freeGoogleData.append(data)

freeAppleData = []
for data in newAppleData[1:]:
    if float(data[4]) == 0.0:
        freeAppleData.append(data)

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we then develop it further.
* If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [107]:
def freq_table(dataset, index):
    freq = {}
    total = 0
    percentage_freq = {}
    for data in dataset[1:]:
        total+= 1
        if data[index] in freq:
            freq[data[index]] += 1
        else:
            freq[data[index]] = 1
            
    for data in freq:
        percentage = (freq[data] / total) * 100
        percentage_freq[data] = percentage
    
    return percentage_freq
            
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(freeGoogleData,9)

Tools : 8.451816745655607
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5319341006544795
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

In [127]:
genre = freq_table(freeAppleData,11)
for data1 in genre :
    total = 0
    len_genre = 0
    for data2 in freeAppleData[1:]:
        genre_app = data2[11]
        if genre_app == data1 :
           total +=  float(data2[5])
           len_genre += 1
    average = total / len_genre
    print(data1,average)

Photo & Video 27249.892215568863
Games 18924.68896765618
Music 56482.02985074627
Social Networking 32503.563380281692
Reference 67447.9
Health & Fitness 19952.315789473683
Weather 47220.93548387097
Utilities 14010.100917431193
Travel 20216.01785714286
Shopping 18746.677685950413
News 15892.724137931034
Navigation 25972.05
Lifestyle 8978.308510638299
Entertainment 10822.961077844311
Food & Drink 20179.093023255813
Sports 20128.974683544304
Book 8498.333333333334
Finance 13522.261904761905
Education 6266.333333333333
Productivity 19053.887096774193
Business 6367.8
Catalogs 1779.5555555555557
Medical 459.75


In [126]:
category = freq_table(freeGoogleData,1)
for data1 in category :
    total = 0
    len_category = 0
    installs
    for data2 in freeGoogleData[1:]:
        category_app = data2[1]
        installs = data2[5]
        if category_app == data1 :
           total +=  float((data2[5].replace("+","")).replace(",",""))
           len_category += 1
    average = total / len_category
    print(data1,average)

ART_AND_DESIGN 1967474.5454545454
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 1833495.145631068
ENTERTAINMENT 11640705.88235294
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15588015.603248259
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17840110.40229885
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10801391.298666667
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24727872.452830188
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.774193