# Profitable App Profiles for the App Store and Google Play Markets
An introductory practice into data analysis.

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

We analyze the two following datasets:

* A [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


In [1]:
import csv
import os
  
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


In [2]:
gps_filename = "googleplaystore.csv"
ios_filename = "AppleStore.csv"
    
with open(gps_filename, "r", encoding="utf8") as googlePS:
    googlePS_reader = list(csv.reader(googlePS))
with open(ios_filename, "r", encoding="utf8") as ios:
    ios_reader = list(csv.reader(ios))
    
del googlePS_reader[10473]
del googlePS_reader[9149]



Unfortunately, the Google Play dataset not only has broken data entries (the two deleted above); it also has duplicate entries that we must filter out before performing any analysis.

In [3]:
for app in googlePS_reader:
    name = app[0]
    if name == "Instagram":
        print(app)

print()

for app in googlePS_reader:
    name = app[0]
    if name == "Facebook":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 

In [4]:
# Counting the Number of Duplicates in a Dataset

duplicates = []
uniques = []

for app in googlePS_reader:
    name = app[0]
    if name in uniques:
        duplicates.append(name)
    else:
        uniques.append(name)

print("Number of duplicate apps:", len(duplicates))

Number of duplicate apps: 1181


We will filter out the duplicate apps based on their number of reviews. We assume that entries for the same app with a higher number of reviews are more recent, and thusly we keep only the entry with the most reviews.

In [5]:
print("Expected length:", len(googlePS_reader) - 1 - len(duplicates))

# Data Filtering
reviews_max = {}

# Generate a dictionary of app names and their max number of reviews found in the dataset
for app in googlePS_reader[1:len(googlePS_reader)]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print("Number of unique entries:", len(reviews_max))

googlePS_clean = []
already_added = []

# Generate a new list with only the entries corresponding to the maximum reviews for each app
for app in googlePS_reader[1:len(googlePS_reader)]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        googlePS_clean.append(app)
        already_added.append(name)

print("Confirmed number of unique entries:", len(googlePS_clean))


Expected length: 9658
Number of unique entries: 9658
Confirmed number of unique entries: 9658


The dataset also includes non-English apps that are not of interest for our analysis, so they also have to be filtered out. To do this, we first define a function that identifies whether an app's name has non-English characters.

We will filter out all app names with more than THREE characters falling outside the standard English ASCII range (0-127).

In [6]:
def is_english(inputstr):
    count = 0
    
    for char in inputstr:
        if ord(char) > 127:
            count += 1
    
    return True if count <= 3 else False

android_clean = []
ios_clean = []

for app in googlePS_clean:
    name = app[0]
    
    if is_english(name) == True:
        android_clean.append(app)

for app in ios_reader[1:len(ios_reader)]:
    name = app[0]
    
    if is_english(name) == True:
        ios_clean.append(app)

print("Number of English Android Apps:", len(android_clean))
print("Number of English iOS Apps:", len(ios_clean))

Number of English Android Apps: 9613
Number of English iOS Apps: 7197


To complete our data filtering, we reduce the dataset to include only Free apps.

In [7]:
android_allclean = []
ios_allclean = []

for app in android_clean:
    if app[6] == "Free":
        android_allclean.append(app)

for app in ios_clean:
    if app[4] == "0.0":
        ios_allclean.append(app)

print("Number of Free English Android Apps:", len(android_allclean))
print("Number of Free English iOS Apps:", len(ios_allclean))

Number of Free English Android Apps: 8863
Number of Free English iOS Apps: 4056


Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

For our specific datasets, our analysis requires build a frequency table for the *prime_genre* column of the App Store data set, and for the *Genres* and *Category* columns of the Google Play data set.

In [10]:
# Generates frequency tables that show percentages, and displays them in descending order

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


def freq_table(dataset, index):
    frequency_table = {}
    
    for app in dataset:
        key = app[index]
        if key in frequency_table:
            frequency_table[key] += 1
        else:
            frequency_table[key] = 1
    
    return frequency_table
            

# display_table(ios_allclean, 11)

Games : 2257
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8


# Analyzing the Apple Store Data

* The two most common genres are "Games" and "Entertainment".
* The two least common genres are "Medical" and "Catalogs".
* Genres associated with merely provide information do not get a lot of apps, probably because redundance is deemed unnecessary. Genres more involved with personal expression and individuality (games, social media, art) get more apps.
* Entertainment in general gets more apps than practical purposes.
* Note that a large number of apps does not necessarily entail a large userbase for the genre as a whole. Users might be concentrated on a very small subset of apps in a given genre.

In [11]:
# display_table(android_allclean, 9)
# print()
# display_table(android_allclean, 1)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

# Analyzing the Google Play Store Data

* The "Category" column of the dataset is a poor reference: the categories are too general (what are "Tools" even?), and some of them are redundant ("Education" and "Educational", for example). App makers' category tagging of their own apps is not ideal.
* The "Genres" column appears to be more useful, and seems to propose a more balanced interest in entertainment and personal usefulness. "Family" is by far the most popular genre.

# The Next Step
We need more information, such as determining the kinds of apps with the most users.

The Google Play Store dataset provides data on the number of installs per app, but the Apple Store does not. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the *rating_count_tot* field. For the Google Store, we'll use the average number of installs per genre.

In [12]:
ios_genre_table = freq_table(ios_allclean, 11)

for genre in ios_genre_table:
    total = 0
    len_genre = 0
    
    for app in ios_allclean:
        genre_app = app[11]
        if genre_app == genre:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
    
    avg_ratings = total / len_genre
    print("Average User Ratings for", genre, "=", avg_ratings)

Average User Ratings for Social Networking = 53078.195804195806
Average User Ratings for Photo & Video = 27249.892215568863
Average User Ratings for Games = 18924.68896765618
Average User Ratings for Music = 56482.02985074627
Average User Ratings for Reference = 67447.9
Average User Ratings for Health & Fitness = 19952.315789473683
Average User Ratings for Weather = 47220.93548387097
Average User Ratings for Utilities = 14010.100917431193
Average User Ratings for Travel = 20216.01785714286
Average User Ratings for Shopping = 18746.677685950413
Average User Ratings for News = 15892.724137931034
Average User Ratings for Navigation = 25972.05
Average User Ratings for Lifestyle = 8978.308510638299
Average User Ratings for Entertainment = 10822.961077844311
Average User Ratings for Food & Drink = 20179.093023255813
Average User Ratings for Sports = 20128.974683544304
Average User Ratings for Book = 8498.333333333334
Average User Ratings for Finance = 13522.261904761905
Average User Ratings 

In [13]:
android_category_table = freq_table(android_allclean, 1)

for category in android_category_table:
    total = 0
    len_category = 0
    
    for app in android_allclean:
        category_app = app[1]
        
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
    
    avg_installs = total / len_category
    print("Average Installs for", category, "=", avg_installs)  
    

Average Installs for ART_AND_DESIGN = 1986335.0877192982
Average Installs for AUTO_AND_VEHICLES = 647317.8170731707
Average Installs for BEAUTY = 513151.88679245283
Average Installs for BOOKS_AND_REFERENCE = 8767811.894736841
Average Installs for BUSINESS = 1712290.1474201474
Average Installs for COMICS = 817657.2727272727
Average Installs for COMMUNICATION = 38456119.167247385
Average Installs for DATING = 854028.8303030303
Average Installs for EDUCATION = 1833495.145631068
Average Installs for ENTERTAINMENT = 11640705.88235294
Average Installs for EVENTS = 253542.22222222222
Average Installs for FINANCE = 1387692.475609756
Average Installs for FOOD_AND_DRINK = 1924897.7363636363
Average Installs for HEALTH_AND_FITNESS = 4188821.9853479853
Average Installs for HOUSE_AND_HOME = 1331540.5616438356
Average Installs for LIBRARIES_AND_DEMO = 638503.734939759
Average Installs for LIFESTYLE = 1437816.2687861272
Average Installs for GAME = 15588015.603248259
Average Installs for FAMILY = 3697