# Profitable App Profiles for the App Store & Google Play Markets

### Background
Our client only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. 

### Goal
The goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
from csv import reader

# Opening the Play Store datasets
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# Opening the Apple Store dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(ios,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [5]:
explore_data(ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [6]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [7]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del android[10472]

In the Google PlayStore dataset, there are duplicate entries. To confirm this, lets look for an entry below:

In [9]:
for entry in android:
    name = entry[0]
    if name == "Slack":
        print(entry)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


As you can see from above, "Slack" has two entries. 

In [10]:
duplicate = []
unique = []

for entry in android:
    name = entry[0]
    if name in unique:
        duplicate.append(name)
    else:
        unique.append(name)
print("The number of duplicate entries are: ", len(duplicate))
print("A few examples are: ", duplicate[:10])

The number of duplicate entries are:  1181
A few examples are:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Taking a closer look at the duplicate entries, we would notice that the number of reviews changes. 

We will not be removing the duplicates randomly. Rather, we would remove those with the highest number of reviews which signals the item being the most recent. 

In [11]:
# Creating a dictionary and putting only names and highest review of all entries.
reviews_max = {}

for entry in android:
    name = entry[0]
    n_reviews = float(entry[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

# Removing duplicate entries by creating a new dataset with only unique values.
android_clean = []
already_added = []

for entry in android:
    name = entry[0]
    n_reviews = float(entry[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(entry)
        already_added.append(entry[0])

        

There are no duplicate values in the AppStore dataset. So we will leave it as it is. 

Next step in cleaning the datasets is removing apps that are for non-English audience. These apps can be identified because they contain non-English letters in their names. 

In [12]:
def nonEnglish(name):
    count = 0
    for character in name:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

In [13]:
android_english = []
ios_english = []

# Going through the Google Play Store dataset
for app in android_clean:
    name = app[0]
    English_status = nonEnglish(name)
    if English_status == True:
        android_english.append(app)

# Going through the Google Play Store dataset        
for app in ios:
    name = app[1]
    English_status = nonEnglish(name)
    if English_status == True:
        ios_english.append(app)


In [14]:
android_free = []
ios_free = []

# Removing non-free apps in the Google Play Store dataset.
for app in android_english:
    price = app[6]
    if price == "Free":
        android_free.append(app)
        
# Removing non-free apps in the AppStore dataset.
for app in ios_english:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)

print(len(android_free))
print(len(ios_free))

8863
3222


Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

In [15]:
def freq_table(dataset, index):
    dict_freq = {}
    
    total = len(dataset)
    for element in dataset:
        item = element[index]
        if element[index] in dict_freq:
            dict_freq[item] += 1
        else:
            dict_freq[item] = 1
    for values in dict_freq:
        dict_freq[values] = round(((dict_freq[values]/total)*100), 2)

    return dict_freq

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [16]:
display_table(ios_free, 11)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


As we can see from above:
1. The most common genre is Games with 58%
2. More apps are designed for recreation i.e. Games, Entertainment, Photo & Video etc. 

In [17]:
display_table(android_free, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;B

In [20]:
prime_genre = freq_table(ios_free, 11)
   
for genre in prime_genre:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    average = total/len_genre
    print(genre,": ",total)
    

Social Networking :  7584125.0
Photo & Video :  4550647.0
Games :  42705967.0
Music :  3783551.0
Reference :  1348958.0
Health & Fitness :  1514371.0
Weather :  1463837.0
Utilities :  1513441.0
Travel :  1129752.0
Shopping :  2261254.0
News :  913665.0
Navigation :  516542.0
Lifestyle :  840774.0
Entertainment :  3563577.0
Food & Drink :  866682.0
Sports :  1587614.0
Book :  556619.0
Finance :  1132846.0
Education :  826470.0
Productivity :  1177591.0
Business :  127349.0
Catalogs :  16016.0
Medical :  3672.0


From the data we can see above, we can see that Games get more reviews on the AppStore than any other app. 

Our company should make free Games on the AppStore. 

In [21]:
Category = freq_table(android_free, 9)
print(Category)

for category in Category:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            installs = float(app[5].replace('+', ''))
            installs = float(app[5].replace(',', ''))
            total += installs
            len_category += 1
    average = total/len_category
    print(category,": ",total)
  

{'Art & Design': 0.6, 'Art & Design;Creativity': 0.07, 'Auto & Vehicles': 0.93, 'Beauty': 0.6, 'Books & Reference': 2.14, 'Business': 4.59, 'Comics': 0.61, 'Comics;Creativity': 0.01, 'Communication': 3.24, 'Dating': 1.86, 'Education': 5.35, 'Education;Creativity': 0.05, 'Education;Education': 0.34, 'Education;Pretend Play': 0.06, 'Education;Brain Games': 0.03, 'Entertainment': 6.07, 'Entertainment;Brain Games': 0.08, 'Entertainment;Creativity': 0.03, 'Entertainment;Music & Video': 0.17, 'Events': 0.71, 'Finance': 3.7, 'Food & Drink': 1.24, 'Health & Fitness': 3.08, 'House & Home': 0.82, 'Libraries & Demo': 0.94, 'Lifestyle': 3.89, 'Lifestyle;Pretend Play': 0.01, 'Card': 0.45, 'Arcade': 1.85, 'Puzzle': 1.13, 'Racing': 0.99, 'Sports': 3.46, 'Casual': 1.76, 'Simulation': 2.04, 'Adventure': 0.68, 'Trivia': 0.42, 'Action': 3.1, 'Word': 0.26, 'Role Playing': 0.94, 'Strategy': 0.9, 'Board': 0.38, 'Music': 0.2, 'Action;Action & Adventure': 0.1, 'Casual;Brain Games': 0.14, 'Educational;Creativi

ZeroDivisionError: division by zero