# Profitable App Profiles for the App Store and Google Play Markets

**Learning Goals**
- Exploring data
- Data cleaning

**Project Purpose**
- Determine kinds of apps that are likely to attract more users
- Business goal: create apps that garner the most number of users
- Project outcome: find app profiles that are successful in Google and Apple app stores

## Exploring Data

### Google Apps Dataset Columns
  1. **App** - Application Name
  2. **Category** - Category the app belongs to
  3. **Rating** - Overall user rating of the app
  4. **Reviews** - Number of user reviews for the app
  5. **Size** - Size of the app
  6. **Installs** - Number of user downloads/installs for the app
  7. **Type** - Paid or Free
  8. **Price** - Price of the app
  9. **Content Rating** - Age group the app is targeted at - Children / Mature 21+ / Adult
  10. **Genres** - An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
  11. **Last Updated** - Date when the app was last updated on Play Store
  12. **Current Ver** - Current version of the app available on Play Store 
  13. **Android Ver** - Min required Android version
  
  
  
### Apple Apps Dataset Columns
  1. **id** - App ID
  2. **track_name** - app name
  3. **size_bytes**
  4. **currency** - Currency type
  5. **price**
  6. **rating_count_tot** - User Rating counts (for all versions)
  7. **rating_count_ver** - User Rating counts (for current version)
  8. **user_rating** - Average User Rating value (for all version)
  9. **user_rating_ver** - Average User Rating value (for current version)
  10. **ver** - Latest version code
  11. **cont_rating** - Content/ Age rating
  12. **prime_genre** - Primary Genre
  13. **sup_devices.num** - Number of supporting devices
  14. **ipadSc_urls.num** - Number of screenshots showed for display
  15. **lang.num** - Number of supported languages
  26. **vpp_lic** - Vpp Device Based Licensing Enabled (0-1)

In [1]:
# Extracts data from CSV into double list
import csv

def extract_data(filename):
    
    with open(filename, encoding="utf8") as file:
        return list(csv.reader(file))
    
# Displays X number of rows top of dataset and prints number of rows and columns
def inspect_data(data, num):
    print("Number of rows:", len(data))
    print("Number of columns:", len(data[0]))
    for d in data[0:num]:
        print(d)

In [2]:
google_dataset = extract_data("./data/googleplaystore.csv")
google_header = google_dataset[0]
google_apps = google_dataset[1:]

print(google_header)
print(len(google_apps))
inspect_data(google_dataset, 4)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
10841
Number of rows: 10842
Number of columns: 13
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


In [3]:
apple_dataset = extract_data("./data/AppleStore.csv")
apple_header = apple_dataset[0]
apple_apps = apple_dataset[1:]

print(apple_header)
print(len(apple_apps))
inspect_data(apple_dataset,4)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
7197
Number of rows: 7198
Number of columns: 16
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


## Data Cleaning

### Deleting Inaccurate data

In [4]:
# Removing wrong rating for entry 10472
# - see https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015

# Verifying data is incorrect/ incomplete
print(len(google_apps[10472])) # Has only 12 columns
print(len(google_header))      # Each row should have 13 columns

12
13


In [5]:
# Deleting the data
print(len(google_apps))
del google_apps[10472]
print(len(google_apps))

10841
10840


### Removing Duplicates

In [6]:
# Counts number of duplicates
# - Duplicate Condition = Same App name
def count_duplicates(data, name_index):
    duplicates = []
    uniques = []
    
    for d in data:
        name = d[name_index]
        if name in uniques:
            duplicates.append(name)
        else:
            uniques.append(name)
            
    return len(duplicates)

In [7]:
google_duplicates_count = count_duplicates(google_apps, 0)
apple_duplicates_count = count_duplicates(apple_apps, 1)

print(google_duplicates_count) # 1181
print(apple_duplicates_count)  # 2

1181
2


In [8]:
# Inspecting Duplicates

# Extracts duplicates 
def extract_duplicates(data, name_index):
    uniques = set()
    duplicates_raw = {}
    
    for d in data:
        name = d[name_index]
        if name not in uniques:
            uniques.add(name)
            duplicates_raw[name] = []
        
        duplicates_raw[name].append(d)
            
    duplicates = {}
    
    for k,v in duplicates_raw.items():
        if len(v) > 1:
            duplicates[k] = v
            
    return duplicates

In [9]:
google_duplicates = extract_duplicates(google_apps, 0)

In [10]:
# Determine which duplicate to keep
#  - Inspect duplicates

instagram_duplicate = google_duplicates["Instagram"]
# Here we see that only the 4th column is different, which is the number of reviews
# The different numbers show data was collected at different times
# We can keep the data with the highest number of reviews, and remove the rest
for i in instagram_duplicate:
    print(i)
    print()

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']



In [11]:
# Apple duplicates
apple_duplicates = extract_duplicates(apple_apps, 1)
print(apple_duplicates)
    
# Apple shows 4 apps with duplicate names;
#   However, these are not duplicates as they have their own unique Apple ID

{'Mannequin Challenge': [['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1'], ['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']], 'VR Roller Coaster': [['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1'], ['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']]}


In [12]:
# TEMP code

temp_data = []
for d in list(google_duplicates.values())[0:5]:
    for d2 in d:
        temp_data.append(d2[0:4])

temp_data.append(['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967'])
temp_data.append(['Coloring book lilly', 'ART_AND_DESIGN', '3.9', '967'])
temp_data.append(['Textgram - write on photos', 'ART_AND_DESIGN', '4.4', '295237'])
temp_data.append(['Wattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2915189'])
temp_data.append(['Mattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2915189'])
        
for d in temp_data:
    print(d)

def count_duplicates_1(data):
    count = {}
    
    for d in data:
        name = d[0]
        if name in count:
            count[name] += 1
        else:
            count[name] = 1
            
    dups = {}
    
    # Extract duplicates 
    for key,value in count.items():
        if value > 1:
            dups[key] = value
            
    return dups

def count_duplicates_2(data):
    duplicates = []
    uniques = []
    
    for d in data:
        name = d[0]
        if name in uniques:
            duplicates.append(name)
        else:
            uniques.append(name)
            
    return len(duplicates)



count_1 = count_duplicates_1(temp_data)
print(sum(count_1.values()))

count_2 = count_duplicates_2(temp_data)
print(count_2)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967']
['Coloring book moana', 'FAMILY', '3.9', '974']
['Mcqueen Coloring pages', 'ART_AND_DESIGN', 'NaN', '61']
['Mcqueen Coloring pages', 'FAMILY', 'NaN', '65']
['UNICORN - Color By Number & Pixel Art Coloring', 'ART_AND_DESIGN', '4.7', '8145']
['UNICORN - Color By Number & Pixel Art Coloring', 'FAMILY', '4.7', '8264']
['Textgram - write on photos', 'ART_AND_DESIGN', '4.4', '295221']
['Textgram - write on photos', 'ART_AND_DESIGN', '4.4', '295237']
['Wattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2914724']
['Wattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2915189']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967']
['Coloring book lilly', 'ART_AND_DESIGN', '3.9', '967']
['Textgram - write on photos', 'ART_AND_DESIGN', '4.4', '295237']
['Wattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2915189']
['Mattpad 📖 Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2915189']
13
8


In [13]:
# Remove duplicates
# - We will be removing duplicates from the Google app dataset
# - Once removed, should be left with 9659 rows

# include only apps with highest reviews, if duplicate, along with all the other apps
reviews_max = {}
for app in google_apps:
    name = app[0]
    reviews = float(app[3])
    if name in reviews_max:
        if reviews_max[name] < reviews:
            reviews_max[name] = reviews
    else:
        reviews_max[name] = reviews
        
print(len(reviews_max)) # should be 9659 rows

9659


In [14]:
temp_google_apps = []
temp_google_already_added = [] # need this because there are duplicates with same number of reviews

for app in google_apps:
    name = app[0]
    reviews = float(app[3])
    if reviews_max[name] == reviews and name not in temp_google_already_added:
        temp_google_apps.append(app)
        temp_google_already_added.append(name)
        
print(len(temp_google_apps)) # 9659

9659


In [15]:
# replace google_apps with temp_google_apps
google_apps = temp_google_apps
len(google_apps)

9659

### Removing Non-English Apps

- Non-english characters have unicode value greater than 127
  - We can use this fact to remove apps with names containing characters that are greater than 127 in unicode value

In [16]:
# Return false if there's more than three characters in the string that doesn't belong to the set of common English chracters, True otherwise
def is_english(word):
    count = 0
    for letter in word:
        if ord(letter) > 127:
            count += 1
        
    if count > 3:
            return False

    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [17]:
# Use previous function to remove non-english apps from datasets

In [18]:
temp_google_apps = []
for app in google_apps:
    name = app[0]
    if is_english(name):
        temp_google_apps.append(app)
        
google_apps = temp_google_apps

temp_apple_apps = []
for app in apple_apps:
    name = app[1]
    if is_english(name):
        temp_apple_apps.append(app)
        
apple_apps = temp_apple_apps

inspect_data(google_apps, 3)
print()
inspect_data(apple_apps, 3)

Number of rows: 9614
Number of columns: 13
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']

Number of rows: 6183
Number of columns: 16
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579',

### Isolating Free Apps
- Since this project's goal is to build apps that are free to download, only free apps is needed from the dataset for analysis

In [19]:
# Excludes non-free apps from dataset

temp_google_apps = []
for app in google_apps:
    price = app[7]
    if price == '0':
        temp_google_apps.append(app)
        
google_apps = temp_google_apps

temp_apple_apps = []
for app in apple_apps:
    price = app[4]
    if price == '0.0':
        temp_apple_apps.append(app)
        
apple_apps = temp_apple_apps

inspect_data(google_apps, 3)
print()
inspect_data(apple_apps, 3)

Number of rows: 8864
Number of columns: 13
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']

Number of rows: 3222
Number of columns: 16
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579',

## Analysis

### Determine most common genres for each market
- Can be determined via frequency tables
- The following columns will be used to generate the frequency tables
  - `Genre` and `Category` (Google)
  - `prime_genre` (Apple)

In [20]:
# Returns frequency table as a dictionary that displays total occurences as percentage for any column (index) in dataset
def freq_table(data, index):
    table = {}
    total = 0
    
    for d in data:
        total += 1
        column = d[index]
        if column in table:
            table[column] += 1
        else:
            table[column] = 1
        
    table_percentages = {}
    
    for k,v in table.items():
        table_percentages[k] = (v/total) * 100
            
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
print("Apple Prime Genre Frequency")
display_table(apple_apps, 11) 
print()
print("Google Genres Frequency")
display_table(google_apps, 9) 
print()
print("Google Category Frequency")
display_table(google_apps, 1) 
print()

Apple Prime Genre Frequency
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665

Google Genres Frequency
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Spor

**Frequency Table Results**
- For Apple, game genre seems to be most common
- For Google, Practical and Fun apps seem to be most common

### Determine average number of installs for each app genre
- Apple: `rating_count_tot` column will be used
- Google: `Installs` column will be used
- How:
    - Isolate apps of each genre
    - Get total user ratings for the apps of that genre
    - Divide total user ratings by number of apps in that genre

In [22]:
print(google_apps[1])

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


In [25]:
# Apple average number of installs per genre (prime_genre)
apple_prime_genre_frequency = freq_table(apple_apps, 11) # Apple app genres and their frequencies (occurrences)
# Looping over unique genres
for genre in apple_prime_genre_frequency.keys():
    apple_total_ratings = 0
    apple_total_apps_per_genre = 0
    for app in apple_apps:
        app_genre = app[11]
        if app_genre == genre:
            ratings = float(app[5])
            apple_total_ratings += ratings
            apple_total_apps_per_genre += 1
    print(genre + ' - ' + str(apple_total_ratings/apple_total_apps_per_genre)) 


Social Networking - 71548.34905660378
Photo & Video - 28441.54375
Games - 22788.6696905016
Music - 57326.530303030304
Reference - 74942.11111111111
Health & Fitness - 23298.015384615384
Weather - 52279.892857142855
Utilities - 18684.456790123455
Travel - 28243.8
Shopping - 26919.690476190477
News - 21248.023255813954
Navigation - 86090.33333333333
Lifestyle - 16485.764705882353
Entertainment - 14029.830708661417
Food & Drink - 33333.92307692308
Sports - 23008.898550724636
Book - 39758.5
Finance - 31467.944444444445
Education - 7003.983050847458
Productivity - 21028.410714285714
Business - 7491.117647058823
Catalogs - 4004.0
Medical - 612.0


Here we see that navigation apps have highest number of reviews on average
However, this is misleading as only a few popular apps skews the data, such as google maps and waze accounting for many reviews

In [26]:
for app in apple_apps:
    if app[11] == 'Navigation':
        print(app[1], ":", app[5]) # Prints name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [28]:
# Google average number of installs per genre (prime_genre)
display_table(google_apps, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category)

In [30]:
categories_android = freq_table(google_apps, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in google_apps:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [32]:
for app in google_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [33]:
under_100_m = []

for app in google_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [34]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [36]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [37]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.