# Profitable app profiles for the App Store and Google play markets

The aim of this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. Assuming the goal for developers is to make the most profit from ad revenue, it is needed to understand what kind of free apps attract the largest amount of users.

Used datasets:

   + [Google Play](https://www.kaggle.com/lava18/google-play-store-apps)
   + [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader

#opening App Store data

opened_ios = open('datasets/AppleStore.csv')
read_ios = reader(opened_ios)
data_ios = list(read_ios)

header_ios = data_ios[0]
content_ios = data_ios[1:]


#opening Goodle Play data

opened_android = open('datasets/googleplaystore.csv')
read_android = reader(opened_android)
data_android = list(read_android)

header_android = data_android[0]
content_android = data_android[1:]

## Exploring the data

Let's look at column names for both datasets:

In [2]:
header_ios

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [3]:
header_android

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

For clarity let's look at the meanings behind every column name:

**Google Play**

|    Column name |        Description                                               |
|----------------|------------------------------------------------------------------|
| App            | App name                                                 |
| Category       | Category the app belongs to                                      |
| Rating         | Overall user rating of the app                                   |
| Reviews        | Number of user reviews                                           |
| Size           | Size of the app                                                  |
| Installs       | Number of user downloads                                         |
| Type           | Paid or Free                                                     |
| Price          | Price of the app                                                 |
| Content Rating | Age group the app is targeted at - Children / Mature 21+ / Adult |
| Genres         | Multiple genres apart from the main category                     |
| Last Updated   | Date when the app was last updated                               |
| Current Ver    | Current version of the app                                       |
| Android Ver    | Min required Android version                                     |

**App Store**

|Column name       |Description                                      |
|------------------|-------------------------------------------------|
| id               | App ID                                          |
| track_name       | App Name                                        |
| size_bytes       | Size (in Bytes)                                 |
| currency         | Currency Type                                   |
| price            | Price amount                                    |
| rating_count_tot | User Rating counts (for all version)            |
| rating_count_ver | User Rating counts (for current version)        |
| user_rating      | Average User Rating value (for all version)     |
| user_rating_ver  | Average User Rating value (for current version) |
| ver              | Latest version code                             |
| cont_rating      | Content Rating                                  |
| prime_genre      | Primary Genre                                   |
| sup_devices.num  | Number of supporting devices                    |
| ipadSc_urls.num  | Number of screenshots showed for display        |
| lang.num         | Number of supported languages                   |
| vpp_lic          | Vpp Device Based Licensing Enabled              |



To make it easier to explore the two data sets, we'll use the function **explore_data()** that presents the data in a readable way & shows the number of rows and columns.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(content_ios, start=0, end=1)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


In [6]:
explore_data(content_android, start=0, end=1)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Deleting wrong data

One of the discussions of the Google Play dataset outlines an error for row 10472: the number of column names does not match up with the elements of the row

In [7]:
print(content_android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


To make sure this is the only row with this problem let's print all the columns in which these parameters don't match:

In [8]:
for row in content_android:
    if len(row) != len(header_android):
        print(row)
        print(content_android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


We can see that we need to delete only this one row.

In [9]:
del content_android[10472]

## Dealing with duplicate entries

In the process of exploring the Google play dataset we've discovered that some apps have duplicate entries. Let's print out the lists of unique and duplicate apps:

In [10]:
duplicates = []
uniques = []

for row in content_android:
    name = row[0]
    if name in uniques:
        duplicates.append(name)
    else:
        uniques.append(name)

In [11]:
print('The number of duplicate apps: ', len(duplicates))

The number of duplicate apps:  1181


Let's see a few examples:

In [12]:
duplicates[:10]

['Quick PDF Scanner + OCR FREE',
 'Box',
 'Google My Business',
 'ZOOM Cloud Meetings',
 'join.me - Simple Meetings',
 'Box',
 'Zenefits',
 'Google Ads',
 'Google My Business',
 'Slack']

In [13]:
for app in content_android:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


As we can see above, there are three rows for the Slack app. The only thing that differs is the 4th column value - the number of reviews. We can conclude that the rows represent the same app, but with the data extracted at different times.

We'll take into consideration the row with the most amount of reviews.

The function **max_reviews()** takes the app name and returns the highest number of reviews for it present in the Google Play dataset:

In [14]:
def max_reviews(app_name):
    
    reviews_list = []
    for row in content_android:
        if row[0] == app_name:
            reviews_list.append(int(row[3]))  
            
    return max(reviews_list)

In [15]:
max_reviews('Slack')

51510

Let's use the function to create a dictionary that stores the key-value sets of names and the max number of reviews for each name:

In [16]:
names = []
reviews_values = []

for row in content_android:
    app = row[0]
    names.append(app)
    max_review = max_reviews(app)
    reviews_values.append(max_review)
    
max_reviews_dict = dict(zip(names, reviews_values))


Now we can create a new dataset **clean_data_android** without the duplicate app rows:

In [17]:
clean_data_android = []
already_added = []

for row in content_android:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == max_reviews_dict[name] and name not in already_added:
        clean_data_android.append(row)
        already_added.append(name)

In [18]:
len(clean_data_android)

9659

## Removing Non-English Apps

The names of some of the apps suggest they are not directed toward an English-speaking audience. Let's see a couple of examples from both data sets:

In [19]:
print(content_ios[813][1], '\n')
print(content_ios[6731][1], '\n')
print(clean_data_android[4412][0], '\n')
print(clean_data_android[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播 

【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜 

中国語 AQリスニング 

لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. We can delete each app whose name contains a symbol that is not commonly used in English text based on the ASCII standart.

The function **is_in_english()** uses the built-in function ord() that returns the ASCII code of a character to determine whether most symbols of a string are usual for the English language (are in the range of 0-126). Any number of unusual symbols above 3 will give a False outcome:

In [20]:
def is_in_english(a_string):
    non_eng = []
    for symbol in a_string:
        if ord(symbol) > 127:
            non_eng.append(symbol)
    if len(non_eng) >= 3:
        return False
    
    return True


In [21]:
is_in_english('Instagram')

True

In [22]:
is_in_english('Instachat 😜')

True

In [23]:
is_in_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Using our new function **is_in_english()** we'll make two new datasets for both Google Play and App Store that inculde apps in English only:

In [24]:
eng_android_data = []

for row in clean_data_android:
    name = row[0]
    if is_in_english(name):
        eng_android_data.append(row)

In [25]:
len(eng_android_data)

9597

In [26]:
eng_ios_data = []

for row in content_ios:
    name = row[1]
    if is_in_english(name):
        eng_ios_data.append(row)

In [27]:
len(eng_ios_data)

6155

## Removing non-free apps

As we've established in the introduction, we're only interested in free apps. Let's clear out both datasets from the rows with paid apps.

In [28]:
free_android_apps = []

for row in eng_android_data:
    price = row[7]
    if price == '0':
        free_android_apps.append(row)

In [29]:
len(free_android_apps)

8848

In [30]:
free_ios_apps = []

for row in eng_ios_data:
    price = row[4]
    if price == '0.0':
        free_ios_apps.append(row)

In [31]:
len(free_ios_apps)

3203

We're left with 8848 Android apps and 3203 iOS apps, which should be enough for the analysis.



## Finding most popular apps by amount

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll build frequency tables for a few columns in both datasets.

The function **freq_table()** generates frequency tables in percentages for a chosen column in a dataset:

In [32]:
# in percentages

def freq_table(dataset, index):
    frequency_table = {}
    
    for row in dataset:
        data_point = row[index]
        if data_point in frequency_table:
            frequency_table[data_point] += 1
        else:
            frequency_table[data_point] = 1
            
    for element in frequency_table:
        frequency_table[element] /= len(dataset)
        frequency_table[element] *= 100
        frequency_table[element] = round(frequency_table[element], 2)
        
        
    return frequency_table

The function **display_table()** presents the frequency table in a descending order of percentages:

In [33]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The **Category** frequency table for Google Play:

In [34]:
display_table(free_android_apps, 1)

FAMILY : 18.94
GAME : 9.7
TOOLS : 8.45
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.89
FINANCE : 3.71
MEDICAL : 3.54
SPORTS : 3.39
PERSONALIZATION : 3.32
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.09
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.67
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.39
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.8
WEATHER : 0.79
EVENTS : 0.71
PARENTING : 0.66
ART_AND_DESIGN : 0.64
COMICS : 0.61
BEAUTY : 0.6


As we can see above, the most popular category in Google Play among free English apps is Family, followed by Games and Tools. Since it is not really obvious what exactly does "Family" mean in this context, let's take a look at the actual website:

![family](images/family.png)

We can see that "Family" actually means "Games for children", so it is safe to say that gaming apps are definitely most represented in Google Play.

There is another relevant column for us to explore that shows more specific genres for apps than those in Category - Genres.

The **Genres** frequency table for Google Play:

In [35]:
display_table(free_android_apps, 9)

Tools : 8.44
Entertainment : 6.08
Education : 5.36
Business : 4.6
Productivity : 3.9
Lifestyle : 3.88
Finance : 3.71
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.23
Action : 3.1
Health & Fitness : 3.09
Photography : 2.95
News & Magazines : 2.8
Social : 2.67
Travel & Local : 2.33
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.05
Dating : 1.86
Arcade : 1.84
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.39
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.92
House & Home : 0.8
Weather : 0.79
Events : 0.71
Adventure : 0.67
Comics : 0.6
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Trivia : 0.42
Casino : 0.42
Educational;Education : 0.4
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;Br

This frequency table presents a completely different picture concerning the nature of most popular genres. This show us that practical apps are actually play a larger role that we've discovered earlier. The reason for that is, probably, the number of highly specific sub-genres for games as opposed to any other category. 

The **prime_genre** frequency table for App Store:

In [36]:
display_table(free_ios_apps, 11)

Games : 58.26
Entertainment : 7.84
Photo & Video : 5.0
Education : 3.68
Social Networking : 3.31
Shopping : 2.59
Utilities : 2.47
Sports : 2.15
Music : 2.06
Health & Fitness : 2.03
Productivity : 1.75
Lifestyle : 1.56
News : 1.34
Travel : 1.25
Finance : 1.09
Weather : 0.87
Food & Drink : 0.81
Reference : 0.53
Business : 0.53
Book : 0.37
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


We can see that among the free English apps, more than a half are Games. Two following app genres are Entertainment and Photo & Video.
The clear impression is that App Store is dominated by apps that are designed for fun. In this instance practical tools don't make the top list at all.

Up to this point, we discovered that the App Store is dominated by gaming and entertainment, while Google Play shows a more balanced landscape of both practical and for-fun apps.

However, the amount of apps in catalog **does not nesessarily mean** they're most popular among users. 

Let's actually look through the most popular genres among *users*.

## Finding most popular apps by user reviews and downloads

The App Store dataset does not naturally provide the number of downloads. As a workaround, we can compare the number of ratings.

Let's save a frequency table for the **prime_genre** column of the App Store dataset

In [37]:
freq_table_genres_ios = freq_table(free_ios_apps, 11)

Now we can use it as a structured list of unique genres to count overall number of ratings made by users for each genre and not individual apps:

In [38]:
genres = []
aver_ratings = []

for genre in freq_table_genres_ios:
    total = 0
    len_genre = 0
    
    for row in free_ios_apps:
        genre_app = row[11]
        num_reviews = float(row[5])
        if genre_app == genre:
            total += num_reviews
            len_genre += 1
            
    aver_ratings_amount = total / len_genre
    
    genres.append(genre)
    aver_ratings.append(aver_ratings_amount)
    
    
    
    print(genre)
    print(aver_ratings_amount)
            
            

Social Networking
71548.34905660378
Photo & Video
28441.54375
Games
22886.36709539121
Music
57326.530303030304
Reference
79350.4705882353
Health & Fitness
23298.015384615384
Weather
52279.892857142855
Utilities
19156.493670886077
Travel
28243.8
Shopping
27230.734939759037
News
21248.023255813954
Navigation
86090.33333333333
Lifestyle
16815.48
Entertainment
14195.358565737051
Food & Drink
33333.92307692308
Sports
23008.898550724636
Book
46384.916666666664
Finance
32367.02857142857
Education
7003.983050847458
Productivity
21028.410714285714
Business
7491.117647058823
Catalogs
4004.0
Medical
612.0


To make any conclusions we need to sort this data by descending. To do that we'll make a dictionary and use a slightly modified function from earlier:

In [39]:
aver_ratings_dict = dict(zip(genres, aver_ratings))

In [40]:
def display_dict(table):

    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The sorted data:

In [41]:
display_dict(aver_ratings_dict)

Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Finance : 32367.02857142857
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 27230.734939759037
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22886.36709539121
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 19156.493670886077
Lifestyle : 16815.48
Entertainment : 14195.358565737051
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


As we can see, the most used apps in the App Store are actually Navigation, Reference and Social Networking. Games made only the second part of the list.

The Google Play dataset does provide the number of downloads. The data istself is not precise, though:

In [42]:
display_table(free_android_apps, 5) # Installs

1,000,000+ : 15.75
100,000+ : 11.54
10,000,000+ : 10.57
10,000+ : 10.19
1,000+ : 8.4
100+ : 6.93
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.49
10+ : 3.54
500+ : 3.24
50,000,000+ : 2.28
100,000,000+ : 2.14
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


To deal with this, we're going to assume that 100+ means the actual number 100, not a whole range of numbers, because we're not really interested in precision to be able to analyse. This means that the data will need some formatting.

Let's save a frequency table for the **Category** column of the App Store dataset

In [43]:
freq_table_cats_android = freq_table(free_android_apps, 1)

Now we can use it as a structured list of unique genres to count overall number of installs made by users for each genre and not individual apps. Note that the number of installed apps is formatted while counting:

In [44]:
categories = []
aver_installs = []

for category in freq_table_cats_android:
    total = 0
    len_category = 0
    
    for row in free_android_apps:
        installs = row[5]
        category_app = row[1]
        if category_app == category:
            
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)

            total += installs
            len_category += 1
        
    aver_installs_amount = total / len_category
    
    categories.append(category)
    aver_installs.append(aver_installs_amount)
    


In [45]:
aver_installs_dict = dict(zip(categories, aver_installs))

The sorted data:

In [46]:
display_dict(aver_installs_dict)

COMMUNICATION : 38590581.08741259
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15544014.51048951
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10830251.970588235
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8814199.78835979
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5145550.285714285
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4049274.6341463416
FAMILY : 3695641.8198090694
SPORTS : 3650602.276666667
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1446158.2238372094
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1360598.042253521
DATING : 854028.8303030303
COMICS : 832613.8888888889
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513151.886

The most installed categories apps on Android are messengers, video players and social media platforms. These niches are dominated by large corporations, such as Google's YouTube or Facebook, who are extremely hard to compete against.

The Games category is highly popular in both Google Play and the App Store, and it is much more probable to develop a marketable gaming app than a new video hosting tool or a social media plaform. Although the market is a bit saturated, with a great product the profit has a potential to be great on both platforms.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that the Games category is the safest in terms of genrating as much income as posiible from Android and iOS applications. The application in question really needs to stand out though. Perhaps it is possible to achieve with an unexpected and interesting new game mechanic. The other way to go around this is to collide fun practicality and use gamification of mundane, but useful apps, because of the Tools category being extremely popular as well.