# Got Profit? Examining the profiles of the best apps on the App Store and Google Play Store

The purpose of this project is to examine free apps on the Apple App Store and Google Play Store to help developers understand what kinds of apps are likely to attract more users.  This is crucial since free apps rely only on in-app advertisements for revenue, so we want to focus on maximizing appeal for these types of apps.

Personal goals of this project include using Python to build functions that open and read data, clean up incorrect data, and extract insights from raw data relevant to the project's objective.

## Open and Retrieve DataSets

The ```get_data``` function transforms the raw data from a bunch of strings to a list of lists, which is much easier to examine (human-wise and code-wise). The ```explore_data``` function allows you to examine any portion of a dataset, provided a dataset and start and end point.

In [1]:
from csv import reader

def get_data(data_file):
    file_open = open(data_file)
    file_read = reader(file_open)
    file_data = list(file_read) # a list of lists
    return file_data 

apple_data = get_data('AppleStore.csv')
google_data= get_data('googleplaystore.csv')

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))



Let's look at a couple of rows from each dataset and see which ones may be of interest for the purpose of this project.

Here's info from the App Store:

In [2]:
explore_data(apple_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


And here's info from the Play Store:

In [3]:
explore_data(google_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


The App Store data contains ~7200 rows and 16 columns, while the data from the Play Store contains ~ 11000 rows and 14 columns.

At a glance it seems the most relevant columns for the App Store include ```track_name```, ```size_bytes```, ```user_rating```, ```prime_genre```,```rating_count_tot```, and ```cont_rating```.

For the Play Store it seems like ```App```, ```Category```, ```Rating```, ```Size```, ```Price```, ```Installs```, and ```Genres```.

The column names for the Play Store dataset are more self-explanatory, but not so for the App Store data.  For more info on its columns, here's where you can find the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

## Finding and Removing Incorrect Data

This [link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) has a discussion that flagged an error in the Play Store dataset.  Row 10473 is missing the ```Rating``` value.  As an aside, the author of the post says 10472, but that's if the header row is removed.  The header row is not removed here.

Here's a look:

In [4]:
explore_data(google_data, 10473, 10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




The empty string (right before ```February 11, 2018``` specifies a missing value for ```Rating```.  This row will be removed.

In [5]:
del google_data[10473] # don't execute more than once

## Finding Duplicate Data

There's another problem in the Play Store dataset: some apps are repeated more than once.  We can find how many apps are duplicated.

The first time an app is encountered, its name will be added to ```unique_apps```.  The function also checks that list to see if the app is encounted again as it iterates over ```google_data```.   If so, that app name will be also added to the ```duplicate_apps``` list.


In [6]:
duplicate_apps = []
unique_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)

print(len(duplicate_apps))
print(duplicate_apps[:8])    

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads']


Let's see how many times the app *Zenefits* is duplicated.  It is duplicated twice.

In [7]:
for app in google_data[1:]:
    name = app[0]
    if name == 'Zenefits':
        print(app)

['Zenefits', 'BUSINESS', '4.2', '296', '14M', '50,000+', 'Free', '0', 'Everyone', 'Business', 'June 15, 2018', '3.2.1', '4.1 and up']
['Zenefits', 'BUSINESS', '4.2', '296', '14M', '50,000+', 'Free', '0', 'Everyone', 'Business', 'June 15, 2018', '3.2.1', '4.1 and up']


## Removing Duplicate Data

How should duplicate data be removed?  We can specify some type of criteria, like only keeping the entry with the greatest number of reviews.  Alternatively, we can keep the most recent version.  Either is fine, but we'll use the first criteria.

The code below will search the Play Store data to find the greatest number of reviews for each app, storing the results in a dictionary.

In [8]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Now we'll remove the duplicate rows.  We want to search through the Play Store dataset and check if the number of reviews for any given app matches that of the ```review_max``` dictionary defined previously.

In [9]:
android_clean = [] # the clean data set
already_added = [] # the app names

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
len(android_clean)

9659

## Removing Non-English Apps

Some apps in the Google Play store are non-English, but we are only interested in apps that are English since the company developers use English only.

This means we'll have to filter out the non-English apps from the dataset.  

One way to accomplish this is to examine the ASCII numeric equivalent for English chracters: the English alphabet, digits 0-9, punctuation marks, and other symbols (+, *, /, etc).

The ```ord``` built-in Python function gives us the ASCII equivalent.

In [10]:
print(ord('r'))

114


The ASCII equivalents for characters used in English text are all in the range of 0 to 127.

The first step is to define a function that interates over an app name and checks if any of the characters are outside the ASCII equivalent of 0-127 (in this case, any equivalent of 128 or above meets this condition).

In [11]:
def is_English(name):
    for char in name:
        if ord(char) > 127:
            return False
    return True

Let's check it out.

In [12]:
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Instragram'))

False
True


The function seem ok, except that it flags certain English apps as non-English. For example:

In [13]:
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))

False
False


As it happens, ™ and 😜 have ASCII equivalents over 127.

In its current form, the function will delete many English apps if we used it on our datasets. To amend this, we'll only remove an app if its name has more than three characters with ASCII equivalents outside the range 0-127.

We'll revise the ```is_English``` function to reflect that condition.

In [14]:
def is_English(name):
    ascii_over_127 = []
    for char in name:
       if ord(char) > 127:
        ascii_over_127.append(char)
    
    if len(ascii_over_127) > 3:
        return False
    return True

Let's verify it.

In [15]:
print(is_English('Instachat 😜'))
print(is_English('Docs To Go™ Free Office Suite'))

True
True


Now that ```is_English``` satisfies our condition, we'll clean the Play Store data accordingly. 

In [16]:
android_final = []
apple_final = []

for app in android_clean:
    name = app[0]
    if is_English(name):
        android_final.append(app)

for app in apple_data[1:]:
    name = app[1]
    if is_English(name):
        apple_final.append(app)

We can now see how many rows are left in the revised datasets.

In [17]:
explore_data(android_final, 0,2, True)
print("\n")
explore_data(apple_final, 0,2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


## Retrieving Free Apps

At the beginning of the project, we mentioned that only free apps were under consideration.  This is what we now filter for.

In [18]:
android_free = []
apple_free = []

for app in apple_final:
    price = float(app[4])
    if price == 0.0:
        apple_free.append(app)

for app in android_final:
    price = float(app[7].replace('$',''))
    if price == 0:
        android_free.append(app)
        
print(len(android_free))
print(len(apple_free))

print(android_free[0:2])

8864
3222
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


We go from 9614 to 8864 Play Store apps, and from 6183 to 3222 App Store apps.

## Most Common Apps by Genre

### Part 1

To maximize revenue (remember these apps are free and revenue derives from advertising), an app needs to be attractive to a wide demographic and available on both the Google and Apple app stores.  For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play.

2) If the app has a good response from users, we then develop it further.

3) If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

We can begin our analysis by finding the most common genres for both app stores.

We'll build a frequency table for the ```prime_genre``` column of the App Store data set, and the ```Genres``` and ```Category``` columns of the Google Play data set.


### Part 2

Let's create a function that generates frequency tables displaying percentages for any given category

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        output = row[index]
        if output in table:
            table[output] +=1
        else:
            table[output] = 1
        total += 1

    for key in table:
        table[key] = round((table[key] / total) * 100, 3)
    return table    

print(freq_table(apple_free,-5))

{'Social Networking': 3.29, 'Photo & Video': 4.966, 'Games': 58.163, 'Music': 2.048, 'Reference': 0.559, 'Health & Fitness': 2.017, 'Weather': 0.869, 'Utilities': 2.514, 'Travel': 1.241, 'Shopping': 2.607, 'News': 1.335, 'Navigation': 0.186, 'Lifestyle': 1.583, 'Entertainment': 7.883, 'Food & Drink': 0.807, 'Sports': 2.142, 'Book': 0.435, 'Finance': 1.117, 'Education': 3.662, 'Productivity': 1.738, 'Business': 0.528, 'Catalogs': 0.124, 'Medical': 0.186}


The ```display_table``` function will format the table in descending order and be more human-readable

In [20]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

These two functions can be used to display a frequency table for the columns ```prime_genre```, ```Genres```, and ```Category```.

In [21]:
display_table(apple_free, 11) # prime_genre

Games : 58.163
Entertainment : 7.883
Photo & Video : 4.966
Education : 3.662
Social Networking : 3.29
Shopping : 2.607
Utilities : 2.514
Sports : 2.142
Music : 2.048
Health & Fitness : 2.017
Productivity : 1.738
Lifestyle : 1.583
News : 1.335
Travel : 1.241
Finance : 1.117
Weather : 0.869
Food & Drink : 0.807
Reference : 0.559
Business : 0.528
Book : 0.435
Navigation : 0.186
Medical : 0.186
Catalogs : 0.124


In [22]:
display_table(android_free, 9)

Tools : 8.45
Entertainment : 6.069
Education : 5.347
Business : 4.592
Productivity : 3.892
Lifestyle : 3.892
Finance : 3.7
Medical : 3.531
Sports : 3.463
Personalization : 3.317
Communication : 3.238
Action : 3.102
Health & Fitness : 3.08
Photography : 2.944
News & Magazines : 2.798
Social : 2.662
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.144
Simulation : 2.042
Dating : 1.861
Arcade : 1.85
Video Players & Editors : 1.771
Casual : 1.76
Maps & Navigation : 1.399
Food & Drink : 1.241
Puzzle : 1.128
Racing : 0.993
Role Playing : 0.936
Libraries & Demo : 0.936
Auto & Vehicles : 0.925
Strategy : 0.914
House & Home : 0.824
Weather : 0.801
Events : 0.711
Adventure : 0.677
Comics : 0.609
Beauty : 0.598
Art & Design : 0.598
Parenting : 0.496
Card : 0.451
Casino : 0.429
Trivia : 0.417
Educational;Education : 0.395
Board : 0.384
Educational : 0.372
Education;Education : 0.338
Word : 0.259
Casual;Pretend Play : 0.237
Music : 0.203
Racing;Action & Adventure : 0.169
Puzzle;Brain G

In [48]:
display_table(android_free, 1) # Category

FAMILY : 18.908
GAME : 9.725
TOOLS : 8.461
BUSINESS : 4.592
LIFESTYLE : 3.903
PRODUCTIVITY : 3.892
FINANCE : 3.7
MEDICAL : 3.531
SPORTS : 3.396
PERSONALIZATION : 3.317
COMMUNICATION : 3.238
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.944
NEWS_AND_MAGAZINES : 2.798
SOCIAL : 2.662
TRAVEL_AND_LOCAL : 2.335
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.144
DATING : 1.861
VIDEO_PLAYERS : 1.794
MAPS_AND_NAVIGATION : 1.399
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.959
LIBRARIES_AND_DEMO : 0.936
AUTO_AND_VEHICLES : 0.925
HOUSE_AND_HOME : 0.824
WEATHER : 0.801
EVENTS : 0.711
PARENTING : 0.654
ART_AND_DESIGN : 0.643
COMICS : 0.62
BEAUTY : 0.598


### Part 3: Analysis

The majority of apps on the App store (> 58%) consist of Games, but that doesn't mean it comprises the category downloaded most. 

For the Play Store, we'll stick with the ```Category``` column since the ```Genres``` column looks a bit too-specific.  Interestingly, games are not the most popular app type but rather Family.  Again, this doesn't necessarily imply that Family apps are the most downloaded apps.  

Determining the most downloaded app category for each app store is next.

## Most Popular Apps on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the ```Installs``` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the ```rating_count_tot``` app.

We calculate the average number of user ratings for each app category below.

In [23]:
apple_genre_freq = freq_table(apple_free, 11)

apple_popularity = {}
for genre in apple_genre_freq:
    total = 0 # stores sum of the number of ratings for each genre
    len_genre = 0 # stores number of apps specific to each genre
    
    for app in apple_free:
        genre_app = app[-5]
        if genre_app == genre:
            ratings = float(app[5])
            total += ratings
            len_genre += 1

    avg_user_ratings = total/len_genre
    apple_popularity[genre] = avg_user_ratings

sorted(apple_popularity.items(), key = lambda k: k[1], reverse = True)

[('Navigation', 86090.33333333333),
 ('Reference', 74942.11111111111),
 ('Social Networking', 71548.34905660378),
 ('Music', 57326.530303030304),
 ('Weather', 52279.892857142855),
 ('Book', 39758.5),
 ('Food & Drink', 33333.92307692308),
 ('Finance', 31467.944444444445),
 ('Photo & Video', 28441.54375),
 ('Travel', 28243.8),
 ('Shopping', 26919.690476190477),
 ('Health & Fitness', 23298.015384615384),
 ('Sports', 23008.898550724636),
 ('Games', 22788.6696905016),
 ('News', 21248.023255813954),
 ('Productivity', 21028.410714285714),
 ('Utilities', 18684.456790123455),
 ('Lifestyle', 16485.764705882353),
 ('Entertainment', 14029.830708661417),
 ('Business', 7491.117647058823),
 ('Education', 7003.983050847458),
 ('Catalogs', 4004.0),
 ('Medical', 612.0)]

Navigation apps are the most reviewed apps on the App store, with Social Networking apps coming second.  Perhaps a good idea is to create a dating app that features free messaging features (a feature that with most apps requires a subscription).  Of course, this app would be supported by advertising but it could be a game changer in the dating world. 

## Most Popular Apps on the Play Store 

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

We don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

The analysis process is much the same as that of the App Store, except that the result will be stored in a dictionary so the results can be sorted.

In [24]:
android_freq = freq_table(android_free, 1)

android_popularity = {} # stores final results
for category in android_freq:
    total = 0 # sum on installs for a genre
    len_category = 0 # number of apps for a genre
    
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '').replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1 
    
    avg_installs = total/len_category
    android_popularity[category] = avg_installs
    
sorted(android_popularity.items(), key = lambda k: k[1], reverse = True) 

[('COMMUNICATION', 38456119.167247385),
 ('VIDEO_PLAYERS', 24727872.452830188),
 ('SOCIAL', 23253652.127118643),
 ('PHOTOGRAPHY', 17840110.40229885),
 ('PRODUCTIVITY', 16787331.344927534),
 ('GAME', 15588015.603248259),
 ('TRAVEL_AND_LOCAL', 13984077.710144928),
 ('ENTERTAINMENT', 11640705.88235294),
 ('TOOLS', 10801391.298666667),
 ('NEWS_AND_MAGAZINES', 9549178.467741935),
 ('BOOKS_AND_REFERENCE', 8767811.894736841),
 ('SHOPPING', 7036877.311557789),
 ('PERSONALIZATION', 5201482.6122448975),
 ('WEATHER', 5074486.197183099),
 ('HEALTH_AND_FITNESS', 4188821.9853479853),
 ('MAPS_AND_NAVIGATION', 4056941.7741935486),
 ('FAMILY', 3695641.8198090694),
 ('SPORTS', 3638640.1428571427),
 ('ART_AND_DESIGN', 1986335.0877192982),
 ('FOOD_AND_DRINK', 1924897.7363636363),
 ('EDUCATION', 1833495.145631068),
 ('BUSINESS', 1712290.1474201474),
 ('LIFESTYLE', 1437816.2687861272),
 ('FINANCE', 1387692.475609756),
 ('HOUSE_AND_HOME', 1331540.5616438356),
 ('DATING', 854028.8303030303),
 ('COMICS', 81765

Communication apps are the most downloaded apps on the Play Store (> 38 million installs).  In a similar vein, a dating app with free messaging capabilities (supported by advertising) may be an interesting starting point.  Though that may fall under the Social category.

In [74]:
dating_app = 'Tinder'
for app in google_data:
    name = app[0]
    if name == 'Tinder':
        print(app[1])

LIFESTYLE


Actually, Tinder falls under Lifestyle, which is farther down the list (it's not even in the Dating category!)  Nevertheless, with frustration over dating apps currently, perhaps free messaging will be a fresh wave of relief for everybody (into dating).

## Conclusions

This project explorted data cleaning and analysis of the Google Play and Apple App Store datasets with the purpose of recommending profitable app profiles.  While a number of recommendations can be made, one inroad can be a dating app with free messaging capabilities.  A more refined search of the categories in the future can yield even more opportunities.