*Yahya Robert Scerbo*
# *Generating a Profitable App Profile* 

## *For the 'App Store' and 'Google Play' Markets*

The project that follows is mainly concerned with finding which app profiles are the most profitable on Google Play and the App Store. We are working as data analysts for a company that builds Andriod and iOS mobile apps, our hope as data analysts is to provide information to our developers that will allow them to make data orientated choices in regards to the apps they develop.

The company we are employed by strictly makes apps that are free to download and install, our main source of revenue is generated through in-app advertisements. Therefore the money made on any given app is influenced by the number of people using it. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Taking our first look at the data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Exploring an inventory of apps and data this large is currently out of our budget, therefore we will instead analyze a sample of the data. During this project, two data sets were used:

[The first](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) is a data set containing about 7,000 iOS apps from the App Store, the information was collected in July of 2017.

[The second](https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv) is a data set containing about 10,000 Android apps from Google Play; the data was collected in August 2018.

The first thing we should do is open the data sets and take a glance at some of their main features.

To make this first process easier we will define a function `explore_data()` that takes in four parameters: the dataset, the start and end point (how big of a slice of the table we want), and finally a boolean parameter that allows us to choose if we want to print the total number of rows and columns or not. 

In [1]:
from csv import reader

# Opening the data sets
open_file = open('/Users/Bud/project_data_sets/AppleStore.csv') 
read_file = reader(open_file)
applestore = list(read_file)
apple_header = applestore[0] 
apple_data = applestore[1:] #iOS Apps

open_file = open('/Users/Bud/project_data_sets/googleplaystore.csv')
read_file = reader(open_file)
googlestore = list(read_file)
google_header = googlestore[0]
google_data = googlestore[1:] # Android Apps

# Function that allows data exploration
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:',len(dataset))
        print('Number of columns:',len(dataset[0]))

In [2]:
# Exploring the data sets at a glance, apple first

print(apple_header) # Inspecting column titles
print('\n')
explore_data(apple_data,0,3,True) # Inspecting 1st 3 rows

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


 The Apple Store data set contains 7197 rows and 16 columns. We've also printed the title row of the data, the columns which seem more useful to helping achieve our goal are: 'track_name', 'price', 'currency', 'rating_count_tot', 'rating_count_ver', 'prime_genre'.
 
 We'll do the same with the Google Play data.

In [3]:
# The Google Play data

print(google_header) # Inspecting column titles
print('\n')
explore_data(google_data,0,3,True) # Inspecting 1st 3 rows

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Google Play data set contains 10,841 rows and 13 columns. The useful columns from the this data seem to be: 'App', 'Category', 'Reviews', 'Installs', 'Price', 'Genres', 'Type'. 

The column titles can be somewhat cryptic, particularly in regards to the Apple Store data set, clearer descriptions are provided for [the Apple Store data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [the Google Play data](https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv).

## Beginning Data Cleaning

We will clean our data by removing or ammending the following:

- Inaccurate or incorrect data
- Duplicate data
- Non-English apps
- Apps that are not free

### Detecting inaccurate data, correcting or removing it

Before doing a thorough analysis it is important to ensure there are no incorrect entries in the data, skipping this step could cause our analysis to be inaccurate. 

There is a discussion forum for the Google Play data set [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion). There is an entry with quite a few comments that mentions a missing entry in the rating column of row 10,472, if we print this row we can have a look to see what is wrong with it.

In [4]:
print(google_data[10472])
print('\n')
print(google_header) # The header of each column
print('\n')
print(google_data[0]) # this is a row we can compare with

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up', '']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The app that apparently has an error is *'Life Made WI-Fi Touchscreen Photo Frame'*. The row seems to be missing its entry for 'category' this has caused the column shift, from doing a quick search on the Google Play Store we found that the correct category of this app is 'Lifestyle'. Adding the newly found category entry should fix this problem. 

In [5]:
print(google_data[10472])
print('\n')
google_data[10472] = ['Life Made WI-Fi Touchscreen Photo Frame','Lifestyle','1.9','19','3.0M','1000+','Free','0','Everyone','','February 11 2018', '1.0.19','4.0 and up']
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up', '']


['Life Made WI-Fi Touchscreen Photo Frame', 'Lifestyle', '1.9', '19', '3.0M', '1000+', 'Free', '0', 'Everyone', '', 'February 11 2018', '1.0.19', '4.0 and up']


That entry should no longer cause a problem. 

There is an also discussion forum for the Apple App Store data [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), but the comments do not discuss any overly obvious such as the one we've just ammended.

### Correcting duplicate entries

Consulting the discussion forum mentioned previously, it has been revealed that there are duplicate entries in the Play Store data, an example of this is printed below: 

In [6]:
for app in google_data:
    app_name = app[0]
    if app_name == 'Snapchat': # showing Snapchat duplicates
        print(app)

['Snapchat', 'SOCIAL', '4', '17014787', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4', '17014705', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4', '17015352', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4', '17000166', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']


The app "Snapchat" has been entered into the data set multiple times. We need to find out exactly how many entries are duplicates, and make a new data set without these duplicates. 

In [7]:
duplicates = []
unique_entries = []

for app in google_data:
    app_name = app[0]
    if app_name in unique_entries:
        duplicates.append(app_name)
    else:
        unique_entries.append(app_name)
        
print('Number of duplicates:', len(duplicates))
print('\n')
print('First 10 duplicates:',duplicates[:10])

Number of duplicates: 1181


First 10 duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We've revealed that of the 10,841 apps in the Play Store data set; 1,181 of them are duplicates. 

To be able to eventually gather accurate information from the data; we should remove duplicate entries, duplicates could potentially throw off our conclusion. 

Instead of deleting rows randomly, we can take a more systematic approach. If you take a look at the 'Snapchat' duplicates printed earlier, you'll notice the only differences are between the numbers in the 4th column of each entry. The data in these columns corresponds to the number of reviews. We're going to keep the entry with the most reviews, this one will give us the most up to date figure available and therefore most accurate. We will delete the remaining duplicates. 

To do this we will:

- Create a dictionary where each key is a unique app name, and the corresponding value os the highest number of reviews.
- Use our dictionary to make a new data set, which will only have one entry per app, the entry with the highest number of reviews.

Building the dictionary:

In [8]:
reviews_max = {}

for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    
    if app_name in reviews_max and reviews_max[app_name] < reviews:
        reviews_max[app_name] = reviews
        
    elif app_name not in reviews_max:
        reviews_max[app_name] = reviews
        
print('Expected length:',(len(google_data)-1181))
print('Length:',len(reviews_max))

Expected length: 9660
Length: 9660


The length of our dictionary is in line with our expectations, as it should simply be the total number of entries minus the duplicate entries. 

Now we can use our dictionary to remove the duplicate entries. We'll do this by creating two empty lists `google_clean` (which stores our new cleaned data) and `already_added` (which will just store app names). 

In [9]:
google_clean = []
already_added = []

for app in google_data:
    app_name = app[0]
    reviews = float(app[3])
    if (reviews_max[app_name] == reviews) and (app_name not in already_added):
        google_clean.append(app)
        already_added.append(app_name)

In the code above, we loop through the Play Store data set to isolate the name and number of reviews of each app. All of the data for each app is added to `google_clean` IF the number of reviews of the current app matches the number of reviews given in the `reviews_max` dictionary AND the name of the current app is not in the `already_added` list, this second condition is added to account for duplicate apps where the highest number of reviews is the same for more than one entry. 

In [10]:
print(len(google_clean))

google_data = google_clean

9660


We now have a new "cleaned" Play Store data set, each app described in this data set is now unique. We've also saved our new data set to the previously used name, to not make a mistake in the future. 

In [11]:
unique_app = []
duplicate_entry = []

for app in apple_data:
    app_name = app[0]
    if app_name in unique_app:
        duplicate_entry.append(app_name)
        
    else:
        unique_app.append(app_name)

print(len(apple_data))
print(len(unique_app))

7197
7197


There were no comments in the Apple App Store discussion forum about duplicate entries but we checked if there were any. The code above shows that each entry in the `apple_data` has a unique app ID.

### Removing Non-English Apps

Our company only develops apps in English, so we only need to analyse apps that are directed toward an English speaking demographic. Exploring the data sets shows that they both contain apps with names that suggest they are not intended for an English speaking audience. 

One way to go about removing these apps is to remove each app whose name contains a symbol that isn't usually used in English language and text. English text usually includes letters from the Latin alphabet, arabic numbers (the numbers 0 to 9), various punctuation marks (.,?,!,) and other symbols (+,*,etc.)

Behind the scenes, each character used in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

We can write code to loop over each app name and find the corresponding number of each character in the name, if a character is greater than 127 then the app probably has a non-English name. 

In [12]:
def app_detector(string):
    for character in string:
        if (ord(character)) > 127:
            return False
    return True
        
print(app_detector('爱奇艺PPS -《欢乐颂2》电视剧热播'))

print(app_detector('Instagram'))

print(app_detector('Docs To Go™ Free Office Suite'))

print(app_detector('Instachat 😜'))

False
True
False
False


In the previous cell we created a function `app_detector` that loops over a string and returns 'False' if the string contains a character with a value greater than 127. 

The issue is that some apps meant for English speaking audiences will have names containing characters outside the ASCII range (such as emojis and other characters). 

So the function was correct to return `False` when *'Docs To Go™ Free Office Suite'* was the input, but this is an app that is likely meant for English speaking audiences. 

If we use the above function we'll end up throwing away perfectly good data since many apps will be incorrectly labelled as non-English. To minimize the impact of data loss we'll only remove an app if its name has more than three disallowed characters. The function wont be perfect but will do.

In [13]:
def app_detector(string):
    too_many = []
    for character in string:
        if (ord(character)) > 127:
            too_many.append(character)
        if len(too_many) == 4:
            return False
    return True

print(app_detector('爱奇艺PPS -《欢乐颂2》电视剧热播'))

print(app_detector('Instagram'))

print(app_detector('Docs To Go™ Free Office Suite'))

print(app_detector('Instachat 😜'))

False
True
True
True


We've fixed the function as intended. 

In [14]:
english_google = []
english_apple = []

for row in google_data:
    app_name = row[0]
    if app_detector(app_name):
        english_google.append(row)
        
google_data = english_google
explore_data(google_data,0,3,True)

for row in apple_data:
    app_name = row[1]
    if app_detector(app_name):
        english_apple.append(row)
print('\n')        
apple_data = english_apple
explore_data(apple_data,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9615
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '

All apps with non-English names have been removed from both data sets. 

### Removing Non-Free Apps

Our final data cleaning step will be to remove all apps from both sets of data that are not free.

In [15]:
google_free = []
apple_free = []

for row in google_data:
    price = row[7]
    if price == '0':
        google_free.append(row)
        
print(len(google_free))

for row in apple_data:
    price = row[4]
    if price == '0':
        apple_free.append(row)
        
print(len(apple_free))

google_data = google_free
apple_data = apple_free

8865
3222


We're now left with 8,865 Android apps and 3,222 iOS Apps, which should be enough for our analysis. 

## Most Common App Genres

As stated early, we want to find out what kinds of apps are likely to attract the most users as our revenue is significantly influenced by the number of people using our apps.

To minimize risk and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to google Play.
2. If the app has a good response from users, we develop it further. 
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store. 

Since our end goal is to create an app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. We'll begin the next part of the analysis by finding some of the most common app genres for each market, this can be done by building frequency tables for a few columns in our data sets. 

In [16]:
print(apple_header)
print('\n')
print(google_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The columns most useful to us in this task will be `prime_genre` from the Apple Store data, as well as `Genres` and `Category` from the Google Play data. 

In [17]:
def freq_table(dataset,index): # The dataset is a list of lists
    table = {}
    total = 0
    
    for app in dataset: # looping over each row in dataset
        total +=1 # Total number of apps, for finding percentage
        entry = app[index] # assigning sought after column as 'entry'
        if entry in table: # entry is added as key to dictionary
            table[entry] +=1
        else:
            table[entry] =1
    
    table_percentages = {}
    for genre in table: 
        percentage = (table[genre]/total)*100
        table_percentages[genre] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We have just defined two functions, the first `freq_table` takes in two inputs: `dataset` and `index`. The function will return the frequency table for any chosen column of the dataset, the frequencies are expressed as percentages. 

The second function `display_table` takes in the same inputs as the first function, it generates a frequency table using the already defined `freq_table` function. The function then turns the frequency table into a list of tuples, sorts them into descending order and prints the frequency table. 

Lets inspect the columns of interest. 

It is important to note that our data sets at this point only contain free English apps, so our conclusions may not reflect on Google Play and the App Store as a whole. 

In [18]:
display_table(apple_data,11) # prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The above frequency table shows the most and least common app genres in our Apple Store data set among free English apps. The most common app genre is 'Games', over half (58.16%) of the apps in our data set come under this genre. The second most common app type is 'Entertainment' which is close to 8%, followed by 'Photo & Video' close to 5%, next is 'Education' which only makes up 3.66% of the apps and social networking which comes to 3.29%. 

It seems as though the vast majority of the free English apps available are designed for recreational purposes or entertainment (such as games, photo and video, social networking, sports, music). On the other hand there seems to be fewer apps designed for more practical purposes (like education, shopping, utilities, productivity, lifestyle).

We cannot really recommend an app profile for the App Store using this frequency table on its own, the case could be that a genre such as 'Social Networking' will have a small number of apps but possibly many users. 

For now we will move on with our analsyis onto the Google Play data set. 

In [19]:
display_table(google_data,1) # category

FAMILY : 18.905809362662154
GAME : 9.723632261703328
TOOLS : 8.460236886632826
BUSINESS : 4.591088550479413
LIFESTYLE : 3.902989283699944
PRODUCTIVITY : 3.8917089678511
FINANCE : 3.699943598420756
MEDICAL : 3.5307388606880994
SPORTS : 3.395375070501974
PERSONALIZATION : 3.3164128595600673
COMMUNICATION : 3.2374506486181613
HEALTH_AND_FITNESS : 3.0795262267343486
PHOTOGRAPHY : 2.9441624365482233
NEWS_AND_MAGAZINES : 2.7975183305132543
SOCIAL : 2.662154540327129
TRAVEL_AND_LOCAL : 2.33502538071066
SHOPPING : 2.2447828539199097
BOOKS_AND_REFERENCE : 2.143260011280316
DATING : 1.8612521150592216
VIDEO_PLAYERS : 1.793570219966159
MAPS_AND_NAVIGATION : 1.3987591652566271
FOOD_AND_DRINK : 1.2408347433728144
EDUCATION : 1.161872532430908
ENTERTAINMENT : 0.9588268471517203
LIBRARIES_AND_DEMO : 0.9362662154540328
AUTO_AND_VEHICLES : 0.924985899605189
HOUSE_AND_HOME : 0.8234630569655951
WEATHER : 0.8009024252679076
EVENTS : 0.7106598984771574
PARENTING : 0.6542583192329385
ART_AND_DESIGN : 0.6429

Google Play seems to have a different selection of available apps, from the `Category` column it looks as though there are fewer apps for fun and gaming whilst there are more apps with practical uses. App types like tools, business, productivity, finance, medical, etc all rank highly in this frequency table. 

We can check this with the `Genres` column also from the Google Play data set.

In [20]:
display_table(google_data,9) # genres

Tools : 8.44895657078398
Entertainment : 6.068809926677947
Education : 5.346869712351946
Business : 4.591088550479413
Productivity : 3.8917089678511
Lifestyle : 3.8917089678511
Finance : 3.699943598420756
Medical : 3.5307388606880994
Sports : 3.4630569655950363
Personalization : 3.3164128595600673
Communication : 3.2374506486181613
Action : 3.102086858432036
Health & Fitness : 3.0795262267343486
Photography : 2.9441624365482233
News & Magazines : 2.7975183305132543
Social : 2.662154540327129
Travel & Local : 2.323745064861816
Shopping : 2.2447828539199097
Books & Reference : 2.143260011280316
Simulation : 2.0417371686407217
Dating : 1.8612521150592216
Arcade : 1.849971799210378
Video Players & Editors : 1.7710095882684715
Casual : 1.7597292724196276
Maps & Navigation : 1.3987591652566271
Food & Drink : 1.2408347433728144
Puzzle : 1.1280315848843767
Racing : 0.9926677946982515
Role Playing : 0.9362662154540328
Libraries & Demo : 0.9362662154540328
Auto & Vehicles : 0.924985899605189
Str

The difference between the `Genres` and `Category` columns is not completely obvious, but it seems as though there is a much greater variety of genre types in the `Genres` column. As we are looking for the bigger picture at the moment, we will use the `Category` column going forward. 

None the less, it seems as though the `Genres` column confirms that Google Play has a higher number of practical apps. Tools, education, business and productivity are all within the top 5 genres of the frequency table. 

So far it seems as though our App Store data set is made up of more fun and recreational apps, where as Google Play is made up of apps intended for practical or educational use. Drwaing our conclusion will be easier if we find which apps genres have the most users.

## Most Popular Apps by Genre 

A way we can find out which genres have the most users is to calculate the average number of installs for each app genre. For the Google Play data set we can use the `Installs` column. Unfortunately, there is not an equivalent column in the App Store data set, so we will use the total number of user ratings as a proxy, this information is in the `rating_count_tot` column.

### App Store

First we will generate a frequency table for the `prime_genre` column to get the unique app genres.

In [21]:
prime_genres = freq_table(apple_data,11)

for genre in prime_genres:
    total = 0
    len_genre = 0
    for app in apple_data:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_rating = total/len_genre
    print(genre,avg_rating)

Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22788.6696905016
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


Above we have printed the average number of installs for each genre in our App Store data set. 

We can begin by disregarding some genres that we definitely will not have an interest in, these include:

- Social Networking apps - This genre tends to be dominated by very successful companies such as Facebook and Twitter.
- Music apps - Again this app genre is dominated by successful giants like Spotify and Shazam.
- Weather apps - As we generate our revenue through advertisements, this app type would not work as users tend to be on the apps for very short period of time.
- Navigation apps - These will also be overwhelmingly dominated by apps such as Google Maps.
- Food and Drink apps - These apps generally allow users to order food for pick-up or delivery, these services are outside the scope of our company.

Once we ignore the above genres, the apps with the highest total user ratings on average are 'Reference' and 'Book'. We can take a closer look at these. 

In [22]:
for app in apple_data:
    if app[11] == 'Reference':
        print(app[1],'-',app[5])  
print('\n')
for app in apple_data:
    if app[11] == 'Book':
        print(app[1],'-',app[5])  

Bible - 985920
Dictionary.com Dictionary & Thesaurus - 200047
Dictionary.com Dictionary & Thesaurus for iPad - 54175
Google Translate - 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran - 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition - 17588
Merriam-Webster Dictionary - 16849
Night Sky - 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) - 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools - 4693
GUNS MODS for Minecraft PC Edition - Mods Tools - 1497
Guides for Pokémon GO - Pokemon GO News and Cheats - 826
WWDC - 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free - 718
VPN Express - 14
Real Bike Traffic Rider Virtual Reality Glasses - 8
教えて!goo - 0
Jishokun-Japanese English Dictionary & Translator - 0


Kindle – Read eBooks, Magazines & Textbooks - 252076
Audible – audio books, original series & podcasts - 105274
Color Th

The 'Book' genre seems to contain apps that allow users to access a library of various books. The 'Reference' genre seems to allow users to access and analyse a single text. 

We may not be prepared to offer a variety or selection of books, but we definitely could create an in-app version of a popular book. This would potentially attract users who frequently access both 'Reference' and 'Book' apps. We could incentivize users to stay on our app as long as possible by offering extra features, such as:

- Fun and interesting facts about the book, the author, their writing process. We could put these at the end of each chapter. 
- We could allow users to set a reading goal, a certain number of pages in a certain time frame for example.
- A dictionary could be built into the app, this would mean users have less need to exit the app. 
- We could include pages for discussion forums, to allow fans of the book to have discussions and ask eachother questions.

I would recommend the app profile described above for the App Store. Our app would have the potential to stand out, it would have the potential to educate users, whislt also giving them the chance to enjoy one of their favourite texts. 

Now that we have a good idea of an app profile for the App Store, lets move onto the Google Play data. 

### Google Play

The `Installs` column of the Google Play data set tells us the number of installs of each app, this should give a clear picture of genre popularity. Unfortunately, the install numbers are not exact, instead they present thresholds (100+, 1,000+, 5,000+, etc):

In [23]:
display_table(google_data,5)

1,000,000+ : 15.724760293288211
100,000+ : 11.551043429216017
10,000,000+ : 10.547095318668923
10,000+ : 10.197405527354766
1,000+ : 8.392554991539763
100+ : 6.91483361534123
5,000,000+ : 6.824591088550479
500,000+ : 5.561195713479977
50,000+ : 4.771573604060913
5,000+ : 4.512126339537507
10+ : 3.542019176536943
500+ : 3.248730964467005
50,000,000+ : 2.3011844331641287
100,000,000+ : 2.131979695431472
50+ : 1.9176536943034406
5+ : 0.7896221094190639
1+ : 0.5076142131979695
500,000,000+ : 0.2707275803722504
1,000,000,000+ : 0.2256063169768754
0+ : 0.04512126339537507
1000+ : 0.011280315848843767
0 : 0.011280315848843767


We do not know whether an app with 100,000+ installs has 100,000 installs or 400,000 installs. Thankfully, we do not need to be overly precise, for our analysis we will assume that an app with 100,000+ installs as having 100,000 installs.

To perform calculations with the install data we will need to convert these numbers from strings to floats, this includes removing the commas and plus signs. 

In [24]:
Category = freq_table(google_data,1)

for category in Category:
    total = 0
    len_category = 0
    for app in google_data:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_category += 1
    avg_installs = total/len_category
    print(category, avg_installs)

ART_AND_DESIGN 1986335.0877192982
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 1833495.145631068
ENTERTAINMENT 11640705.88235294
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15588015.603248259
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17840110.40229885
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10801391.298666667
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24727872.452830188
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.774193

Again we should begin by disregarding some apps dominated by a few giant companies:

- Communication apps - This category will most likely be domincated by WhatsApp and Facebook Messenger.
- Video_Players - Again this will be taken up by apps like Youtube or Netflix. 

I believe the app profile we came up with for the App Store would also work in the Google Play market. The 'Books and References' category has a significant average number of installs; about 9 million. The 'Productivity' category has an average of about 17 million installs and the 'Entertainment' category has an average of about 12 million installs. 
With the app profile we previously suggested, our app can be appealing to users of these genres too. Allowing our users to set reading goals is a way of letting them increase their productivity, and they will surely gain entertainment from reading and learning more about the book.

To help us decide what kind of book would be successful in app form, we can take a closer look at which apps are most popular in the 'Books and References' category.

In [25]:
for app in google_data:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5]=='5,000,000+'
                                             or app[5]=='10,000,000+'
                                             or app[5]=='50,000,000+'):
        print(app[0],'-',app[5])

Wikipedia - 10,000,000+
Cool Reader - 10,000,000+
FBReader: Favorite Book Reader - 10,000,000+
AlReader -any text book reader - 5,000,000+
Ebook Reader - 5,000,000+
Read books online - 5,000,000+
Ancestry - 5,000,000+
HTC Help - 10,000,000+
Moon+ Reader - 10,000,000+
Aldiko Book Reader - 10,000,000+
Dictionary - WordWeb - 5,000,000+
50000 Free eBooks & Free AudioBooks - 5,000,000+
Al-Quran (Free) - 10,000,000+
Al Quran Indonesia - 10,000,000+
Al'Quran Bahasa Indonesia - 10,000,000+
Al Quran : EAlim - Translations & MP3 Offline - 5,000,000+
Quran for Android - 10,000,000+
Dictionary.com: Find Definitions for English Words - 10,000,000+
English Dictionary - Offline - 10,000,000+
Bible KJV - 5,000,000+
NOOK: Read eBooks & Magazines - 10,000,000+
Dictionary - 10,000,000+
Spanish English Translator - 10,000,000+
Dictionary - Merriam-Webster - 10,000,000+
JW Library - 10,000,000+
Oxford Dictionary of English : Free - 10,000,000+
English Hindi Dictionary - 10,000,000+
English to Hindi Diction

This niche seems to be dominated mainly by dictionaries, book libraries and online book reading software. We may not see success making apps similair to these.

Fortunately there are successful apps based around books like 'Bible' or 'Quran', taking a popular book and having it in app form seems as though it can be profitable! 

Perhaps choosing a very popular, modern book and turning it into an app could be profitable on both the Google Play and App Store markets.

As stated previously, there are already plenty of libraries available on both markets so we would need to find extra feautures to help our app stand out. 

## Conclusions and Recommendation

In this project, we analyzed two data sets, one with information on a selection of apps from the App Store and the other with similair information on Google Play mobile apps. We analyzed this data hoping to make a recommendation on an app profile that could be profitable on both markets.

The conclusion we came to is that making an app version of a popular book (perhaps a more recent book) could be profitable on both markets. We stated earlier that both markets are already full of libraries, so it would be wise to add extra features to make our app stand out. Some of the features we have come up with are:

- Having a built in dictionary.
- Offering quizzes on the book.
- Giving interesting facts about the book and author at the end of each chapter.
- Allow users to set themselves time-bound reading goals.
- A forum to allow users to ask eachother questions and have discussions.
- Perhaps and audio version of the book.