# Apps Analysis

The aim of this project is to find the kinds of mobile apps that attract our users. As our company only build free to download apps, the main source of revenue consists of in-app ads. Our goal is to gain insight on the styles of apps that are most popular with our users. 

## Opening and Exploring the Data


Since there are over a million apps on the market for iOS and Android, it is not wise to use all of those data for our analysis so we'll use a sample of data instead. On Kaggle, there are downloadable datasets that we can use for the analysis. One of the dataset is for iOS (approx. 7000 apps) and the other is for Android (approx. 10000 apps). 

[iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv)

[Android](https://www.kaggle.com/lava18/google-play-store-apps#googleplaystore.csv)



Let's start by opening the files and exploring the data


In [1]:
# Opening the AppleStore.csv file 
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
# seperate the header from the rest 
ios_header = ios[0]
ios = ios[1:]

# Opening the googleplaystore.csv file
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
# seperate the header from the rest
android_header = android[0]
android = android[1:]


We will use a generic function to make the list more readable. We will start with ios:

In [2]:
# Generic function to display the few rows of the dataset
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# iOS
# print the header list
print(ios_header)
# enter a new line
print('\n')
# apply the function to the dataset and print the first 4 rows
# assigning the argument for rows_and_columns = True to print the number of rows and columns
explore_data(ios, 0 , 4, True)
        


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7197
Number of columns: 17


We can see that the first column doesn't have a name and the values are just indexes for each row. Since we don't need this, we will remove the entire column from the data set.

In [3]:
# Create a loop to go through each row and delete the first element
for app in ios:
    del app[0]
    
del ios_header[0]

print(ios_header)
print('\n')
explore_data(ios, 0 , 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7197
Number of columns: 16


We can see that there are 7197 rows which represent the number of apps and 16 columns of information which can be used for our analysis. At first glance, the rating_count_tot, prime_genre, user_rating, track_name and price might be useful for our analysis. Details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv).

Now we do the same for Android apps.

In [4]:
# Android
# print the header list
print(android_header)
# enter a new line
print('\n')
# apply the function to the dataset and print the first 4 rows
# assigning the argument for rows_and_columns = True to print the number of rows and columns
explore_data(android, 0 , 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


We can see there are 10841 users and 13 columns of information for Android apps. Similarly, it seems like 'App', 'Category', 'Rating', 'Reviews','Price', 'Content Rating' and'Genres' might be useful for our analysis.


## Removing wrong data

There's a dedicated [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) for the Google Play and it pointed out an error in row 10472. Let's print this row and compare it against the header and few other rows that is correct.

In [5]:
print(android_header)
print('\n')
print(android[10471])
print('\n')
print(android[10472])
print('\n')
print(android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


We can see that row 10472 is indeed has an error. It seems like it is missing the category data and had a column shift which affected the data for the rest of thr row. Since we are not sure what the actual value is, we will delete the entire row 10472.

In [6]:
# Before we delete the row, lets check the number of rows
print(len(android))
# Delete row 10472, make sure this code is run only ONCE
del android[10472]  
# Check whether the row has been deleted successfully
print(len(android))

10841
10840


## Deleting Duplicate Entries

Duplicate entries is very common in data sets, and it is no surprise that the data set we have for Android and iOS also have duplicate entries.

For example, the Instagram app has few entries in the Google Play data set:

In [7]:
# create a loop to go through for each row in the android data set
for app in android:
    # column index 0 represents the name of the app
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can see there's 4 duplicate entries for Instagram app. We suspect there are other apps with duplicate entries. Let's see if that is the case. We will use a function to help us to find them.


In [8]:
# Create an empty list for both duplicate and unique apps
duplicate_apps =[]
unique_apps = []

# Create a loop to go through each row and put the names in the unique_apps list
for app in android:
    name = app[0]
    # Check if name is already in unique_apps, if true then append to duplicate_apps
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
# Calculate the length of the duplicate_apps list and print out the number of apps        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
# print the first 15 examples of duplicate apps
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyse data, so we need to remove the duplicate entries and keep only one entry per app. We could remove the duplicate rows randomly, but we could probably find a better way.

If we look at the header and rows we printed for the Instagram app, we can see the fourth position of each row corresponds to the number of reviews. The different numbers show the data was collected at different times. Therefore we will keep the row with the most recent data and remove the other rows.

In [9]:
print(android_header)
print('\n')
for app in android:
    # column index 0 represents the name of the app
    name = app[0]
    if name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name. The corresponding dictionary value is the highest number of reviews of that app.

- Create a new data set using the information stored in the dictionary. The new data set will then have only one entry per app.
  For each app, we'll only select the entry with the highest number of reviews).


#### First objective: 
Create a dictionary which contains the unique apps name and highest number of reviews

In [10]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    # if name is in the dictionary and n_reviews is larger than the current corresponding value for the key
    if name in reviews_max and reviews_max[name] < n_reviews:
        # if true, update the corresponding value for the dictionary key
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        # Create a new entry in the dictionary
        reviews_max[name] = n_reviews
        
# Check whether the row has been deleted successfully
# expected = android - duplicate_apps
expected = len(android)-len(duplicate_apps)
print(expected)
print(len(reviews_max))

9659
9659


We can confirm that the data set for Android now has no duplicate entries.

#### Second objective: 
Creates a new data set with only one entry per app

In [11]:
# Remove the duplicate rows by having two list: android_clean and already_added
# To store new cleaned data set
android_clean = []
# To store app names
already_added = []

# create a loop with conditions to add the apps from 'android' data set to 'android_clean' list 
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    # if n_reviews is the same as the corresponding value in reviews_max dictionary
    # and app name is not in 'already_added' list to avoid adding duplicate apps with the same number of reviews.
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

Now let's look at the new data set, and make sure that the number of rows is 9,659.

In [12]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Indeed, we have 9659 rows. Our new data set is free of duplicate entries.

## Removing Non-English Apps

As we use English for the apps we develop at our company so we'd like to analyze only the apps that are directed toward an English-speaking audience. Both of our data sets have apps with non-English names so we will need to remove them as we're not interested in keeping these apps. Below are some examples:

In [13]:
print(ios[7157][1])
print(ios[7164][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

机で卓球
ＣＲスーパー海物語ＩＮ沖縄４


中国語 AQリスニング
لعبة تقدر تربح DZ


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. We will create a function that can check if the word is an English word.

In [14]:
def english_words(string):
    for char in string:
        if ord(char) > 127:
            return False
    
    return True

print(english_words('机で卓球'))
print(english_words('لعبة تقدر تربح'))
print(english_words('Instagram'))

False
False
True


The function seems to work fine, but we need to take symbols and emojis into account as some app names contain emojis and other symbols.

In [15]:
print(english_words('Docs To Go™ Free Office Suite'))
print(english_words('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

False
False
8482
128540


The emojis and symbols fall outside the ASCII range hence the function doesn't work very well at present. To minimise the impact of data loss, we will remove only if an app has a name more than three non ASCII characters because it is rare that an English app name will have more than three symbols/emojis.

In [16]:
def english_words(string):
    sum = 0
    for char in string:
        if ord(char) > 127:
            sum += 1
            
    if sum >3:
        return False
    else:
        return True
    
print(english_words('Docs To Go™ Free Office Suite'))
print(english_words('Instachat 😜'))

True
True


The function is now better but it is still not perfect as some non-English word might be classified as English words by our function. However, we should spend too much time on modifying the function and it is good enough for our analysis.

Now we will use this new function to filter out non-English apps from both data sets.

In [17]:
# Android apps
android_english = []
for app in android:
    name = app[0]
    # if the function 'english_words' return True
    if english_words(name):
        android_english.append(app)

# iOS apps
ios_english = []
for app in ios:
    name = app[0]
    # if the function 'english_words' return True
    if english_words(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 4, True)
print('\n')
explore_data(ios_english, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10795
Number of columns: 13


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26',

Now we're left with 10795 Android apps and 7197 iOS apps.

## Isolating the Free Apps

As we only build apps that are free to download and install so our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [18]:
# Android
android_final = []
for app in android_english:
    price = app[7]
    if price == '0' or price == '0.0':
        android_final.append(app)
        
# iOS
ios_final = []
for app in ios_english:
    price = app[4]
    if price == '0' or price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))


9999
4056


We're now left with 9999 Android apps and 4056 iOS apps in our final data set before analysis.

## Most common genres

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

First, let's create a frequency table for both of our data sets to get a sense of what are the most common genres for each market. We'll use the prime_genre column to create a frequency table for the iOS data set, and Genres and Category columns for the Android.

We'll build two functions we can use to analyze the frequency tables:
- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order



In [40]:
def freq_table(dataset, index):
    frequency_table = {}
    for app in dataset:
        column = app[index]
        if column in frequency_table:
            frequency_table[column] += 1
        else:
            frequency_table[column] = 1
    

    frequency_percent ={}
    for row in frequency_table:
        proportion = frequency_table[row] / len(dataset)
        percentage = round((proportion*100),2)
        frequency_percent[row] = percentage
    return frequency_percent

# Create an empty list and store the frequency table with percentage so it can be sorted in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        # Reverse the order of dictionary key and corresponding value
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    # reverse = True so it will be in descending order
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [47]:
print('     iOS  -----------  prime_genre (%)')  
print('\n')
android_table = display_table(ios, 11)

     iOS  -----------  prime_genre (%)


Games : 53.66
Entertainment : 7.43
Education : 6.29
Photo & Video : 4.85
Utilities : 3.45
Health & Fitness : 2.5
Productivity : 2.47
Social Networking : 2.32
Lifestyle : 2.0
Music : 1.92
Shopping : 1.7
Sports : 1.58
Book : 1.56
Finance : 1.45
Travel : 1.13
News : 1.04
Weather : 1.0
Reference : 0.89
Food & Drink : 0.88
Business : 0.79
Navigation : 0.64
Medical : 0.32
Catalogs : 0.14


In [48]:
print('      Android ----------  Category  (%)')  
print('\n')
ios_table = display_table(android, 1)

      Android ----------  Category  (%)


FAMILY : 18.19
GAME : 10.55
TOOLS : 7.78
MEDICAL : 4.27
BUSINESS : 4.24
PRODUCTIVITY : 3.91
PERSONALIZATION : 3.62
COMMUNICATION : 3.57
SPORTS : 3.54
LIFESTYLE : 3.52
FINANCE : 3.38
HEALTH_AND_FITNESS : 3.15
PHOTOGRAPHY : 3.09
SOCIAL : 2.72
NEWS_AND_MAGAZINES : 2.61
SHOPPING : 2.4
TRAVEL_AND_LOCAL : 2.38
DATING : 2.16
BOOKS_AND_REFERENCE : 2.13
VIDEO_PLAYERS : 1.61
EDUCATION : 1.44
ENTERTAINMENT : 1.37
MAPS_AND_NAVIGATION : 1.26
FOOD_AND_DRINK : 1.17
HOUSE_AND_HOME : 0.81
LIBRARIES_AND_DEMO : 0.78
AUTO_AND_VEHICLES : 0.78
WEATHER : 0.76
ART_AND_DESIGN : 0.6
EVENTS : 0.59
PARENTING : 0.55
COMICS : 0.55
BEAUTY : 0.49


In [49]:
print('    Android ---------  Genres  (%)')  
print('\n')
ios_table = display_table(android, 9)

    Android ---------  Genres  (%)


Tools : 7.77
Entertainment : 5.75
Education : 5.06
Medical : 4.27
Business : 4.24
Productivity : 3.91
Sports : 3.67
Personalization : 3.62
Communication : 3.57
Lifestyle : 3.51
Finance : 3.38
Action : 3.37
Health & Fitness : 3.15
Photography : 3.09
Social : 2.72
News & Magazines : 2.61
Shopping : 2.4
Travel & Local : 2.37
Dating : 2.16
Books & Reference : 2.13
Arcade : 2.03
Simulation : 1.85
Casual : 1.78
Video Players & Editors : 1.6
Puzzle : 1.29
Maps & Navigation : 1.26
Food & Drink : 1.17
Role Playing : 1.01
Strategy : 0.99
Racing : 0.9
House & Home : 0.81
Libraries & Demo : 0.78
Auto & Vehicles : 0.78
Weather : 0.76
Adventure : 0.69
Events : 0.59
Comics : 0.54
Art & Design : 0.54
Beauty : 0.49
Education;Education : 0.46
Card : 0.44
Parenting : 0.42
Board : 0.41
Educational;Education : 0.38
Casino : 0.36
Trivia : 0.35
Educational : 0.34
Casual;Pretend Play : 0.29
Word : 0.27
Entertainment;Music & Video : 0.25
Education;Pretend Play : 0.21
Music 

At first glance, it seems like apps with genre of 'Game' is very popular across both platforms and overwhelmingly the most popular (54%) on the iOS. 

The second most popular genre for Android is 'Entertainment' suggesting that 'Game' is indeed the most attractive style of app for our users. 

Interestingly, apps under the category/genre of 'Tools' seems to be quite popular with Android users as it came out on top in the Genres table but not in the iOS table which will be in 'Utilities' and 'Productivity'. Another popular type of apps across both platforms is 'Education' which ranked just below 'Tools' and 'Game'.

Generally, App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, Google Play Store has users downloading apps that are related to tools and family.



## Kind of Apps with Most Users

Now let's get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.

For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. To overcome the problem, we'll take the total number of user ratings instead, which we can find in the rating_count_tot column.




### App Store

We 'll start by calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:
- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [60]:
# Generate a frequency table for the prime_genre column to get the unique app genres
freq_table(ios, 11)


{'Games': 53.66,
 'Productivity': 2.47,
 'Weather': 1.0,
 'Shopping': 1.7,
 'Reference': 0.89,
 'Finance': 1.45,
 'Music': 1.92,
 'Utilities': 3.45,
 'Travel': 1.13,
 'Social Networking': 2.32,
 'Sports': 1.58,
 'Business': 0.79,
 'Health & Fitness': 2.5,
 'Entertainment': 7.43,
 'Photo & Video': 4.85,
 'Navigation': 0.64,
 'Education': 6.29,
 'Lifestyle': 2.0,
 'Food & Drink': 0.88,
 'News': 1.04,
 'Book': 1.56,
 'Medical': 0.32,
 'Catalogs': 0.14}

In [70]:
prime_genre = freq_table(ios, 11)

# Create a loop over each unique genres of the iOS data set
for genre in prime_genre:
    # Store the sum of user ratings(the number of ratings, not the actual ratings) specific to each genre
    total = 0
    # Store the number of apps specific to each genre
    len_genre = 0
    # Create another loop which will run in each iteration of the loop below
    for app in ios_final:
        genre_app = app[11]
        # if the genre from iOS data set is the same as the genre in the frequency table
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings))

Games : 18925
Productivity : 19054
Weather : 47221
Shopping : 18747
Reference : 67448
Finance : 13522
Music : 56482
Utilities : 14010
Travel : 20216
Social Networking : 53078
Sports : 20129
Business : 6368
Health & Fitness : 19952
Entertainment : 10823
Photo & Video : 27250
Navigation : 25972
Education : 6266
Lifestyle : 8978
Food & Drink : 20179
News : 15893
Book : 8498
Medical : 460
Catalogs : 1780


Reference apps have the highest average ratings. Lets have a look at some examples of apps over 100000

In [95]:
for app in ios_final:
    if app[11] == 'Reference' and (float(app[5]) > 1000):
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497


We can see that the religious apps came out on top. Since we are trying to build apps for large variety of users we would need to find other genre to make our decision on our future apps. Although we need to bear in mind that dictionaries are quite popular too.

Next, we will look at Social Networking apps with over 100000 total ratings.

In [86]:
for app in ios_final:
    if app[11] == 'Social Networking' and (float(app[5]) > 100000):
        print(app[1], ':', app[5])

Facebook : 2974676
Skype for iPhone : 373519
Tumblr : 334293
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Kik : 260965
Viber Messenger – Text & Call : 164249
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
Messenger : 351466
Followers - Social Analytics For Instagram : 112778


It's clear that Facebook is the most popular app while apps like Skype, Tumblr and WhatsApp are also quite popular with our iOS users.

In [88]:
for app in ios_final:
    if app[11] == 'Music' and (float(app[5]) > 10000):
        print(app[1], ':', app[5])

Pandora - Music & Radio : 1126879
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
Sonos Controller : 48905
Spotify Music : 878563
SoundCloud - Music & Audio : 135744
Sing Karaoke Songs Unlimited with StarMaker : 26227
SoundHound Song Search & Music Player : 82602
Ringtones for iPhone & Ringtone Maker : 25403
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Magic Piano by Smule : 131695
Bandsintown Concerts : 30845
edjing Mix:DJ turntable to remix and scratch music : 13580
Smule Sing! : 119316
Amazon Music : 106235
AutoRap by Smule : 18202
My Mixtapez Music : 26286
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
Napster - Top Music & Radio : 14268
Musi - Unlimited Music For YouTube : 25193
Spinrilla - Mixtapes For Free : 15053
Google Play Music : 10118
Free Piano app by Yokee : 13016
Free Music - MP3 Streamer & Playlist Manager Pro : 13443


Looking at the distribution of the different music apps on the market, it seems like our users are willing to experiment different music apps which our companies could focus on making. Because social networking apps have some big names already and the impact that our new apps is going to have will be minimal and may take a long time to establish in the market.

In [98]:
for app in ios_final:
    if app[11] == 'Productivity' and (float(app[5]) > 10000):
        print(app[1], ':', app[5])

Evernote - stay organized : 161065
iTranslate - Language Translator & Dictionary : 123215
Dropbox : 49578
Documents 6 - File manager, PDF reader and browser : 29110
Microsoft OneNote : 39638
Gmail - email by Google: secure, fast & organized : 135962
Hotspot Shield Free VPN Proxy & Wi-Fi Privacy : 32499
Microsoft OneDrive – File & photo cloud storage : 12797
Paper by FiftyThree - Sketch, Diagram, Take Notes : 18219
Google Drive - free online storage : 59255
T-Mobile : 19977
Yahoo Mail - Keeps You Organized! : 113709
MyScript Calculator - Handwriting calculator : 16555
Microsoft Word : 47999
Microsoft PowerPoint : 10939
Microsoft Excel : 24430
Drawing Desk - Draw, Paint, Doodle & Sketch board : 11040
Tayasui Sketches : 11505
Ever - Capture Your Memories : 12755
Speak & Translate － Voice and Text Translator : 12062
Google Docs : 64259
Google Sheets : 24602
Inbox by Gmail : 21561
Email - Fast & Secure mail for Gmail iCloud Yahoo : 10778
Microsoft Outlook - email and calendar : 32807
VPN Pr

In [99]:
for app in ios_final:
    if app[11] == 'Productivity' and (float(app[5]) > 10000):
        print(app[1], ':', app[5])

Evernote - stay organized : 161065
iTranslate - Language Translator & Dictionary : 123215
Dropbox : 49578
Documents 6 - File manager, PDF reader and browser : 29110
Microsoft OneNote : 39638
Gmail - email by Google: secure, fast & organized : 135962
Hotspot Shield Free VPN Proxy & Wi-Fi Privacy : 32499
Microsoft OneDrive – File & photo cloud storage : 12797
Paper by FiftyThree - Sketch, Diagram, Take Notes : 18219
Google Drive - free online storage : 59255
T-Mobile : 19977
Yahoo Mail - Keeps You Organized! : 113709
MyScript Calculator - Handwriting calculator : 16555
Microsoft Word : 47999
Microsoft PowerPoint : 10939
Microsoft Excel : 24430
Drawing Desk - Draw, Paint, Doodle & Sketch board : 11040
Tayasui Sketches : 11505
Ever - Capture Your Memories : 12755
Speak & Translate － Voice and Text Translator : 12062
Google Docs : 64259
Google Sheets : 24602
Inbox by Gmail : 21561
Email - Fast & Secure mail for Gmail iCloud Yahoo : 10778
Microsoft Outlook - email and calendar : 32807
VPN Pr

The apps for processing and reading ebooks seems quite popular, as well as various collections of translator and dictionaries, so it's probably a good idea to build similar apps.

### Google Play Store

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. Lets display the frequency table for the Android data set. 

In [64]:
display_table(android_final, 5) 

1,000,000+ : 15.52
10,000,000+ : 12.49
100,000+ : 10.72
10,000+ : 9.16
1,000+ : 7.52
5,000,000+ : 7.5
100+ : 6.2
500,000+ : 5.27
50,000+ : 4.3
100,000,000+ : 4.09
5,000+ : 4.07
10+ : 3.15
500+ : 2.9
50,000,000+ : 2.89
50+ : 1.71
500,000,000+ : 0.72
5+ : 0.7
1,000,000,000+ : 0.58
1+ : 0.45
0+ : 0.04
0 : 0.01


The install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. First, we'll need to convert each install number from string to float so we can remove the commas and the plus characters for our calculation.

In [71]:
category_android = freq_table(android_final, 1)

for category in category_android:
    total = 0 
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',' , '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_install = total/ len_category
    print(category, ' : ', round(avg_install))

ART_AND_DESIGN  :  2038051
AUTO_AND_VEHICLES  :  647318
BEAUTY  :  513152
BOOKS_AND_REFERENCE  :  9655197
BUSINESS  :  2250454
COMICS  :  950443
COMMUNICATION  :  90935672
DATING  :  1164271
EDUCATION  :  5760596
ENTERTAINMENT  :  19516735
EVENTS  :  253542
FINANCE  :  2511356
FOOD_AND_DRINK  :  2190710
HEALTH_AND_FITNESS  :  4869226
HOUSE_AND_HOME  :  1917187
LIBRARIES_AND_DEMO  :  749950
LIFESTYLE  :  1479957
GAME  :  33111303
FAMILY  :  5784095
MEDICAL  :  147563
SOCIAL  :  48184459
SHOPPING  :  12637504
PHOTOGRAPHY  :  32321374
SPORTS  :  4860919
TRAVEL_AND_LOCAL  :  27921561
TOOLS  :  14988277
PERSONALIZATION  :  7533233
PRODUCTIVITY  :  35885138
PARENTING  :  542604
WEATHER  :  5747142
VIDEO_PLAYERS  :  36599010
NEWS_AND_MAGAZINES  :  27058831
MAPS_AND_NAVIGATION  :  5569698


For the Android users, communication apps have the most installs: 38,456,119. 

In [74]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
imo free video calls and chat : 500,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Viber Messenger : 500,000,000+
Hangouts : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Viber Messenger : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Viber Messenger : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages :

As we can see the communication apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 500 million installs.

Across both platforms, social networking and communication is clearly the most popular with our users. Apps like Facebook, Viber, Whatsapp, Gmail are the kind of apps that our company should focus on building in the future.

Now lets look at entertainment apps:

In [93]:
for app in android_final:
    if app[1] == 'ENTERTAINMENT' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'
                                      or app[5] == '50,000,000+'
                                     or app[5] == '10,000,000+'
                                     or app[5] == '5,000,000+'):
        print(app[0], ':', app[5])

Netflix : 100,000,000+
Tubi TV - Free Movies & TV : 10,000,000+
YouTube Kids : 50,000,000+
Mobile TV : 10,000,000+
TV+ : 5,000,000+
Digital TV : 5,000,000+
Motorola Spotlight Player™ : 10,000,000+
Vigo Lite : 5,000,000+
Google Play Games : 1,000,000,000+
Hotstar : 100,000,000+
Peers.TV: broadcast TV channels First, Match TV, TNT ... : 5,000,000+
Spectrum TV : 5,000,000+
H TV : 5,000,000+
MEGOGO - Cinema and TV : 10,000,000+
Talking Angela : 100,000,000+
DStv Now : 5,000,000+
ivi - movies and TV shows in HD : 10,000,000+
Viki: Asian TV Dramas & Movies : 10,000,000+
Talking Ginger 2 : 50,000,000+
Girly Lock Screen Wallpaper with Quotes : 5,000,000+
No.Draw - Colors by Number 2018 : 10,000,000+
Movies by Flixster, with Rotten Tomatoes : 10,000,000+
BBC Media Player : 10,000,000+
Amazon Prime Video : 50,000,000+
IMDb Movies & TV : 100,000,000+
Twitch: Livestream Multiplayer Games & Esports : 50,000,000+
YouTube Gaming : 5,000,000+
PlayStation App : 50,000,000+
Talking Ben the Dog : 100,000

Let's look at books and reference apps.

In [94]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'
                                      or app[5] == '50,000,000+'
                                     or app[5] == '10,000,000+'
                                     or app[5] == '5,000,000+'):
        print(app[0], ':', app[5])

Wattpad 📖 Free Books : 100,000,000+
Wikipedia : 10,000,000+
Amazon Kindle : 100,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Oxford Dictionary of English : Free : 10,000,000+
Spanish English Translator : 10,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Ebook Reader : 5,000,000+
Aldiko Book Reader : 10,000,000+
Wattpad 📖 Free Books : 100,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Quran for Android : 1

We can see that apps like dictionary, ereader and translator are very popular with our users.

## Conclusions

In this project, we analysed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets. 


We concluded that making a dictionary or translation app  could be profitable for both the Google Play and the App Store markets. Another option is to take a recent popular book and turning it into an app. Although I have suggested making a music app for the iOS but books and reference apps are better for both platforms. The markets are already full of libraries, so we need to add some special features besides the raw version of the book to make it more appealing to our users.