## Profitable App Profiles for the App Store and Google Play Markets

### About: 

We're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

### Goal: 
To analyze data to help developers understand what type of apps are likely to attract more users on Google Play and the App Store

#### Exploring Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

- A <a href = 'https://www.kaggle.com/lava18/google-play-store-apps'>data set</a> containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from <a href = 'https://dq-content.s3.amazonaws.com/350/googleplaystore.csv'> this link</a>.
- A <a href = 'https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'>data set</a> containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from <a href='https://dq-content.s3.amazonaws.com/350/AppleStore.csv'> this link</a>.

Let's start by opening the two data sets and then continue with exploring the data.

In [2]:
from csv import reader

# Apple Store (IOS) Apps data
ios_all_data = list(reader(open('data/AppleStore.csv', encoding='utf8')))
ios_header = ios_all_data[0]
ios_data = ios_all_data[1:]

# Google Play Store (Android) Apps data
android_all_data = list(reader(open('data/googleplaystore.csv', encoding='utf8')))
android_header = android_all_data[0]
android_data = android_all_data[1:]

To make things easier to explore, we created a function named explore_data() that can be used repeatedly to print rows in a readable way

In [3]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

__Let's explore the IOS dataset now__

In [4]:
print(ios_header)
print('\n')
explore_data(ios_data, 2, 6, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


Number of rows:  7197
Number of columns:  16


#### **_We noticed that the Apple Store contains 7197 mobile apps and Every app has 16 different types of information associated with it._**

The columns that could help in our analysis are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'prime_genre'

Not all of the column names are self-explanatory. In order to know more about this dataset, check out <a href= 'https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'> this link</a>

__Let's explore the Android dataset now__

In [5]:
print(android_header)
print('\n')
explore_data(android_data, 2, 6, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  10841
Number of columns:  13


#### **_We noticed that the Google Play Store contains 10841 mobile apps and Every app has 13 different types of information associated with it._**

The columns of interest are: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', 'Genres'

#### Detecting Accurate Data, and correct or remove it

Recall that at our company, we only build apps that re free to download and install.
And, that are directed toward an English-speaking audience.

In other words,

- Remove non-English Apps
- Remove apps that aren't free

This process of preparing our data for analysis is called __data cleaning__. Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

The Google Play data set has a dedicated <a href='https://www.kaggle.com/lava18/google-play-store-apps/discussion'> discussion section</a>, and we can see that <a href= 'https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015'> one of the discussions</a> describes an error for a row 10472.

Let's print this row and compare it with the header

In [6]:
print(android_data[10472])
print('\n')
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Seems like this row [10472] has a missing entry for the __'Category'__ column which shifted the results for the rest of the columns. So we will remove this row from our dataset.

__Note: DO NOT RUN the _del_ command MORE THAN ONCE. Otherwise, you'll end up LOSING DATA__

In [7]:
del android_data[10472]

Read the <a href='https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion'> discussion section</a> for the App Store data set, and see whether you can find any reports of wrong data.

To look for the same type of error as in our Google Dataset, use the code below. This code checks if there exists any missing values by comapring the length of the rows to the header.

In [8]:
for row in ios_data:
    if len(row) != len(ios_header):
        print(ios_data.index(row), row)

#### Removing Duplicate rows: Part One

If you explore the Google Play data set long enough or look at the <a href= 'https://www.kaggle.com/lava18/google-play-store-apps/discussion'> discussions</a> section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [9]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
duplicate_apps = []
unique_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))

Number of duplicate apps:  1181


Number of unique apps:  9659


In total, there are 1,181 cases where an app occurs more than once:

##### List few duplicate entries

In [11]:
print('List of a few duplicate apps:', duplicate_apps[:10])

List of a few duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


**_Note_** that we don't want to remove duplicates at random. Taking the example of Instagram, we noticed that the one entry that changes is the rating counts.

So, we can keep the highest count as it is the most recent. We will use this criterion to filter duplicates

To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

#### Part Two

1. Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

    - Start by creating an empty dictionary named reviews_max.
    - Loop through the Google Play data set (make sure you don't include the header row). For each iteration:
        - Assign the app name to a variable named name.
        - Convert the number of reviews to float. Assign it to a variable named n_reviews.
        - If name already exists as a key in the reviews_max dictionary and reviews_max[name] < n_reviews, update the number of reviews for that entry in the reviews_max dictionary.
        - If name is not in the reviews_max dictionary as a key, create a new entry in the dictionary where the key is the app name, and the value is the number of reviews. Make sure you don't use an else clause here, otherwise the number of reviews will be incorrectly updated whenever reviews_max[name] < n_reviews evaluates to False.
    - Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.

In [12]:
reviews_max={}
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


2. Use the dictionary you created above to remove the duplicate rows:

    - Start by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will just store app names).
    - Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
        - Assign the app name to a variable named name.
        - Convert the number of reviews to float, and assign it to a variable named n_reviews.
    - If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added (read the solution notebook to find out why we need this supplementary condition):
        - Append the entire row to the android_clean list (which will eventually be a list of list and store our cleaned data set).
        - Append the name of the app name to the already_added list — this helps us to keep track of apps that we already added.

In [13]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's now explore the '__android_clean__' dataset to ensure everything went as expected. The dataset should have 9,659 rows

In [14]:
explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows:  9659
Number of columns:  13


Great! We have 9659 rows, as expected

#### Removing Non-English Apps: Part One

If we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

In [15]:
print(ios_data[813][1])
print(ios_data[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

All English characters range from 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. So, if we build a function that checks if any character fall within this range to qualify as a common English character.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. Our app names, however, are stored as strings, so how could we take each individual character of a string and check its corresponding number?

We can get the corresponding number of each character using the ord() built-in function.

1. Write a function that takes in a string and returns False if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns True.

    - Inside the function, iterate over the input string. For each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately return False — the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
    - If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127 — the app name is probably English, so the functions should return True.

In [16]:
def is_english(a_string):
    for char in a_string:
        if ord(char) > 127:
            return False
    return True 

2. Use your function to check whether these app names are detected as English or non-English:

- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

In [17]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


The functions seems to miscalculate strings with emoji or other symbols such as (™,—(em dash), -(en dash)) that fall outside the ASCII range.

So, our function is not fully ready to run on the actual dataset. Let's look at ways to modify it so we do not lose useful information.

In [18]:
print(ord('™'))
print(ord('😜'))

8482
128540


#### Part Two

To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

1. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True.

In [19]:
def is_english(a_string):
    non_ascii_count = 0
    for char in a_string:
        if ord(char) > 127:
            non_ascii_count += 1  
    if non_ascii_count > 3:
        return False
    else:
        return True

2. Use the new function to check whether these app names are detected as English or non-English:

- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'

In [20]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


3. Use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.

- Our Current IOS dataset:  __ios_data__
- Our Current ANdroid dataset:  __android_clean__

In [21]:
english_apps_ios = []
english_apps_android = []

# For the IOS data set
for row in ios_data:
    name = row[1]
    if is_english(name):
        english_apps_ios.append(row)

# For the Android data set
for row in android_clean:
    name = row[0]
    if is_english(name):
        english_apps_android.append(row)

4. Explore the data sets and see how many rows you have remaining for each data set.

In [22]:
explore_data(english_apps_ios, 0, 3, True) # IOS Dataset

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  6183
Number of columns:  16


Our initial set of IOS App data had 7197 rows. After removing non-English apps, we are left with 6183 apps. 

In [23]:
# The Android data
explore_data(english_apps_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


Our android_clean dataset had 9659 rows. Our new English-only dataset consists of 9614 rows.

#### Isolating Free Apps

So far in the data cleaning process, we:

1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps

Our goal is to build apps that are free to download and install, and our main source of revenue comes from in-app ads. So, we need to isolate the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process.

- Our Current IOS dataset:  __english_apps_ios__
- Our Current Android dataset:  __english_apps_android__

Note: The '__price__' column in the Android dataset has the "$$  " attached to the number in the form '$4.99'. In this case, we can either use the '__Type__' column (index 6) or remove the '$' from the price. Either way works

In [24]:
free_apps_android = []
free_apps_ios = []

# The IOS dataset
for row in english_apps_ios:
    price = float(row[4])
    if price == 0:
        free_apps_ios.append(row)
        
# The Android dataset
for row in english_apps_android:
    price = float(row[7].strip('$')) # Or you can use the 'Type' column (index 6) whose results are 'Paid or Free'
    if price == 0:
        free_apps_android.append(row)

Let's explore the datasets now to check how many rows are we left with.

In [25]:
# The IOS dataset
explore_data(free_apps_ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  3222
Number of columns:  16


In [26]:
# The Android dataset
explore_data(free_apps_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


We are now left with __3222__ IOS Apps and __8864__ Android Apps

#### Most Common Apps by Genre: Part One

So far, we spent a good amount of time on cleaning data, and:

1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps
4. Isolated the free apps

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

For the __IOS__ dataset, we will use the '__prime_genre__' column to build freq tables.
For the __Android__ dataset, we will use the '__Genres__' and '__Category__' columns.

#### Part Two

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

In [27]:
def freq_table(dataset, index):
    freq_table ={}
    total_apps = 0
    for row in dataset:
        total_apps += 1
        genre = row[index]
        if genre in freq_table:
            freq_table[genre] += 1
        else:
            freq_table[genre] = 1
    
    freq_table_percentages = {}
    
    for app in freq_table:
        percentage = (freq_table[app]/ total_apps) * 100
        freq_table_percentages[app] = percentage
        
    return freq_table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Part Three

Display the frequency table of the columns prime_genre, Genres, and Category.

- Our Current IOS dataset:  __free_apps_ios__
- Our Current Android dataset:  __free_apps_android__

In [28]:
# The IOS dataset with the prime_genre column (index 11)
display_table(free_apps_ios, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Upon analyzing the IOS dataset, we noticed that 58.2% apps are games (most common) which is more than half of the dataset. 7.8% apps belong to the Entertainment genre (the runner-up) followed by Photo & Video with 4.9%. Only 3.66% apps are desihned for Education and Social Networking Apps consists of 3.28% of the entire dataset.

The general impression of the Apple Store apps, that are free and target the English speaking audience, is that they focus highly on apps for the purpose of entertainment such as Games, Photo and Video, Social Networking, sports, music etc. Practical apps such as education, shopping, utilities, productivity, lifestyle etc. are not as popular.

However, we can't solely rely on this frequency table to recommend an app profile. If a particular genre has numerous apps, that doesn't imply that these apps have a large number of users

In [29]:
# The Android dataset with the Category column (index 1)
display_table(free_apps_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The Google Play Store seems to have a pretty balanced app collection with apps for fun (game) as well as for practical purposes (tools, lifestyle, business). Even that Family category is the most popular, upon checking out, we observed that it consists of games mostly for kids

In [30]:
# The Android dataset with the genres column (index 9)
display_table(free_apps_android, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The App Store is more Entertainment oriented while Google Play Store has more balanced set of apps. The frequency table for genres gives a more granular representation of the distribution of apps which seemed more balanced as opposed to the App Store.

We still can't use the frequency tables alone. So, let's check out the total users for each category or genre of apps to see if we can find a better app profile for our analysis.

#### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.

For the __Google Play__ data set, we can find this information in the **_Installs_** column, but this information is missing for the __App Store__ data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the **_rating_count_tot_** app.

- Our Current IOS dataset:  __free_apps_ios__
- Our Current Android dataset:  __free_apps_android__

__Let's start with calculating the average number of user ratings per app genre on the App Store__

In [31]:
ios_prime_genre_ft = freq_table(free_apps_ios, 11)

for genre in ios_prime_genre_ft:
    total = 0 # This variable will store the sum of user ratings (the number of ratings, not the actual ratings) specific to each genre.
    len_genre = 0 # This variable will store the number of apps specific to each genre.
    for app in free_apps_ios:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ':', avg_ratings)

Medical : 612.0
Travel : 28243.8
Book : 39758.5
Sports : 23008.898550724636
Shopping : 26919.690476190477
Photo & Video : 28441.54375
Games : 22788.6696905016
Navigation : 86090.33333333333
Reference : 74942.11111111111
Entertainment : 14029.830708661417
Music : 57326.530303030304
Utilities : 18684.456790123455
Finance : 31467.944444444445
News : 21248.023255813954
Education : 7003.983050847458
Business : 7491.117647058823
Catalogs : 4004.0
Health & Fitness : 23298.015384615384
Productivity : 21028.410714285714
Weather : 52279.892857142855
Lifestyle : 16485.764705882353
Food & Drink : 33333.92307692308
Social Networking : 71548.34905660378


Upon analysis, Navigation Apps have the highest user ratings with an average of 86090.33. Social Network also are popular with third highest average user ratings.

In [32]:
for app in free_apps_ios:
    if app[11] == 'Navigation':
        print(app[1], ': ', app[5]) # Name and total ratings

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


Even though Navigation is at the top, most of their ratings is driven by two specific apps, namely, Waze and Google Maps which almost half a million ratings.

Now Let's take a look at the Social Networking Apps:

In [33]:
for app in free_apps_ios:
    if app[11] == 'Social Networking':
        print(app[1], ': ', app[5]) # Name and total ratings

Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373519
Messenger :  351466
Tumblr :  334293
WhatsApp Messenger :  287589
Kik :  260965
ooVoo – Free Video Call, Text and Voice :  177501
TextNow - Unlimited Text + Calls :  164963
Viber Messenger – Text & Call :  164249
Followers - Social Analytics For Instagram :  112778
MeetMe - Chat and Meet New People :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos :  90414
InsTrack for Instagram - Analytics Plus More :  85535
Tango - Free Video Call, Voice and Chat :  75412
LinkedIn :  71856
Match™ - #1 Dating App. :  60659
Skype for iPad :  60163
POF - Best Dating App for Conversations :  52642
Timehop :  49510
Find My Family, Friends & iPhone - Life360 Locator :  43877
Whisper - Share, Express, Meet :  39819
Hangouts :  36404
LINE PLAY - Your Avatar World :  34677
WeChat :  34584
Badoo - Meet New People, Chat, Socialize. :  34428
Followers + for Instagram - Follower Analytics :  28633
GroupMe :  28260
Marco Polo Video Walki

Similar to Navigation Apps, SN is heavily dominated by a few popular apps such as Facebook,Pinterest,Skype for iPhone, Messenger(Facebook owned), Tumblr & WhatsApp Messenger(also owned by FB).

Even though we have these top genres, it is still not enough to give a proper App profile. With a few popular apps dominating an entire genre makes it harder for other apps to make the 10000 threshold. We could eliminate these popular apps for each genre and recalculate the average.

Let's look at the Reference list as well:

In [34]:
for app in free_apps_ios:
    if app[11] == 'Reference':
        print(app[1], ': ', app[5]) # Name and total ratings

Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Google Translate :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
Merriam-Webster Dictionary :  16849
Night Sky :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693
GUNS MODS for Minecraft PC Edition - Mods Tools :  1497
Guides for Pokémon GO - Pokemon GO News and Cheats :  826
WWDC :  762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free :  718
VPN Express :  14
Real Bike Traffic Rider Virtual Reality Glasses :  8
教えて!goo :  0
Jishokun-Japanese English Dictionary & Translator :  0


Reference rank second in the list with 74942.11 user ratings on average. However, the results are skewed mostly by two apps, Bible and Dictionary.com. However, this genre is more practical than for fun which seems to be domainting the App Store. Developing an app for a book and adding different features such as a dictionary could stand a chance.

Other genres that are popular are as follows:
1. Music : 57326.53
2. Weather : 52279.89
3. Book : 39758.5
4. Finance : 31467.94

The book genre can be interesting for us, the rest doesn't fuel to our App profile recommendation.

__Let's take a look at the Google Play Store Apps:__

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

In [35]:
display_table(free_apps_android, 5) # The Installs column

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


As we can see that our data is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. Since we only want to find out which app genres attract the most users, we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

- Our Current Android dataset: __free_apps_android__

In [36]:
android_category_ft = freq_table(free_apps_android, 1)

for category in android_category_ft:
    total = 0         # This variable will store the sum of installs specific to each genre.
    len_category = 0  # This variable will store the number of apps specific to each genre.
    for app in free_apps_android:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

PERSONALIZATION : 5201482.6122448975
COMMUNICATION : 38456119.167247385
COMICS : 817657.2727272727
ART_AND_DESIGN : 1986335.0877192982
EDUCATION : 1833495.145631068
EVENTS : 253542.22222222222
TRAVEL_AND_LOCAL : 13984077.710144928
FAMILY : 3695641.8198090694
FOOD_AND_DRINK : 1924897.7363636363
BOOKS_AND_REFERENCE : 8767811.894736841
VIDEO_PLAYERS : 24727872.452830188
GAME : 15588015.603248259
LIFESTYLE : 1437816.2687861272
NEWS_AND_MAGAZINES : 9549178.467741935
MEDICAL : 120550.61980830671
PRODUCTIVITY : 16787331.344927534
WEATHER : 5074486.197183099
BUSINESS : 1712290.1474201474
FINANCE : 1387692.475609756
HEALTH_AND_FITNESS : 4188821.9853479853
SPORTS : 3638640.1428571427
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
TOOLS : 10801391.298666667
ENTERTAINMENT : 11640705.88235294
PARENTING : 542603.6206896552
AUTO_AND_VEHICLES : 647317.8170731707
MAPS_AND_NAVIGATION : 4056941.7741935486
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
PHOTOGRAPHY : 17

On average, Communication Apps have the highest number of installs (38,456,119). Upon deeper analysis, we noticed that only a few popular apps are skewing the results just like the ones we observed in the App Store dataset. Apps such as WhatsApp, Messenger, Skype, Google Apps(Google Chrome, Gmail, Hangouts) alone has over a billion installs among other apps.

In [40]:
# Communication Category
for app in free_apps_android:
    if app[1] == 'COMMUNICATION' and (app[5] == '100,000,000+' or app[5] == '500,000,000+' or app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we are to remove these apps with over 100 million downloads, we see that their average is reduced ten times

In [41]:
under_100_mil = []

for app in free_apps_android:
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_mil.append(float(n_installs))

sum(under_100_mil) / len(under_100_mil)

3603485.3884615386

For the Video Player category, we have an average installs of 24727872. We see the same pattern here where a few apps dominate the market such as YouTube, Google Play Movies & TV or MX Player. The pattern is same for Social Apps (Facebook, Google+, Instagram, etc.) and Photography Apps (Google Photos, B612, Sweet Selfie, etc.).

In [42]:
# Video Players Category
for app in free_apps_android:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '100,000,000+' or app[5] == '500,000,000+' or app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


In [43]:
# Social App Category
for app in free_apps_android:
    if app[1] == 'SOCIAL' and (app[5] == '100,000,000+' or app[5] == '500,000,000+' or app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


In [44]:
# Photography Category
for app in free_apps_android:
    if app[1] == 'PHOTOGRAPHY' and (app[5] == '100,000,000+' or app[5] == '500,000,000+' or app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
Google Photos : 1,000,000,000+
Retrica : 100,000,000+
Photo Editor Pro : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
AR effect : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+


Our main goal is to build an app which can dominate both the markets. For App Store, we recommended Book App and we see that BOOK & REFERENCE Category is also quite popular in the Google Play Store with an average installs of 8767811. Maybe we can explore this category a little deeper.

In [45]:
# Book & Reference category
for app in free_apps_android:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

Let's filter them to get a list of the most installed Apps.

In [46]:
# Book & Reference Category
for app in free_apps_android:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '100,000,000+' or app[5] == '500,000,000+' or app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


This list is too small so we could lower our threshold to see if more apps show up.

In [47]:
# Book & Reference Category
for app in free_apps_android:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+' or app[5] == '5,000,000+' or app[5] == '10,000,000+' or app[5] == '50,000,000+'):
        print(app[0], ':', app[5]) # Name and Number of Installs

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

There exists a lot of apps that processes and reads ebooks, as well as dictionaries and various libraries for translation. So, building a similar app will increase competition which may not be helpful for us.

We also see a lot of apps about the Quran, which suggests that maybe an app for a popular book can be profitable for us for both Apple Store and Google Play markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

### Conclusions

In this project, we analyzed two app datasets from the Google Play store and Apple Store respectively. We cleaned the dataset to remove duplicate and wrong data and focused on Apps that are free to download and targets an English speaking audience.

We identified the most popular apps based on genre and user ratings or user installs and found that most of these most apps are dominated by one or two highly popular ones which skewed the results significantly. So, we focused on selecting an app that is more practical and not just for fun.

We concluded that taking a very popular book and turning it into an app could be profitable for both the Google Play and the App Store market. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.