# Guided Project: Profitable App Profiles for the App Store and Google Play Markets

For this project, I'll pretend that I'm working as a data analyst for a company that builds Android and iOS mobile apps. The company makes the apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better.

My goal for this project is to analyse data to help our developers understand what type of apps are likely to attract more users.

As my aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store, I'll need to collect and analyse data about mobile apps available on those platforms.

The two datasets that I will focus on in this project are the following:
- A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
- A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

## Opening and Exploring the Data

To begin, I will open and explore the two data sets. To make them easier to explore, I will use a function named explore_data() that you can repeatedly use to print rows in a readable way.
Firstly, pasting in the code for the explore_data() function:

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Next, I will import the two datasets, and save the headers to separate variables.

In [2]:
from csv import reader

### The Google Play data set
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Now I will use the explore_data() function to print the first few rows of each dataset.

In [3]:
# Exploring Android dataset
explore_data(android,0,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
# Exploring IOS dataset
explore_data(ios,0,4,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


I will now print the headers for both datasets, to view the column names. I will then try and pick out a few columns that would be useful for my analysis.

In [5]:
print("Android columns:","\n",android_header,"\n")
print("IOS columns:","\n",ios_header)

Android columns: 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

IOS columns: 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Out of the Android columns, useful fields for my analysis are:
- 'App'
- 'Category'
- 'Reviews'
- 'Installs'
- 'Type'
- 'Price'
- 'Genres'

And in the IOS columns, useful fields include:
- 'track_name'
- 'currency'
- 'price'
- 'rating_count_tot'
- 'rating_count_ver'
- 'prime_genre'

## Deleting Wrong Data

Before beginning my analysis, I need to make sure the data I analyse is accurate, or the results of my  analysis will be wrong. This means that I need to do the following:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

Since at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience, I'll need to:

- Remove non-English apps.
- Remove apps that aren't free.

The Google Play dataset has a dedicated discussion section, and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row (row 10472). I will check this row, and delete if I can see it containing an error.

In [6]:
# Printing row 10472
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


At first glance I can't really see where the problem is. I will compare this row to the header on the same line of code to better see what the issue is.

In [7]:
print(android[10472])
print('\n')
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


I can now see the "Category" section is listed as "1.9" which does not make sense. This confirms this row has an error. I will now delete this row.

In [8]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


By checking the number of rows in the dataset before and after the delete function, I can see the row was successfully removed.

## Removing Duplicate Entries

Exploring the Google Play data set long enough, or looking at the discussions section, it is possible to see that some apps have duplicate entries. For instance, Instagram has four entries:

In [9]:
for app in android:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once:

In [10]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of duplicate apps: ", len(duplicate_apps))
print("\n")
print("Examples of duplicate apps: ", duplicate_apps[:15])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


I don't want to count certain apps more than once when I analyse data, so I need to remove the duplicate entries and keep only one entry per app. One thing I could do is remove the duplicate rows randomly, but I could probably find a better way.

Examining the rows above I printed for the Instagram app, I can see that the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, **I'll only keep the row with the highest number of reviews and remove the other entries for any given app**.

To do this, I will first create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. Then I will use the dictionary to remove the duplicate rows.

In [11]:
# Initialising empty dictionary
reviews_max = {}

In [12]:
# Building dictionary with app name and highest number of reviews
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

I will need to check if the length of the dictionary is correct. From above, I can see that that the number of duplicate android apps was 1181. So I can subtract this from the total number of rows to see what the expect length of the dictionary should be.

In [13]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


The expected and actual lengths match, therefore the dictionary is the correct length. Now I can use this dictionary to remove the duplicate rows from the dataset. To do this, I will do the following:
- Create two empty lists: "android_clean" (which will store the new cleaned data set) and "already_added" (which will just store app names).
- Loop through the Google Play dataset, and for each iteration, assign the app name to a variable named "name", convert the number of reviews to float, and assign it to a variable named "n_reviews".
- If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added:
    - Append the entire row to the android_clean list
    - Append the name of the app name to the already_added list — this helps to keep track of apps that I already added.



In [14]:
# Initialising empty lists
android_clean = []
already_added = []

In [15]:
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

Now I will explore the android_clean set to ensure the deletions went as expected. It should contain 9'659 rows.

In [16]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In [17]:
for app in android_clean:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


A quick observation using the explore_data() function, and checking Instagram entries indicates that the exercise has succeeded.

## Removing Non-English Apps

The next step is to remove non-english apps from the dataset. I will define a function that iterates over an input string, and checks whether the ASCII number associated with the character is greater than 127. If any element is greater than 127, then the function will return false.

In [18]:
def eng_check(string):
    number = 0
    
    for letter in string:
        if ord(letter) > 127:
            number += 1
    
    if number >= 1:
        return False
    else:
        return True

Testing the above function with example strings:

In [19]:
print(eng_check('Instagram'))
print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False
False
False


The problem with the above function is that it assumes any name with symbols like emojis count as non-English. To help minimise data loss, I will modify the function so that will allow 3 or less characters outside the ASCII range (0-127).

In [20]:
def eng_check(string):
    number = 0
    
    for letter in string:
        if ord(letter) > 127:
            number += 1
    
    if number > 3:
        return False
    else:
        return True

Testing the newly modified function:

In [21]:
print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


The new function works as expected. I will now use it to filter out non-English apps from both datasets. If an app name is identified as English, I will append the whole row to a separate list.

In [22]:
# English Android apps
eng_android = []
for app in android_clean:
    status = eng_check(app[0])
    if status is True:
        eng_android.append(app)

explore_data(eng_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In [23]:
# English IOS apps
eng_ios = []
for app in ios:
    status = eng_check(app[1])
    if status is True:
        eng_ios.append(app)

explore_data(eng_ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


## Isolating the Free Apps

The last step in the data cleaning process is isolating the free apps. I will loop through each dataset to isolate the free apps in separate lists.

In [24]:
# Isolating free Android apps
free_android = []
for app in eng_android:
    if app[6] == "Free":
        free_android.append(app)

explore_data(free_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13


In [25]:
# Isolating free IOS apps
free_ios = []

for app in eng_ios:
    price = app[4]
    if price == '0.0':
        free_ios.append(app)

explore_data(free_ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


I will now compare the length of the filtered datasets, to get an idea of how many apps were filtered out.

In [26]:
print("Number of Android apps: ",len(eng_android))
print("Number of free Android apps: ",len(free_android))
print("\n")
print("Number of IOS apps: ",len(eng_ios))
print("Number of free IOS apps: ",len(free_ios))

Number of Android apps:  9614
Number of free Android apps:  8863


Number of IOS apps:  6183
Number of free IOS apps:  3222


## Most Common Apps by Genre

My goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimise risks and overhead, my validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because the end goal is to add the app on both Google Play and the App Store, I need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

I'll begin the analysis by getting a sense of the most common genres for each market.
For this, I'll build a frequency table for the "prime_genre" column of the App Store data set, and the "Genres" and "Category" columns of the Google Play data set.

I'll build two functions I can use to analyse the frequency tables:
- One function to generate frequency tables that show percentages
- Another function that I can use to display the percentages in a descending order

In [27]:
# Frequency table with percentages function
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

# Display descending table function
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Using the display_table function to display the frequency table of the columns "prime_genre", "Genres", and "Category":

In [28]:
# prime_genre (IOS)
display_table(free_ios,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [29]:
# Category (Android)
display_table(free_android,1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Genres (Google Play):

In [30]:
# Genres (Android)
display_table(free_android,-4)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

In the prime_genre column of the App store dataset, I can see the following:
- The most common genre is Games (55.6%), and the next most common is Entertainment (8.2%)
- The prevailing perception suggests that the App Store predominantly features apps geared towards leisure and entertainment, such as games, multimedia, social networking, sports, and music, among others. In contrast, apps designed for practical purposes like education, shopping, utilities, productivity, and lifestyle seem to be relatively scarce. It's important to note that the sheer abundance of fun-oriented apps doesn't necessarily equate to a higher user base.

Now I will analyse the Category and Genres columns of the Google Play dataset:
- The Google Play categories offers a distinct perspective from the App Store. Here, the emphasis leans heavily towards utilitarian applications, spanning across categories like family, tools, business, lifestyle, and productivity, with a notable scarcity of apps designed solely for entertainment.
- Practical applications appear to enjoy a more significant presence on Google Play when compared to the App Store. This observation aligns with the insights derived from the frequency table in the Genres column.
- Although the distinction between the Genres and Category columns is not entirely clear, it is apparent that the Genres column shows a deeper level of granularity, featuring a wider array of categories. However, for my current analytical purposes, I will focus exclusively on the Category column, as I seek to grasp the broader trends in app distribution.

Now I'm going to investigate which type of apps have the most users.

## Most Popular Apps by Genre on the App Store

To figure out the most popular app genres, I can calculate the average number of installations per genre. For the Google Play data, this info is in the "Installs" column. But for the App Store data, this column doesn't exist. So, as a substitute, I will use the total number of user ratings as an indicator of popularity. 

To start, I will calculate the average number of user ratings per app genre on the App Store, by following these steps:
- Generating a frequency table for prime_genre to get the unique app genres
- Looping over the unique genres and computing the average number of ratings

In [31]:
# Frequency table for prime_genre
ios_freq_table = freq_table(free_ios,-5)

In [32]:
# Looping over genres and calculating averages
for genre in ios_freq_table:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Navigation apps have the most number of average ratings, followed by social networking, and "Reference" apps.

## Most Popular Apps by Genre on Google Play

Now I'll calculate the average number of installs per app genre for the Google Play dataset. I'll use a nested loop like I did previously.

In [33]:
# Frequency table for the installs column
display_table(free_android, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


One problem with this data is that is not precise. For instance, I can't tell whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, I don't need very precise data for my purposes — I only want to get an idea which app genres attract the most users, and I don't need perfect precision with respect to the number of users.

I'll leave the numbers as they are, which means that I'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, I'll need to convert each install number to float — this means that I need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. I'll do this directly in the loop below, where I also compute the average number of installs for each genre (category):

In [34]:
categories_android = freq_table(free_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, **communication** apps have the most installs: 38,456,119. This is most likely skewed by big players like WhatsApp. Further investigation into the makeup of this category would help enhance this analysis.