# 1. Analyzing Mobile App Data

This short project is about analyzing user counts using our free mobile apps --> taken from Dataquest

The ultimate goal is to analyze the data to help our developers understand what type of apps are likely to attract more users. The csv Files can be found at [Kaggle](https://www.kaggle.com/datasets) as the locations of the files are constantly changing, please use the searchbar provided.

# 2. Opening and Exploring the Data

In the previous step, we outlined that our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

In [4]:
from csv import reader

# function for opening csv files

def open_csvFile(file): # defining the function name and input parameters --> file is of type String
    filename = open(file) # opening the file and storing it in a variable named `filename`
    csv = reader(filename) # using the imported 'reader' parses the opened file and stores it in a variable named `csv`
    csv_list = [] # creating an empty list
    for row in csv: # looping through the `csv` and for each iteration:
        csv_list.append(row) # add a row from the "csv" and store it in "csv_list" as list
    return csv_list # return the csv_list
        
    
ios = open_csvFile("AppleStore.csv") # App store dataset
android = open_csvFile("googleplaystore.csv") # Google Play dataset

In [5]:
# predefined function from "dataquest.io", takes in the following parameters:
# - dataset: the opened csv from the above function "open_csvFile()"
# - start: the indexnumber (as Integer) of the row you´d like to start your exploration with
# - stop: the indexnumber (as Integer) of the row you´d like to stop your exploration at
# - rows_n_cols: if set to "True" prints out the number of rows and columns of the whole dataset

def explore_data(dataset, start, stop, rows_n_cols=False): 
    ds_slice = dataset[start:stop] # slices the dataset according to the input parameters for "start" and "stop" and stores it in a variable named "ds_slice" as List of List 
    for row in ds_slice: # loops through the "ds_slice" and for each iteration:
        print(row) # prints each row
        print("\n") # seperated by a new line/empty row
    if rows_n_cols: # additional condition: if the parameter "rows_n_cols" was set to True:
        print("Number of rows: ",len(dataset)) # print the number of total rows
        print("Number of cols: ",len(dataset[0])) # print the number of total columns
        
        


In [6]:
explore_data(ios,0,5, True) # testing the "explore_data()" Function with the iOS dataset (first 4 columns) and print the total no. of rows and columns
explore_data(android,0,5) # testing the "explore_data()" Function with the android dataset (first 4 columns)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows:  7198
Number of cols:  17
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Pr

# 3. Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row.

In [14]:
# function named "find_error()" that takes the desired "dataset" as input parameter:

def find_error(dataset):
    for row in dataset: # for each iteration:
        headerlength = len(dataset[0]) # get the no. of columns of the header (which is at index 0) and store it in a variable called "headerlength"
        rowlength = len(row) # same as above but with the rest of the rows and store it in a variable called "rowlength"
        if rowlength != headerlength: # compare if there are rows that have NOT the same no. of columns as the header --> hence cause a column shift
            print(row) # print these rows
            print(dataset.index(row)) # print the index of these row(s)

In [15]:
find_error(android) # find the corrupted row(s) for the android dataset
#del android[10473] # --> deletes the corrupted row at the identified index --> commented out, so that no other lines were deleted

# 4. Removing Duplicate Entries: Part One

The below function splits the dataset into two lists

- dups (for duplicates) and
- uni (for uniques)

and returns both the number of each lists length and it´s contents

In [16]:
# function named "find_dups" to identify duplicate rows in each dataset. Takes the following parameters:
# dataset: the dataset to inspect (list of lists)
# col: the index of the column to inspect (Integer)

def find_dups(dataset, col):
    dups = [] # create an empty list to store duplicates
    uni = [] # create an empty list to store unique rows
    for row in dataset: # for each iteration:
        name = row[col] # get the row (in this case the "Appname") from the desired column (provided by "col") and store it in a variable called "name"
        if name in uni: # if the name already occurs in the above created "uni" list:
            dups.append(name) # store that name in the other list --> "dups" created above
        else: # if the name does NOT already occur in the "uni" list --> if it´s new:
            uni.append(name) # store that name in the "uni" list. If the same name occurs again, it will be stored in the "dups" list!
    print("Number of duplicates: ",len(dups)) # in the end: print the no. of rows in the newly populated "dups" list
    print("Number of uniques: ",len(uni)) # print the no. of rows in the newly populated "uni" list
    return dups, uni # return both lists as Tuple

# please specify the column index of the "name" column of each dataset in the parameter (since that differs from csv to csv)
# the function returns a tuple --> if you wish to display only the duplicates please store the result in a variable and call print:
# e.g. dups_ios = find_dups(ios, 2)
# print(dups_ios[0])

# if you wish to display the unique app-names use:
# print(dups_ios[1])

In [17]:
dups_android = find_dups(android, 0) # find duplicates in the android dataset --> name column index is 0
dups_ios = find_dups(ios, 2) # find the duplicates in the iOs dataset --> name column index is 2
print(dups_ios[0]) # print the names of the iOS duplicates, since the function returns a Tuple, we need the Tuple at index 0 (index 1 would be the unique names)

Number of duplicates:  1181
Number of uniques:  9660
Number of duplicates:  2
Number of uniques:  7196
['VR Roller Coaster', 'Mannequin Challenge']


In [18]:
for row in ios: # quick check if the above function works, iterate over each row in the ios dataset and:
    if row[2] == "VR Roller Coaster": # filter for names "VR Roller Coaster"
        print(row) # print these rows

['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']


### how to decide which record to delete?

As we can see in the output above (Example "VR Roller Coaster"), the record differs in the numbers of reviews (col 6). So we can use this to determine which entry is newer (the one with the more reviews).

# 5. Removing Duplicate Entries: Part Two

To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [19]:
# function that creates a Dictionary which stores the AppNames with the corresponding no. of reviews (only those with the highest number of reviews are stored)
# Parameters are:
# - dataset: the dataset to extract the unique names from
# - name_index: the index of the name column
# - review_index: the index of the review-counter column

def create_rev_Tuple(dataset, name_index, review_index):
    reviews_max = {} # creates an empty Dictionary named "reviews_max"
    for row in dataset[1:]: # looping through the dataset (leaving out the header column, so starting from row 1)
        name = row[name_index] # get all the rows from the name column and store them in a variable called "name"
        n_reviews = float(row[review_index]) # get all the rows from the review-counter column, convert it to a float and store it in a variable called "n_reviews"
        if name in reviews_max and reviews_max[name] < n_reviews: # check if the extracted name already occors in the "reviews_max" dictionary AND if it is smaller than the above created "n_reviews"
            reviews_max[name] = n_reviews # update the already stored appname in "reviews_max" with the new number of reviews "n_reviews"
        elif name not in reviews_max: # if the extracted name is NOT yet present in "reviews_max", 
            reviews_max[name] = n_reviews # simply add it to the list with it´s corresponding review-count
    return reviews_max # return "reviews_max" as Dictionary

In [20]:
andTuple = create_rev_Tuple(android, 0, 3) # using the above function, create an Android Dictionary called "andTuple" which stores only Appnames with the hightes review counts
print(len(andTuple)) # how many entries does the dictionary have?
print(type(andTuple)) # what type is the new dataset?
iosTuple = create_rev_Tuple(ios, 2,6) # do the same for the iOS dataset
print(len(iosTuple)) # how many entries does the iOS dictionary have?



9659
<class 'dict'>
7195


In [21]:
# function for creating a new dataset, which removes all duplicates and only leaves us with unique appnames
# Parameters needed are:
# dataset: name of the dataset to clean
# name_index: index number of the name-column
# review_index: index number of the review-column
# dictname: name of the dictionary where you want to store your cleaned dataset 

def save_uniques(dataset, name_index, review_index, dictname):
    cleanlist = [] # empty list that stores the new cleaned dataset
    already_added = [] # empty list that stores just the app names (helper list)

    for row in dataset[1:]: # loop through the dataset, leaving out the header --> hence starting at row 1
        name = row[name_index] # get the name column and store it in a variable called "name"
        n_reviews = float(row[review_index]) # get the review column, convert it to float and store it in a variable called "n_reviews"
        if (n_reviews == dictname[name]) and (name not in already_added): # check if the no. of reviews in the "dictname" dataset has the same value as the "n_reviews" AND if it´s not in the "already_added" list
            cleanlist.append(row) # add the whole row to the "cleanlist"
            already_added.append(name) # and also add it´s name to the "already_added" list (helper list)
    return cleanlist # return only the "cleanlist"

### Unique cleaned Datasets:

In [22]:
android_clean = save_uniques(android, 0, 3, andTuple) # create clean android dataset using the above function
ios_clean = save_uniques(ios, 2, 6, iosTuple) # same for iOS dataset

# 6. Removing Non-English Apps: Part One

This section is dedicated to identifying apps that are not aimed at English speaking audiences. To achieve this I will search the cleaned datasets for ASCII characters `>127` since all commonly used English characters are between 0 and 127.

In [23]:
# function that checks, if an app name is English or not. Parameter is "appName" which is a String:

def englisch(appName):
    counter = 0 # create a counter and set it to 0
    for char in appName: # loop through each character of the String from the input "appName" and
        if ord(char) > 127: # using the "ord()" method, check if the character is above 127 (which is the ASCII limit for commonly used english characters)
            counter += 1 # increase the above counter by 1
            if counter > 3: # check if there are more than 3 "non-english" characters in the String (3 is a randomly set limit)
                return False # if there are more than 3 non-english characters found in the String, return False
    return True # if the String is identified as Englisch, return True

In [24]:
# Testing the above function

print(englisch("Instagram"))
print(englisch("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(englisch("Docs To Go™ Free Office Suite"))
print(englisch("Instachat 😜😜😜😜"))
print(englisch("😜😜😜😜"))

True
False
True
False
False


# 7. Removing Non-English Apps: Part Two

- Change the function you created in the previous screen. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return `False` (identify the string as non-English), otherwise it should return `True`.

In [25]:
# this function uses the above function "englisch()" which returns a Boolean, to create a new dataset with only English AppNames:
# Parameters are:
# - dataset: the dataset to clean
# - name_index: the index of the name column
# - output: the cleaned dataset as list

ios_english = [] # create empty list to store the english iOS apps
android_english = [] # create empty list to store the english android apps
def only_eng(dataset, name_index, output): 
    for row in dataset: # for each iteration through the dataset:
        name = row[name_index] # get the name column and store it in a variable called "name"
        if englisch(name): # use the above function and check if the name qualifies as "english"
            output.append(row) # if so, store the appname in "output"
    return output # return the output

In [26]:
eng_android = only_eng(android_clean, 0, android_english) # using the function above, create a list of only english android apps
explore_data(eng_android,0,3,True) # explore the new dataset
eng_ios = only_eng(ios_clean, 2 , ios_english) # using the function above, create a list of only english iOS apps
explore_data(eng_ios, 0,3,True) # explore the new dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of cols:  13
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583

### Current Dataset to use:

from this step on please use the following cleaned datasets for each function calls

- ios_english
- android_english

# 8. Isolating the Free Apps

- Loop through each data set to isolate the free apps in separate lists. Make sure you identify the columns describing the app price correctly.
- After you isolate the free apps, check the length of each data set to see how many apps you have remaining.

In [27]:
#ios price = col 5
#android price = col 7
# function to select only the "free" apps in both datasets with the following parameters:
# - dataset: the dataset to transform
# - price_index: the index of the price column
# - output: name of the list to store the results 

ios_free = [] # create an empty list to store the free ios apps
android_free = [] # create an empty list to store the free android apps
def is_Free(dataset, price_index, output): 
    for row in dataset: # for each iteration throuth the dataset:
        price = row[price_index] # get the price column and store it in a variable called "price"
        if price == "0": # if the price equals ZERO:
            output.append(row) # store the row in the output list
    return output # return the output

In [28]:
free_ios = is_Free(ios_english, 5, ios_free) # create a new dataset with only free iOS apps
free_android = is_Free(android_english, 7, android_free) # do the same for android apps
explore_data(free_ios,0,3,True) # explore the new dataset
explore_data(free_android,0,3,True) # explore the new dataset


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows:  3220
Number of cols:  17
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '

# 9. Most Common Apps by Genre: Part One

## Further goals:

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play.
2) If the app has a good response from users, we develop it further.
3) If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

### Current datasets to use:

from this step on please use the following cleaned datasets for each function calls

- free_ios
- free_android


In [31]:
def display_table(dataset, index):
    table = freq_table(dataset, index) # get a new frequency table with percentages and store it in a variable called "table"
    table_display = [] # create a new empty list called "table_display"
    for key in table: # for each key in the above created frequency table:
        key_val_as_tuple = (table[key], key) # get each key-value-pair and save it as a Tuple called "key_val_as_tuple"
        table_display.append(key_val_as_tuple) # add that newly created Tuple to the "table_display" list
    table_sorted = sorted(table_display, reverse = True) # sort the "table_display" list in descending order and store it in a variable called "table_sorted"
    for entry in table_sorted: # loop through the newly created sorted table and for each iteration:
        print(entry[1], ":", entry[0]) # print out the key-value-pair in reverse order

The `display_table()` function above

- Takes in two parameters:`dataset` and `index`.`dataset` is expected to be a list of lists, and `index` is expected to be an integer.
- Generates a frequency table using the `freq_table()` function (which you're going to write as an exercise).
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
- Prints the entries of the frequency table in descending order.

# 10. Most Common Apps by Genre: Part Two

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

In [30]:
# function takes in the following parameters:
# - dataset: the dataset to work with
# - index: the index of the desired column to inspect

def freq_table(dataset, index):
    f_dict = {} # create an empty dictionary called "f_dict"
    totalApps = len(dataset) # get the number of rows for the dataset and store it in a variable called "totalApps"
    for row in dataset: # loop through the dataset and for each iteration:
        val = row[index] # get the row from the desired column (defined by the index parameter) and store it in a variable called "val"
        if val in f_dict: # if the "val" is already present in the "f_dict" dictionary:
            f_dict[val] += 1 # increase the value of the f_dict dictionary by 1
        else: # otherwise:
            f_dict[val] = 1 # just set the value of the f_dict to 1
    for key in f_dict: # loop through each key from the newly created f_dict dictionary and for each iteration:
        perc = (f_dict[key]/totalApps)*100 # calculate the percentage and store it in a varialbe called "perc"
        f_dict[key] = perc # replace the values in f_dict with the newly calculated percentages
    return f_dict # return the new f_dict as Dictionary

### Test if the above function worked as expected

by calling it and checking if the sum of all percentages equals to 100

In [32]:
#ios: prime_genre col: 12
#android: genres col: 9, category col: 1
test = freq_table(free_ios, 12)
print(test)
summe = 0
for key in test:
    summe += test[key]
print(summe)

{'Productivity': 1.7391304347826086, 'Weather': 0.8695652173913043, 'Shopping': 2.608695652173913, 'Reference': 0.5590062111801243, 'Finance': 1.1180124223602486, 'Music': 2.049689440993789, 'Utilities': 2.515527950310559, 'Travel': 1.2422360248447204, 'Social Networking': 3.291925465838509, 'Sports': 2.142857142857143, 'Health & Fitness': 2.018633540372671, 'Games': 58.13664596273293, 'Food & Drink': 0.8074534161490683, 'News': 1.3354037267080745, 'Book': 0.43478260869565216, 'Photo & Video': 4.968944099378882, 'Entertainment': 7.888198757763975, 'Business': 0.5279503105590062, 'Lifestyle': 1.5838509316770186, 'Education': 3.6645962732919255, 'Navigation': 0.18633540372670807, 'Medical': 0.18633540372670807, 'Catalogs': 0.12422360248447205}
100.00000000000001


In [33]:
print("iOS Apps sorted by Genre descending:")
print("\n")
ios_table = display_table(free_ios, 12)
print("\n")

iOS Apps sorted by Genre descending:


Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205




# 11. Most Common Apps by Genre: Part Three

## iOS Analysis

- What is the most common genre? What is the runner-up?
  > the first 4 entries are made up of Entertainment categories with "Games" being the top runner-up (55%)
- What other patterns do you see?
  > free iOS apps for the english market seem to be mostly focused around Entertainment
- What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)? 
  > see answer above
- Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
  > definitely go for Games. Although this does not promise to also have a large user base, it´s sheer number rules out all other Genres!

In [34]:
print("Android Apps sorted by Genre descending:")
print("\n")
android_g_table = display_table(free_android, 9)
print("\n")


Android Apps sorted by Genre descending:


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718

In [35]:
print("Android Apps sorted by Category descending:")
print("\n")
android_c_table = display_table(free_android, 1)

Android Apps sorted by Category descending:


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PA

## Android Analysis

- What are the most common genres?
  > Family & Games, Tools & Entertainment
- What other patterns do you see?
  > unlike iOS Apps these top categories seem to be aimed more towards practical purposes
- Compare the patterns you see for the Google Play market with those you saw for the App Store market.
  > see above, but android is split up into more categories which spreads out the frequencies a bit further.
- Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?
  > I think the frequency tables alone don´t reveal any hints on numbers of users. However the Games Genre for iOS seems to be taking half the cake!


# 12. Most Common Apps by Genre on the App Store

The frequency tables we analyzed on the previous screen showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of `installs` for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [36]:
# ios_table
# android_g_table
# android_c_table

#free_ios: prime_genre col: 12, ratings col: 6
#free_android: genres col: 9, category col: 1

ios_genre = freq_table(free_ios, 12)
#android_genre = freq_table(free_android, 9)
#print(ios_genre)
summe = 0
for key in ios_genre:
    summe += ios_genre[key]
print(summe)

100.00000000000001


In [37]:
# function that calculates the average rating-count for each genre using the following parameters:
# - uni_genres: the above created "ios_genre" frequency table
# - dataset: the latest cleaned dataset to work with
# - genre_index: the index of the genre column
# - rating_index: the index of the rating column

def get_Avg(uni_genres, dataset, genre_index, rating_index):
    for genre in uni_genres: # loop through the frequency table and for each iteration:
        total = 0 # create a variable "total" and set it to 0
        len_genre = 0 # create a variable "len_genre" and set it to 0
        for row in dataset: # loop through each row in the cleaned dataset and for each iteration:
            genre_app = row[genre_index] # get the genre column and store it in a variable called "genre_app"
            if genre_app == genre: # check if the genre matches the genre of the frequency table of the outer loop
                rating = float(row[rating_index]) # if so, get the rating column, convert it to a float and store it in a variable called "rating"
                total += rating # add the rating-count to the above variable "total" of the outer loop
                len_genre += 1 # increase the "len_genre" counter by 1
        average = total/len_genre # calculate the average rating count per genre
        print(genre) # print all the stuff:
        print(total)
        print(len_genre)
        print(average)    
    return average # return the calculated average as float
        

In [38]:
avg_ios_genre = get_Avg(ios_genre, free_ios, 12,6)
# avg_android_genre = get_Avg(android_genre, free_android, 1,5) <--- Install col contains Strings!!! Will be dealt with in the next step

Productivity
1177591.0
56
21028.410714285714
Weather
1463837.0
28
52279.892857142855
Shopping
2261254.0
84
26919.690476190477
Reference
1348958.0
18
74942.11111111111
Finance
1132846.0
36
31467.944444444445
Music
3783551.0
66
57326.530303030304
Utilities
1513441.0
81
18684.456790123455
Travel
1129752.0
40
28243.8
Social Networking
7584125.0
106
71548.34905660378
Sports
1587614.0
69
23008.898550724636
Health & Fitness
1514371.0
65
23298.015384615384
Games
42705795.0
1872
22812.92467948718
Food & Drink
866682.0
26
33333.92307692308
News
913665.0
43
21248.023255813954
Book
556619.0
14
39758.5
Photo & Video
4550647.0
160
28441.54375
Entertainment
3563577.0
254
14029.830708661417
Business
127349.0
17
7491.117647058823
Lifestyle
840774.0
51
16485.764705882353
Education
826470.0
118
7003.983050847458
Navigation
516542.0
6
86090.33333333333
Medical
3672.0
6
612.0
Catalogs
16016.0
4
4004.0


# 13. Most Popular Apps by Genre on Google Play

In the previous screen, we came up with an app profile recommendation for the App Store based on the number of user ratings. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [39]:
display_table(free_android, 5) # display and sort the free android apps by installs

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [40]:
# android_g_table
# android_c_table

android_cats = freq_table(free_android, 1)
#free_android: Installs col: 5, category col: 1

# function that handles the number of installs (String) and converts it to Integers. The rest does the same as the above function "get_Avg()"
# Parameters are:
# - uni_genres: the frequency table for each genre
# - dataset: the latest cleaned dataset
# - genre_index: the index number of the genre column
# - install_index: the index number of the "Install" column

def get_Avg_andro(uni_genres, dataset, genre_index, install_index):
    for genre in uni_genres: # loop through the frequency table and for each iteration:
        total = 0 # create a variable called "total" and set it to 0
        len_genre = 0 # create a variable called "len_genre" and set it to 0
        for row in dataset: # loop through each row in the cleaned dataset for each iteration:
            genre_app = row[genre_index] # get the genre column and store it in a variable named "genre_app"
            if genre_app == genre: # if that genre is the same as the one in the frequency table:
                installs = (row[install_index]).replace("+", "") # remove the "+" from the String and store it in "installs"
                installs = installs.replace(",","") # then also remove the "," and overvrite "installs"
                installs = float(installs) # finally convert "installs" to a float
                total += installs # and add the no. of installs to "total"
                len_genre += 1 # also increase the no. of len_genre by 1
        average = total/len_genre # calculate the average no. of Installs and store it in a variable called "average"
        print(genre) # print all the stuff:
        print(total)
        print(len_genre)
        print(average)    
    return average # return the average as float


In [41]:
avg_andro = get_Avg_andro(android_cats, free_android, 1, 5)
type(avg_andro)

ART_AND_DESIGN
113221100.0
57
1986335.0877192982
AUTO_AND_VEHICLES
53080061.0
82
647317.8170731707
BEAUTY
27197050.0
53
513151.88679245283
BOOKS_AND_REFERENCE
1665884260.0
190
8767811.894736841
BUSINESS
696902090.0
407
1712290.1474201474
COMICS
44971150.0
55
817657.2727272727
COMMUNICATION
11036906201.0
287
38456119.167247385
DATING
140914757.0
165
854028.8303030303
EDUCATION
188850000.0
103
1833495.145631068
ENTERTAINMENT
989460000.0
85
11640705.88235294
EVENTS
15973160.0
63
253542.22222222222
FINANCE
455163132.0
328
1387692.475609756
FOOD_AND_DRINK
211738751.0
110
1924897.7363636363
HEALTH_AND_FITNESS
1143548402.0
273
4188821.9853479853
HOUSE_AND_HOME
97202461.0
73
1331540.5616438356
LIBRARIES_AND_DEMO
52995810.0
83
638503.734939759
LIFESTYLE
497484429.0
346
1437816.2687861272
GAME
13436869450.0
862
15588015.603248259
FAMILY
6193895690.0
1676
3695641.8198090694
MEDICAL
37732344.0
313
120550.61980830671
SOCIAL
5487861902.0
236
23253652.127118643
SHOPPING
1400338585.0
199
7036877.31155

float

## my personal Results

Android Categories differ greatly based on their number of installs from what we´ve seen on the iOS side. The most downloaded Categories within free Apps are

1. Communication
2. Video_Players
3. Social

iOS Top Three

1. Games
2. Entertainment
3. Photo

Games is on rank 6 whereas on iOS it´s No. 1. Since the iOS ranking is based on number of user ratings and the Android ranking is based on actual installs, this is hard to compare. However a good compromise seems to be to focus on Games and Utilities such as Photo/Video.
