# Most Popular iOS and Android Apps

In order to determine which types of apps are most likely to attract users, I will analyze a dataset of apps from the Apple App Store and the Google Play Store. By examining what attributes make an app popular, hopefully I can aide aspiring app developers as they decide which types of apps they are going to develop.

There are approximately 2 million iOS apps available on the App Store and 2.1 million Andriod apps available on the Google Play Store. Instead of analyzing all apps, I will examine one sample from the Apple App Store and Another from the Google Play Store, which are both taken from Kaggle:

The [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing the approximately 7,000 iOS apps from the App Store can be downloaded by clicking [here](https://www.kaggle.com/lava18/google-play-store-apps/download) and the [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing the approximately 10,000 Android apps from the Play Store can be downloaded by clicking [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download).

Both the App Store and Play Store have around the same number of apps (close to 2M), meaning there are likely around the same number of app developers publishing apps on each store. Thus, 
young developers are unlikely to obtain any advantage from learning Xcode (for iOS) or Android Studio (for Android) first.

Note that Xcode and Android Studio support different programming languages, so if you are already familiar with a few programming languages, take your experience into consideration when choosing an IDE. Circa August 2020, the IDE's support the following languages:
- Xcode (iOS) = C, C++, Objective-C, Objective-C++, Java, AppleScript, Python, Ruby, ResEdit, and Swift
- Android Developer (Android) = Java, C++, and Kotlin    

Also note that Android Studio and Xcode have several options for phone emulators, so if a programmer doesn't have an Android/Apple phone or doesn't want to run unfinished apps on his or her personal device, he or she can still test for bugs.

## 1. Understanding the App and Play Store Datasets

In this section, I try to better understand the datasets containing iOS and Android apps in order to more effectively clean and analyze them. First, I will take a look at the datasets in order to better understand the way the data is organized.

In [1]:
from csv import reader

apple_open = open("apple_store.csv")
google_open = open("play_store.csv")

apple_read = reader(apple_open)
google_read = reader(google_open)

apple_data = list(apple_read)
google_data = list(google_read)

Next I will create a function in order to make this data more readable. Note that I am not using pandas on these datasets because I'm practicing my basic python without having to rely on the pandas library. 

### a) Increasing Readability

Both of these datasets are formated as lists of lists, which is very difficult to read. In order to make the data more readable, I will create a function that converts each row of the dataset into a more readable format.
- dataset = accepts a dataset as a list of lists
- start/end = accept the first and last rows (starting from one, not zero, since the header is row 0) in the range of rows the user wants to examine
- header = if set to True (defaulted to False), the function will print out the header
- rows_and_columns = if set to True, will print out the number of rows and columns in the entire dataset

In [4]:
def explore_data(dataset, start, end, header=False, rows_and_columns=False):
    if header:
        dataset_slice = dataset[start-1:end] # will start at index 0, which is the header
    else:
        dataset_slice = dataset[start:end] # will start at index 1, which is the first data entry
    for row in dataset_slice:
        print(row)
        print('\n') # adds an empty line after each data entry for readability

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### b) Structure of the App and Play Store Datasets

Just to get an idea for the structure of each dataset, here are the top five apps in the App Store and Play Store respectively, along with their headers.

In [5]:
explore_data(apple_data, 1, 6, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [6]:
explore_data(google_data, 1, 6, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

### c) App and Play Store Column Names

It appears as though the data displayed in both the App and Play Store datasets is relatively similar. Most of the column in both datasets will not be useful to my analysis. The table below shows which columns in each dataset will be examined.

| App Store        | Play Store      | Purpose                                                                      |
|:----------------:|:---------------:|:-----------------------------------------------------------------------------|
|*track_name*      |*App*            |If the app is well known, additional info could be used to determine outliers.|
|                  |*Category*       |Could be used to organize the type of app.                                    |
|*prime_genre*     |*Genres*         |Could also be used to organize the type of app.                               |
|                  |*Installs*       |Would be an excellent indicator of the app's popularity.                      |
|*rating_count_tot*|                 |Would be a great indicator of an app's popularity.                            |
|*price*           |*Price*          |Useful if we wanted to analyze free vs. paid apps separate                    |
|*count_ratings*   |                 |Could determine which age groups a developer should target their audience.    |

## 2. Cleaning the Datasets

A disorganized dataset can greatly impede the ease by which one is able to analyze the data. In this section, I will clean up the data and get rid of unnecessary data entries.

### a) Detecting and Handling Missing Values

Data entries with missing values can make analyzing the data much more difficult. In order to minimize difficulty, I will create a function that determines which rows in the dataset have missing values and returns them. Depending on what is missing, I might choose to delete the entire data entry, ignore the error, or fill in the missing value with a different value. The function below will find and return all missing values in a dataset, as well as the row index so it can be manipulated if necessary.

In [7]:
def data_with_missing_values(dataset):
    missing_values = {}
    row_index = 1 # start iterating at the first data entry, which is index 1 (index 0 is the header)
    for row in dataset[1:]: # iterating through all data entry rows in the dataset
        temp_dict = {} # will contain index of data entry
        temp_list = []
        # will determine what the name of the app with the missing values is (later used as dictionary key)
        if dataset[0][1] == "track_name":
            app_name = row[1]
        else:
            app_name = row[0]
        column_index = 0
        for entry in row:
            if entry == "": # will append only missing values
                temp_list.append(dataset[0][column_index]) # will append the column name of the missing value
            column_index += 1
        if len(temp_list) > 0: # won't add a data entry unless there is at least one data entry missing
            temp_dict["missing"] = temp_list # includes all missing values
            temp_dict["index"] = row_index # includes index of data entry with missing values
            missing_values[app_name] = temp_dict
        row_index += 1
    return missing_values

In [8]:
data_with_missing_values(apple_data)

{}

In [9]:
data_with_missing_values(google_data)

{'Market Update Helper': {'missing': ['Current Ver'], 'index': 1554},
 'Life Made WI-Fi Touchscreen Photo Frame': {'missing': ['Content Rating'],
  'index': 10473}}

All though none of the data entries in the App Store dataset have missing entries, there are two data entries with missing values in the Play Store dataset. Let's take a look at both of these data entries to determine our next steps.

In [10]:
print(google_data[0])
print("")
print(google_data[1554])
print("")
print(google_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Market Update Helper', 'LIBRARIES_AND_DEMO', '4.1', '20145', '11k', '1,000,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'February 12, 2013', '', '1.5 and up']

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It looks like the "Current Ver" is missing from index 1554 and the "Content Rating" is missing from index 10473. The first entry seems to be a rather popular app with over a million downloads, but the other entry has much less downloads. Although neither of these missing values are likely to effect our investigation, I would rather eliminate the second data entry as it is unlikely to affect the analysis due to its unpopularity.

In [11]:
print(len(google_data))
del google_data[10473]
print(len(google_data))

10842
10841


Now that we have successfully deleted this data entry from our Google Play dataset, I will check to determine whether the dataset contains any entries that were recorded multiple times.

### b) Detecting and Dealing with Duplicate Entries

I only want to analyze unique apps, as multiple instances of an app could warp my analysis, especially since popular apps are more likely to be recorded multiple times. Using the for loops below, I will iterate through every row in both databases and add all of the unique apps in a dictionary.

In [12]:
unique_iOS = {}

for data in apple_data[1:]:
    app_name = data[1].lower().capitalize() # accounts for differences in capitalization in app name
    num_reviews = float(data[7]) # converts num_reviews from a string to a float
    
    if app_name in unique_iOS and unique_iOS[app_name] < num_reviews:
        unique_iOS[app_name] = num_reviews # duplicate with most reviews will be added to the dictionary
    elif app_name not in unique_iOS:
        unique_iOS[app_name] = num_reviews # adds app name if not already included in the unique_iOS dictionary 
        
print("There are " + str(len(apple_data)) + " apps in our App Store dataset.")
print("There are " + str(len(unique_iOS)) + " unique apps in our App Store dataset.")

There are 7198 apps in our App Store dataset.
There are 7195 unique apps in our App Store dataset.


In [13]:
unique_Android = {}

for data in google_data[1:]:
    app_name = data[0].lower().capitalize() # accounts for differences in capitalization in app name
    num_reviews = float(data[3]) # converts num_reviews from a string to a float
    
    if app_name in unique_Android and unique_Android[app_name] < num_reviews:
        unique_Android[app_name] = num_reviews # duplicate with most reviews will be added to the dictionary
    elif app_name not in unique_Android:
        unique_Android[app_name] = num_reviews # adds app name if not already included in the unique_Android dictionary

print("There are " + str(len(google_data)) + " apps in our Play dataset.")
print("There are " + str(len(unique_Android)) + " unique apps in our Play dataset.")

There are 10841 apps in our Play dataset.
There are 9638 unique apps in our Play dataset.


It looks like there are only 3 duplicate entries in the App Store dataset, but over 1200 duplicate entries in the Play Store dataset. We will need to remove these duplicate entries.

### c) Eliminating Duplicate Entries from the Datasets

Using the two dictionaries that contain only unique apps created above, I can create lists of all the unique apps in both the App store and the Play store.

In [14]:
apple_clean = [] # will be a list of data entries, which will be lists themselves
apple_added = []

for app in apple_data[1:]:
    app_name = app[1].lower().capitalize() # accounts for differences in capitalization in app name
    num_reviews = float(app[7]) # converts num_reviews from a string to a float
    
    if (unique_iOS[app_name] == num_reviews) and (app_name not in apple_added):
        apple_clean.append(app) # only adds data entries that aren't duplicates
        apple_added.append(app_name) # list of duplicate entries, to prevent duplicates from being added in apple_clean 
        
explore_data(apple_clean, 1, 3, True, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7195
Number of columns: 16


In [15]:
google_clean = [] # will be a list of data entries, which will be lists themselves
google_added = []

for app in google_data[1:]:
    app_name = app[0].lower().capitalize() # accounts for differences in capitalization in app name
    num_reviews = float(app[3]) # converts num_reviews from a string to a float
    
    if (unique_Android[app_name] == num_reviews) and (app_name not in google_added):
        google_clean.append(app) # only adds data entries that aren't duplicates
        google_added.append(app_name) # list of duplicate entries, to prevent duplicates from being added in apple_clean 
        
explore_data(google_clean, 0, 3, False, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9638
Number of columns: 13


The number of rows in both the apple_clean and google_clean matches the number of unique apps determined by the unique_iOS and unique_Android dictionaries. I will use the cleaned versions of the datasets for the remainder of my analysis.

### d) Removing Entries Consisting of Primarily Non-Roman Characters

Many of the apps on both stores are written in langauges with non-Roman characters, especially apps that are designed for Asian markets, which do not use Roman characters. These will be difficult to analyze since the data is likely to be in languages I can't understand, so any dataset with non-Roman characters should be removed from the dataset. Roman chracters are denoted by characters in the order range 0-127 (we can convert the character to a order range using the ord function), so any app with characters outside of that range is likely to be a foreign apps. 

However, it is important to note that many apps have emojis in their names, so even if an app name has a few non-Roman characters in the name, it should not be removed from the dataset. The function below will return False if any string has more than 3 non-Roman characters.

In [16]:
def roman_only(string):
    bad_chars = 0
    for char in string: # checks each char in the string to see if it is a Roman character
        if ord(char) > 127:
            bad_chars += 1
    if bad_chars > 3: # returns False if there are more than 3 non-Roman characters
        return False
    return True

In [17]:
# Testing Function
print(roman_only('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(roman_only('Instagram'))

False
True


Using the roman_only function, I will parse through the apps within both of the cleaned datasets to remove all the apps containing primarily non-Roman characters from the selected sample size.

In [18]:
apple_roman = []

for app in apple_clean:
    app_name = app[1]
    
    if roman_only(app_name):
        apple_roman.append(app) # will add only apps with Roman characters to the new dataset
        
num_roman_apple = len(apple_roman)
num_clean_apple = len(apple_clean)
num_non_roman_apple = num_clean_apple - num_roman_apple

print("There are " + str(num_roman_apple) + " apps ({:.2%}) consisting of Roman characters in our new App store dataset.".format(num_roman_apple/num_clean_apple))
print("There are " + str(num_non_roman_apple) + " apps ({:.2%}) consisting of non-Roman characters in our new App store dataset.".format(num_non_roman_apple/num_clean_apple))


There are 6181 apps (85.91%) consisting of Roman characters in our new App store dataset.
There are 1014 apps (14.09%) consisting of non-Roman characters in our new App store dataset.


In [19]:
google_roman = []

for app in google_clean:
    app_name = app[0]
    
    if roman_only(app_name) == True:
        google_roman.append(app) # will add only apps with Roman characters to the new dataset
        
num_roman_google = len(google_roman)
num_clean_google = len(google_clean)
num_non_roman_google = num_clean_google - num_roman_google
        
print("There are " + str(num_roman_google) + " apps ({:.2%}) consisting of Roman characters in our new Play store dataset.".format(num_roman_google/num_clean_google))
print("There are " + str(num_non_roman_google) + " apps ({:.2%}) consisting of non-Roman characters in our new Play store dataset.".format(num_non_roman_google/num_clean_google))


There are 9593 apps (99.53%) consisting of Roman characters in our new Play store dataset.
There are 45 apps (0.47%) consisting of non-Roman characters in our new Play store dataset.


There are a lot more apps with non-Roman names in the App store than there are in the Play store. Perhaps the App Store contains many more apps that are designed for countries that don't use the Western style alphabet. Or perhaps it is related to the manner in which the data was collected. Regardless, now that I have created subsets containing only apps consisting of primarily non-Roman characters, I will use these subsets for the remainder of my analysis.

### e) Removing Non-Free Apps

Since our main indicator of a successful app is popularity, rather than profitability, I will only be looking at free apps and will remove all of the paid apps from our dataset. Because paid apps cost money, they are less likely to be downloaded and thus certain kinds of apps appear to be less popular than they would be if they were free. Thus, we will only analyze free apps and assume that revenue is generated via other means, such as advertisements, in-app purchases, or data harvesting.

In [20]:
apple_free = []

for app in apple_roman:
    price = float(app[4])
    
    if price == 0:
        apple_free.append(app) # will add only free apps to the new dataset

num_free_apple = len(apple_free)
num_paid_apple = num_roman_apple - num_free_apple

print("There are " + str(num_free_apple) + " free apps ({:.2%}) in our new Apple database.".format(num_free_apple/num_roman_apple))
print("There are " + str(num_paid_apple) + " paid apps ({:.2%}) in our new Apple database.".format(num_paid_apple/num_roman_apple))


There are 3220 free apps (52.10%) in our new Apple database.
There are 2961 paid apps (47.90%) in our new Apple database.


In [21]:
google_free = []

for app in google_roman:
    price = app[6]
    
    if price == "Free":
        google_free.append(app) # will add only free apps to the new dataset

num_free_google = len(google_free)
num_paid_google = num_roman_google - num_free_google
        
print("There are " + str(num_free_google) + " free apps ({:.2%}) in our new Google Play database.".format(num_free_google/num_roman_google))
print("There are " + str(num_paid_google) + " free apps ({:.2%}) in our new Google Play database.".format(num_paid_google/num_roman_google))


There are 8845 free apps (92.20%) in our new Google Play database.
There are 748 free apps (7.80%) in our new Google Play database.


Note that paid apps are much more common in the App Store (around 50% of the total apps in our subset) than they are in the Play Store (around 8% of total apps in our subset). Understanding this distinction might be useful when strategizing how to generate revenue. Perhaps App store customers are more willing to pay for apps than Play store customers are.

## 3. Functions for Analyzing the Datasets

Now that I have cleaned the datasets, I will start comparing the attributes within the datasets to help determine which apps a developer should focus his or her time creating. First, I will create two functions that I will be able to use in order to analyze the columns in both datasets.

### a) Frequency Table Function

The function below creates a frequency table (as a dictionary) for a chosen column within a given dataset. If you select a column index within a given dataset, the function will return  every possible variation within that index with it's given frequency with the data stored as a dictionary.

In [22]:
def freq_table(dataset, index):
    
    frequency_table = {}
    
    for app in dataset:
        target_index = app[index]
        
        if target_index in frequency_table: # adds 1 to the total of the variation count if the value has already been logged
            frequency_table[target_index] += 1
        else:
            frequency_table[target_index] = 1 # sets the total of the variation count to one if it hasn't been logged
            
    for variation in frequency_table:
        percentage = frequency_table[variation]
        percentage /= len(dataset)
        percentage *= 100
        percentage = round(percentage, 2) # converts variation total from a count to a percentage
        frequency_table[variation] = percentage
        
    return frequency_table

In [23]:
freq_table(google_free, 5)

{'10,000+': 10.19,
 '5,000,000+': 6.84,
 '50,000,000+': 2.31,
 '100,000+': 11.55,
 '50,000+': 4.76,
 '1,000,000+': 15.75,
 '10,000,000+': 10.57,
 '5,000+': 4.51,
 '500,000+': 5.55,
 '1,000,000,000+': 0.23,
 '100,000,000+': 2.13,
 '1,000+': 8.39,
 '500,000,000+': 0.27,
 '500+': 3.26,
 '100+': 6.9,
 '50+': 1.92,
 '10+': 3.54,
 '1+': 0.51,
 '5+': 0.79,
 '0+': 0.05}

### b) Display Table Function

Next I will create a function that uses the freq_table function created and formats the possible variations of the selected index in a manner that is much more readable.

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = [] # list enables variations to be sorted by percentage of total (not possible in dictionary)
    
    for key, value in table.items():
        value_key_switch = (value, key) # switchs the order of key and value to enable sorting process
        table_display.append(value_key_switch)

    table_sorted = sorted(table_display, reverse = True) # sorting the value by percentage of total
    for entry in table_sorted: # returns key-value pair in their original orientation and makes it more readable
        print(entry[1], ':', entry[0])

## 4. Comparing the Genres Amongst Apple App Store Apps

In order to decides which type of app an iOS developer should develop, a developer should consider the following: 
- The average number of users per type
- The overall popularity of the type
- The distribution of app popularity within a type
- Which revenue streams are possible with the type
- The cost associated with designing the app

The best indicator of app type in this dataset is the prime_genre column, which is index 11. Let's take a closer looks at the variation within this column using the display_table function we created earlier.

In [25]:
display_table(apple_free, 11)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Game apps are extremely popular in the App Store, consisting of over 58% of the free apps in our dataset. It appears as though the next two most popular categories (Entertainment and Photo & Video), suggesting that iOS developers primarily create entertainment-driven apps rather than apps for more practical purposes.

### a) Average Number of Ratings of Apps By Genre

The for loop below will be used to sort each of the genres based on their average number of ratings, which is the best indicator of popularity amongst Apple Store apps. It will also list the frequency of each genre. Then, I will create another for loop that will print the items within the newly-created list will be used to make the information more readable.

In [26]:
users_per_genre = []
genres = freq_table(apple_free, 11) # will use the frequency table created earlier to assist the iteration process

for gen in list(genres):
    count = 0 # will count the total number of review of an app within each genre
    len_gen = 0 # will count the total number of apps within each genre
        
    for app in apple_free:
        genre = app[11]
        if genre == gen:
            num_rating = float(app[5])
            count += num_rating
            len_gen += 1
    avg_ratings = count / len_gen # contains the average amount of ratings per genre
    
    users_per_genre.append((avg_ratings, gen)) # appends tuple (reversed order) for later sorting
    
users_per_genre = sorted(users_per_genre, reverse = True) # sorting genres based on average number of ratings
frequency = freq_table(apple_free, 11)

for genre in users_per_genre[:5]:
    print(genre[1] + ":", "\n  Average Number of Ratings:", round(genre[0], 2), "\n  Frequency:", frequency[genre[1]], "\n")
    

Navigation: 
  Average Number of Ratings: 86090.33 
  Frequency: 0.19 

Reference: 
  Average Number of Ratings: 74942.11 
  Frequency: 0.56 

Social Networking: 
  Average Number of Ratings: 71548.35 
  Frequency: 3.29 

Music: 
  Average Number of Ratings: 57326.53 
  Frequency: 2.05 

Weather: 
  Average Number of Ratings: 52279.89 
  Frequency: 0.87 



Note that the two highest genres, Navigation and Reference, have a very low frequency. Since a few extremely popular apps could be skewing the average to make it appear as though creating an app in that genre will ensure it's success, I will need to take a closer look at the apps within these categories.

### b) App Store Genres with High Average Number of Ratings

Lets take a look at the four genres that contain the highest average number of ratings. All four categories have a fairly low frequency, so it is possible that a few incredibly popular apps are inflating the average. Let's take a look at the ten most popular apps for each of these four genres.

In [27]:
navigation_apps = []
reference_apps = []
social_apps = []
music_apps = []

for app in apple_free:
    if app[11] == "Navigation":
        navigation_apps.append(str(app[1]) + " : " + str(app[5]))
    elif app[11] == "Reference":
        reference_apps.append(str(app[1]) + " : " + str(app[5]))
    elif app[11] == "Social Networking":
        social_apps.append(str(app[1]) + " : " + str(app[5]))
    elif app[11] == "Music":
        music_apps.append(str(app[1]) + " : " + str(app[5]))

print(navigation_apps[:10], "\n\n", reference_apps[:10], "\n\n", social_apps[:10], "\n\n", music_apps[:10])

['Waze - GPS Navigation, Maps & Real-time Traffic : 345046', 'Google Maps - Navigation & Transit : 154911', 'Geocaching® : 12811', 'CoPilot GPS – Car Navigation & Offline Maps : 3582', 'ImmobilienScout24: Real Estate Search in Germany : 187', 'Railway Route Search : 5'] 

 ['Bible : 985920', 'Dictionary.com Dictionary & Thesaurus : 200047', 'Dictionary.com Dictionary & Thesaurus for iPad : 54175', 'Google Translate : 26786', 'Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418', 'New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588', 'Merriam-Webster Dictionary : 16849', 'Night Sky : 12122', 'City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535', 'LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693'] 

 ['Facebook : 2974676', 'Pinterest : 1061624', 'Skype for iPhone : 373519', 'Messenger : 351466', 'Tumblr : 334293', 'WhatsApp Messenger : 287589', 'Kik : 260965', 'ooVoo – Free 

As expected, the most popular apps in these four genres significantly inflate the average. Note that the extremely popular apps in the genres "Navigation", "Social Networking", and "Music" were created by large corporations and thus likely require a lot of capital to design. Thus, although the apps are incredibly popular and likely profitable, a sole app developer would not be capable of creating an app such as these by himself.

While the most popular apps in the "Reference" genre don't appear to have been created by large companies, most of them look as though they require large databases (Bible, Dictionaries, Google Translate, etc.) in order to create, so they would also likely be capital intensive.

### c) Sorting Genres in the Apple App Store by Highest Median

Since the four genres above have an inflated average (in terms of number of reviews) due to several extraordinarily popular apps, the vast majority of app developers are not going to have the capital, resources, or ability to develop one of these apps. Thus, I will resort the genre categories by their median rather than by their average to give a more accurate representation as to how many downloads an app developer could expect to earn by creating an app in one of these genre categories. Lets parse through all of the categories once again and sort them by their median number of ratings rather than by their average number of ratings.

In [28]:
users_per_genre = []

for genre in list(genres):
    count = 0
    len_gen = 0
    temp_gen = []
        
    for app in apple_free: # counts total ratings
        genre_app = app[11]
        if genre_app == genre:
            num_rating = float(app[5])
            count += num_rating
            len_gen += 1
            temp_gen.append(num_rating) # places the number of ratings for each app into a dictionary to be sorted for median
            
    avg_ratings = count / len_gen # calculates the average number of ratings per genre
    temp_gen = sorted(temp_gen, reverse=True)
    
    if len_gen % 2 == 0: # calculates the median number of ratings
        med_ratings = (temp_gen[int(len_gen / 2)] + temp_gen[int(len_gen / 2) - 1]) / 2
    else:
        med_ratings = temp_gen[int(len_gen / 2)]
    
    
    users_per_genre.append((med_ratings, avg_ratings, genre))
    
users_per_genre = sorted(users_per_genre, reverse = True)
frequency = freq_table(apple_free, 11)

for genre in users_per_genre[:7]:
    print(genre[2] + ":", "\n  Median Number of Ratings:", round(genre[0], 2), "\n  Average Number of Ratings:", round(genre[1], 2), "\n  Frequency:", frequency[genre[2]], "\n")
    

Productivity: 
  Median Number of Ratings: 8737.5 
  Average Number of Ratings: 21028.41 
  Frequency: 1.74 

Navigation: 
  Median Number of Ratings: 8196.5 
  Average Number of Ratings: 86090.33 
  Frequency: 0.19 

Reference: 
  Median Number of Ratings: 6614.0 
  Average Number of Ratings: 74942.11 
  Frequency: 0.56 

Shopping: 
  Median Number of Ratings: 5936.0 
  Average Number of Ratings: 26919.69 
  Frequency: 2.61 

Social Networking: 
  Median Number of Ratings: 4199.0 
  Average Number of Ratings: 71548.35 
  Frequency: 3.29 

Music: 
  Median Number of Ratings: 3850.0 
  Average Number of Ratings: 57326.53 
  Frequency: 2.05 

Health & Fitness: 
  Median Number of Ratings: 2459.0 
  Average Number of Ratings: 23298.02 
  Frequency: 2.02 



### d) Best Genres to Design Apple Apps

After considering the median number of ratings, the average number of ratings, and the frequency of the apps, it seems as though the best choices for what type of app to develop are "Productivity", "Social Networking", and "Reference". 

Even though "Navigation" had the second highest median (8,196.50) and the highest average (86,090.33) of all the app categories in the Apple store, there were only 6 apps listed within that category, and the largest two (Waze and Google Maps) drastically inflated the average. The fourth largest app in the genre only had 3,582 ratings, which is lower than the medians for our three selected categories. 

Lets examine the "Productivity", "Social Networking", and "Reference" genres in the table below:

|                           | Productivity | Social Networking | Reference |
|:--------------|:--------------:|:--------------:|:--------------:|
| Median Number of Ratings  | 8,737.5        | 4,199.0             | 6,614.0     |
| Average Number of Ratings | 21,028.41      | 71,548.35           | 74,942.11   |
| Frequency                 | 1.74          | 3.29                | 0.56        |

Apps in the "Productive" category have the highest median of any category. They also have a fairly high average compared to most genres, a fairly high frequency, and likely don't require lots of capital or skill to make. An app developer could potentially make some money creating a high-quality product and later monetize the app (if raising revenue is a desire) via ads once the app has gained some popularity. 

Apps in the "Social Networking" category have a average number of ratings that is around three times as large as the average for "Productivity" apps, but their median is less than half the median of "Productivity" apps. Thus, although there are a good number of "Social Networking" apps on the app store, the vast majority of them do not make very much money. However, if you have a team of talented programmers and some decent marketing skills, you might be able to make a reasonably successful app.

Determining whether creating an app in the "Reference" category is worthwhile is difficult. There are not many apps in the app store under this genre, however they have a high median and an extremely high average. If you have specialized knowledge in a field in which you could create such an app, perhaps this is the right category for you.

If you have an app idea that you desire to pursue, go for it. However, you should at least consider that no matter what app you make, it is highly unlikely to be popular unless you are on a team and are making an app for a company. Nevertheless, it's good experience and a fun activity, so consider the above information when determining which type of app you want to design.

## 5. Comparing the Categories Amongst Google Play Store Apps

I will repeat the same process as we did for the iOS apps to determine which types of apps an Android developer should work on. Again, we will focus on the following:
- The average number of users per type
- The overall popularity of the type
- The distribution of app popularity within a type
- Which revenue streams are possible with the type
- The cost associated with designing the app

Let's take a look at the different types of genres.

### a) Choosing Between the Category and Genre Columns

There are two possible columns we could analyze in order to differentiate between types of Android apps: Categories and Genres. Lets take a look at the display table for both of these columns to determine which one to use for our analysis.

In [29]:
display_table(google_free, 1)

FAMILY : 18.89
GAME : 9.72
TOOLS : 8.45
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.9
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.31
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.93
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.15
DATING : 1.87
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.23
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.83
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.66
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [30]:
display_table(google_free, 9)

Tools : 8.43
Entertainment : 6.05
Education : 5.35
Business : 4.59
Productivity : 3.9
Lifestyle : 3.9
Finance : 3.7
Medical : 3.53
Sports : 3.47
Personalization : 3.31
Communication : 3.24
Action : 3.11
Health & Fitness : 3.08
Photography : 2.93
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.33
Shopping : 2.25
Books & Reference : 2.15
Simulation : 2.05
Dating : 1.87
Arcade : 1.84
Video Players & Editors : 1.78
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.23
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.83
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.4
Educational : 0.37
Board : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;Bra

There are far too many categories in the Genres column to effectively analyze as it appears as though the scope is too narrow for our purposes. Thus, I will analyze the Categories column instead.

Because of my experience analyzing the App Store dataset, I will skip a few steps and order the categories based on median number of installs, which appears to be the best indicator of popularity amongst Play Store apps. I will also list the average number of installs as well as the frequency of the installs as both proved to be useful when examing which genre of app an iOS developer should decide to focus on designing.

Let's take another look at the breakdown of the Category column.

In [31]:
display_table(google_free, 1)

FAMILY : 18.89
GAME : 9.72
TOOLS : 8.45
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.9
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.31
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.93
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.15
DATING : 1.87
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.23
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.83
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.66
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Practical apps are much more prevalent in the Play Store than they are in the App Store as seen by the breakdown of the Category column. Although "Games" is the second most popular category, only two out of ten of the top categories are apps designed for entertainment purposes.
- Entertainment = "Games" and "Sports"
- Practical = "Tools", "Business", "Productivity", "Finance", and "Medical"
- Hard to Tell = "Family", "Lifestyle", and "Personalization"

This distinction between iOS and Android apps is incredibly important when deciding whether to design the iOS or Android version of your app first. Based on frequecy of the category alone (we will analyze other factors later), if a developer is designing an entertainment-based app, they should create the iOS version first and if they are designing a productivity-based app, they should create the Android version first.

### b) Examining the Number of Installs

I will order the categories by their median number of installs, which is the best indicator of popularity amongst Play Store apps. We will list the average number of installs as well as the frequency of the installs as both proved pertinent when examing the App Store data. The number of installs is stored in discrete bins, so I will use the display_table function to get a better look at these bins.

In [32]:
display_table(google_free, 5)

1,000,000+ : 15.75
100,000+ : 11.55
10,000,000+ : 10.57
10,000+ : 10.19
1,000+ : 8.39
100+ : 6.9
5,000,000+ : 6.84
500,000+ : 5.55
50,000+ : 4.76
5,000+ : 4.51
10+ : 3.54
500+ : 3.26
50,000,000+ : 2.31
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05


Organizing data based on install bins is going to result in a very inacurate median and average as the range of possible values within a bin is very large. For example, an app with 499,000 would be in the same bin as an app with 100,000, making analysis fairly inacurate. However, we could use the Reviews value, which consists of continuous values, to get a more accurate median and average. Although not as good of an indicator of popularity as a number of installs column would be (assuming it contained continuous values), the continuous nature of the reviews column makes it a better value to analyze.

### c) Examining the Median Number of Reviews

I will use a for loop in order to examine the reviews column and create a sorted list of genres based on the median number of review. This list will also include the category's average number of reviews, frequency of apps, and the total number of apps. Then I will use another for loop in order to display the values in an easier-to-read format.

In [33]:
users_per_category = []
categories = freq_table(google_free, 1)

for cat in list(categories):
    count = 0
    len_cat = 0
    temp_cat = []
        
    for app in google_free: # counts total ratings
        cat_app = app[1]
        if cat_app.capitalize() == cat.capitalize():
            reviews = float(app[3])
            count += reviews
            len_cat += 1
            temp_cat.append(reviews) # places the number of ratings for each app into a dictionary to be sorted for median
            
    avg_reviews = count / len_cat # calculates the average number of ratings per genre
    temp_cat = sorted(temp_cat, reverse=True)
    
    if len_cat % 2 == 0 & len_cat > 0: # calculates the median number of reviews
        med_reviews = (temp_cat[int(len_cat / 2)] + temp_gen[int(len_cat / 2) - 1]) / 2
    else:
        med_reviews = temp_cat[int(len_cat / 2)]
     
    users_per_category.append((med_reviews, avg_reviews, cat, len_cat))
    
users_per_category = sorted(users_per_category, reverse = True)

for category in users_per_category[:7]:
    print(category[2].capitalize() + ":", "\n  Median Number of Reviews:", round(category[0], 2), "\n  Average Number of Reviews:", round(category[1], 2), "\n  Frequency:", categories[category[2]], "\n  Total Number of Apps:", category[3], "\n")


Game: 
  Median Number of Reviews: 35572.0 
  Average Number of Reviews: 685108.14 
  Frequency: 9.72 
  Total Number of Apps: 860 

Entertainment: 
  Median Number of Reviews: 35279.0 
  Average Number of Reviews: 301752.25 
  Frequency: 0.96 
  Total Number of Apps: 85 

Photography: 
  Median Number of Reviews: 32398.0 
  Average Number of Reviews: 407197.92 
  Frequency: 2.93 
  Total Number of Apps: 259 

Education: 
  Median Number of Reviews: 13612.0 
  Average Number of Reviews: 56293.1 
  Frequency: 1.16 
  Total Number of Apps: 103 

Shopping: 
  Median Number of Reviews: 13085.0 
  Average Number of Reviews: 223887.35 
  Frequency: 2.25 
  Total Number of Apps: 199 

Weather: 
  Median Number of Reviews: 11297.0 
  Average Number of Reviews: 171250.77 
  Frequency: 0.8 
  Total Number of Apps: 71 

Communication: 
  Median Number of Reviews: 6454.0 
  Average Number of Reviews: 995608.46 
  Frequency: 3.24 
  Total Number of Apps: 287 



### d) Best Categories to Design Play Store Apps

The data listed above is extremely confusing, as the median number of reviews is almost always far smaller than the average number of reviews. This suggests the Play Store is dominated by extraordinarily popular apps and the vast majority of apps have very few reviews. 

Regardless, the best genre to design an app is clearly the "Game" genre. It has the highest median number of views, the second highest frequency, and the third highest average number of views. Other decent categories to design would be "Entertainment", "Photography", and "Communication". 

I've created a table for all four of the categories below:

|                           | Game       | Entertainment | Photography | Communication |
|:--------------------------|:----------:|:-------------:|:-----------:|:-------------:|
| Median Number of Reviews  | 35,572.0   | 35,279.0      | 32,398.0    | 6,454.0       |
| Average Number of Reviews | 685,108.14 | 301,752.25    | 407,197.92  | 995,608.46    |
| Frequency                 | 9.72       | 0.96          | 2.93        | 3.24          |

## 6. Results

There are several key differences between apps in the Apple App Store and the Google Play Store that are of note.

There were very few duplicates in the App Store (only 3) whereas there were many in the Play Store (over 1200). Although the duplicates were removed before I began my analysis, this suggests that the App Store dataset was more significantly cleaned than that of the Play Store and thus contains more reliable data.

A much higher percentage of the iOS apps (approximately 14%) contained many non-Roman characters than did Android apps (approximately 0.5%). Therefore, it is reasonable to hypothesize that more iOS apps target non-Western markets than do Android apps. Thus, if you intend to target your app for one of these markets, choosing to create an iOS app might be a wise decision. 

Nearly half of the apps in the App Store cost money (approximately 48%) whereas less than 8% of apps in the Play Store cost money. Thus I hypothesize that App Store customers are more willing to pay for apps than Play Store customers, so if you are creating an App Store app, perhaps charging for the app is a feasible way to monetize it.

The most frequent genres in the App Store are primarily entertainment-based:
1. Games, 58 percent
2. Entertainment, 8 percent
3. Photo/Video, 5 percent

However, the most frequent categories in the Play Store are primarily practical-based:
1. Family, 19 percent
2. Games, 10 percent
3. Tools, 8.5 percent
4. Business, 4.5 percent
5. Lifestyle, 4 percent
  
Although the most frequent apps in the App Store are entertainment-based, the best in terms of popularity where primarily practical-based ("Productivity" and "Reference") whereas the best apps in terms of popularity on the Play Store were entertainment-based ("Game", "Entertainment", "Photography"). This suggests there is an weak inverse relationship between frequency of the genre and that genre's overall success (determined by multiple measures, including median reviews, average reviews, and frequency), which could result from the fact that the more apps that are designed in a given category, the more competition each app has and thus is less likely to be downloaded.

Ultimately, a new developer should consider these factors when deciding which types of apps he or she should develop. Understanding the App Store and Play Store marketplaces can be incredibly beneficial when designing your app, especially if you care about profitability. Hopefully this project will help new iOS and Android developers better understand the challenges they will face when developing a new app.