_This is a guided project from DataQuest, and my first attempt at any data science project. DataQuest walked me through most of the code seen below, however I feel that I've learned a lot about the process of data cleaning. I relied less on the guides during the analysis section, and I feel that I've provided my own unique analysis of the data. Thanks for checking it out! &mdash; Zeth De Luna_

# Profitable App Profiles for the App Store and Google Play Markets

The goal of this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We are a team of data analysts working for a company that builds Android and iOS mobile apps.

At our company, we only build apps that are free to download and install, meaning that our main source of revenue consists of in-app ads. So, our revenue is directly related to the number of users who use our app &mdash; the more users that see and engage with ads, the better. In this project, we will analyze data to help our developers understand what type of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money, so we'll analyze a sample of data instead.

These two data sets will be suitable for our purpose:

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. A direct download of the data set is available [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. A direct download of the data set is available [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Let's start by opening the data sets.

In [2]:
from csv import reader

# Google Play data set
opened_gp = open('googleplaystore.csv')
read_gp_data = reader(opened_gp)
googleplay = list(read_gp_data)
googleplay_header = googleplay[0]
googleplay = googleplay[1:]

# App Store data set
opened_ap = open('AppleStore.csv')
read_ap_data = reader(opened_ap)
appstore = list(read_ap_data)
appstore_header = appstore[0]
appstore = appstore[1:]

To make exploring the data sets easier, we created a function called `explore_data()` which allows us to repeatedly print rows in a readable way. With the function, we can also choose to view the number of rows and columns of a dataset.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row, '\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Let's explore the first few rows of each data set. Let's check the size of each data set as well.

In [4]:
print('Apple Store Data Set \n')
print(appstore_header, '\n')
explore_data(appstore, 0, 4, rows_and_columns=True)

Apple Store Data Set 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] 

Number of rows:  7197
Number of columns:  16


We can see that the Apple Store data set has 7,197 apps and 16 columns. From the header, we can see that the columns that might be useful for our analysis are `track_name`, `price`, `rating_count_tot`, `rating_count_ver`, `user_rating`, `user_rating_ver`, `cont_rating`, and `prime_genre`. Details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

Now let's look at the Google Play data set.

In [5]:
print('\n Google Play Data Set \n')
print('\n', googleplay_header, '\n')
explore_data(googleplay, 0, 4, rows_and_columns=True)


 Google Play Data Set 


 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows:  10841
Number of columns:  13


We see that the Google Play data set has 10,841 apps and 13 columns. The columns that might be useful for our analysis are `App`, `Category`, `Rating`, `Reviews`, `Installs`, `Type`, `Price`, `Content Rating`, and `Genres`.

## Cleaning Our Data

Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise our results will be wrong. So, we need to:

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

We also know that our company ony builds apps that are _free_ to download and install, and that are directed toward an _English-speaking_ audience. So, we'll need to:

* Remove non-English apps.
* Remove apps that aren't free.

### Detecting and Deleting Wrong Data

On the [documentation](https://www.kaggle.com/lava18/google-play-store-apps) website for the Google Play data set, there is a dedicated discussion section, and we see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for the app with row index 10472.

Let's take a closer look by comparing this row with the header and another row that is correct.

In [6]:
print(googleplay_header, '\n')
print('Correct Row')
print(googleplay[0], '\n')
print('Row with Error')
print(googleplay[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Correct Row
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

Row with Error
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can see that the row with the error has a `Category` of `1.9` and a `Rating` of `19`. This is clearly wrong because `Category` is described using words in the Correct Row and `Rating` only ranges between 0 and 5. So, we see that for this app, `Category` is missing and the columns to right of `App` are shifted to the left, resulting in incorrect data. 

Since the Google Play data set contains a decently large amount of data, it would be much more efficient to remove the inaccurate data point rather than trying to find the information to correct it.

In [7]:
del googleplay[10472]

Now, let's check if the Google Play data set contains any duplicate rows.

We've created a function called `check_duplicates()` which gives the app name and row index of any duplicate rows. By creating a function, we can easily check both data sets for duplicates.

In [8]:
def check_duplicates(dataset, show_amount=False, show_ex=False):
    unique_app = []
    duplicate_app = []
    for row in dataset:
        app = row[0]
        if app in unique_app:
            duplicate_app.append(app)
        else:
            unique_app.append(app)
    if show_amount:
        print('Number of duplicate apps: ', len(duplicate_app), '\n')
    if show_ex:
        print('Examples of duplicate apps: ', duplicate_app[:8])
    else:
        return duplicate_app

In [9]:
check_duplicates(googleplay, True, True)

Number of duplicate apps:  1181 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads']


We found that the Google Play data set has 1,181 duplicate entries. Rather than removing the duplicates at random, we'll remove them according to the date `Last Updated`. We will get more accurate data by using the most recent release of the duplicate app because it has the most recent information about the app.

First, we need to convert the date string in the `Last Updated` column into a date object that we can make comparisons with.

In [10]:
from datetime import datetime

for row in googleplay:
    date = row[10]
    date_obj = datetime.strptime(date, "%B %d, %Y")
    row[10] = date_obj

Next, we created a dictionary where each key is a unique app name and the value is the latest date for which that app was updated.

In [11]:
latest_date = {}

for app in googleplay:
    name = app[0]
    last_update = app[10]
    if name in latest_date and latest_date[name] > last_update:
        latest_date[name] = last_update
    elif name not in latest_date:
        latest_date[name] = last_update

We found earlier that there are 1,181 duplicates in `googleplay`, so the length of our dictionary should be the length of the data set minus 1,181. To check that our dictionary is correct, we'll compare the length of the dictionary to the calculate length of the data set without duplicates.

In [12]:
print('Expected length: ', len(googleplay) - 1181)
print('Actual length: ', len(latest_date))

Expected length:  9659
Actual length:  9659


Now, we can use the `latest_date` dictionary to remove the duplicates, only keeping the app with the most recent date.

We'll start by creating two new, empty lists: `googleplay_clean` and `already_added`. Then, we'll iterate through the `googleplay` data set. The app will be added into `googleplay_clean` ___if___ the app's `Last Updated` date matches the app's date in the `latest_date` dictionary ___and___ the app name isn't in `already_added`.

In [13]:
googleplay_clean = []
already_added = []

for app in googleplay:
    name = app[0]
    last_update = app[10]
    if (latest_date[name] == last_update) and (name not in already_added):
        googleplay_clean.append(app)
        already_added.append(name)

To double check if the duplicate deletion was successful, we can check the length of `googleplay_clean` and run `googleplay_clean` through our `check_duplicates()` function.

In [14]:
print(len(googleplay_clean))
check_duplicates(googleplay_clean, True, True)

9659
Number of duplicate apps:  0 

Examples of duplicate apps:  []


We have successfully removed the duplicates from the `googleplay` data set! Before we move on, let's do the same for the `appstore` data set.

On the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) for the Apple Store data set, the only error that was discussed was that the data set contained duplicates. So, we'll run the `appstore` data set through our `check_duplicates()` function.

In [15]:
check_duplicates(appstore, True, True)

Number of duplicate apps:  0 

Examples of duplicate apps:  []


The Apple Store data set seems to be clean, no errors and no duplicates.

Now that we've removed errors and duplicates from our data sets, let's remove any apps that are ___not___ in English.

<br>

### Removing Non-English Apps

Recall, our company develops apps catered toward an English-speaking audience. So, we're not interested in keeping any apps that are not in English. A good way to determine if an app is intended for an English-speaking audience is to check the name of the app. Usually, an app with a non-English name suggests that they are not directed toward an English-speaking audience, so we'll remove those apps.

One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text. Each character we use in a string has a corresponding number associated with it, which we can get using the `ord()` built-in function. 

The numbers corresponding to the characters used in English text are only in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) system. If a character in the name of an app has a corresponding number greater than 127, then we can assume that the app has a non-English name.

Let's create a function called `is_english()`that iterates over the characters in an input string and determines if any of the characters doesn't belong to the set of common English characters.

In [16]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

In [17]:
# testing is_english() function
print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
False
False


We can see that this `is_english()` function cannot accurately determine if the app's name is in English due to the ‚Ñ¢ and üòú being outside the range of common English characters.

In [18]:
print(ord('‚Ñ¢'))
print(ord('üòú'))

8482
128540


To prevent losing too much data, we'll only remove an app if its name has more than three characters outside of the common English characters range.

In [19]:
def is_english(string):
    not_english = 0
    for char in string:
        if ord(char) > 127:
            not_english += 1
    if not_english > 3:
        return False
    return True

In [20]:
# testing modified is_english() function
print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
True
True


Now, we'll use the `is_english()` function to filter out any non-English apps from both data sets.

Let's start with the `googleplay_clean` data set. Apps with English names will be appended to a new list, `googleplay_eng`.

In [21]:
googleplay_eng = []

for app in googleplay_clean:
    name = app[0]
    if is_english(name) == True:
        googleplay_eng.append(app)

In [22]:
explore_data(googleplay_eng, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', datetime.datetime(2018, 1, 7, 0, 0), '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', datetime.datetime(2018, 1, 15, 0, 0), '2.0.0', '4.0.3 and up'] 

['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', datetime.datetime(2018, 8, 1, 0, 0), '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', datetime.datetime(2018, 6, 8, 0, 0), 'Varies with device', '4.2 and up'] 

Number of rows:  9614
Number of columns:  13


We see that we removed 45 (9,659 - 9,614) apps from `googleplay_clean`.

Let's do the same for the Apple Store data set. (Keep in mind that the name of each app in the Apple Store data set is the second element in each app list, under the column `track_name`).

In [23]:
appstore_eng = []

for app in appstore:
    name = app[1]
    if is_english(name) == True:
        appstore_eng.append(app)

In [24]:
explore_data(appstore_eng, 0, 4, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] 

Number of rows:  6183
Number of columns:  16


We see that 1,014 (7,197 - 6,183) apps were removed from `appstore`.

<br>

### Removing Apps That Are Not Free

Our last step before beginning analysis is to remove any app that is ___not___ free. Remember, our company only builds apps that are free to download and install, and our main source of revenue consists of in-app ads. So, we'll need to remove any non-free apps from our data before our analysis.

To do this, we can simply iterate through the apps and check if `price` is zero. We'll create an empty list for both data sets, `googleplay_final` and `appstore_final`. If the price of an app is zero, we'll append that app into its corresponding list, otherwise the app will be left out. (Notice that the `price` columns in `googleplay_eng` and `appstore_eng` have different locations).

In [25]:
googleplay_final = []
appstore_final = []

# generate final cleaned Google Play data set
for app in googleplay_eng:
    price = app[7]
    if price == '0':
        googleplay_final.append(app)
        
# generate final cleaned Apple Store data set
for app in appstore_eng:
    price = float(app[4])
    if price == 0:
        appstore_final.append(app)

In [26]:
print('Android Apps: ', len(googleplay_final))
print('iOS Apps: ', len(appstore_final))

Android Apps:  8862
iOS Apps:  3222


After cleaning the data according to our needs, we're left with 8,862 Android apps and 3,222 iOS apps, which should be enough for our analysis.

## Analysis

Again, the goal of our company is to develop apps that are likely to attract more users because our revenue is highly dependent on the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Ultimately, we want our app to be available on both Google Play and the App Store. So, we need to find app profiles that are successful on both markets.

## Most Common Apps by Genre

We can begin our analysis by getting a sense of what the most common genres for each market are.

To do this, we'll need to build frequency tables for the `prime_genre` column of the App Store data set, and the `Category` and `Genres` columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

DataQuest has provided a function called `display_table()`, which:
* Takes in two parameters `dataset` (expected to be a list)  and `index` (expected to be an integer).
* Generates a frequency table using the `freq_table()` function (which I will write)
* Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
* Prints the entries of the frequency table in descending order.

First, we'll write the `freq_table()` function, which will then be used inside the `display_table()` function.

In [27]:
def freq_table(dataset, index):
    # creates a frequency table
    f_table = {}
    for row in dataset:
        data = row[index]
        if data in f_table:
            f_table[data] += 1
        else:
            f_table[data] = 1
    # converts each value into percentage
    total_points = len(dataset)
    for key in f_table:
        f_table[key] = (f_table[key] / total_points) * 100
    
    return f_table

Now, we'll write down the `display_table()` function and apply it to each data set, then run it.

In [28]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Apple Store data set: `prime_genre` frequency table

In [29]:
display_table(appstore_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see from our Apple Store frequency table that the most frequent app genre is Games (58.16%), followed by Entertainment (7.88%), and Photo & Video (4.97%). Only 3.66% of apps are made for Education and 3.29% are Social Networking. The remainder of the apps have some type of practical purpose (Shopping, Utilities, Health & Fitness, News, Weather, etc.) and each make up only a very small portion of the Apple Store.

The data makes it clear that most of the apps available in the Apple Store are geared towards fun (games, entertainment, photo & video, etc.), while the apps with more practical purposes are more rare (education, utilities, health & fitness, etc.). This could indicate that Apple Store market has a higher demand for apps that are "fun" because developers seem to produce Games and Entertainment apps more than anything else. 

However, it is important to keep in mind that the frequency of the genre does not necessarily reflect the amount of downloads and installs per app. Producing an app for a densely populated genre might even make it less likely that a user will come across it.

#### Google Play data set: `Genres` frequency table

In [30]:
display_table(googleplay_final, 9)

Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5319341006544795
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.0692845858722633
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8618821936357481
Video Players & Editors : 1.782893252087565
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

At a quick glance of the Google Play Genres frequency table, you could jump to the conclusion that Tools apps make up the highest percentage of the Google Play apps. However, after a closer look it seems that what would be the "Games Genre" has been split into many different sub-genres (Racing, Strategy, Casino, Trivia, Puzzle, etc.). The same situation can be seen for other genres like Education (Education;Music & Video, Education;Brain Games, Education;Creativity, etc.) and Entertainment (Creativity, Action & Adventure, etc.). Many of the genres even share some sub-genres. Because of this, it would be impractical to use this data as it may lead to confusion or inaccurate percentages.

Luckily, the Google Play data set has another column that describes app genres that may be more useful to us.

#### Google Play data set: `Category` frequency table

In [31]:
display_table(googleplay_final, 1)

FAMILY : 18.43827578424735
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5319341006544795
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.0692845858722633
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0.6

The `Category` frequency table above shows a much more accurate representation of the population of apps in the Google Play store. We can clearly see that "FAMILY" apps are the most present at 18.44%, followed by "GAME" at 9.87% and "TOOLS" at 8.44%.

Unlike what we saw in the Apple Store data, no app category in the Google Play store holds a dominating majority. This data shows that apps geared toward family friendly are most prevalent in the Google Play store. This may indicate that family friendly apps are in the highest demand. To investigate further, we checked Google Play and it seems that the "Family" category no longer exists (date checked: July 5, 2020).

![title](gp_categories.png)

If we think about the "family genre" in other markets (i.e. movies, TV, activities, etc.), the one factor that they all have in common is that they are geared towards the kids. Parents may enjoy the experience as well, however it is important for the products to be designed so that young children would be able to comprehend and enjoy the experience. Pixar, for example, produces animated movies with complex underlying themes and scenes that portray serious topics, like mental illness or the passing of loved ones. This allows for the parents to watch the movie for themselves, to be moved emotionally rather than just playing it for the kids. However, at surface level, they are made for the enjoyment of a younger audience&mdash;fun adventures, action, comedy, friendly looking characters, story lines that are easy enough to understand, etc.

Now, bringing this back to our analysis, we can probably assume that the "family" category in Google Play consisted of apps made for kids. And if we look at the rest of the categories, it is most likely that these kids apps are actually games (I highly doubt that developers would make tools, lifestyle, or business apps for kids). So, we can estimate that the "Game" category actually represents about 25 - 30% of the Google Play market. Even with this increase in the percentage held by the Game category, we can still see that apps in the Google Play market are more evenly balanced between apps that are fun and apps that are practical.

Up to this point, we found that the Apple Store market is dominated by apps that are geared towards fun, and the Google Play market has a good balance between apps geared towards fun and apps geared towards practical uses. The analysis, so far, has given us the most popular genres based on the number of apps available per genre. To get an idea about the kind of apps with the most users, we must do some further analysis.

## Most Popular Apps by Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. This information is available in the Google Play data set in the `Installs` column, however it's not included in the Apple Store data set. As a workaround, we can observe the total number of user ratings, which we can find in the `rating_count_tot` column. Because a written review implies that the user has experienced the app, it will be safe to assume that most of an app's reviews have been written by users that have downloaded and installed the app.

### Apple Store's Most Popular by Genre

Let's start with calculating the average number of user ratings per app genre on the Apple Store. To do that, we'll need to:

* Isolate the apps of each genre.
* Sum up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

First, we'll create a frequency table for the `prime_genre` column using our `freq_table()` function (remember that the values are percentages, but for our purposes now only the keys are needed).

Then, we'll loop through the unique genres. For each genre, we will loop through the Apple Store data set and count the number of apps that match each genre and the sum of user ratings.

Finally, we can calculate the average number of user ratings by dividing the number of user ratings by the number of apps per genre.


In [32]:
genre_appstore = freq_table(appstore_final, 11)

for genre in genre_appstore:
    total_reviews = 0 # total number of reviews per genre
    genre_length = 0 # amount of apps per genre
    for app in appstore_final: # loops through each app
        app_genre = app[11]
        num_user_ratings = app[5] # looks at genre and num of ratings
        # separates number of reviews and number of apps into unique genres
        if app_genre == genre: 
            total_reviews += float(num_user_ratings)
            genre_length += 1
    # calculate average number of reviews per genre
    avg_num_reviews = total_reviews / genre_length
    print(genre, ':', avg_num_reviews, ' (Apps in Genre:', genre_length, ')')

Social Networking : 71548.34905660378  (Apps in Genre: 106 )
Photo & Video : 28441.54375  (Apps in Genre: 160 )
Games : 22788.6696905016  (Apps in Genre: 1874 )
Music : 57326.530303030304  (Apps in Genre: 66 )
Reference : 74942.11111111111  (Apps in Genre: 18 )
Health & Fitness : 23298.015384615384  (Apps in Genre: 65 )
Weather : 52279.892857142855  (Apps in Genre: 28 )
Utilities : 18684.456790123455  (Apps in Genre: 81 )
Travel : 28243.8  (Apps in Genre: 40 )
Shopping : 26919.690476190477  (Apps in Genre: 84 )
News : 21248.023255813954  (Apps in Genre: 43 )
Navigation : 86090.33333333333  (Apps in Genre: 6 )
Lifestyle : 16485.764705882353  (Apps in Genre: 51 )
Entertainment : 14029.830708661417  (Apps in Genre: 254 )
Food & Drink : 33333.92307692308  (Apps in Genre: 26 )
Sports : 23008.898550724636  (Apps in Genre: 69 )
Book : 39758.5  (Apps in Genre: 14 )
Finance : 31467.944444444445  (Apps in Genre: 36 )
Education : 7003.983050847458  (Apps in Genre: 118 )
Productivity : 21028.41071

We can see from our calculations that the top three genres containing the highest average number of reviews (number of reviews per number of apps) are Navigation (86090 reviews), Reference (74942 reviews), and Social Networking (71548 reviews). Based on these numbers, we would recommend for the company to develop apps within these three genres.

It is important to keep in mind that the high number of reviews per genre might be dominated by only a few apps. In Social Networking, for example, Facebook and Twitter might provide most of the review counts while the remaining 104 apps only account for a small percentage of review counts. The same idea may be applicable to the other genres. 

### Google Play's Most Popular by Genre

Now, let's take a look at a few rows of the `Installs` column in the Google Play data set.

In [33]:
display_table(googleplay_final, 5) # left: value, right: percentage of apps with that value

1,000,000+ : 15.741367637102236
100,000+ : 11.554953735048521
10,000,000+ : 10.52809749492214
10,000+ : 10.212141728729407
1,000+ : 8.406680207628076
100+ : 6.917174452719477
5,000,000+ : 6.8156172421575265
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.279395170390431
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


We can see that the number of installs is a bit ambiguous. For apps with 500,000+ installs, we don't know if an app has 500,000 installs, 678,999 installs, or 821,200 installs. The same goes for the other values. For simplicity, we're just going to use the values provided by the table. So, apps with 1,000,000+ installs will just have 1,000,000 installs, apps with 50+ installs will just have 50 installs, and so on.

Now, let's calculate the average number of installs per app genre for the Google Play data set.

We'll start by generating a frequency table for the `Category` column. Then, we'll loop over the genres and calculate the average number of installs per genre.

In [34]:
category_googleplay = freq_table(googleplay_final, 1)

for category in category_googleplay:
    total_reviews = 0 # sum of installs per genre
    genre_length = 0 # num of apps per genre
    for app in googleplay_final:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total_reviews += float(installs)
            genre_length += 1
    avg_num_reviews = total_reviews / genre_length
    print(category, ':', avg_num_reviews, '(Apps in Genre:', genre_length, ')')

ART_AND_DESIGN : 1905351.6666666667 (Apps in Genre: 60 )
AUTO_AND_VEHICLES : 647317.8170731707 (Apps in Genre: 82 )
BEAUTY : 513151.88679245283 (Apps in Genre: 53 )
BOOKS_AND_REFERENCE : 8767811.894736841 (Apps in Genre: 190 )
BUSINESS : 1712290.1474201474 (Apps in Genre: 407 )
COMICS : 817657.2727272727 (Apps in Genre: 55 )
COMMUNICATION : 38456119.167247385 (Apps in Genre: 287 )
DATING : 854028.8303030303 (Apps in Genre: 165 )
EDUCATION : 3082017.543859649 (Apps in Genre: 114 )
ENTERTAINMENT : 21134600.0 (Apps in Genre: 100 )
EVENTS : 253542.22222222222 (Apps in Genre: 63 )
FINANCE : 1387692.475609756 (Apps in Genre: 328 )
FOOD_AND_DRINK : 1924897.7363636363 (Apps in Genre: 110 )
HEALTH_AND_FITNESS : 4167457.3602941176 (Apps in Genre: 272 )
HOUSE_AND_HOME : 1313681.9054054054 (Apps in Genre: 74 )
LIBRARIES_AND_DEMO : 638503.734939759 (Apps in Genre: 83 )
LIFESTYLE : 1437816.2687861272 (Apps in Genre: 346 )
GAME : 15831850.8 (Apps in Genre: 875 )
FAMILY : 2690205.440636475 (Apps in Ge

We can see that the top three genres with the most installs are Communication ($\approx$38.5M reviews), Video Players ($\approx$24.7M reviews), and Social ($\approx$23.3M reviews). When comparing these top three genres to those of the Apple Store, the only common genre is Social Media. 

## Conclusions

In this project, we analyzed data about the Apple Store and Google Play mobile apps with the goal of recommending an app profile that would be profitable for both markets.

We saw that in the Apple Store, the games genre accounted for a majority of the apps that were available for download. However, the amount of apps alone could not provide enough information about how many users actually downloaded and installed the apps. So, we took a look at the number of installs per genre for both the Apple Store data set and the Google Play data set. This data provided accurate information about how many users were installing apps in each genre. 

Since our goal was to develop an app that would be successful in both markets, we would recommend spending the company's time and energy into developing a social media app. The social media genre ranked at #3 in the Apple Store data set and the Google Play data set in terms of average number of installs. Navigation and Reference from the Apple Store, and Communication and Video Players from Google Play had higher averages than Social Media, but, again, our goal is to develop an app that would succeed in _both_ markets.