# What Apps Attract Users? 
## Google Play and Apple Store Analysis

In this scenario, we are data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

The company only creates English-speaking apps that are free to download and install; the main source of revenue is from in-app ads. This means our **revenue for any given app is mostly influenced by the number of users who use our app**.  We want users who see and engage with the ads. 

The goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

The data sets used are:

1. A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
2. A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

After exploring the data, both data sets will be examined for rows that do not contain complete data, duplicate rows, and non-English speaking apps.  From there, free apps will be isolated.

The validation strategy of the company to minimize risks and overhead are as follows:
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

The end goal is to add the app on both Google Play and the App Store, so it is important to determine app profiles that are successful on both markets. 

In [1]:
## The explore data function allows us to print the rows in a 
## readable way and is provided by DataQuest

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
## Open and read a dataset into a list

def prep_file(filename):
    from csv import reader
    opened_file = open(filename)
    read_file = reader(opened_file)
    output = list(read_file)
    opened_file.close()
    return output

In [2]:
## Open and explore the Apple Store dataset

apple_data = prep_file("AppleStore.csv")
explore_data(apple_data, 0, 3, rows_and_columns=True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


### Initial Thoughts: Apple Store Data Analysis

We are given the following information:

"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"rating_count_tot": User Rating counts (for all version)

"rating_count_ver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"user_rating_ver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

Based on the information provided, we will need to do some data cleaning to eliminate paid apps, followed by taking a closer look at the data to see if any rows might be irrelevant (ie. non-English apps).  From there, I might get more information on the rating count totals, eliminate outliers, and then break the remaining data down by genre to make a recommendation.

In [3]:
## Open and explore the Google Play dataset

google_data = prep_file("googleplaystore.csv")
explore_data(google_data, 0, 3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


### Initial Thoughts: Google Play Store Data Analysis

We are given the following:

App: Application name

Category: Category the app belongs to

Rating: Overall user rating of the app (as when scraped)

Reviews: Number of user reviews for the app (as when scraped)

Size: Size of the app (as when scraped)

Installs: Number of user downloads/installs for the app (as when scraped)

Type: Paid or Free

Price: Price of the app (as when scraped)

Content:  RatingAge group the app is targeted at - Children / Mature 21+ / Adult

Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

Last Updated: Date when the app was last updated on Play Store (as when scraped)

Current Ver: Current version of the app available on Play Store (as when scraped)

Android Ver: Min required Android version (as when scraped)

Based on the information provided, we will need to do some data cleaning to eliminate paid apps, followed by taking a closer look at the data to see if any rows might be irrelevant (similar to the App Store data).  From there, I might get more information on the number of installs, eliminate outliers, and then break the remaining data down by genre to make a recommendation.

### Data Cleaning: Google Play

From the Google Play [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), one of the rows has missing data and should be removed.  We also know from the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section that there is duplicate data.  We need to remove non-English apps.  Finally, we will need to get rid of paid apps.

#### First, we will get rid of any rows with missing data:


In [4]:
## Find the row with missing data by comparing row lengths

header_length = len(google_data[0])
for row in google_data[1:]:
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(google_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [5]:
## Remove the row 

del google_data[10473]

## Confirm removal by running code from previous cell

header_length = len(google_data[0])
for row in google_data[1:]:
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(google_data.index(row))

#### Next, we will remove duplicate rows (when an app occurs more than once):

In [6]:
## Use modified code from DataQuest to show the number of 
## duplicate rows based on app name as well as a few examples
## of duplicate rows

duplicate_apps = []
unique_apps = []

for row in google_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
        
print("Number of duplicate rows:", len(duplicate_apps))
print("\n")
print("Examples:", duplicate_apps[:5])


Number of duplicate rows: 1181


Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [7]:
## Next, let's take a closer look at an example to see what 
## (if any) differences exist for the duplicate rows.  We will
## use the "Box" app from the examples listed above - this was
## chosen randomly

for row in google_data:
    if row[0] == "Box":
        print(row)



['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
## It appears that these rows are identical.  From DataQuest,
## we know that there are differences in the rows for Instagram;
## let's take a closer look

for row in google_data:
    if row[0] == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Based on the Instagram data, it appears that with at least some of the duplicate entries, there is a difference in the number of reviews.  As DataQuest points out, a higher number of reviews represents more recent data, so it would make sense to keep the row with the highest number of reviews.  Now that we have a criterion for keeping one copy of each app, we can move forward with removing duplicate entries

In [9]:
## First, let's get the expected length of the revised data set
## without the header row

print("Expected length:", len(google_data[1:]) - len(duplicate_apps))



Expected length: 9659


In [10]:
## Next, create a dictionary with unique app names as the key 
## with the value being the highest number of reviews.  This will give us
## the criteria for creating the cleaned list

reviews_max = {}

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
## Check to see if name is in the dictionary; if not, add it    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
## Next, if name is in dict, check to see if it has more reviews;
## if so, add it
    elif (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews

## Print dictionary length; should match expected length above
print(len(reviews_max))

9659


In [11]:
## Next, use the dictionary above to create a new list of 
## Google Play apps that does not contain duplicates and has the
## highest number of reviews for that app.  Because some entries have the 
## same number of reviews, we also need to keep track of apps that have 
## been added to the cleaned list so that duplicate entries do not occur 

googleplay_clean = []
already_added = []

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name not in already_added) and (n_reviews == reviews_max[name]):
        googleplay_clean.append(row)
        already_added.append(name)

## Check length; should be expected length
print(len(googleplay_clean))

9659


#### Next, we will remove non-English apps 

Using the fact that English characters are represented by 0-127 in ASCII, we can use this as a basis for finding non-ASCII characters in the app names.  If there are more than three non-ASCII characters, we will assume the app is directed toward a non-English speaking audience and remove it from the cleaned data list.

In [12]:
## First, let's find apps with more than three non-English characters 
## in a string

def is_English(app_name):
    count = 0
    for letter in app_name:
        if ord(letter) > 127:
            count += 1
            if count > 3:
                return False
    return True

## Test the function

print(is_English("Instagram"))
print(is_English("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_English("Docs To Go™ Free Office Suite"))
print(is_English("Instachat 😜"))

True
False
True
True


In [13]:
## Remove non-English apps from Google Play data

googleplay_clean_no_eng = []

for row in googleplay_clean:
    name = row[0]
    if is_English(name):
        googleplay_clean_no_eng.append(row)
        
print(len(googleplay_clean_no_eng))

9614


#### Finally, isolate free apps to create final Google Play app list

In [14]:
## List for cleaned data
googleplay_final = []

for row in googleplay_clean_no_eng:
    is_free = row[6]
    if is_free == "Free":
        googleplay_final.append(row)

## Confirm lists are different sizes
print(len(googleplay_clean_no_eng))
print(len(googleplay_final))
        

9614
8863


### Data Cleaning: Apple Store

In the guided project, DataQuest did not have us check for rows with missing data or duplicate rows.  I think that ideally functions could be written to take in a data set and check for this, but I am going to reuse existing code to show these cases do not exist for the Apple Store data.  

However, we do need to remove non-English apps.  Finally, we will need to get rid of paid apps.

#### First, we will get check for rows with missing data:


In [15]:
## Find rows with missing data by comparing row lengths

header_length = len(apple_data[0])
for row in apple_data[1:]:
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(apple_data.index(row))

#### Next, we will check for duplicate rows (when an app occurs more than once):

In [16]:
## Use modified code from DataQuest to show the number of 
## duplicate rows based on app name as well as a few examples
## of duplicate rows

duplicate_apps = []
unique_apps = []

for row in apple_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
        
print("Number of duplicate rows:", len(duplicate_apps))
print("Examples:", duplicate_apps[:5])


Number of duplicate rows: 0
Examples: []


#### Next, we will remove non-English apps 

In [17]:
## Remove non-English apps from Apple Store data

apple_no_eng = []

## Remove header row at this point
for row in apple_data[1:]:

## Note that app name is in different column

    name = row[1]
    if is_English(name):
        apple_no_eng.append(row)
        
print(len(apple_data))
print(len(apple_no_eng))

7198
6183


#### Finally, isolate free apps to create final Apple Store app list

In [18]:
## List for cleaned data
apple_final = []

for row in apple_no_eng:
    is_free = float(row[4])
    if is_free == 0:
        apple_final.append(row)

## Confirm lists are different sizes
print(len(apple_no_eng))
print(len(apple_final))

6183
3222


### Data Analysis

We now have two cleaned datasets, googleplay_final and apple_final.  Because the end goal is to add the app on both Google Play and the App Store, so it is important to determine app profiles that are successful on both markets. 

For the Apple Store, we are provided information on primary genre (prime_genre column).  For the Google Play Store, we ae given both a category and a genre (the Category and Genre columns, respectively).  

First we will generate a frequency table to for any column in a dataset.  Next we will use a function from Dataquest to display the entries of the frequency table in descending order.

In [19]:
## Create frequency table (in percentages) for any column 
## of interest in a data set.

def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        column = row[index]
        if column in table:
            table[column] += 1
        else:
            table[column] = 1
    
    for item in table:
        table[item] = (table[item] / total) * 100
        
    return table

## The display_table() function takes in two parameters: dataset
## and index. Dataset is expected to be a list of lists, and 
## index is expected to be an integer. Generates a frequency 
## table using the freq_table() function above and transforms 
## the frequency table into a list of tuples, then sorts and 
## prints the list in a descending order.
## Provided by Dataquest

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

In [20]:
## Display table for prime_genre in Apple Store

display_table(apple_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The Apple Store, in terms of free apps, is dominated by the Games category.  It makes up almost 60% of the market.  This is followed by Entertainment, Photo & Video, and Education.

In [21]:
## Display table for Genres in Google Play Store

display_table(googleplay_final, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The Genre column in the Google Play store does not have a dominant group like the Apple Store.  The top genres are tools, entertainment, education and business.

There does not appear to be a games group in genre, but if the list is closely examined, it looks like games are broken down into multiple genres.  Addition together genres such as Simulation, Arcade, Puzzle, Racing, and Role Playing may bring a "games" genre closer to the top of the list.  Also, education and entertainment are both high on the list, which is consistent with the Apple Store.    

In [22]:
## Display table for Category in Google Play Store

display_table(googleplay_final, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

The Category column results are not consistent with the Genre column results from the Google Play Store.  Here, Education and Entertainment are not very popular, but a new category "Family"  top the list.  Games is second.  DataQuest notes that the "Family" category has many games for kids. 

#### Initial Data Analysis Results

Based on initial results, I would recommend taking a closer look at Games.  Even though they are not as dominant in the Genre group of the Google Play Store, it may be possible that this is due to games being broken down into multiple genres.

With the "Family" and "Games" categories being the highest in the Categories, and with the knowledge that the Family category contains many apps for kids, my initial recommendation would be to develop a kids game.  This would also allow for portability to the Apple Store, as we know that Games are the dominant category.   

This initial recommendation, however, is based only on which categories/genres are most popular.  Next, we are going to take a look at the number of users for each category.  The Apple Store does not have a total number of installs, but it does have a rating count total (rating_count_tot) column.  The Google Play Store has an Installs column, as well as the number of user reviews (Reviews column).

I think that comparing rating_count_tot from the Apple Store and the Reviews column from the Google Play Store will be the most effective comparison. I think that if a user takes the time to rate an app, they are more likely to use it frequently, which would be of interest to a business trying to get users to click on ads.

DataQuest also looks at the Installs column, so I would like to add this to the analysis as well.  

In [23]:
## Starting with the Apple Store, we will get a frequency table 
## for prime_genre

apple_genre = freq_table(apple_final, 11)

## Next, calculate the average number of user ratings
table_display = []
for genre in apple_genre:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre +=1
    ## Convert to int to make results more readable
    average = int(total/len_genre) 
    
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average,genre, len_genre)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

Navigation : 86090 Number of apps: 6
Reference : 74942 Number of apps: 18
Social Networking : 71548 Number of apps: 106
Music : 57326 Number of apps: 66
Weather : 52279 Number of apps: 28
Book : 39758 Number of apps: 14
Food & Drink : 33333 Number of apps: 26
Finance : 31467 Number of apps: 36
Photo & Video : 28441 Number of apps: 160
Travel : 28243 Number of apps: 40
Shopping : 26919 Number of apps: 84
Health & Fitness : 23298 Number of apps: 65
Sports : 23008 Number of apps: 69
Games : 22788 Number of apps: 1874
News : 21248 Number of apps: 43
Productivity : 21028 Number of apps: 56
Utilities : 18684 Number of apps: 81
Lifestyle : 16485 Number of apps: 51
Entertainment : 14029 Number of apps: 254
Business : 7491 Number of apps: 17
Education : 7003 Number of apps: 118
Catalogs : 4004 Number of apps: 4
Medical : 612 Number of apps: 6


After taking a closer look, I can't help but wonder if there are some outliers (ie. really popular apps, or maybe apps with no ratings) skewing the averages.

In [24]:
## Let's take a look at overall statistics
totals_table = []
for row in apple_final:
    total = int(row[5])
    totals_table.append(total)
    
print("Max: ", max(totals_table))
print("Min: ", min(totals_table))
print("Average: ", int(sum(totals_table)/len(totals_table)))

Max:  2974676
Min:  0
Average:  24824


It does appear that at least one app has almost 3 million ratings, at least one app has 0 ratings, and the average is about 25K ratings.  Let's create a frequency table to take a closer look:

In [25]:
apps_freq_table = {"Less than 100": 0, "More than 500,000": 0}
for total in totals_table:
    if total < 100:
        apps_freq_table["Less than 100"] += 1
    elif total > 500000:
        apps_freq_table["More than 500,000"] += 1
        
print(apps_freq_table)
    

{'More than 500,000': 19, 'Less than 100': 669}


I think these criteria are reasonable cutoffs.  If an app has less than 100 ratings, it is probably not of interest to a company whose interest is in high user numbers.  Cutting off the top 20 apps is also fairly reasonable, given that these are probably the most popular apps everyone has on their phone.  Let's recalculate the average number of user ratings for the genre_app after getting rid of apps with less than 100 ratings and apps with more than 500,000 ratings.

In [26]:
## Starting with the Apple Store, we will get a frequency table 
## for prime_genre

apple_genre = freq_table(apple_final, 11)

## Next, calculate the average number of user ratings
table_display = []
for genre in apple_genre:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[11]
        user_ratings = float(row[5])
        if genre_app == genre:
            if (user_ratings > 100) and (user_ratings < 500000):
                total += user_ratings
                len_genre +=1
    ## Convert to int to make results more readable
    average = int(total/len_genre) 
    
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average,genre, len_genre)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

Navigation : 103307 Number of apps: 5
Weather : 91468 Number of apps: 16
Book : 61845 Number of apps: 9
Food & Drink : 45608 Number of apps: 19
Finance : 43561 Number of apps: 26
Social Networking : 37741 Number of apps: 94
Travel : 33224 Number of apps: 34
Shopping : 30970 Number of apps: 73
Music : 30653 Number of apps: 58
Sports : 29391 Number of apps: 54
Reference : 27924 Number of apps: 13
News : 27679 Number of apps: 33
Utilities : 25639 Number of apps: 59
Productivity : 24028 Number of apps: 49
Games : 22080 Number of apps: 1455
Health & Fitness : 20537 Number of apps: 49
Lifestyle : 20012 Number of apps: 42
Photo & Video : 18372 Number of apps: 130
Entertainment : 17043 Number of apps: 209
Education : 9944 Number of apps: 83
Business : 9088 Number of apps: 14
Catalogs : 4004 Number of apps: 4
Medical : 1214 Number of apps: 3


#### Data Analysis: Apple Store

I think these results are a little less skewed.  For the Apple Store, I think there are some categories that may be good targets.  I would suggest taking a look at genres with a high number of ratings but a lower number of apps.  I think a lower number of apps means the market has a less change of being saturated, and there are some genres that might lend themselves to targeted advertising, such as "Books", "Food & Drink", and "Travel".

Next, we will look at the number of installs for the Google Play Store per the DataQuest guided project.

In [27]:
## Get the display table for the Installs column

display_table(googleplay_final, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


Because we cannot determine the exact number of installs, we are just going to remove the commas and plus signs so that we can convert the strings to numbers:

In [28]:
## Get a frequency table for Category column

googleplay_category = freq_table(googleplay_final, 1)

## Get estimate of installs for each category
table_display = []
for category in googleplay_category:
    total = 0
    len_category = 0
    for row in googleplay_final:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_category += 1

    ## Convert to int to make results more readable
    average = int(total/len_category) 
    
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average, category, len_category)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

COMMUNICATION : 38456119 Number of apps: 287
VIDEO_PLAYERS : 24727872 Number of apps: 159
SOCIAL : 23253652 Number of apps: 236
PHOTOGRAPHY : 17840110 Number of apps: 261
PRODUCTIVITY : 16787331 Number of apps: 345
GAME : 15588015 Number of apps: 862
TRAVEL_AND_LOCAL : 13984077 Number of apps: 207
ENTERTAINMENT : 11640705 Number of apps: 85
TOOLS : 10801391 Number of apps: 750
NEWS_AND_MAGAZINES : 9549178 Number of apps: 248
BOOKS_AND_REFERENCE : 8767811 Number of apps: 190
SHOPPING : 7036877 Number of apps: 199
PERSONALIZATION : 5201482 Number of apps: 294
WEATHER : 5074486 Number of apps: 71
HEALTH_AND_FITNESS : 4188821 Number of apps: 273
MAPS_AND_NAVIGATION : 4056941 Number of apps: 124
FAMILY : 3697848 Number of apps: 1675
SPORTS : 3638640 Number of apps: 301
ART_AND_DESIGN : 1986335 Number of apps: 57
FOOD_AND_DRINK : 1924897 Number of apps: 110
EDUCATION : 1833495 Number of apps: 103
BUSINESS : 1712290 Number of apps: 407
LIFESTYLE : 1437816 Number of apps: 346
FINANCE : 1387692

Similar to the Apple Store data, I think it is worth examining potential outliers:

In [29]:
## Let's take a look at overall statistics
google_totals_table = []
for row in googleplay_final:
    installs = row[5]
    installs = installs.replace("+", "")
    installs = installs.replace(",", "")
    installs = float(installs)
    google_totals_table.append(installs)
    
print("Max: ", max(google_totals_table))
print("Min: ", min(google_totals_table))
print("Average: ", int(sum(google_totals_table)/len(google_totals_table)))

Max:  1000000000.0
Min:  0.0
Average:  8490471


We confirmed from the previous display table that there are apps with over a billion installs. Let's take a closer look and create a frequency table similar to the Apple Store: 

In [30]:
google_apps_freq_table = {"Less than 100": 0, "More than 500,000,000": 0}
for total in google_totals_table:
    if total < 100:
        google_apps_freq_table["Less than 100"] += 1
    elif total > 500000000:
        google_apps_freq_table["More than 500,000,000"] += 1
        
print(google_apps_freq_table)
    

{'More than 500,000,000': 20, 'Less than 100': 603}


Similar to the Apple Store, I think these criteria are reasonable cutoffs.  If an app has less than 100 installs, it is probably not of interest to a company whose interest is in high user numbers.  Cutting off the top 20 apps is also fairly reasonable, given that these are probably the most popular apps everyone has on their phone.  There is also value in using consistent criteria.  Let's recalculate the average number of installs for each category after getting rid of apps with less than 100 installs and apps with more than 500,000,000 installs.

In [31]:
## Get a frequency table for Category column

googleplay_category = freq_table(googleplay_final, 1)

## Get estimate of installs for each category
table_display = []
for category in googleplay_category:
    total = 0
    len_category = 0
    for row in googleplay_final:
        category_app = row[1]
        installs = row[5]
        installs = installs.replace("+", "")
        installs = installs.replace(",", "")
        installs = float(installs)
        if category_app == category:
            if installs > 100 and installs < 500000000:
                total += installs
                len_category += 1

    ## Convert to int to make results more readable
    average = int(total/len_category) 
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average, category, len_category)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

PHOTOGRAPHY : 14984702 Number of apps: 244
GAME : 12837474 Number of apps: 813
ENTERTAINMENT : 11640705 Number of apps: 85
COMMUNICATION : 11225234 Number of apps: 226
PRODUCTIVITY : 10225734 Number of apps: 273
VIDEO_PLAYERS : 9739666 Number of apps: 147
SHOPPING : 7779652 Number of apps: 180
SOCIAL : 7552588 Number of apps: 197
TOOLS : 6939726 Number of apps: 663
PERSONALIZATION : 6293137 Number of apps: 243
WEATHER : 5221572 Number of apps: 69
TRAVEL_AND_LOCAL : 4889084 Number of apps: 183
SPORTS : 4507111 Number of apps: 243
MAPS_AND_NAVIGATION : 4451858 Number of apps: 113
BOOKS_AND_REFERENCE : 3849037 Number of apps: 173
FAMILY : 3464899 Number of apps: 1499
HEALTH_AND_FITNESS : 2898853 Number of apps: 222
BUSINESS : 2832898 Number of apps: 246
FOOD_AND_DRINK : 2252531 Number of apps: 94
ART_AND_DESIGN : 2058563 Number of apps: 55
EDUCATION : 1833495 Number of apps: 103
NEWS_AND_MAGAZINES : 1822740 Number of apps: 202
LIFESTYLE : 1802463 Number of apps: 276
FINANCE : 1548164 Numb

I think it is interesting to note that there are categories without a large number of apps but that do have a lot of installs.  This may be of interest to the company.  Consistent with the Apple Store are "Food and Drink".  In the Google Play Store, books are included in the "Books and Reference" category, but might still be worth looking into.  Travel is included in "Travel and Local".

Next, let's take a closer look at the Reviews column from the Google Play Store:

In [32]:
## Get a frequency table for Category column

googleplay_category = freq_table(googleplay_final, 1)

## Get number of reviews for each category
table_display = []
for category in googleplay_category:
    total = 0
    len_category = 0
    for row in googleplay_final:
        category_app = row[1]
        if category_app == category:
            reviews = int(row[3])
            total += reviews
            len_category += 1

    ## Convert to int to make results more readable
    average = int(total/len_category) 
    
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average, category, len_category)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

COMMUNICATION : 995608 Number of apps: 287
SOCIAL : 965830 Number of apps: 236
GAME : 683523 Number of apps: 862
VIDEO_PLAYERS : 425350 Number of apps: 159
PHOTOGRAPHY : 404081 Number of apps: 261
TOOLS : 305732 Number of apps: 750
ENTERTAINMENT : 301752 Number of apps: 85
SHOPPING : 223887 Number of apps: 199
PERSONALIZATION : 181122 Number of apps: 294
WEATHER : 171250 Number of apps: 71
PRODUCTIVITY : 160634 Number of apps: 345
MAPS_AND_NAVIGATION : 142860 Number of apps: 124
TRAVEL_AND_LOCAL : 129484 Number of apps: 207
SPORTS : 116938 Number of apps: 301
FAMILY : 113210 Number of apps: 1675
NEWS_AND_MAGAZINES : 93088 Number of apps: 248
BOOKS_AND_REFERENCE : 87995 Number of apps: 190
HEALTH_AND_FITNESS : 78094 Number of apps: 273
FOOD_AND_DRINK : 57478 Number of apps: 110
EDUCATION : 56293 Number of apps: 103
COMICS : 42585 Number of apps: 55
FINANCE : 38535 Number of apps: 328
LIFESTYLE : 33921 Number of apps: 346
HOUSE_AND_HOME : 26435 Number of apps: 73
ART_AND_DESIGN : 24699 N

In [33]:
## Similar to previous analyses, let's look for potential 
## outliers

## Let's take a look at overall statistics
google_totals_table = []
for row in googleplay_final:
    reviews = int(row[3])
    google_totals_table.append(reviews)
    
print("Max: ", max(google_totals_table))
print("Min: ", min(google_totals_table))
print("Average: ", int(sum(google_totals_table)/len(google_totals_table)))

Max:  78158306
Min:  0
Average:  235492


In [34]:
## Similar to before, let's create a frequency table

google_apps_freq_table = {"Less than 100": 0, "More than 12,500,000": 0}
for total in google_totals_table:
    if total < 100:
        google_apps_freq_table["Less than 100"] += 1
    elif total > 12500000:
        google_apps_freq_table["More than 12,500,000"] += 1
        
print(google_apps_freq_table)
    

{'More than 12,500,000': 19, 'Less than 100': 2925}


Let's recalculate the number of reviews for each category after getting rid of apps with less than 100 reviews and apps with more than 12,500,000 reviews.

In [35]:
## Get a frequency table for Category column

googleplay_category = freq_table(googleplay_final, 1)

## Get number of reviews for each category
table_display = []
for category in googleplay_category:
    total = 0
    len_category = 0
    for row in googleplay_final:
        category_app = row[1]
        if category_app == category:
            reviews = int(row[3])
            if reviews > 100 and reviews < 12500000:
                total += reviews
                len_category += 1

    ## Convert to int to make results more readable
    average = int(total/len_category) 
    
    
## Repurpose code from DataQuest to create sorted list by average number 
## of user ratings, as well as the number of apps in that genre
    key_val_as_tuple = (average, category, len_category)
    table_display.append(key_val_as_tuple)


table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0], "Number of apps:", entry[2])
    

COMMUNICATION : 707179 Number of apps: 183
GAME : 586094 Number of apps: 754
PHOTOGRAPHY : 490531 Number of apps: 215
SOCIAL : 421551 Number of apps: 157
VIDEO_PLAYERS : 368193 Number of apps: 114
ENTERTAINMENT : 301752 Number of apps: 85
PERSONALIZATION : 290972 Number of apps: 183
SHOPPING : 287437 Number of apps: 155
PRODUCTIVITY : 254206 Number of apps: 218
TOOLS : 245777 Number of apps: 482
MAPS_AND_NAVIGATION : 210874 Number of apps: 84
TRAVEL_AND_LOCAL : 197068 Number of apps: 136
WEATHER : 189980 Number of apps: 64
FAMILY : 169603 Number of apps: 1118
SPORTS : 166812 Number of apps: 211
NEWS_AND_MAGAZINES : 150873 Number of apps: 153
BOOKS_AND_REFERENCE : 142884 Number of apps: 117
HEALTH_AND_FITNESS : 111036 Number of apps: 192
FOOD_AND_DRINK : 78049 Number of apps: 81
BUSINESS : 71464 Number of apps: 138
LIFESTYLE : 63426 Number of apps: 185
FINANCE : 57968 Number of apps: 218
EDUCATION : 56844 Number of apps: 102
COMICS : 49828 Number of apps: 47
DATING : 35855 Number of app

## Final Analysis and Recommendations

Initially, based on the popularity of categories, a recommendation was made to pursue a games app, perhaps for kids.  However, upon further analysis of user reviews and installs, it appears that those categories are saturated compared to other categories.  To elaborate, even if the company creates a great game, it will probably be more difficult to get noticed based on the number of apps in the category, both in the Google Play Store and the Apple Store.  

Because the main source of revenue for this company is from in-app ads, I would recommend taking a look at categories where:

1.  There are not a large number of existing apps
2.  There are still a relatively high number of installs or reviews
3.  The category lends itself to targeted advertising

My first recommendation would be "Food and Drink".  This category does not have a large number of apps in either store, but has high numbers of installs/reviews.  Based on user responses in the app, there is the potential for targeted advertising through recommendations.

"Books and Reference" is also of interest, although I would recommend further analysis on the Google Play Store to examine the potential for targeted advertisting.  

I would also recommend further analysis of "Travel and Local" based on the same reasoning as "Books and Reference".