# What type of apps are likely to attract more users?

### Profiling apps to find profitable ideas for the App Store and Google Play markets

For this project, my aim is to help a company that builds Android and iOS mobile apps that are available on the Google Play and App Store.

This company only builds Android and iOS mobile apps that are free to download and install. Its main source of revenue consists of in-app purchases, which means that the revenue for any given app stems from its number of users on the Google Play and the App Store. 

My **goal** for this project is to analyze data to help developers understand what type of apps are likely to attract more users.

1. I'll collect and analyze data about mobile apps available on Google Play and the App Store using a sample of data. Since there are over 4 million apps on the market, it wouldn't be cost-efficient to collect data for every app.

I found some relevant existing data at no cost:

[Google Play data set: ](https://www.kaggle.com/lava18/google-play-store-apps/home) contains approximately 10,000 Android apps from Google Play.

[App Store data set: ](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contains approximately 7,000 iOS apps from the App Store.


In [1]:
#I create an explore data function to make the data sets more readable
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

# The Google Play data set #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The App Store data set #
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [3]:
#I explore both data sets 
explore_data(android, 0, 3, rows_and_columns=True)
print('\n')
explore_data(ios, 0, 3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+

In [4]:
#printing the column names
#here is a link to the column descriptions 
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Here is the documentation that offer a more detailed description of each column. [Android](https://www.kaggle.com/lava18/google-play-store-apps)
[iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) 

## Data Cleaning
In order to ensure that the data I analyze is accurate, I will 
1. Detect inaccurate data, and correct it or remove it
2. Detect duplicate data, and remove the duplicates
3. Remove data that won't help to answer my research question

    -Remove non-English apps
    
    -Remove paid apps

I looked in the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) and found that there was an [error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/81460#latest-518319) in one of the rows.

### Deleting Wrong Data

In [5]:
print(android[10472])
print(android_header[2])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Rating


The rating in this column is unusual.It seems that there was an error in the data entry process. The rating should probably be 1.9 rather than 19, since the rest of the apps are rated up to 5.

In [6]:
#I'll delete this row
print(len(android))
#del android[10472]
print(len(android))

10841
10841


In the discussion, there were also comments about duplicates. I'll explore the data for duplicates below:

### Removing Duplicate Entries
#### **Part One:**

In [7]:
#I'll check an app, let's say instagram, to see if it has duplicates
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
#I'll go through each app name, and append it to a list if,
#if I come across that app again, I'll append it to the duplicate list
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


#### **Part Two:**

There are 1881 cases where the app occurs more than once. I will create a dictionary and use it to remove the duplicates. 

In [9]:
print('Expected length', len(android) - 1181)

reviews_max = {}

for app in android:
    name = app[0]     #assign the app name
    n_reviews = app[3] #assign the number of reviews in the 4th column
    
    #if the name already exists as a key in the reviews_max dictionary
    #and its number of reviews is greater than its duplicate in the dictionary
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews  #update the number of reviews for that entry
    #if the number of reviews hasn't yet been added to be compared
    elif name not in reviews_max:      
        reviews_max[name] = n_reviews #add the number of reviews to the reviews_max dictionary
        
print('Actual length:', len(reviews_max))
print('Expected length:', len(android) - 1181)

Expected length 9660
Actual length: 9660
Expected length: 9660


If I delete the duplicates, there will be 9660 apps in total.

In [10]:
#I'll keep the entries with the highest number of reviews, since they are 
#probably the most recent -- assuming that the # of users grows over time

android_clean = []
already_added = []

for app in android: #loop through the android data set
    name = app[0] #isolate the name of the app 
    n_reviews = (app[3]) #isolate the number of reviews
    
    #checking for the highest # of reviews that hasn't already been added
    #in order to account for duplicates that share the same highest number of reviews
    #if the # of reviews matches the max previously recorded in the reviews_max dictionary
    if reviews_max[name] == n_reviews and (name not in already_added):
        android_clean.append(app) #add the current row to this list
        already_added.append(name) #add the current name to this list

Explore the number of rows and confirm that the number of rows is 9660

In [11]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9660
Number of columns: 13


### Removing Non-English Apps
Since the company I'm analyzing data for targets an English-speaking audience, I'll remove apps whose names suggest that they are not directed toward an English-speaking audience.

1. Remove each app with a name containing a symbol that is not commonly used in English text. (numbers 0-9, punctuation marks(., !, ?, ;), and other symbols (+, *, /))
    a. Iterate through the characters of the app name's string
    b. If the app name contains a character with an ASCII number greater than 127, (the English range is 0-127), then the app probably has a non-English name.

#### **Part One:**

In [12]:
#For example:
print(ios[813][1])
print(ios[6731][1], '\n')
print(android_clean[4412][0])
print(android_clean[7940][0], '\n')

#get the corresponding ASCII number of each character(English range 0 to 127)
print(ord('a'))
print(ord('A'))
print(ord('爱'))
print(ord('5'))
print(ord('+'))

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜 

Wowkwis aq Ka'qaquj
PHARMAGUIDE (DZ) 

97
65
29233
53
43


In [13]:
#for app in android_clean:
#        name = app[0]
def is_english(name):
    
    for character in name:   #for each character of the string
        if ord(character) > 127:   #if it's not in the English ASCII
            return False   #the app name is probably non-English

    return True    #the app name is probably English
is_english('Instagram')
is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')


False

In [14]:
is_english('Docs To Go™ Free Office Suite')
is_english('Instachat 😜')

False

Some apps have miscellanious English characters that are not below 127 in ASCII, like the trademark symbol(™) and this emoji ("😜"). 
My function evaluated them to False.

In [15]:
print(ord('™'))
print(ord('😜'))

8482
128540


**Part Two:**

In order to minimize the loss of too much valuable data, I'll have to create a new function that will classify an app name as English if it has at most 3 English characters are above the English ASCII threshold of 127.

In [16]:
def is_english_forthemostpart(name):
    non_english = 0
    for character in name:   #for each character of the string
        if ord(character) > 127:   #if it's not in the English ASCII
            non_english += 1 #the app name is probably non-English
        
    if non_english > 3:
        return False    #the app name has too many non-English characters to be safely considered English
    else:
        return True     #the app name is probably English with a few miscellaneous characters

print(is_english_forthemostpart('Instagram'))
print(is_english_forthemostpart('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_forthemostpart('Docs To Go™ Free Office Suite'))
print(is_english_forthemostpart('Instachat 😜'))

True
False
True
True


Great! I'll use this function to filter out non-English apps from both data sets. 
    1. Loop through each data set
        a. if an app name is identified as English, append the whole row to a separate list

In [17]:
android_clean_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english_forthemostpart(name):
        android_clean_english.append(app)

for app in ios:
    name = app[1]   #the name column in ios starts on the second column
    if is_english_forthemostpart(name):
        ios_english.append(app)
     
print('Android', '\n')
print('Number of rows before English filtering:', len(android_clean))
print('Number of rows after English filtering', len(android_clean_english))
print('Number of rows filtered:', len(android_clean) - len(android_clean_english), '\n')

explore_data(android_clean_english, 0, 3, True)
print('\n')

print('iOS', '\n')
print('Number of rows before English filtering:', len(ios))
print('Number of rows after English filtering:', len(ios_english))
print('Number of rows filtered:', len(ios) - len(ios_english), '\n')
explore_data(ios_english, 0, 3, True)

Android 

Number of rows before English filtering: 9660
Number of rows after English filtering 9615
Number of rows filtered: 45 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9615
Number of columns: 13


iOS 

Number of rows before English filtering: 7197
Number of rows after English filtering: 6183
Number of rows filtered: 1014 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', 

I removed 45 non-English apps from the Android app data set and 1014 non-English apps from the iOS stores, minimizing a loss of data by making exceptions for apps with cumbersome titles.

I'm left with 9615 Android apps and 6183 iOS apps.
(It seems that there are 22x more non-English apps in the iOS store!)

### Isolating the Free Apps

As I mentioned in the introduction, the company only want to build apps that are free to download and install, and its main source of revenye consists of in-app ads. The data set conains both free and non-free apps, I'll need to isolate only the free apps for my analysis.

In [18]:
#loop through each dataset to isolate the free apps in separate lists for app in android: if 
android_final = []
ios_final = []

for app in android_clean_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

#check the length of each data set to see the number of remaining apps
print('Android')
print(len(android))
print(len(android_clean_english))
print(len(android_final))
print('Removed', len(android_clean_english) - len(android_final), 'paid apps')
print('\n')
print('iOS')
print(len(ios))
print(len(ios_english))
print(len(ios_final))
print('Removed', len(ios_english) - len(ios_final), 'paid apps')

def percent_change(old, new):
    difference = old - new
    return difference/old * 100

print('\nWhat\'s the difference?\n', 
percent_change(len(android_clean_english), len(android_final)),
percent_change(len(android), len(android_final)),
percent_change(len(ios_english), len(ios_final)),
percent_change(len(ios), len(ios_final))
)


Android
10841
9615
8862
Removed 753 paid apps


iOS
7197
6183
3222
Removed 2961 paid apps

What's the difference?
 7.831513260530422 18.254773544875935 47.88937409024746 55.23134639433097


The final android app dataset length decreasrd by 7.83% when I isolated the free apps. The original android dataset decreased by 18.25% since I began the cleaning process.

The final ios app data set has 47.89% fewer apps without the paid apps, and the overall cleaning process decreased the ios data by 55.23%!

The ios app store had a significant amount of paid apps compared to android!

Before this point, I tried to minimize the loss of valuable data while cleaning the data according to my research goals. I removed inaccurate and duplicate app entries and non-English apps, and isolated the free apps.

## Analysis
Now, I will analyze the data to determine the kinds of apps that are likely to attract more users because the company's revenue is highly influenced by the number of people using their apps.

In order to minimize risk and overhead, the company's validation strategy for an app idea consists of three steps:

1. Build a minimal Android version of the app, and add it to Google Play
2. If the app has a good response from users, develop it further
3. If the app is profitable after six months, build an IOS version of the app and add it to the App Store

Their end goal is to add the app on both Google Play and the App Store, so they need to add find app profiles that are successful on both markets. For example, an app that works well for both markets would be a productivity app that uses gamification to improve utility.


### Most Common App by Genre
I'll begin the analysis by getting a sense of what the most common genres are for each market.
#### **Part One:**
Build frequency tables for a few columns in the data sets.

In [19]:
#find appropriate columns
print(android_header,'\n', android_header[1], android_header[-4])
print('\n', ios_header, '\n', ios_header[-5])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 
 Category Genres

 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 
 prime_genre


I've chosen the 'Category' and 'Genre' columns in the android dataset, and the 'prime_genre' column in the ios dataet.

#### **Part Two**
I'll generate frequency tables that show percentages

In [20]:
#build frequency function
def freq_table(dataset, index):    #take a list of lists and an integer
    total = 0                      #initialize the total count to 0
    table = {}                     #declare the dictionary
    for app in dataset:
        total += 1                 #increase count by 1
        genre = app[index]         #extract app's genre at a predetermined index
        if genre in table:         #if the genre is in the table
            table[genre] += 1      #increase genre's count in the table
        else:                      #otherwise
            table[genre] = 1       #add genre to the table
       
    table_percentages = {}         #declare a new dictionary for genre's percentages
    for key in table:              
        percentage = (table[key] / total) * 100  #calculate the percentage for each frequency based on total number of columns 
        table_percentages[key] = percentage      #add result to new dictionary
    
    return table_percentages
    
#build display function to show table in descending order
def display(dataset, index):        
    table = freq_table(dataset, index)   #generatess a freq table 
    table_display = []                   
    for key in table:                    
        key_val_as_tuple = (table[key], key) #transforms dictionary into a list of tuples to make use of sorted function
        table_display.append(key_val_as_tuple)  #stores tuples
        
    table_sorted = sorted(table_display, reverse = True) #sorts tuples in descending order
    for entry in table_sorted:
        print(entry[1], ':', entry[0])  #prints sorted tuples in proper order

#### **Part Three**
I'll analyze the frequency table for each column.

In [21]:
display(ios, -5) #iOS's prime_genre columns

Games : 53.66124774211477
Entertainment : 7.433652910935113
Education : 6.294289287203002
Photo & Video : 4.849242740030569
Utilities : 3.4458802278727245
Health & Fitness : 2.501042100875365
Productivity : 2.473252744198972
Social Networking : 2.3204112824788106
Lifestyle : 2.0008336807002918
Music : 1.9174656106711132
Shopping : 1.6951507572599693
Sports : 1.5839933305543976
Book : 1.5562039738780047
Finance : 1.445046547172433
Travel : 1.1254689453939142
News : 1.0421008753647354
Weather : 1.0004168403501459
Reference : 0.8892594136445742
Food & Drink : 0.8753647353063776
Business : 0.7919966652771988
Navigation : 0.6391552035570377
Medical : 0.31957760177851885
Catalogs : 0.1389467833819647


iOS
The most common iOS genre is Games, and then Entertainment.
Games accounts for more than 53% of the prime genres.
Some other common genres are Education, Photo & Video, Utility, Health & Fitness, and Productivity.

Most of the apps are designed for fun(games, entertainment, photo and video editing, social networking, sports, music) and fewer for practical purposes (education, shopping, utilities, productivity, lifestyle)

I wouldn't be quick to recommend an app profile for the App Store market based on this frequency table alone, because even though there is a large number of apps for a particular genre, that does not necessarily imply that they have a large number of users.

In [22]:
display(android_final, 1) #Android's Category column

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

The most common android app categories are Family, Game, and Tools.

The landscape is different for apps on Google Play. Most apps are for practical purposes, rather than for fun.

When taking a closer look at the family apps, I found that most of them were actually gaming apps for kids. 

Even so, practical apps have a greater representation on Google Play compared to App Store.

This is confirmed by the Genres column on Google Play.

In [23]:
display(android_final, -4) #Android's Genres columnn

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

The most common genre is Tools, which accounts for 8.44%, followed by Entertainment (6.07%), Education (5.35%), Business (4.59%), Productivity (3.89%), Lifestyle (3.89%), and Finance (3.7%).

Although Entertainment is the runner up, it only accounts for 6%, while practical apps have a larger representation.

Overall, the frequency tables shows that the App Store is dominated by apps designed for fun, while the Google Play shows a more balanced landscape of both practicala and fun apps.

However, the most common app genres doesn't give me enough information to recommend the kind of app with the most users. 

I have to look at the number of user to find the most popular apps by Genre.

### Most Popular Apps by Genre
I'll calculate the average number of installs for each app genre.
The Google Play dataset has an 'Installs' column, however this kind of column is missing for the App Store dataset. As a proxy, I'll take the total number of ratings, which is in the 'rating_count_tot' column.

#calculate the average number of user ratings per app genre on the App Store

1. Isolate the apps of each genre
2. Sum up the user ratings for the apps of that genre
3. Divide the sum by th number of apps belonging to that genre (not by the total number of apps)

### On the App Store

In [24]:
#looking at the rating_count_tot column
ios_freq_table = freq_table(ios_final, -5)

#create nested loop
for genre in ios_freq_table:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            num_user_ratings = float(app[5])   # index of 'rating_count_tot'
            total += num_user_ratings
            len_genre += 1
    avg_num_user_ratings = total/len_genre
    print(genre, ':', avg_num_user_ratings)

Weather : 52279.892857142855
Utilities : 18684.456790123455
Medical : 612.0
Travel : 28243.8
News : 21248.023255813954
Navigation : 86090.33333333333
Sports : 23008.898550724636
Shopping : 26919.690476190477
Education : 7003.983050847458
Finance : 31467.944444444445
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Social Networking : 71548.34905660378
Music : 57326.530303030304
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Productivity : 21028.410714285714
Lifestyle : 16485.764705882353
Business : 7491.117647058823
Reference : 74942.11111111111
Catalogs : 4004.0
Food & Drink : 33333.92307692308
Book : 39758.5


Navigation apps appear to be the most popular with an average of 86,090 user ratings.

In [25]:
navigation_apps = {}

for app in ios_final:
    genre = app[-5]
    name = app[1]
    if genre == 'Navigation':
        navigation_apps[name] = app[5]
        print(app[1], ':', app[5]) # print name and number of ratings

print('\n')
total_of_values = 0
for value in navigation_apps.values():
    total_of_values += int(value)

values = sorted(navigation_apps.values(), key=lambda kv: int(kv), reverse=True)
top_values = values[:2]
sum_top_values = int(top_values[0]) + int(top_values[1])

print('User counts:', values)
print('Top user counts:', top_values)
print('Total users:', total_of_values)

top_values_share = (sum_top_values/total_of_values) * 100
print('Top user counts share:', top_values_share)

#sorted(navigation_apps.items(), key=lambda kv: kv[1], reverse=True)
#print(navigation_apps, '\n')

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


User counts: ['345046', '154911', '12811', '3582', '187', '5']
Top user counts: ['345046', '154911']
Total users: 516542
Top user counts share: 96.78922527112994


Waze and Google Maps have the highest number of installs (proxied by rating counts) among Navigation apps, with 345046 and 154911 installs respectively.

They represent 96% of users the Navigation apps, and have close to half a million user reviews together.

In this case, I would recommend creating apps whose profiles consist of GPS Navigation, Maps, Real-time Traffic, and Transit, which are key elements of Waze and Google Maps.

However, the high average number of users for Navigation apps is accounted for by two apps that dominate the space. While some apps have around a quarter of a million users, others are barely approaching 15,000.

This is similar for the Social Networking genre(Facebook, Pinterest, and Skype), and the Music genre (Pandora, Spotify, and Shazam) where a few large apps influence the average.

If my goal is to find the most popular genre based on the average number of users (i.e. installs, rating counts), then I'll need to consider that these huge apps have skewed the average, making the genre seem more popular than it actually is.

Since these genres have such huge players, it might be less risky and more cost-efficient to profile a genre with smaller competitors, and a user average that isn't as skewed by a few number of extremely popular apps. 

For further analysis, I could remove the extremely popular apps, and recalculate the averages for each genre. 

### On Google Play


In [26]:
#looking at the installs column
display(android_final, 5)

1,000,000+ : 15.741367637102236
100,000+ : 11.554953735048521
10,000,000+ : 10.516813360415256
10,000+ : 10.200857594222523
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.838185511171294
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.2906793048973144
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


The install numbers don't seem precise enough-- most values are open-ended (100+, 1,000+, 5,000+, etc.)

It isn't obvious whether an app with 100,000+ installs has 100,000 installs, 200,000, or 500,000. For the goal of finding out which app genres attract most users, the data pertaining to number of users doesn't have to be precise.

I'll consider the numbers as they are, meaning that I will leave an app with 100,000+ installs has 100,000 installs. To perform computations, I'll need to convert each install number from string to float. I need to remove tha commas and the plus characters to prevent an error.

In [27]:
#calculating the average number of installs per app genre for the Google Play dataset

android_freq_table = freq_table(android_final, 1)
all_installs = []

#create nested loop
for category in android_freq_table:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]   # index of 'installs'
            n_installs = n_installs.replace('+', '')   #remove plus characters, replace with empty string
            n_installs = n_installs.replace(',', '')   #remove commas, replace with empty string
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total/len_category
    print(category, ':', avg_n_installs)
    all_installs.append(avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
HEALTH_AND_FITNESS : 4188821.9853479853
FINANCE : 1387692.475609756
BUSINESS : 1712290.1474201474
PRODUCTIVITY : 16787331.344927534
FOOD_AND_DRINK : 1924897.7363636363
VIDEO_PLAYERS : 24727872.452830188
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
SPORTS : 3638640.1428571427
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
MEDICAL : 120616.48717948717
HOUSE_AND_HOME : 1331540.5616438356
SOCIAL : 23253652.127118643
PERSONALIZATION : 5201482.6122448975
TRAVEL_AND_LOCAL : 13984077.710144928
EDUCATION : 1820673.076923077
TOOLS : 10682301.033377837
BEAUTY : 513151.88679245283
LIFESTYLE : 1437816.2687861272
NEWS_AND_MAGAZINES : 9549178.467741935
SHOPPING : 7036877.311557789
EVENTS : 253542.22222222222
ENTERTAINMENT : 11640705.88235294
FAMILY : 3694276.334922527
WEATHER : 5074486.197183099
PHOTOGRAPHY : 17805627.643678162
GAME : 15560965.599534342
BOOKS_AND_REFERENCE : 8767811.8947368

In [78]:
all_installs_sorted = sorted(all_installs, reverse=True)

top_installs = all_installs_sorted[:5]
sum_top_installs = int(top_installs[0]) + int(top_installs[1])
total_of_installs = len(all_installs_sorted)

#print('User counts:', all_installs_sorted)
print('Top user counts:', top_installs)
#print('Total users:', total_of_installs)
most_installs = top_installs[0]  #average # installs for communucation apps


Top user counts: [38456119.167247385, 24727872.452830188, 23253652.127118643, 17805627.643678162, 16787331.344927534]


The most popular app categories on Google Play based on the number of users (i.e. number of installs) are Communication, Video Players, Social, Photography, and Productivity.

These categories show potential for being profitable on the App Store and Google Play, especially Communication apps, which have the most installs (38,456,119). Taking a closer look at Communication apps shows that there are a few giants, similar to the App Store, that account for the majority of installs, while there are others that don't match up.

In [71]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                        or app[5] == '100,000,000+'):
        name = app[0]
        print(name, ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

WhatsApp messenger, Messenger, and Skype, Google Chrome, Gmail, and Hangouts have the greatest number of installs (over 1,000,000,000). 

How do the other apps measure up?
Let's see what the average number of installs would be if I removed the apps that have over 100 million installs.

In [75]:
under_100 = []
for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = n_installs.replace(',', '')
    n_installs = float(n_installs)
    if app[1] == 'COMMUNICATION' and n_installs < 100000000:
        under_100.append(n_installs)

avg_under_100 = sum(under_100)/len(under_100)

difference_in_average = most_installs/avg_under_100

print(most_installs, avg_under_100, difference_in_average)

38456119.167247385 3603485.3884615386 10.6719231581693


After removing the communication games above 100,000,000 installs, I found that the new average was 10 times smaller!

The Communications genre's average number of installs is skewed by a few giants that dominate the market. It is not as popular as it seems, and it would be difficult to enter a market with such huge competitors.

There is a similar pattern for runner-ups:
    -Video Player: Youtube, Google Play Movies & TV, or MX Player
    -Social: Facebook, Instagram, Google+
    -Photography: Google Photos and other popular photo editors
    -Productivity: Microsoft Word, Dropbox, Google Calendar, Evernote)

The game genre is popular, but the market is saturated, so it's best to look elsewhere for recommendations.

The book and reference genre looks popular on both Google Play and the App Store. It has an average number of installs at 8,767,811.

For the App Store, the number of user ratings are "Reference 74942.11" and "Book : 39758.5".

Since this genre has demonstrated potential on the App Store, it fulfills the goal of recommending a profile that can be successful on both platforms.

Here are the apps in Books and Reference with their numner of installs:

In [81]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        name = app[0]
        print(name, ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre include a variety of apps, including aoftware for processing and reading books, various collections of libraries, dictionaries, tutorials on programming and languages, and religious texts.

Here are the popular apps in with over 100,000,000 installs.

In [82]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                        or app[5] == '100,000,000+'):
        name = app[0]
        print(name, ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


In comparison to the other seemingly more popular categories, there aren't as many apps that skew the average.

This market has potential because there are only a few popular apps. 

I'll look at the midrange apps, to get some ideas for app recommendations. 

In [85]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The midrange apps are oversaturated with software for processing and reading ebooks, collections of libraries, dictionaries.

However, there are some successful apps built for a single book, popular academic subjects (ex. All Maths Formulas), sacred texts (ex. Quran, Bible), and even help manuals (e. HTC Help) can be both profitable and feasible.

The most popular software for reading ebooks include features for highlighting and notetaking. 

The app "All Maths Formulas", which has over 1 million installs, integrates its content of math formulas with calculation, visualization, quizzes, and everything that would make learning about its content practical.

The app Brilliant Quotes: Life, Love, Family & Motivation, which also has over 1,000,000 installs, includes morning and evening notifications with quotes, a community of people that like quotes, a feature for saving and adding new quotes to make the content more personal, a feature for learning more about the authors of the quotes, an optionn to display content outside of the app with a widget, and a search feature for nacigating all the content.

It seems that taking an in-demand academic subject or a popular book and turning it into an an app can be profitable for both App Store and Google Play.

Since there are many ebook readers, I wouldn't advise simply creating an app with the raw version of the book. I would recommend adding additional features that would leverage the book's content to add more convenience and utility to the user. When people download books to their phones, tablets, and computers, it is usually for the convenience and ease of accessing the book's content. 

### Conclusions

In this project, I analyzed data about the App Store and Google Play mobile apps with the goal of recommending and app profile that can be profitable for both markets. I concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. 

I would recommend creating an app based on a popular book that contains a feature for learning more about the critial and interesting points in the book with information pulled from popular sources across the web, daily quotes from the book, an audio version of the book, quizzes on the book, or a forum where people can discuss the book.