# Profitable App Profiles for the App Store and Google Play Markets

In this project, we will explore what applications will best engage users to increase revenue for developers on Google Play and the App Store, using two datasets. We will pretend we are working as data analysts for a company that builds Android and iOS mobile apps, available in their respective stores. Our company only builds apps that are free to download and install, with our main source of revenue comprising in-app ads. The more users who see and engage with the ads, the higher our revenue. Our goal is to provide insight for developers on what kind of apps are likely to attract more users. 

Documentation for the Google Play Store apps dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps), and documentation for the App Store apps dataset can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [1]:
from csv import reader

# Importing the App Store data set
opened_file_1 = open('AppleStore.csv', encoding = "utf8")
read_file_1 = reader(opened_file_1)
app_Store = list(read_file_1)

# Importing the App Store data set
opened_file_2 = open('googleplaystore.csv', encoding = "utf8")
read_file_2 = reader(opened_file_2)
gp_Store = list(read_file_2)

To ease the process of exploring these data sets, we'll write a function called `explore_data()`. We'll also add an option for `explore_data()` to show the number of rows and columns for any data set.

In [2]:
def explore_Data(dataset, start, end, rows_And_Columns=False):
    dataset_Slice = dataset[start:end]    
    for row in dataset_Slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_And_Columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_Data(app_Store, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


Of all the columns in the app_Store data set, `track_name, size_bytes, price, user_rating, cont_rating, prime_genre` are all variables of interest. We can also see our data set has 7,198 rows and 16 columns.

In [4]:
explore_Data(gp_Store, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


Of all the columns in the gp_Store data set,`App, Category, Rating, Size, Installs, Type, Price, Content Rating, Genre` are all variables of interest. We can also see that our data set has 10,841 rows and 13 columns.
## Removing Incomplete Data

According to an [online discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) for the Google Play data set, there is [an error for row 10472](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). 

In [5]:
explore_Data(gp_Store, 10473, 10474) # incorrect row, note the header row is still in the data set thus we look at row 10473.

explore_Data(gp_Store, 0, 1)         # header

explore_Data(gp_Store, 1, 2)         # instance of correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']




We see the incorrect row corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, with a rating of 19. The maximum rating for apps in the Google Play store is 5. According to its [discussions section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), the mistake appears to be caused by a missing value in the 'Category' column , so we will remove this row from the dataset.

In [6]:
del gp_Store[10473]
explore_Data(gp_Store, 10473, 10474) # It has been removed.

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




To be sure there are no more rows with missing values, let's create a function that compares the length of a row with the length of the header row. If there is a discrepancy between the two, we will know the row has an improper number of values.

In [7]:
# This function checks for rows that are missing values. 
def check_Rows(dataset):
    incomp_Rows = []
    length_DS = len(dataset[0]) # Obtains row length of header, which will be used to check the rest of the rows.
    for row in dataset:
        if len(row) != length_DS:
            incomp_Rows.append(row)
    return incomp_Rows

app_Incomp = check_Rows(app_Store) # number of incomplete rows in appStore
print(app_Incomp)
gp_Incomp = check_Rows(gp_Store)   # number of incomplete rows in gpStore
print(gp_Incomp)

[]
[]


It turns out that aside from that one row we removed, there are no other rows with missing values.

## Deleting Duplicate Entries

To demonstrate the presence of duplicate entries, we will verify such an occurrence happening with one of the apps brought up in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/136133) of the Google Play data set, namely 'Instagram'.

In [8]:
for app in gp_Store:
    name = app[0]
    if name == 'Instagram':
        print(app)

# Note that the fourth row has varying values. This is the review column.
# We can use this column to prioritize the entry with the most reviews and remove the duplicates.
# This way, we focus on the most up-to-date entry.

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Note that the fourth row has varying values. This is the review column. We can use this column to prioritize the entry with the most reviews and remove the duplicates. This way, we focus on the most up-to-date entry. Let's create a function to determine the number of rows with duplicate entries. 

In [9]:
# This function will check for duplicates (in name) in both data sets we are analyzing.

def check_Dup(dataset, ret_Dup = True):
    unique_Apps = []
    duplicate_Apps = []
    for row in dataset:
        name = row[0]
        if name in unique_Apps:
            duplicate_Apps.append(name)
        else:
            unique_Apps.append(name)
    if ret_Dup == True:
        return duplicate_Apps
    else:
        return unique_Apps

gp_Dup = check_Dup(gp_Store) # number of duplicates in gp_Store
app_Dup = check_Dup(app_Store) # number of duplicates in app_Store
print('Number of Duplicate GP Apps: ' + str(len(gp_Dup)))
print('Number of Duplicate Apple Apps: ' + str(len(app_Dup)))

Number of Duplicate GP Apps: 1181
Number of Duplicate Apple Apps: 0


There are 1181 duplicate entries in the Google Play dataset, while there are zero in the App Store dataset.

In [10]:
print('Without duplicates, the Google Play dataset should have ' + str(len(gp_Store[1:]) - 1181) + ' rows.')

Without duplicates, the Google Play dataset should have 9659 rows.


In [11]:
unique_GP = {} # Will have no duplicates.
for row in gp_Store[1:]:
    review_Num = float(row[3]) # reviewNum will be the number of reviews an app has.
    name = row[0]
    if name in unique_GP:
        dict_Var = unique_GP[name] # Obtains the row equivalent in the dictionary for comparison.
        if review_Num > float(dict_Var): 
            unique_GP[name] = review_Num # Updates uniqueGP to have the entry with the highest number of reviews.
    else:
        unique_GP[name] = review_Num
        
print(len(unique_GP))

# Time to clean the Google Play dataset for our purposes. 

clean_GP = []
dup_GP = []

for row in gp_Store[1:]:
    name = row[0]
    num_Rev = float(row[3])
    if num_Rev == unique_GP[name] and name not in dup_GP:
        clean_GP.append(row)
        dup_GP.append(name)
print(len(clean_GP))

9659
9659


We successfully removed the duplicate rows in the Google Play dataset. Now let's check for non-English apps. Our company uses English for the apps we develop, thus we don't need apps not designed for an English-speaking audience. 

In [12]:
# Checking for non-English characters in the names of the apps.

def non_Eng(string): # If the application has more than 3 non-English characters (including emojis), this will detect it.
    flags = 0
    for char in string:
        if ord(char) > 127:
            flags += 1
            if flags == 3:
                return True
    return False
        # This way, we will not remove any apps that have a few outlier characters like ™...
        # but remove most non-English apps such as 爱奇艺PPS -《欢乐颂2》电视剧热播 from our data set.
    
cleaner_GP = []
cleaner_App = []

for row in clean_GP[1:]:
    name = row[0]
    if non_Eng(name) == False:
        cleaner_GP.append(row)
        
for row in app_Store[1:]:
    name = row[1]
    if non_Eng(name) == False:
        cleaner_App.append(row)
print('The length of the newly cleaned GP dataset is: ' + str(len(cleaner_GP)) + ' compared to our beginning ' + str(len(gp_Store)) + '.')
print('The length of the newly cleaned App Store dataset is: ' + str(len(cleaner_App)) + ' compared to our beginning ' + str(len(app_Store)) + '.')

The length of the newly cleaned GP dataset is: 9596 compared to our beginning 10841.
The length of the newly cleaned App Store dataset is: 6155 compared to our beginning 7198.


## Most Popular Apps in the App Store and Google Play Store

Now that the data is all clean, it is time to figure out what applications are the most popular among the App Store and Google Play users to determine the types of applications that will best generate revenue for developers. We'll start by finding the most common genres.

In [13]:
def freq_Table(dataset, index): # Creates a frequency table of any column we want.
    freq_T = {}
    for row in dataset:
        item = row[index]
        if item in freq_T:
            freq_T[item] += 1
        else:
            freq_T[item] = 1
    return freq_T

def display_Table(dataset, index): # Sorts the dictionary by descending order.
    table = freq_Table(dataset, index)
    table_Display = []
    for key in table:
        key_Val_As_Tuple = (table[key], key)
        table_Display.append(key_Val_As_Tuple)

    table_Sorted = sorted(table_Display, reverse = True)
    for entry in table_Sorted:
        print(entry[1], ':', entry[0])
        
display_Table(cleaner_App, 11) # prime_genre column
print('\n')
display_Table(cleaner_GP, 9) # Genres column
print('\n')
display_Table(cleaner_GP, 1) # Category column

Games : 3380
Entertainment : 446
Education : 409
Photo & Video : 341
Utilities : 211
Productivity : 168
Health & Fitness : 164
Music : 137
Social Networking : 126
Sports : 104
Lifestyle : 98
Shopping : 84
Weather : 69
Travel : 59
News : 56
Business : 53
Book : 53
Reference : 51
Finance : 48
Food & Drink : 44
Navigation : 28
Medical : 21
Catalogs : 5


Tools : 825
Entertainment : 557
Education : 503
Business : 419
Medical : 395
Personalization : 375
Productivity : 373
Lifestyle : 361
Finance : 345
Sports : 330
Communication : 313
Action : 298
Health & Fitness : 288
Photography : 280
News & Magazines : 249
Social : 239
Travel & Local : 218
Books & Reference : 217
Shopping : 201
Simulation : 190
Arcade : 183
Dating : 170
Casual : 165
Video Players & Editors : 161
Maps & Navigation : 128
Puzzle : 119
Food & Drink : 112
Role Playing : 104
Strategy : 94
Racing : 91
Libraries & Demo : 84
Auto & Vehicles : 84
Weather : 78
House & Home : 71
Adventure : 71
Events : 64
Art & Design : 55
Comics : 

For the `prime_genre` column of the cleaned App Store dataset, the most common genre is `Games` (3380 apps), with `Entertainment` (446) and `Education` (409) being the runner-ups. The mass presence of games may be due to several factors: the appeal of games to people of all ages, the ease in ability to make a unique product of a game compared to other genres, and the number of genres within games themselves. Overall, it seems the App Store is filled with apps designed for fun. However, this is not enough to make a recommendation to developers.

In the `Genres` column of the Google Play data set, `Tools` appears to be the most common genre (825). It is important to note that this column lacks the games genre we see in the previous column analyzed. Instead, the game genre is divided into its subcategories, such as action, arcade, and casual. However, summed together, the number of games adds up to 940, and while it surpasses tools in presence, family is by far the most common. Many applications from every category are listed under the family genre. Altogether, the Google Play Store appears to be filled with a mix of practical and entertaining applications.

To make a valid recommendation, more needs to be explored. Average ratings per genre, average number of installations per genre, and the like are good places to start.

### Most Popular iOS Apps by Genre

In [14]:
prime_Genre_App = freq_Table(cleaner_App, 11)

pg_App_List = []

for genre in prime_Genre_App:
    total = 0    # sum of all ratings
    len_Genre = 0 # number of apps in genre
    for row in cleaner_App:
        genre_App = row[11]
        if genre_App == genre:
            rating = row[7]
            total += float(rating)
            len_Genre += 1
    avg_Rat = round(total / len_Genre, 2)
    tuple_Append = (genre, avg_Rat)
    pg_App_List.append(tuple_Append)

pg_App_List = sorted(pg_App_List, key = lambda ratings: ratings[1], reverse = True)
print(pg_App_List)

[('Catalogs', 4.2), ('Games', 4.06), ('Productivity', 4.03), ('Music', 3.98), ('Reference', 3.98), ('Shopping', 3.98), ('Business', 3.98), ('Health & Fitness', 3.87), ('Book', 3.87), ('Photo & Video', 3.85), ('Weather', 3.7), ('Navigation', 3.66), ('Food & Drink', 3.65), ('Education', 3.59), ('Travel', 3.57), ('Social Networking', 3.51), ('Entertainment', 3.5), ('Finance', 3.47), ('Medical', 3.45), ('Lifestyle', 3.41), ('Utilities', 3.39), ('News', 3.36), ('Sports', 3.09)]


Assuming a reasonable threshold is a rating above 3.8, that means catalogs, games, productivity, music, reference, shopping, business, health & fitness, book, and photo & video apps would be recommended. There is still not enough information however. This data is not taking into account the number of installs, which would more hone in what to recommend.

While we do not have the information on how many people downloaded each app for the App Store dataset, we can substitute the total number of ratings instead.

In [15]:
tot_Rating_List = []

for genre in prime_Genre_App:
    total = 0 # sum of all ratings
    len_Genre = 0 # number of apps in genre
    for row in cleaner_App:
        genre_App = row[11]
        if genre_App == genre:
            tot_Rating = row[5]
            total += float(tot_Rating)
            len_Genre += 1
    avg_Rat = round(total / len_Genre, 2)
    tuple_Append = (genre, avg_Rat)
    tot_Rating_List.append(tuple_Append)
    
tot_Rating_List = sorted(tot_Rating_List, key = lambda num_Rat: num_Rat[1], reverse = True)
print(tot_Rating_List)

[('Social Networking', 60253.85), ('Music', 29047.11), ('Reference', 28096.22), ('Shopping', 26938.96), ('Finance', 23840.06), ('Weather', 23145.25), ('Food & Drink', 19934.39), ('Navigation', 19370.82), ('Travel', 19351.44), ('News', 17283.54), ('Games', 15641.67), ('Sports', 15350.91), ('Photo & Video', 14688.72), ('Health & Fitness', 10868.02), ('Book', 10750.11), ('Lifestyle', 9021.5), ('Entertainment', 8920.81), ('Productivity', 8508.09), ('Utilities', 8002.3), ('Business', 5149.32), ('Catalogs', 3465.0), ('Education', 2478.21), ('Medical', 648.95)]


It appears that the social networking genre is the genre with the most installed applications. Despite the 3380 games considered in the dataset, the 126 social networking apps have garnered more ratings. With giants such as Snapchat and Facebook, this may come as no surprise. Regardless of the apps within the social networking genre having an average rating of 3.51, it may be prudent to make an exception to the 3.8 rating threshold as the sheer number of ratings is worth noting.

The music genre is next in number, and with an average rating of 3.98, the apps within the music genre will definitely be recommended. The same will occur with apps within the reference genre (average rating of 3.98). Shopping (average rating of 3.98), food & drink (average rating of 3.65), and games (average rating of 4.06) will also be recommended.

Now let's turn our attention to Google Play apps.

### Most Popular Google Play Apps by Genre

We conveniently have the number of installs for each app, but the numbers are not precise enough--most values are open-ended (50+, 100+, 500+, etc.):

In [16]:
freq_Table(cleaner_GP, 5) # Installs column

{'5,000,000+': 604,
 '50,000,000+': 202,
 '100,000+': 1103,
 '50,000+': 462,
 '1,000,000+': 1414,
 '10,000+': 1018,
 '10,000,000+': 937,
 '5,000+': 462,
 '500,000+': 503,
 '1,000,000,000+': 20,
 '100,000,000+': 189,
 '1,000+': 879,
 '500,000,000+': 24,
 '50+': 204,
 '100+': 704,
 '500+': 326,
 '10+': 383,
 '1+': 66,
 '5+': 82,
 '0+': 13,
 '0': 1}

The issue with this is we don't know whether an app with 50,000+ installs has 99,999, 87,654, or 65,432 installs. However, we don't need the data to be this precise; we just need an idea of how popular each app genre is. Therefore, we will assume that an app with 5,000+ installs has 5,000 installs, an app with 100,000,000+ installs has 100,000,000 installs, and so on. We'll need to convert the values to `float`, removing the commas and plus characters in the process (done below in the loop).

In [17]:
categ_GP = freq_Table(cleaner_GP, 1)
categ_GP_List = []

for category in categ_GP:
    total = 0
    len_Category = 0
    for row in cleaner_GP:
        category_App = row[1]
        if category_App == category:
            num_Inst = row[5]
            num_Inst = num_Inst.replace('+', '') # Remove the plus characters
            num_Inst = num_Inst.replace(',', '') # Remove the commas
            total += float(num_Inst)
            len_Category += 1
    avg_Inst = round(total / len_Category, 2) 
    tuple_Append = (category, avg_Inst) # A tuple of the category and the average number of installs
    categ_GP_List.append(tuple_Append)
    
categ_GP_List = sorted(categ_GP_List, key = lambda inst: inst[1], reverse = True) 
print(categ_GP_List)

[('COMMUNICATION', 35266026.33), ('VIDEO_PLAYERS', 24121489.08), ('SOCIAL', 22961790.38), ('PHOTOGRAPHY', 16636241.27), ('PRODUCTIVITY', 15530942.01), ('GAME', 14210387.68), ('TRAVEL_AND_LOCAL', 13218662.77), ('ENTERTAINMENT', 11375402.3), ('TOOLS', 9809631.86), ('NEWS_AND_MAGAZINES', 9510848.43), ('BOOKS_AND_REFERENCE', 7676991.13), ('SHOPPING', 6966908.88), ('WEATHER', 4628211.79), ('PERSONALIZATION', 4086652.49), ('HEALTH_AND_FITNESS', 3972300.39), ('MAPS_AND_NAVIGATION', 3892045.94), ('SPORTS', 3384026.23), ('FAMILY', 3345018.52), ('ART_AND_DESIGN', 1919103.39), ('FOOD_AND_DRINK', 1891060.28), ('EDUCATION', 1782566.04), ('BUSINESS', 1663758.63), ('LIFESTYLE', 1377507.01), ('HOUSE_AND_HOME', 1360598.04), ('FINANCE', 1319851.4), ('COMICS', 832613.89), ('DATING', 828971.22), ('AUTO_AND_VEHICLES', 632501.32), ('LIBRARIES_AND_DEMO', 630903.69), ('PARENTING', 525351.83), ('BEAUTY', 513151.89), ('EVENTS', 249580.64), ('MEDICAL', 96944.5)]


It appears that communication apps have the most installs on average (35,266,026). However, apps like WhatsApp, Facebook Messenger, and Skype heavily skew this number:

In [18]:
for app in cleaner_GP:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

We see the same thing happening in the video players genre with giants like YouTube, Google Play Movies & TV, and MX Player:

In [19]:
for app in cleaner_GP:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


It also occurs in the social genre with apps like Facebook, Instagram, and Google+:

In [20]:
for app in cleaner_GP:
    if app[1] == 'SOCIAL' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


Let's create a function to check which of the top 10 genres (following communication, video players, and social) has 5 or fewer "giants" (defined here by having more than 100,000,000+ installs):

In [21]:
def num_Check(dataset, genre_List):
    potential_Genres = []
    for genre in genre_List:
        counter = 0 # number of "giants"
        for app in dataset:
            if app[1] == genre and (app[5] == '1,000,000,000+'
                                              or app[5] == '500,000,000+'
                                              or app[5] == '100,000,000+'):
                counter += 1
            if counter > 5:
                break
        if counter <= 5:
            potential_Genres.append(genre)
    return potential_Genres
list_Genre = ['PHOTOGRAPHY', 'PRODUCTIVITY', 'GAME', 'TRAVEL_AND_LOCAL', 
             'ENTERTAINMENT', 'TOOLS', 'NEWS_AND_MAGAZINES', 'BOOKS_AND_REFERENCE',
            'SHOPPING', 'WEATHER']

print(num_Check(cleaner_GP, list_Genre))

['TRAVEL_AND_LOCAL', 'ENTERTAINMENT', 'NEWS_AND_MAGAZINES', 'BOOKS_AND_REFERENCE', 'SHOPPING', 'WEATHER']


The following genres share potential within both the App Store and Google Play store: books and reference, shopping, weather, travel, and news.

Seeing as the travel genre has the most app installs amongst these genres (about 13,218,663), let's take a look at some apps in the travel genre:

In [22]:
for app in cleaner_GP:
    if app[1] == 'TRAVEL_AND_LOCAL':
        print(app[0], ':', app[5])

trivago: Hotels & Travel : 50,000,000+
Hopper - Watch & Book Flights : 5,000,000+
TripIt: Travel Organizer : 1,000,000+
Trip by Skyscanner - City & Travel Guide : 500,000+
CityMaps2Go Plan Trips Travel Guide Offline Maps : 1,000,000+
KAYAK Flights, Hotels & Cars : 10,000,000+
World Travel Guide by Triposo : 500,000+
Booking.com Travel Deals : 100,000,000+
Hostelworld: Hostels & Cheap Hotels Travel App : 1,000,000+
Google Trips - Travel Planner : 5,000,000+
GPS Map Free : 5,000,000+
GasBuddy: Find Cheap Gas : 10,000,000+
Southwest Airlines : 5,000,000+
AT&T Navigator: Maps, Traffic : 10,000,000+
VZ Navigator : 50,000,000+
KakaoMap - Map / Navigation : 10,000,000+
AirAsia : 10,000,000+
Expedia Hotels, Flights & Car Rental Travel Deals : 10,000,000+
Goibibo - Flight Hotel Bus Car IRCTC Booking App : 10,000,000+
Allegiant : 1,000,000+
Amtrak : 1,000,000+
JAL (Domestic and international flights) : 1,000,000+
Flight & Hotel Booking App - ixigo : 5,000,000+
VZ Navigator for Tablets : 500,000+

The travel genre contains a variety of apps: flight and hotel assistance, navigation, restaurant location, and hotspot interests. Let's see what popular apps there are:

In [23]:
for app in cleaner_GP:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Booking.com Travel Deals : 100,000,000+
TripAdvisor Hotels Flights Restaurants Attractions : 100,000,000+
Maps - Navigate & Explore : 1,000,000,000+
Google Street View : 1,000,000,000+
Google Earth : 100,000,000+


Three of these apps focus on navigation, while the other two hone in on hotel and flight assistance. That leaves hotspot interest as a potential market to tap into. Let's see what less popular apps there are:

In [24]:
for app in cleaner_GP:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

trivago: Hotels & Travel : 50,000,000+
Hopper - Watch & Book Flights : 5,000,000+
TripIt: Travel Organizer : 1,000,000+
CityMaps2Go Plan Trips Travel Guide Offline Maps : 1,000,000+
KAYAK Flights, Hotels & Cars : 10,000,000+
Hostelworld: Hostels & Cheap Hotels Travel App : 1,000,000+
Google Trips - Travel Planner : 5,000,000+
GPS Map Free : 5,000,000+
GasBuddy: Find Cheap Gas : 10,000,000+
Southwest Airlines : 5,000,000+
AT&T Navigator: Maps, Traffic : 10,000,000+
VZ Navigator : 50,000,000+
KakaoMap - Map / Navigation : 10,000,000+
AirAsia : 10,000,000+
Expedia Hotels, Flights & Car Rental Travel Deals : 10,000,000+
Goibibo - Flight Hotel Bus Car IRCTC Booking App : 10,000,000+
Allegiant : 1,000,000+
Amtrak : 1,000,000+
JAL (Domestic and international flights) : 1,000,000+
Flight & Hotel Booking App - ixigo : 5,000,000+
Wisepilot for XPERIA™ : 5,000,000+
VZ Navigator for Galaxy S4 : 5,000,000+
MAIN : 1,000,000+
Yoriza Pension - travel, lodging, pension, camping, caravan, pool villas ac

Yelp appears to be among the few apps that focus on hotspot interest, and upon further investigation, Yelp only focuses on the goods and services aspect of hotspot interests, such as restaurants, salons, and dentists. It could be profitable to make an app that focuses more on the sightseeing part of travel, including deals on experiences like skydiving, rafting, etc. It would need a similar review feature and a forum to discuss said experiences, and it could provide recommendations of where to go based on the time of year and prices.

## Conclusions

In this project, we analyzed Google Play and Apple Store mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that expanding upon the sightseeing part of travel could have untapped potential, focusing on popular points of interest and lesser known places. It would need a similar review feature and a forum to discuss said experiences, and it could provide recommendations of where to go based on the time of year and prices.