# Profitable App Profiles for the App Store and Google Play Markets

**Objective:** to find mobile app profiles that are profitable for the App Store and Google Play markets.

In [94]:
from csv import reader
# Google Play
opened_file = open("/Users/Taylor/OneDrive/Documents/DC/Dataquest/App Store Project/googleplaystore.csv", encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# App Store
opened_file = open("/Users/Taylor/OneDrive/Documents/DC/Dataquest/App Store Project/AppleStore.csv", encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


In [95]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Exploring the data. Column names and first five rows
* App: Application name
* Category: Category the app belongs to
* Rating: Overall user rating of the app (as when scraped)
* Reviews: Number of user reviews for the app (as when scraped)
* Size: Size of the app (as when scraped)
* Installs: Number of user downloads/installs for the app (as when scraped)
* Type: Paid or Free
* Price: Price of the app (as when scraped)
* Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult
* Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
* Last Updated: Date when the app was last updated on Play Store (as when scraped)
* Current Ver: Current version of the app available on Play Store (as when scraped)
* Android Ver: Min required Android version (as when scraped)

[IOS data source](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
[Android data source](https://www.kaggle.com/lava18/google-play-store-apps)

In [96]:
explore_data(android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In [97]:
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [98]:
explore_data(ios, 0, 5, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows: 7197
Number of columns: 17


In [99]:
ios_header

['',
 'id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

# Data cleaning

**1) Deleting an incorrect row**

In [100]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


**2) Removing duplicate entries**

First for Android

In [101]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of Android duplicate apps', len(duplicate_apps))
print('\n')
print(duplicate_apps[:10])

Number of Android duplicate apps 1181


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Finding the most recent duplicate entries

In [102]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print('Expected Android Length', len(android) - len(duplicate_apps))
print('Actual Android Length', len(reviews_max))

Expected Android Length 9659
Actual Android Length 9659


In [103]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print('Expected Android Length', len(android) - len(duplicate_apps))
print('Actual Android Lenght', len(android_clean))

Expected Android Length 9659
Actual Android Lenght 9659


Second for App Store

In [132]:
#explore_data(android_clean, 0, 3, True)

In [105]:
duplicate_apps = []
unique_apps = []

for app in ios:
    name = app[2]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of IOS duplicate apps', len(duplicate_apps))
print('\n')
print(duplicate_apps[:10])

Number of IOS duplicate apps 2


['VR Roller Coaster', 'Mannequin Challenge']


Only two duplicates, so we'll leave them

**3) Removing non-English apps**

In [106]:
def is_english(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
        if non_ascii > 3:
            return False
    else:
        return True

Let's make sure it works

In [107]:
print(is_english('Instagram'))
print(is_english('中国語 AQリスニング'))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True


In [108]:
android_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

print('Android English app #: ', len(android_english))

Android English app #:  9614


In [131]:
#explore_data(android_english, 0, 3, True)

In [110]:
ios_english = []

for app in ios:
    name = app[2]
    if is_english(name):
        ios_english.append(app)

print('IOS English app #: ', len(ios_english))

IOS English app #:  6183


Now we have both Android and IOS cleaned up for English only apps

In [129]:
#explore_data(android_english, 0, 3, True)

In [130]:
#explore_data(ios_english, 0, 3, True)

In [113]:
android_free = []

for app in android_english:
    price = app[6]
    if price == 'Free':
        android_free.append(app)
        
ios_free = []

for app in ios_english:
    price = app[5]
    if price == '0':
        ios_free.append(app)
        
print('Android Free app #: ', len(android_free))
print('IOS Free app #: ', len(ios_free))

Android Free app #:  8863
IOS Free app #:  3222


# Data analysis

Now that data is clean, let's look at which apps perform the best. First, we need to create a function to generate frequency tables since we will be doing it 3 times.

In [114]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = round(percentage, 2)
    return table_percentages

We need a second functino to sort the genres in decending order. The sorted() function doesn't work too well with dictionaries because it only considers and returns the dictionary keys. However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value.

In [115]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [116]:
display_table(ios_free, -5)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


In [117]:
display_table(android_free, 1) #Category

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [118]:
display_table(android_free, -4) #Genre

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;B

As a next step, let's consider the popularity of these genres in terms of number of reviews. As a reminder: IOS header is:
* App: Application name
* Category: Category the app belongs to
* Rating: Overall user rating of the app (as when scraped)
* Reviews: Number of user reviews for the app (as when scraped)
* Size: Size of the app (as when scraped)
* Installs: Number of user downloads/installs for the app (as when scraped)
* Type: Paid or Free
* Price: Price of the app (as when scraped)
* Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult
* Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
* Last Updated: Date when the app was last updated on Play Store (as when scraped)
* Current Ver: Current version of the app available on Play Store (as when scraped)
* Android Ver: Min required Android version (as when scraped)

In [147]:
ios_genres = freq_table(ios_free, -5)

#Initialize variables
genres = []
genre1 = 0
genre2 = 0
genre3 = 0
genre4 = 0
genre5 = 0
genre1n = 'A'
genre2n = 'B'
genre3n = 'C'
genre4n = 'C'
genre5n = 'C'

for genre in ios_genres:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            n_rating = float(app[6])
            total += n_rating
            len_genre += 1
    avg_rating = round(total / len_genre,0)
    #print(genre, ": ", avg_rating)
    
#Keeping track of top 3 genres
    if avg_rating > genre1:
        genre1 = avg_rating
        genre1n = genre
    elif avg_rating > genre2:
        genre2 = avg_rating
        genre2n = genre
    elif avg_rating > genre3:
        genre3 = avg_rating
        genre3n = genre
    elif avg_rating > genre4:
        genre4 = avg_rating
        genre4n = genre
    elif avg_rating > genre5:
        genre5 = avg_rating
        genre5n = genre 

print("#1 genre: ", genre1n, " with" , genre1, "average installs per app")
print("#2 genre: ", genre2n, " with" , genre2, "average installs per app")
print("#3 genre: ", genre3n, " with" , genre3, "average installs per app")
print("#4 genre: ", genre4n, " with" , genre4, "average installs per app")
print("#5 genre: ", genre5n, " with" , genre5, "average installs per app")

#1 genre:  Navigation  with 86090.0 average installs per app
#2 genre:  Social Networking  with 71548.0 average installs per app
#3 genre:  Book  with 39758.0 average installs per app
#4 genre:  Photo & Video  with 28442.0 average installs per app
#5 genre:  Games  with 22789.0 average installs per app


In [120]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[2], ": ", app[6])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Geocaching® :  12811
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5
CoPilot GPS – Car Navigation & Offline Maps :  3582
Google Maps - Navigation & Transit :  154911


In [146]:
android_genres = freq_table(android_free, 1)
#Initialize variables
genres = []
genre1 = 0
genre2 = 0
genre3 = 0
genre4 = 0
genre5 = 0
genre1n = 'A'
genre2n = 'B'
genre3n = 'C'
genre4n = 'C'
genre5n = 'C'

for genre in android_genres:
    total = 0
    len_genre = 0
    for app in android_free:
        genre_app = app[1]
        if genre_app == genre:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_genre += 1
    avg_rating = round(total / len_genre,0)
    #print(genre, ": ", avg_rating)
#Keeping track of top 5 genres
    
    if avg_rating > genre1:
        genre1 = avg_rating
        genre1n = genre
    elif avg_rating > genre2:
        genre2 = avg_rating
        genre2n = genre
    elif avg_rating > genre3:
        genre3 = avg_rating
        genre3n = genre
    elif avg_rating > genre4:
        genre4 = avg_rating
        genre4n = genre
    elif avg_rating > genre5:
        genre5 = avg_rating
        genre5n = genre    

#print("\n")    
print("#1 genre: ", genre1n, " with" , genre1, "average installs per app")
print("#2 genre: ", genre2n, " with" , genre2, "average installs per app")
print("#3 genre: ", genre3n, " with" , genre3, "average installs per app")
print("#4 genre: ", genre4n, " with" , genre4, "average installs per app")
print("#5 genre: ", genre5n, " with" , genre5, "average installs per app")

#1 genre:  COMMUNICATION  with 38456119.0 average installs per app
#2 genre:  VIDEO_PLAYERS  with 24727872.0 average installs per app
#3 genre:  PHOTOGRAPHY  with 17840110.0 average installs per app
#4 genre:  PRODUCTIVITY  with 16787331.0 average installs per app
#5 genre:  TOOLS  with 10801391.0 average installs per app


Based on the above analysis, we know that the genres with the most average installs per app are the following for IOS and Android, respectively:
**IOS**
* Navigation
* Social Networking
* Book
* Photo & Video
* Games
**Android**
* Communication
* Video Players
* Photography
* Productivity
* Tools

So now we have the data clean and all the tools, so which genre app should we suggest to build to maximize revenue? We should consider:

* App difficulty to build for genre (e.g. Navigation would be difficult)
* Avg. number of installs per genre

Based on the top 5 apps for both Android and IOS, we can conclude the following:
* Navigation apps like Google Maps and Waze would be too hard to build from scratch
* Social networking apps like Facebook and Communication apps like Whatsapp would be too hard to gain market share due to the network effects
* Photo & Video show up as top 5 for both Android AND IOS

Therefore, we should try to build a Photo or Video app because it would not be impossible to build, and they are very popular. Let's take a look at Apps in this category. 

In [180]:
#Create a function to display apps in a genre
def display_genre(dataset, name_index, genre_index, target_genre, install_index, android):        
    table_display = []
    row = tuple()
    app1 = 0
    app2 = 0
    app3 = 0
    app4 = 0
    app5 = 0
    app1n = 'A'
    app2n = 'B'
    app3n = 'C'
    app4n = 'C'
    app5n = 'C'
    for app in dataset:
        genre = app[genre_index]
        if genre == target_genre:
            row = (app[name_index], app[install_index])
            table_display.append(row)
            installs = app[install_index]
            if android == True:
                installs = installs.replace(',', '')
                installs = installs.replace('+', '')
            installs = int(installs)
            #Keeping track of top 5 apps 
            if installs > app1:
                app1 = installs
                app1n = app[name_index]
            elif installs > app2:
                app2 = installs
                app2n = app[name_index]
            elif installs > app3:
                app3 = installs
                app3n = app[name_index]
            elif installs > app4:
                app4 = installs
                app4n = app[name_index]
            elif installs > app5:
                app5 = installs
                app5n = app[name_index]
            
    table_sorted = sorted(table_display)
    
    print("#1 app: ", app1n, " with" , app1, "installs")
    print("#2 app: ", app2n, " with" , app2, "installs")
    print("#3 app: ", app3n, " with" , app3, "installs")
    print("#4 app: ", app4n, " with" , app4, "installs")
    print("#5 app: ", app5n, " with" , app5, "installs")

In [183]:
display_genre(ios_free, 2, -5, "Photo & Video", 6, android = False)

#1 app:  Instagram  with 2161558 installs
#2 app:  Snapchat  with 323905 installs
#3 app:  YouTube - Watch Videos, Music, and Live Streams  with 278166 installs
#4 app:  Funimate video editor: add cool effects to videos  with 123268 installs
#5 app:  Vine Camera  with 90355 installs


In [184]:
display_genre(android_free, 0, 1, "VIDEO_PLAYERS", 5, android = True)

#1 app:  YouTube  with 1000000000 installs
#2 app:  Google Play Movies & TV  with 1000000000 installs
#3 app:  MX Player  with 500000000 installs
#4 app:  Dubsmash  with 100000000 installs
#5 app:  VivaVideo - Video Editor & Photo Movie  with 100000000 installs


In [185]:
display_genre(android_free, 0, 1, "PHOTOGRAPHY", 5, android = True)

#1 app:  Google Photos  with 1000000000 installs
#2 app:  YouCam Makeup - Magic Selfie Makeovers  with 100000000 installs
#3 app:  Sweet Selfie - selfie camera, beauty cam, photo edit  with 100000000 installs
#4 app:  Retrica  with 100000000 installs
#5 app:  Photo Editor Pro  with 100000000 installs


After looking at our competition, it would be difficult to beat Instagram, YouTube, and Snapchat. But the other apps are niche video and photo editing apps. For example, YouCam Makeup - Magic Selfie Makeovers. We just need to find one aspect of photos that people don't like and then change it. Adding a simple filter or superimposing something onto the picture would do the trick. Perhaps find a dedicated fan base like Pokemon and then allow people to superimpose Pokemon into their selfies. That seems relatively easy and could be very successful. 

# Conclusion 

The objective of this project was to analyze the IOS and Android app stores to find trends and recommend an app product idea that would be capable of generating revenue through ads. The App had to be free and in English. I have concluded that the development team should focus on Video & Photography and find one aspect where users are not happy and then address it such as adding a filter or superimposing images into photos. 