# Profitable Apps Profilesfor Apps Store/Google Play Markets

## In this project we seek to do a couple of things: 
1. We want to engage the data set with basic data analysis. We want to explore what are some patterns, look at some summary statistics to get a big picture understanding of the data.
2. We want to use the information from the data analysis to make smart decisions for a small company that focuses on making free apps. We want to inspect the current app market and decide what sort of app is the best for us to make.

In [8]:
from csv import reader
# Apple Store DataSet
data = open('AppleStore.csv')
read_data = reader(data)
list_data = list(read_data)
header_apple = list_data[0]
apple = list_data[1:]

# Google Play Store DataSet
data2 = open('googleplaystore.csv')
read_data2 = reader(data2)
list_data2 = list(read_data2)
header_google = list_data2[0]
google = list_data2[1:]

## Creating a quick view of the data:
We wish to find a way to easily present our data, so we create an `explore_data()` function

In [9]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Having created our function, we try and see what our data looks like. We do so by observing the dimensions as well as the features of the dataset.

In [10]:
print(header_google)
print('\n')
explore_data(google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [11]:
print(header_apple)
print('\n')
explore_data(apple, 0, 3, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see things like price, content rating, category and names. These will be useful in our analysis. So we take note of their indicies and names so that we may refer to them appropriately from now on.

## Error Data
We move on and begin to clean and filter our data. We notice that one of the rows produces an error. We look at the discussion [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) and see row 10472 is causing the issue. We choose to simply delete the row in this case in order to not have to deal with any issues.

In [13]:
print(google[10472])
print('\n')
print(header_google)
print('\n')
print(google[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We notice that the data point actually is missing the category, so the columns are shifted. To deal with this easily, I will just delete the data point.

In [14]:
print(len(google))
del google[10472]
print(len(google))
# Check the pre and post-deletion to see if a change was made.

10841
10840


## Deleting Duplicate Data
After sifting through the data, we can note that some data is duplicated. It is important for our analysis to remove duplicate data, doing so will make our analysis accurate, in order to not mess up our computations. 

### Part 1:
Let us look at some examples in order to better illustrate this.

In [15]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)
        

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We want to count the total amount of duplicates in our dataset. So we create a method of counting duplicates.

In [16]:
dupe_apps = []
uniq_apps = []
for app in google:
    name = app[0]
    if name in uniq_apps:
        dupe_apps.append(name)
    else:
        uniq_apps.append(name)
print('Number of duplicate apps:', len(dupe_apps))
print('\n')
print('Example of dupe apps:', dupe_apps[:15])

Number of duplicate apps: 1181


Example of dupe apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


So there are a total of 1181 duplicate entries. Some duplicates simply are older information. So we want the most recent information.
We want to keep the rows with the most reviews (because it will make our analysis accurate), and get rid of the older less informative data points.

To do this we will follow a few steps:
* Create a dictionary, where each key is a unique app name, and its corresponding value is the app with the most reviews
* Using the aforementioned dictionary, we will create a new data set. This new data set will only contain one point with the highest number of reviews.


In [17]:
reviews_max = {}
for app in google:
    name = app[0]
    num_revs = float(app[3])
    if name in reviews_max and reviews_max[name] < num_revs:
        reviews_max[name] = num_revs
    elif name not in reviews_max:
        reviews_max[name] = num_revs

We previously found that there are 1181 dupes, so we expect that our new dicitonary be the size of our original dataset size - 1181

In [18]:
print('Expected Length: ', len(google) -1181)
print('Actual Length: ', len(reviews_max))

Expected Length:  9659
Actual Length:  9659


We wish to generate the dataset without duplicates to do this we follow a few steps:
 * Create 2 empty arrays `google_clean` and `already_added` which will track the clean data set and the names of data that are already in the clean data, respectively
 * We iterate through the original dataset, we store the name of the application, and the number of app reviews. We use this to compare if this instance of the application has reviews equal to the application that has the max number of reviews. At the same time we check if this applicationis not in the clean dataset. If this is true we add this to our clean data set and add the name to the set of apps that have been added
 * We then check the length and see if we got something that is correct

In [19]:
google_clean = []
already_added = []
for app in google:
    name = app[0]
    num_revs = float(app[3])
    if num_revs == reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)
print(len(google_clean))

9659


In [21]:
explore_data(google_clean, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


## Removing non-English Apps
### Part 1:
We care about only analyzing English based apps, since we are catering to an English-speaking demographic. Here are some examples of non-english applications.

In [23]:
print(apple[813][1])
print(apple[6731][1])
print(google_clean[4412][0])
print(google_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


As we can see, these applications have non-english characters, using ASCII, we can check if the application names contain english characters. Using this we can build a method to remove non-English applications.
We use the builtin in function `ord()` in order to see if a character is english. We know that english ASCII characters are less than 127, so we use this to build the method.

In [24]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True
# here are some examples
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


We see that our outputs work as defined. The function returns False as soon as there is a character that is not defined to be english. However, if remove apps with this method we will end up removing english apps that simply have non-english characters (such as emojis, trademark symbols etc.) We revise our function to only remove the application if it has more than 3 non-english characters.

In [25]:
def is_english(string):
    non_ascii = 0
    for char in string:
        if ord(char) > 127:
            non_ascii+=1
    if non_ascii > 3:
        return False
    else:
        return True

We will now use this method for our filtration. It is not perfect, but it is good enough that we will not lose signficant amounts of data.

In [29]:
google_english = []
apple_english = []
for app in google_clean:
    name = app[0]
    if is_english(name):
        google_english.append(app)
for app in apple:
    name = app[1]
    if is_english(name):
        apple_english.append(app)
# Explore some of the cleaned data.
explore_data(google_english, 0, 5, True)
print('\n')
explore_data(apple_english, 0, 5, True)

        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', 

Based on this cleaning, we see that there are 9614 English Google Play applications, and 6183 English Apple Store applications.

## Finding the Free Apps
Our analysis seeks to consider the free apps market. So we must then build a way to filter through only the free applications in both of the data sets.

In [31]:
google_final = []
apple_final = []
for app in google_english:
    price = app[7]
    if price == '0':
        google_final.append(app)
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)
print(len(google_final))
print(len(apple_final))

8864
3222


Having removeod the non-free apps we see that there are 8864 free android apps and 3222 free iOS apps.

## Common Apps by Genre
Our strategy for an app idea is compromised of three steps:
1. Build a minimal Android app, add it to the Google Play Store
2. If the app is responsive, we develop it more.
3. If it is profitable after six months, we build the app in iOS, and add it to the App Store

Given this strategy, we can now focus on finding app profiles that are successfull on both platforms. We start by looking common genres. To see this clearly we build a frequency table

In [54]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Having created the methods, we look at the `prime_genre` column of the App Store dataset.

In [55]:
display_table(apple_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We see that over 50% of apps in English on the App Store are games. Approximately 8% are entertainment, about 5% are photo & video. Social Networking is about 3% and the rest are less.

It seems that the majority of apps are games, and the rest are not, but this is interesting because the apps with the most users are usually things like social media. This just may mean that there is less demand for things like social networking, but more for games.

We now look at the Google Play Store to compare. In this case there are 2 relevant columns: Category, and Genres.


In [56]:
display_table(google_final, 1) 
# category for Google Play Store

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

We see that there is no clear majority, but the most populous category is the FAMILY group. However, family apps means games for kids generallly, 

In [57]:
display_table(google_final, -4)
# genre for Google Play Store

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

There are more tools as well here on the Google Play Store. There is a much more diverse set of applications to choose from in this case. 

The genres column has much more categories, so it seems to give less of a clear picture for distributions of apps. 
## Popular Apps By Genre

We now want to look at which apps are the most popular. So we group then by category and see how many people are installing. We cant explicityly see this in the App Store data, so we use total number of ratings in its place.

In [59]:
genres_apple = freq_table(apple_final, -5)
for genre in genres_apple:
    total = 0
    len_genre = 0
    for app in apple_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total+=n_ratings
            len_genre+=1
    avg_n_ratings = total/len_genre
    print(genre, ':', avg_n_ratings)

Utilities : 18684.456790123455
Productivity : 21028.410714285714
Social Networking : 71548.34905660378
Business : 7491.117647058823
News : 21248.023255813954
Catalogs : 4004.0
Navigation : 86090.33333333333
Games : 22788.6696905016
Sports : 23008.898550724636
Medical : 612.0
Shopping : 26919.690476190477
Food & Drink : 33333.92307692308
Entertainment : 14029.830708661417
Book : 39758.5
Weather : 52279.892857142855
Health & Fitness : 23298.015384615384
Lifestyle : 16485.764705882353
Travel : 28243.8
Photo & Video : 28441.54375
Finance : 31467.944444444445
Reference : 74942.11111111111
Music : 57326.530303030304
Education : 7003.983050847458


Navigation apps have the highest number of average reviews, which seems reasonable given the popularity of something like google maps and it being pretty much readily available on most android phones. Navigation apps are generally hard to make from the ground up, so most of them use some sort of API that pulls from an already built system, so it may not be wise to create a Navigation app. Next is reference, and it is not sure what this means, but it could be some sort of information app. FOllowing reference social networking, which makes sense since Facebook, Instagram, Twitter, etc. are extremely popular. This is a good place to think about, since these aren't as tough to make as something like a navigation app.

In [61]:
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])
# we can see the name of the applications and the number of ratings.
# there are only 6, but the average is high because of the sheer amount 
# of Google Maps and Waze users

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


As previously mentioned, the same happens for social networking applications. Addtionally, the music category also shows this because of Apps like Spotify, Pandora and others dominating the market which very high amounts of users.

In [63]:
for app in apple_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We were also interested in the Reference genre, we see that the Bible App and the Dictionary.com app heavily skews the rating average. Based on this we can find something that is currently popular and build an app that is a sort of informational database. This is even simpler than a social network app, and is very specific or very general depending on current popular content markets. This could be a potential market for building a reference app that is built on current trends. Could be a sort of dynamic database.