# Researching profitable apps profiles

## Step 1: What is this project about?

This project aims to discover what are the most profitable types of free apps for iOs and android. In order to do that, we'll explorer two datasets with data from each app store, and try to discover what sort of free apps are more popular than the others.

## Step 2: Reading the files

First we will need to open the files with the data to be evaluated, and convert it to a format that is easier to be analysed. Both dataset, named "AppleStore" for iOs apps and "googleplaystore" for android apps, are in Comma-separated values (CSV) format.

In [1]:
from csv import reader

file_raw = open('AppleStore.csv', encoding='utf8')
file_csv = reader(file_raw, delimiter=',')
file_apple = list(file_csv)
header_apple = file_apple[0]

file_raw = open('googleplaystore.csv', encoding='utf8')
file_csv = reader(file_raw, delimiter=',')
file_android = list(file_csv)
header_android = file_android[0]

Now we have the converted file (it has beeen transformed into a list of lists) and the header (the first row of the file, which contains the name of the columns that we'll analyse), for each app store. After that, it's necessary to create a function to print the data contained in the files.

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    
    dataset_slice = dataset[start:end]    
    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
    print('\n')

We can test the ouput of the "explore_data" function using the code bellow, which aims to show the first 4 lines of data within each data set, and also show the number of rows and columns for each file. This is important because the number of rows will determine how many apps we will be evaluating (including 1 row representing the header) and the columns will show what type of information we have about each app.

In [3]:
explore_data(file_apple,0,4,True)
explore_data(file_android,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18'

To understand better the data, we might also analyse only the header of each file, because it describes what do the values inside the data file correspond to. It also shows that the header is different for each app store, so we'll need to develop specfici code to explore each app store.

In [4]:
explore_data(file_apple,0,1,True)
explore_data(file_android,0,1,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13




In the iOs file, track_name, price, prime_genre and the columns that have rating data maybe the most important for this analysis.
In the android file, app, category, rating, price and genre might be the columns that have the most relevant data for this project.
The [documentation of these datasets](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contain more information about what each column represents.

## Step 3: Finding the incorrect row

The discussion about the data set reveals that there is at [least one row that has incorrect data](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in the android dataset, so it needs to be deleted. In order to do that, we must first confirm if row in question has the problem that's referenced to in the documentation.

In [5]:
print(header_android)
print(file_android[10473])
print('Total rows of data for android file: ', (len(file_android)))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', '11-Feb-18', '1.0.19', '4.0 and up', '']
Total rows of data for android file:  10842


As described in the documentation, the row number 10473 really lacks the data for one column, so it can be analysed by our program. To fix data, we are going to remove the row from our dataset.

In [6]:
del file_android[10473]
print(file_android[10473])
print('Total rows of data for android file after correction: ', (len(file_android)))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', '7-Aug-18', '6.06.14', '4.4 and up']
Total rows of data for android file after correction:  10841


The [documentation for the iOs app store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) doesn't have any information about rows that might have imprecise data, so we'll not explore that dataset at this moment.

## Step 4: Finding duplicate apps

Continuing our research of the datasets, now it's time to search for duplicate entries in the datasets. So in this step, we'll be focusing on finding apps that might have the same name in our dataset).

In [7]:
for app in file_android[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)
        
print('\n')

for app in file_apple[1:]:
    name = app[1]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


As shown by this snippet of code, "Instagram" app has 4 entries on the Android dataset, and only one entrie on the iOs dataset. That might indicate that only the android dataset has duplicates, but we'll confirm that latter.
In order to have the best dataset, we now need to identify all the cases in order to eliminate them.

In [8]:
duplicate_apps_apple = []
unique_apps_apple = []

for app in file_apple[1:]:
    name = app[1]
    if name in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)
    
print('Number of duplicate apps: ', len(duplicate_apps_apple))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps_apple)

Number of duplicate apps:  2


Examples of duplicate apps:  ['Mannequin Challenge', 'VR Roller Coaster']


In [9]:
duplicate_apps = []
unique_apps = []

for app in file_android[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


From this results, we might think that iOs has only two duplicates, but if we search the iOs app store, we we'll discover that [Mannequin Challenge](https://www.apple.com/br/search/Mannequin-Challenge?src=serp) is actually two different apps with the same name, and that is also the case with VR Roller Coaster. So we'll not look further into duplicate apps on the iOs dataset.

## Step 5: Removing duplicate apps

Now that it's known the number of duplicate apps, and that only android dataset have duplicates, the next step is to remove these duplicate entries.
In the chunck of code bellow we are going to identify and remove the duplicates, leaving only the version of the app with the highest number of reviews. We could use other criterias to remove the duplicate apps, but as we're trying to understand what are the most popular and porfitable free apps, the number of reviews might be the best indicator of the better version of the app to be evaluated this time.
To do the removel, first it's needed to create a dictionary with the names of the all apps, in pair with the highest review count for that app.

In [10]:
reviews_max = {}

for app in file_android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    else:
        reviews_max[name] = n_reviews

print('Expected length without duplicates: ', len(reviews_max))
print('Actual length at this moment: ', len(file_android[1:]))
print('Number of apps to be removed: ', (len(file_android[1:]) - len(reviews_max)))

Expected length without duplicates:  9659
Actual length at this moment:  10840
Number of apps to be removed:  1181


Now that the dictionary has been created, we need to compare it to the original list of android apps, and create a new list as a result. This new list will only have unique apps, so in case of duplicate apps, only the version with the highest number of reviews will be there.

In [11]:
android_clean = []
already_added = []

for app in file_android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
            android_clean.append(app)
            already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13




## Step 6: Identifying non-English apps

In this step, we'll try to identify which of the apps might not have english names. Our research focus only in apps for English-speaking audience, so we need to identify if there are any foreign characters in the name of the app.

In [12]:
def english_characters(string):
        
    for i in string:
        if ord(i) > 127:
            return False
    return True
    
print(english_characters('Instagram'))
print(english_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_characters('Docs To Go™ Free Office Suite'))
print(english_characters('Instachat 😜'))

True
False
False
False


## Step 7: Removing non-English apps

The function that we built identifies an app as english if of it's all the characters name were in the ASCII range. This is usefull, but it could lead to identify some apps as foreing for using some characters ouside of this range, like emojis (😜) or special characters (™). So we'll develop a new function that is more accurate than that.

In [13]:
def english_name(string):
    count = 0
    for i in string:
        if ord(i) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

print(english_name('Docs To Go™ Free Office Suite'))
print(english_name('Instachat 😜'))
print(english_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now that we can identify foreign apps with more precision, we'll to search for then in our databases, and create a new output without those apps.

In [14]:
english_android = []
english_apple = []

for app in android_clean:
    name = app[0]
    if english_name(name):
        english_android.append(app)

for app in file_apple[1:]:
    name = app[1]
    if english_name(name):
        english_apple.append(app)  

explore_data(english_android,0,4,True)
explore_data(english_apple,0,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', '20-Jun-18', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23',

## Step 8: Selecting only free apps

In this step, we'll keep cleaning our datasets to obtain the best results at the end. This time, we'll keep only the free apps, which is the focus of our investigation. In order to do that, we'll create a snippet of code similar of the ones that we already used when removing non-Enligsh apps and duplicate apps, but this time we'll need to search for the price of the app instead of it's name.

In [15]:
free_android = []
free_apple = []

for app in english_android:
    price = app[7]
    if price == '0':
        free_android.append(app)

for app in english_apple:
    price = app[4]
    if price == '0':
        free_apple.append(app)

explore_data(free_android,0,4,True)
explore_data(free_apple,0,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', '20-Jun-18', '1.1', '4.4 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23',

## Step 9: Creating and showing the frequencies tables

Now that we have cleaned our datasets, we need to explore it. As our objective is to compreehend what are the most popular apps whithin the datasets, in order to create profile of popular apps based in that information, we need to first understand what are the most popular genres.
To do that we'll create a frequency table, that will contain the genres and percentage of apps in that genre. The genre might be extracted from many columns. In the android dataset we have the second column as category, and the tenth column being genres. In the iOs dataset, we have the twelfth column as the prime genre.

In [16]:
def freq_table(dataset,index):
    
    genre = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in genre:
            genre[value] += 1
        else:
            genre[value] = 1
            
    genre_per = {}
    
    for app in genre:
        per = (genre[app] / total)*100
        genre_per[app] = per
    
    return genre_per

Now that we have the frequency table, we need to be able to show it on the screen, in order to understand what are the most popular genres and deepen our research.

In [17]:
def display_table(dataset, index):

    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [18]:
display_table(free_android,1)

FAMILY : 19.223826714801444
GAME : 9.510379061371841
TOOLS : 8.461191335740072
BUSINESS : 4.580324909747293
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.5424187725631766
SPORTS : 3.4183212996389893
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2490974729241873
HEALTH_AND_FITNESS : 3.068592057761733
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.782490974729242
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.128158844765343
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
ENTERTAINMENT : 0.8799638989169676
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

In [19]:
display_table(free_android,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.580324909747293
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.5424187725631766
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2490974729241873
Action : 3.1024368231046933
Health & Fitness : 3.068592057761733
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.861462093862816
Video Players & Editors : 1.782490974729242
Casual : 1.7486462093862816
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.925090252707581

In [20]:
display_table(free_apple,11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


## Step 11: Analyzing the frequencies tables

In the last step we were able to explore our data and extract information of which type of free app is more prevalent in Android and iOs.
If go deeper in our data, we can see that in iOs, games are clear winner, with 58% of the free apps being of this genre, with entartainment being the runner up, with almost 8%. In Android we have the category family on the top, with 19%, and game and tools almost tied for the second place with 9% and 8%. If we look into the genres category, we mmight see a diferent result for android, with tools being in first with 8% and and entertainment and education as runner ups, with 6% and 5%.
This shows that this data alone might not be the best indicator of what type of app we should build. We can come to this conclusion if we analyse that, for example, in iOs we don't have social network in the top or as a runner up, although it's a known fact that apps like facebook, instagram and other social networks are installed in almost every smartphone of the planet. So in order to discover what are the most profitable app profile, we need to dive further into the datasets.

## Step 12: Extracting more precise data from the iOs apps

In this step we'll try to extract more relevant data in order to come to a more precise conclusion. We already have information about the how many apps are in each genre, but we don't have any idea of how many users each genre might have. In order to do that, we'll try to relate the number of installs or users with the genres or categories of apps.

In [21]:
genres_ios = freq_table(free_apple, 11)

for genre in genres_ios:
    
    total = 0
    total_user_rat = 0
    len_genre = 0
    
    for app in free_apple:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            user_rating = float(app[7])
            total += n_ratings
            total_user_rat += user_rating
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    avg_user_ratings = total_user_rat / len_genre
    
    print(genre, 'average number of ratings per app:', avg_n_ratings)
    print(genre, 'total number of ratings:', total)
    print(genre, 'average rating per app:', avg_user_ratings)

Social Networking average number of ratings per app: 71548.34905660378
Social Networking total number of ratings: 7584125.0
Social Networking average rating per app: 3.5943396226415096
Photo & Video average number of ratings per app: 28441.54375
Photo & Video total number of ratings: 4550647.0
Photo & Video average rating per app: 3.903125
Games average number of ratings per app: 22788.6696905016
Games total number of ratings: 42705967.0
Games average rating per app: 4.037086446104589
Music average number of ratings per app: 57326.530303030304
Music total number of ratings: 3783551.0
Music average rating per app: 3.946969696969697
Reference average number of ratings per app: 74942.11111111111
Reference total number of ratings: 1348958.0
Reference average rating per app: 3.6666666666666665
Health & Fitness average number of ratings per app: 23298.015384615384
Health & Fitness total number of ratings: 1514371.0
Health & Fitness average rating per app: 3.769230769230769
Weather average nu

Using as base all the data extracted until this point, we can say that Games might be the most popular apps in iOs. We're using the number of ratings that an app might have, which might not reflect the number of installations of that app or it's quality, but we can certainly see that games have good strategy to make people rate their app. But the main problem with developing a game is the budget for developing a successful game, that might be [above](https://www.imaginovation.net/blog/mobile-game-app-development-cost/) than the [average cost](https://www.newgenapps.com/blog/cost-to-develop-mobile-game-development-cost/) to develop an app.
Other numbers are also interesting, as the high number of ratings per app in the navigation, social network and weather category. But with we dwelve further into it, we can see that this might happen because this categories have some apps that are much more popular than others like facebook or instagram, so that analysis might not be precise at all.

In [22]:
for app in free_apple:
    if app[1] == 'Facebook' or app[1] == 'Instagram' or app[1] == 'Pinterest' :
        print('Number of ratings for', app[1], ':', app[5])

Number of ratings for Facebook : 2974676
Number of ratings for Instagram : 2161558
Number of ratings for Pinterest : 1061624


Another interesting conclusion is that there are some categories with a lot of low rating apps, which should make easier to compete with. We can see that phenomenon with Book, News and Entertainment genres. These are 3 genres with a good number of ratings per app, which show that they might be popular, but with low ratings per app, so it would be easier to introduce a better app than those that we'll be competing with.

## Step 13: Extracting more precise data from the android apps

Now it's time to extract data from the android dataset. This snippet of code will be very similar to the one related to extracting information about rating on iOs, the big difference is that in Android we have the number of total installations per app, which might help to build a more precise hypotesis about this dataset. We also need to watch out for special characters in the numbers, such as "Nan", "+" and ",". We'll analyse the data using the "Category" column, as it's more general than the "Genres" column in this dataset.

In [23]:
categories_android = freq_table(free_android,1)

for category in categories_android:
    
    total = 0
    len_category = 0
    total_user_rat = 0
    
    for app in free_android:
        category_app = app[1]
        
        if category_app == category:
            
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = float(n_installs.replace(',',''))
            total += n_installs
            user_rating = app[2]
            if user_rating == 'NaN':
                user_rating = 0.0
            total_user_rat += float(user_rating)
            len_category += 1
    
    avg_n_installs = total / len_category
    avg_user_ratings = total_user_rat / len_category
    
    print(category, 'average installs per app:', avg_n_installs)       
    print(category, 'average rating per app:', avg_user_ratings)

ART_AND_DESIGN average installs per app: 1986335.0877192982
ART_AND_DESIGN average rating per app: 4.185964912280701
AUTO_AND_VEHICLES average installs per app: 647317.8170731707
AUTO_AND_VEHICLES average rating per app: 3.674390243902439
BEAUTY average installs per app: 513151.88679245283
BEAUTY average rating per app: 3.3905660377358484
BOOKS_AND_REFERENCE average installs per app: 8767811.894736841
BOOKS_AND_REFERENCE average rating per app: 3.638421052631579
BUSINESS average installs per app: 1704192.3399014778
BUSINESS average rating per app: 2.546059113300492
COMICS average installs per app: 817657.2727272727
COMICS average rating per app: 4.025454545454546
COMMUNICATION average installs per app: 38326063.197916664
COMMUNICATION average rating per app: 3.36736111111111
DATING average installs per app: 854028.8303030303
DATING average rating per app: 3.161818181818181
EDUCATION average installs per app: 1768500.0
EDUCATION average rating per app: 4.2989999999999995
ENTERTAINMENT a

Now that we have all the data, we can see some similarity with the landscape already seen on iOs. First of all, it's important to realize that some of pre-installed apps or hlighly popular apps on android might inflate the numbers of some categories.

In [24]:
for app in free_android:
    if app[0] == 'Hangouts' or app[0] == 'Facebook' or app[0] == 'Google News' or app[0] == 'Instagram':
        print('Number of installs for', app[0], ':', app[5])

Number of installs for Instagram : 1,000,000,000+
Number of installs for Facebook : 1,000,000,000+
Number of installs for Hangouts : 1,000,000,000+
Number of installs for Google News : 1,000,000,000+


The output of that last chuck of code also shows another problem with the android dataset, the imprecise numbers. Although the number installs might be more relevant data than the number of ratings per app, we only have the approximate number of installs, as shown below.

In [25]:
for app in free_android:
    if app[5] == '1,000,000,000+':
        print('Number of installs for', app[0], ':', app[5])
        
print('\n')

for app in free_android:
    if app[5] == '500,000,000+':
        print('Number of installs for', app[0], ':', app[5])

Number of installs for Google Play Books : 1,000,000,000+
Number of installs for Messenger – Text and Video Chat for Free : 1,000,000,000+
Number of installs for Gmail : 1,000,000,000+
Number of installs for Skype - free IM & video calls : 1,000,000,000+
Number of installs for Google Street View : 1,000,000,000+
Number of installs for Google Play Movies & TV : 1,000,000,000+
Number of installs for Subway Surfers : 1,000,000,000+
Number of installs for WhatsApp Messenger : 1,000,000,000+
Number of installs for Instagram : 1,000,000,000+
Number of installs for YouTube : 1,000,000,000+
Number of installs for Facebook : 1,000,000,000+
Number of installs for Google Chrome: Fast & Secure : 1,000,000,000+
Number of installs for Maps - Navigate & Explore : 1,000,000,000+
Number of installs for Google+ : 1,000,000,000+
Number of installs for Google : 1,000,000,000+
Number of installs for Hangouts : 1,000,000,000+
Number of installs for Google Drive : 1,000,000,000+
Number of installs for Google

Due to this fact, the number of installs by itself, or even the average of installs per app might not give us all the important information. So we can use that information with the average rating per app, so we can find categories that might have a good public, but not a lot of good competition. It's also interesting to choose a category that don't require a large user base to become interesting, such as Waze.
Taking into consideration all these facts, "Books and reference", "News and Magazine" and also "Productivity" might be the three most interesting categories to creat an app

## Conclusion

In this project we were able to see many data relevant to the popularity of an app, and how hard it might be to decide what data you should consider more relevant. It was possible to see that there are many genres or categories of apps that are dominated by "champion" apps, which are way more popular than most of their competitors, and that this might lead to a biased conclusion.
Although the datasets were very different, we ended up with similar conclusions, as books and news ended up being popular categories with not a lot of good competition, so they are the best categories to create a profitable free app. 