
# Finding the next Big Idea - Analyzing App Profiles for IOS & Google Store 

Our main aim is to find profitable app profiles for both the App Store and Google Play Store. We are working as data analysts for a copmpany along side with developers so that we can facilitate them with data-driven decisions.

Our company is only concerned with building apps that are free because our main source of revenue consists of in-app adds. Therefore, the main objective here is to analyze only free applications.

## Exploring Data

We have two separate files containing data from Google play store and IOS store. This data was compiled in 2018 and at that time there were 2 million IOS apps on the App Store and 2.1 million Android apps on the Google Play Store. For this task we won't be analyzing such a big dataset instead we will be analyzing a subset of that dataset which is avalaible on Kaggle.

<ul>
    <li> Link to <a href="https://www.kaggle.com/lava18/google-play-store-apps">Android Dataset</a> containing approximately ten thousand apps</li>
    <li>Link to <a href="https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps">IOS Dataset</a> containing approximately seven thousand apps</li>
</ul>

In [1]:
from csv import reader
#Opening Google Data Set
open_file=open('googleplaystore.csv',encoding='utf-8')
read_data=reader(open_file)
android=list(read_data)
android_headers=android[0]
android_wout_headers=android[1:]

#Opening Apple Data Set
open_file=open('AppleStore.csv',encoding='utf-8')
read_data=reader(open_file)
ios=list(read_data)
ios_headers=ios[0]
ios_wout_headers=ios[1:]

print('Android Headers')
print(android_headers)
print('\n')
print('IOS Headers')
print(ios_headers)

Android Headers
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


IOS Headers
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


I have created a function explore_data that prints out some rows from a dataset along with the information about the datasets number of rows and columns if specified while calling the function.

In [2]:
def explore_data(dataset,start,end,row_and_cols=False):
    data_slice=dataset[start:end]
    for i in range(len(data_slice)):
        print(data_slice[i])
        print('\n')
    
    if row_and_cols:
        print("Number of rows: ",len(dataset))
        print("Number of columns: ",len(dataset[0]))

In [3]:
#Explore Google Dataset
print('Google Dataset Exploration\n')
explore_data(android_wout_headers,2,5,True)

#Explore IOS Dataset
print('IOS Dataset Exploration\n')
explore_data(ios_wout_headers,2,5,True)

Google Dataset Exploration

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows:  10841
Number of columns:  13
IOS Dataset Exploration

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '

Google Play store dataset contains 10841 records and 13 columns. In a glance it seems that headers or column names are self explanatory for this dataset, and, columns like category, reviews, rating , install, genres, and price will be important in the analysis.


App Store datasetcontains 7,197 records and 16 columns. On a glance it seems that it would be best to refer to the documentation of the dataset for column descriptions. Colums like track_name, user_rating, and prime_genre will be important in the analysis

## Removing rows with missing values

After looking at the discussion session of both the dataset, and exploring it on my own. I have come across an entry that has a missing value in the Google Play Store dataset. I am using my function explore_data to print out records from 10471 to 10478 of android dataset. 

In [4]:
explore_data(android,10471,10478)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']


['Lennox iComfort Wi-Fi', 'LIFESTYLE', '3.0', '552', '7.6M', '50,000+', 'Free', '0', 'E

Life Made WI-Fi Touchscreen Photo Frame app doesnot have a content rating, and it is also missing a genre. Therefore it would be best to delete this entry from our dataset.

In [5]:
del android[10473]
del android_wout_headers[10472]

## Removing Repetative Rows



In the IOS App Store dataset every application has a unique ID which is mentioned in the ID column. However, in the Google Play store dataset there is not an ID column as the applications have been distinguished on the basis of their name. This means that there might be a chance that we would find repetative rows in this dataset, so it is best to remove these rows.

I am using a function remove_repetative_row that I have created for identifying duplicate rows. It takes in the dataset and returns two datasets that contain unique and duplicate applications.

In [6]:
def remove_repetative_row(dataset):
    unique_dataset=[]
    duplicate_dataset=[]
    for row in dataset:
        name=row[0]
        if name in unique_dataset:
            duplicate_dataset.append(name)
        else:
            unique_dataset.append(name)
    
    return unique_dataset,duplicate_dataset

In [7]:
android_unique,android_duplicate=remove_repetative_row(android_wout_headers)
print(len(android_unique))
print(len(android_duplicate))

print(android_duplicate[10:20])
print("\n")
print("Printing duplicate entry.\n")
for row in android_wout_headers:
    name=row[0]
    if name == "Instagram":
        
        print(row)

    
    
        


9659
1181
['FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Printing duplicate entry.

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', '

The android Google Play store dataset has 9,659 applications that are unique, and 1,181 applications have repeated entries. As an example names of 10 applications have been printed that have duplicate entries. Furthermore duplicate entries of an application Instagram have been printed out.

Removing duplicate entries is necessary as this can corrupt our analysis. Duplicate entries can either be removed randomly or on the basis of a criteria. In the duplicate entries of Instagram above it seems that the only difference here is in the reviews (4th value above) column. Therefore, it seems that duplicate applications can be removed on the basis of maximum reviews that is from amongst the duplicate entries of a certain application the entry with the most reviews will be kept in our analysis.

The approach here is to create an empty dictionary where name of the application is the key and its value is the maximum number of reviews it has in our dataset. Once this dictionary is created we will create a new list containing all the information about the unique entries.

In [8]:
reviews_max={}
#Duplicate Entry Resolving Criteria
for row in android_wout_headers:
    name=row[0]
    n_reviews=float(row[3])
    
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews

        
print("Length of dictionary: ",len(reviews_max)) 
#Preparing Clean android data
android_clean=[]
already_added=[]

for row in android_wout_headers:
    name=row[0]
    n_reviews=float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

print("Length of cleaned dataset: ",len(android_clean))

Length of dictionary:  9659
Length of cleaned dataset:  9659


The length of our dictionary and our cleaned dataset is the same and we finally have a cleaned dataset containing 9,659 records.

## Remove Non English Applications

Our company is concerned with only English applications therefore, I will remove any application having a non-english name. I have created a function english_character that identifies whether a name's first three characters are english or non-english.

In [9]:
def english_character(word):
    non_ascii = 0
    
    for character in word:
        if ord(character)<0 or ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True


print(english_character('Instagram'))
print(english_character('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_character('Docs To Go™ Free Office Suite'))
print(english_character('Instachat 😜'))

True
False
True
True


The above result shows that our function is working perfectly as it can differentiate between english and non-english names.

In [10]:
#Removing non English Apps from the Data

#Android Dataset
android_cl_eng=[]#android clean english apps
for row in android_clean:
    name=row[0]
    if english_character(name):
        android_cl_eng.append(row)

print('Total Number of Apps: ',len(android_clean))
print('Number of English Apps: ',len(android_cl_eng))

#IOS Dataset
ios_cl_eng=[]#ios clean english apps
for record in ios_wout_headers:
    name=record[1]
    if english_character(name):
        ios_cl_eng.append(record)

print('Total Number of Apps: ',len(ios_wout_headers))
print('Number of English Apps: ',len(ios_cl_eng))

explore_data(android_cl_eng, 0, 3, True)
explore_data(ios_cl_eng, 0, 3, True)



Total Number of Apps:  9659
Number of English Apps:  9614
Total Number of Apps:  7197
Number of English Apps:  6183
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29',

On the basis of our function english_character we have extracted all applications having english names. The english applications of google play store dataset are present in android_cl_eng list and the english applications in the IOS APP store dataset are present in the ios_cl_eng list.

Android dataset has 9,659 records out which the number of english applications is 9,614 and IOS dataset has 7,197 records out of which 6,183 are english.

## Removing Applications that are Paid

Since our company is concerned with applications that are free we will remove all applications that are paid from our filtered dataset. I have created a get_free_apps functions that takes the dataset, and an index and returns a dataset containing applications that are free. The function identifies free applications on the basis of the column index passed to it.

In [11]:
#A function that extracts data of Apps that are free
def get_free_apps(dataset,index):
    dataset_free=[]
    for row in dataset:
        check=row[index]
        if check=="0" or check=='Free' or check=="0.0":
            dataset_free.append(row)
    return dataset_free

In [12]:
#Android Dataset

android_free=get_free_apps(android_cl_eng,7)
print("Number of Free Apps in Android Dataset: ",len(android_free))


#IOS Dataset
ios_free=get_free_apps(ios_cl_eng,4)
print("Number of free Apps in IOS Dataset: ",len(ios_free))


Number of Free Apps in Android Dataset:  8864
Number of free Apps in IOS Dataset:  3222


After running the get_free_apps function we see number of free apps in android dataset are 8,864 and the number of free apps in IOS dataset is 3,222.

## Most Common Apps By Genre

### Part one

We need to identify an application that has the most users because our revenue depends on in-app adds. Therefore we will find the most common genre or category of apps present in both Google and the App store. In the Google play store dataset we will find the frequency using both the category and genre column, and then we will see which column to use for further analysis. IN the App store dataset we will use prime_genre column.

I have created two functions below for further analysis of data. The freq_table function takes in the dataset and the column number that we want to use for frequency calculation along with a boolean variable choice_percent that specifies whether the frequency should be in percentage or numbers. This function returns a dictionary where the key is the app name and value is the frequency.

I have also created a second function that takes this dictionary and returns it in descending order. This function converts the dictionary into a list of typles and then applies the built in sorted function on that list and returns it.

In [13]:
#Creating Frequency tables
def freq_table(dataset,index,choice_percent):
    freq_dict={}
    total=0
    for row in dataset:
        key=row[index]
        total+=1
        if key in freq_dict:
            freq_dict[key]+=1
            
        else:
            freq_dict[key]=1
    if choice_percent==True:
        freq_dict_percent={}
        for key in freq_dict:
            percentage=(freq_dict[key]/total)*100
            freq_dict_percent[key]=percentage
        return freq_dict_percent
    else:
        return freq_dict

#Displaying Frequency tables in descending order
def display_table(dataset,index,choice_percent,choice_freq,freq):
    if choice_freq==True:
        freq_dict=freq_table(dataset,index,choice_percent)
    else:
        freq_dict=freq
    table_display=[]
    for key in freq_dict:
        key_val_tuple=(freq_dict[key],key)
        table_display.append(key_val_tuple)
    table_sorted=sorted(table_display,reverse=True)
    for entry in table_sorted:
        print(entry[1]," : ",entry[0])
    return table_sorted

In [14]:
#Google Play Store Category Frequency
print("Google Play Store Category Frequency Table\n")
android_category=display_table(android_free,1,True,True,0)
print("\n")
#Google Play Store Genres Frequency
print("Google Play Store Genres Frequency Table\n")
android_genres=display_table(android_free,9,True,True,0)
print("\n")
#IOS Store Prime_Genres Frequency Table
print("IOS Store Prime_Genres Frequency Table\n")
ios_prime_genres=display_table(ios_free,11,True,True,0)
print("\n")

Google Play Store Category Frequency Table

FAMILY  :  18.907942238267147
GAME  :  9.724729241877256
TOOLS  :  8.461191335740072
BUSINESS  :  4.591606498194946
LIFESTYLE  :  3.9034296028880866
PRODUCTIVITY  :  3.892148014440433
FINANCE  :  3.7003610108303246
MEDICAL  :  3.531137184115524
SPORTS  :  3.395758122743682
PERSONALIZATION  :  3.3167870036101084
COMMUNICATION  :  3.2378158844765346
HEALTH_AND_FITNESS  :  3.0798736462093865
PHOTOGRAPHY  :  2.944494584837545
NEWS_AND_MAGAZINES  :  2.7978339350180503
SOCIAL  :  2.6624548736462095
TRAVEL_AND_LOCAL  :  2.33528880866426
SHOPPING  :  2.2450361010830324
BOOKS_AND_REFERENCE  :  2.1435018050541514
DATING  :  1.861462093862816
VIDEO_PLAYERS  :  1.7937725631768955
MAPS_AND_NAVIGATION  :  1.3989169675090252
FOOD_AND_DRINK  :  1.2409747292418771
EDUCATION  :  1.1620036101083033
ENTERTAINMENT  :  0.9589350180505415
LIBRARIES_AND_DEMO  :  0.9363718411552346
AUTO_AND_VEHICLES  :  0.9250902527075812
HOUSE_AND_HOME  :  0.8235559566787004
WEATHER

### Analyzing Prime_genres Frequency table to get insight into IOS Store

The most common genre is <b>Games</b> with <b>58.16%</b>, and the runner up here is <b>Entertainment</b> with <b>7.88%</b>. It seems that most of the apps on IOS store are designed for an entertainment purpose rather than a practical purpose. From the frequency table alone it seems that we should only focus on designing games but this doesn't indicate that these games have the largest number of users. Most popular genre according to large number of users can be found by comparing the number of installs or user reviews of a genre.

### Analyzing Google Play Store Genre and Category frequency table

In the genre frequency table the most common genre is <b>(8.449909747292418, Tools)</b> and in the categories frequency table the most common is <b>(18.907942238267147, FAMILY)</b>. After looking at the frequency tables of google play store, it seems that there are more productivity apps on the store other than games. Moreover, by looking at the frequency tables of category and genre of the Google play store dataset it seems that the compared to category, genre is more granular in terms of its division of Apps. Therefore, for further analysis we will be using the category column. We may refer to the genre column once we are done analyzing the category column.

### Comparison of Google and App store

On the IOS store we found that there were more entertainment Apps than productivity Apps where as it was the other way around in the case of Google Play Store. This means that according to our validation strategy there is a high chance of a game ending up on both the Apps stores other than productive Apps. This also tells that the productive apps present on ios store have a permanent audience that is satisfied with those apps. Since productivity apps are more on google play store than IOS store this means that few productivity Apps earn enough to make it to the IOS store. Based on the frequency tables I would recomment an App in the entertainment industry such as games. The frequency tables tell us the genre having the most number of apps present on the store however, it doesnot tell us about the genre that has the most number of users. After analyzing the genres based on the most number of users we can reach on a conclusion of the best genre for the company.



## Most Common Genre based on the number of users in App Store

The App store has no such column which tells us the number of users whi have downloaded that application. Therefore, we will be using the ratings column to get an idea of the number of users using that application.

### Part one

I have obtained the non-percentage form frequency table of the prime_genre column in the App store dataset. I will be creating a dictionary where the key is the genre and the value is the user rating of that genre. Every time a genre repeats its user rating gets added in to its entry present in the dictionary. The value present in this dictionary have been used to calculate the average ratings of a genre. The average rating is calculated by dividing the total ratings with the frequency of that genre. 

In [15]:

prime_genre_freq=freq_table(ios_free,11,False)

#Finding most popular App by genre on App Store
#Doing this by looking at the number of users using the App

prime_genre_rating={}
for row in ios_free:
    rating=float(row[5])
    genre=row[11]
    if genre in prime_genre_rating:
        prime_genre_rating[genre]+=rating
    else:
        prime_genre_rating[genre]=rating
        

prime_genre_avg_rating={}
for key in prime_genre_freq:
    prime_genre_avg_rating[key]=prime_genre_rating[key]/prime_genre_freq[key]

prime_genre_avg_table=display_table(0,0,False,False,prime_genre_avg_rating)



Navigation  :  86090.33333333333
Reference  :  74942.11111111111
Social Networking  :  71548.34905660378
Music  :  57326.530303030304
Weather  :  52279.892857142855
Book  :  39758.5
Food & Drink  :  33333.92307692308
Finance  :  31467.944444444445
Photo & Video  :  28441.54375
Travel  :  28243.8
Shopping  :  26919.690476190477
Health & Fitness  :  23298.015384615384
Sports  :  23008.898550724636
Games  :  22788.6696905016
News  :  21248.023255813954
Productivity  :  21028.410714285714
Utilities  :  18684.456790123455
Lifestyle  :  16485.764705882353
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Education  :  7003.983050847458
Catalogs  :  4004.0
Medical  :  612.0


From the data above it seems that the Navigation Apps are the ones having the most users. It would be best to analyze it further by looking at some App profiles before jumping in to conclusions.

### Part Two

I have created a function print_app_genre for this purpose. This function takes in the dataset, the indices of the columns we want to print, the index of the rating column along with the genre, and the number of apps to print.

In [16]:
#Going forward with Navigation Apps as the ones having most users

#Printing App profiles of some Navigation apps

def print_app_genre(dataset,index_cond,index_id,index_val,genre,count):
    num=0
    for row in dataset:
        genre_app=row[index_cond]
        if genre_app==genre:
            if num==count:
                break
            else:
                print(row[index_id],":",row[index_val])
                num+=1
    return

print_app_genre(ios_free,11,1,5,"Navigation",10)

#Analysis

#


Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


After looking at the App profiles of Navigation it seems that most of the ratings of this category is because of big heavy weights like <b> Waze - GPS Navigation</b> and <b>Google Maps - Navigation</b>. Similarly, in the other front running genres like Social Networking and Music you can notice that most of their ratings would be due to heavy weights like Facebook, and WhatsApp, Spotify, and Sound Cloud respectively These heavy weights are influencing our choice because there might be some other genre that is more popular but finds it difficult to cross certain number of user ratings. Before making a final decision its best to explore other genres.

In [17]:
#Now looking at the runner up that is Reference section
print_app_genre(ios_free,11,1,5,"Reference",15)

#This genre shows better results that Navigation because even if we
#elimate the giants then again we have apps that have a good number oif users who downloaded and reviewd it
#So if we can create an app that has a couple of popular series as text books as well as audio then
#we might get a good start.

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14


<b>Reference</b> genre shows better results the <b>Navigation</b> genre because even if we elimate the giants then again we have apps that have a good number of users who downloaded and reviewed it. So if we can create an app that has a couple of popular series as text books as well as audio then we might get a good number of users to see the in-app adds.

I also analyzed the weather, food and drink, and finance apps and reached to the following conclusion.
<ul>
    <li>People generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.</li>
    <li>Food & Drink examples include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.</li>
    <li>Finance apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.</li>
</ul>

However, I have decided to take a look into Photo & Video apps before coming to a conclusion.

In [18]:
#Now looking at Weather, Food and Drink, Finance, Photo & Video
print_app_genre(ios_free,11,1,5,"Photo & Video",30)


Instagram : 2161558
Snapchat : 323905
YouTube - Watch Videos, Music, and Live Streams : 278166
Pic Collage - Picture Editor & Photo Collage Maker : 123433
Funimate video editor: add cool effects to videos : 123268
musical.ly - your video social network : 105429
Photo Collage Maker & Photo Editor - Live Collage : 93781
Vine Camera : 90355
Google Photos - unlimited photo and video storage : 88742
Flipagram : 79905
Mixgram - Picture Collage Maker - Pic Photo Editor : 54282
Shutterfly: Prints, Photo Books, Cards Made Easy : 51427
Pic Jointer – Photo Collage, Camera Effects Editor : 51330
Color Pop Effects - Photo Editor & Picture Editing : 45320
Photo Grid - photo collage maker & photo editor : 40531
iSwap Faces LITE : 39722
MOLDIV - Photo Editor, Collage & Beauty Camera : 39501
Photo Editor by Aviary : 39501
Photo Lab: Picture Editor, effects & fun face app : 34585
Rookie Cam - Photo Editor & Filter Camera : 33921
FotoRus -Camera & Photo Editor & Pic Collage Maker : 32558
PicsArt Photo St

Photo & Video Apps generally have a good number of users engaged, and therefore I would either recommend a book app or a photo & video app. Not an editor but something like TikTok cause its quite trendy these days. However, no decision can be made unless and untill we look at frequencies from Google play store dataset.

## Most Common Category in Google Play store on the basis of Number of Users

### Part one

In [19]:
explore_data(android_free, 0, 5, False)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']




We will be using the 6th row (Installs column in dataset) for analysis. The 6th row tells us about the how many users have downloaded and installed this application. If we look at the Installs column we see that the number is not definite as from the above examples the number can be 10,000+ or 50,000+ but we don't know the exact figure. Therefore, inorder to use it for analysis we have to remove any commas and + signs from these values. Then we will add them up and compile it in the form of a dictionary. This dictionary will be used to calculate the average number of installs by dividing the total number of installs of a category with the frequency of that category.

In [20]:
#Google Play Store most popular Apps by Genre analysis

category_freq=freq_table(android_free,1,False)

category_freq_install={}
for row in android_free:
    no_install=row[5]
    key=row[1]
    no_install=no_install.replace(',','')
    no_install=float(no_install.replace('+',''))
    if key in category_freq_install:
        category_freq_install[key]+=no_install
    else:
        category_freq_install[key]=no_install

category_avg_install={}        
for key in category_freq_install:
    category_avg_install[key]=category_freq_install[key]/category_freq[key]

category_avg_table=display_table(0,0,False,False,category_avg_install)
    

COMMUNICATION  :  38456119.167247385
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17840110.40229885
PRODUCTIVITY  :  16787331.344927534
GAME  :  15588015.603248259
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10801391.298666667
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8767811.894736841
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5074486.197183099
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4056941.7741935486
FAMILY  :  3695641.8198090694
SPORTS  :  3638640.1428571427
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1833495.145631068
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1437816.2687861272
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
DATING  :  854028.8303030303
COMICS  :  817657.2727272727
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638

The frequency table above shows that Communication category is the most popular but it is highly likely that its rating is influenced by the big names such as WhatsApp,Gmai,Facebook,Skype etc.

In [21]:
print_app_genre(android_free,1,0,5,"COMMUNICATION",10)

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+


This shows that the communication category is highly skewed by the big players. Therefore, it would not be an ideal choice.
Next we move towards video players where we see that the results are also skewed by the big players like Youtube, Vimeo etc but it also shows that with a little bit of effort on the marketting side that is by hiring starts from other platforms we may be able to bring a considerable amount of traffic on our platform. Video playing applications have a charm of engaging audience for several hours, henceforth it may be an ideal category.

Other categories like social, photography, productivity and game are also highly skewed by the big players and the game category is highly saturated as there is a small number of games that are able to gather a very large audience. The books and references category is still fairly popular with 8,767,811 installs. It would be best to explore this category in detail before jumping to conclusion.

In [22]:
print_app_genre(android_free,1,0,5,"BOOKS_AND_REFERENCE",50)


E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The books and reference category is also skewed by big players but if we take alook at the first 50 apps in this category we can see that the number of big players might be less than other categories. It would be best to verify our results before analysing it further.

In [23]:
for row in android_free:
    if row[1]=="BOOKS_AND_REFERENCE" and (row[5]=='1,000,000,000+' or row[5]=='500,000,000+' or row[5]=='100,000,000+'):
        print(row[0]," : ",row[5])

Google Play Books  :  1,000,000,000+
Bible  :  100,000,000+
Amazon Kindle  :  100,000,000+
Wattpad 📖 Free Books  :  100,000,000+
Audiobooks from Audible  :  100,000,000+


There are a few big players in thius category however, inorder to come up with an idea it would be best to look at the apps that are somewhere in the middle interms of popularity.

In [24]:
for row in android_free:
    if row[1]=="BOOKS_AND_REFERENCE" and (row[5]=='1,000,000+' or row[5]=='5,000,000+' or row[5]=='10,000,000+'or row[5]=='50,000,000+'):
        print(row[0]," : ",row[5])

Wikipedia  :  10,000,000+
Cool Reader  :  10,000,000+
Book store  :  1,000,000+
FBReader: Favorite Book Reader  :  10,000,000+
Free Books - Spirit Fanfiction and Stories  :  1,000,000+
AlReader -any text book reader  :  5,000,000+
FamilySearch Tree  :  1,000,000+
Cloud of Books  :  1,000,000+
ReadEra – free ebook reader  :  1,000,000+
Ebook Reader  :  5,000,000+
Read books online  :  5,000,000+
eBoox: book reader fb2 epub zip  :  1,000,000+
All Maths Formulas  :  1,000,000+
Ancestry  :  5,000,000+
HTC Help  :  10,000,000+
Moon+ Reader  :  10,000,000+
English-Myanmar Dictionary  :  1,000,000+
Golden Dictionary (EN-AR)  :  1,000,000+
All Language Translator Free  :  1,000,000+
Aldiko Book Reader  :  10,000,000+
Dictionary - WordWeb  :  5,000,000+
50000 Free eBooks & Free AudioBooks  :  5,000,000+
Al-Quran (Free)  :  10,000,000+
Al Quran Indonesia  :  10,000,000+
Al'Quran Bahasa Indonesia  :  10,000,000+
Al Quran Al karim  :  1,000,000+
Al Quran : EAlim - Translations & MP3 Offline  :  5,

The above entries are composed of religious books, dictionaries, e-books, and e-libraries. Working on a religious book will require hiring religious experts and while working on dictionaries would require a language expert. Moreover people donot spend much time on these applications therefore, they are not beneficial since our source of revenue is in-app adds. The best thing would be to work on e-books or e-libraries. We can add several books in our application available in both as text, and audio.

## Conclusion

After analyzing the data above I have finalized two categories: Books and References, Photo and Video. For the Photo and Video category we can make an application like tiktok, and musically but it involves a significant amount of investment as we would need to hire influencers to market the app and bring people over to our platform. However, for an e-bbok or an audio book even a simple advertisement would help in bringing our application to the book lovers. After a risk assessment it would be best to go with the books and references idea as the risk and investment involved in this project is low with high returns..
