<h1>Profitable App Profiles for the App Store and Google Play Markets</h1>

**The premise of this project is one works as a data analyst at a company that builds Android and iOS mobile apps that are then made avaialble on their respective app stores.**

Because we only build free apps our primary source of revenue is through an ad-based model, meaning, in-app adds. This implies that our most important determinant of revenue is the volume of users. In order to find out how to generate the most volume of users, we need to find out which apps on the stores engage the most amount of users. The more engagement, the more ad views, the better. To find that, we will analyze all of the apps on the respective stores and determine what typs of apps generate the most engagement with our in-app ads.

We will need to analyze two separate datasets (the Google Play and Apple Store sets), filter by price and language, then scan the categories. We will then reconcile business needs with the results of the project.

**Task List**
- [X] Write detailed, clear project description
<br>
- [ ] Import and open Google Play and Apple Store data separately
<br>
- [ ] Explore both datasets
<br>
- [ ] Data Cleaning
<br>
- [ ] Analysis of Data

In [1]:
from csv import reader
# This is Google Play dataset
opened_file= open('Googleplaystore.csv', encoding ='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

#This is Apple Store dataset
opened_file = open('AppleStore.csv', encoding ='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

**Task List**
- [X] Write detailed, clear project description
<br>
- [X] Import and open Google Play and Apple Store data separately
<br>
- [ ] Explore both datasets
<br>
- [ ] Data Cleaning
<br>
- [ ] Analysis of Data

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


**Task List**
- [X] Write detailed, clear project description
<br>
- [X] Import and open Google Play and Apple Store data separately
<br>
- [X] Explore both datasets
<br>
- [ ] Data Cleaning
<br>
- [ ] Analysis of Data

**At this point, what columns will help identify useful information to use in our analysis and business-decision making?**

The columns that I picked are: price, total rating count, primary genre, and language.
Why?
<br>**Price**: we only care about free apps because we make free apps
<br>**Total rating count**: we care about the volume of users engaging with the app
<br> **Primary Genre** : we care about the genre because of the outsized effects of heavyweights in the genre
<br>**Language**: Our adds are in English so we are targetting the English speaking market.

<h1>Begin Data Cleaning</h1>
There is an error in row 10472 where there is a missing rating. To clean data, delete the row at index 10472 (because the header has been removed in the android variable)

In [5]:
explore_data(android, 10472, 10473, True)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13


In [6]:
del android[10472]

<h3> Remove Duplicate Entries </h3>
As an example, there are multiple instances of the Instagram app. This needs to be removed for all apps using for loops and lists.

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps', len(duplicate_apps))
print('\n')
print('Example of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps 1181


Example of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


**How should I know which duplicate entries to delete?**
One possible criterion is to pick the duplicate entry with the highest amount of reviews. This suggests the most recent data. *It doesn't make sense to remove duplicate rows arbitrarily because it may leave out current and accurate information*


**Solution**: Use key-value pair for (app name , amount of reviews) pair. Each app name will be a unique key and the value will be selected to be the highest number of reviews.

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [9]:
print('Expected length', len(android) - 1181)
print('Actual length', len(reviews_max))

Expected length 9659
Actual length 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

We start by initializing two empty lists, android_clean and already_added.

We loop through the android data set, and for every iteration:

We isolate the name of the app and the number of reviews.

We add the current row (app) to the android_clean list, and the app name (name) to the already_cleaned list if:

The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and

The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [10]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


**Next, we want to check the entire datasets for apps that are in English only**
<br>I will create a function that will then loop through the datasets and filter out non-English apps

In [11]:
def check_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True
        
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
False
False


Notice above that two English apps were removed because the TM and emojii are ASCII characters that have an ord > 127. To minimize data loss, we'll remove an app if it has more than three chracters with an ord > 127

In [12]:
def check_english(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii +=1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
True


In [13]:
android_english = []
ios_english = []

for app in android_clean:
    name= app[0]
    if check_english:
        android_english.append(app)
    
for app in ios:
    name = app[1]
    if check_english(name):
        ios_english.append(app)
    
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

<h1> Now, isolate the free apps </h1> Use for loops and conditional statements

In [14]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))

8905
0


We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

**Task List**
- [X] Write detailed, clear project description
<br>
- [X] Import and open Google Play and Apple Store data separately
<br>
- [X] Explore both datasets
<br>
- [X] Data Cleaning
<br>
- [ ] Analysis of Data



<h1> Most Common Apps by Genre </h1>
<h3> Part One </h3>

As part of the original plan, we need to find out what type of app is most liklely to attract users because the revenue generated by the app is highly dependent on volume of users looking at the in-app ads.

As part of a business strategy to reduce risk and overhead expenses:
<br>1) We start with a mininal Android version.
   **Why?** Because Android OS has a larger market share (about 75%). Meaning, more    volume of users
<br>
2) If the app gets good responses from users, develop it further
<br>
3) After six months, if the Android version of the app is profitable, then develop the iOS version of the app and add it to App Store.
<br>
<br>Ultimately, the end goal is an app that's both profitable on both the Google Play and App Store. So, we need to analyze both data sets.

<h3> Part Two </h3>
We'll build two functions we can use to analyze the frequency tables:
<br>
1) One function to generate frequency tables that show percentages
<br>
2) Another function that we can use to display the percentages in a descending order

In [15]:
def freq_table(dataset, index):
    table={}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] +=1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total)*100
        table_percentages[key]= percentage
        
    return table_percentages
    




def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

<h3>Part Three</h3>
We start by examining the frequency table for the prime_genre column of the App Store data set.

In [16]:
display_table(ios_final, -5)

<h2>Interpretation (iOS store):</h2> 
**Observations:**What we are seeing here is that more than 58% of the free, English-based apps on the iOS App Store are games. This is followed by 7% in the Entertainment category, ~5% in the Photo & Video category, 3.6% in the Education category, and 3.3% in the Social Networking category. 
<br><br>
**Analysis:** What this suggests is that in the catgories of free, English-based apps, the Apple App Store is dominated by apps designed to entertain and take up time. Meanwhile, utility apps take a lower priority. *This does not in any way imply, at the moment, that fun-based apps would see large volume of users for our revenue generation scheme. We need to dig further to find out*

In [17]:
display_table(android_final, 1)

FAMILY : 18.97810218978102
GAME : 9.70241437394722
TOOLS : 8.433464345873105
BUSINESS : 4.581695676586187
LIFESTYLE : 3.9303761931499155
PRODUCTIVITY : 3.885457608085345
FINANCE : 3.6833239752947784
MEDICAL : 3.5148792813026386
SPORTS : 3.3801235261089273
PERSONALIZATION : 3.312745648512072
COMMUNICATION : 3.2341381246490735
HEALTH_AND_FITNESS : 3.065693430656934
PHOTOGRAPHY : 2.9421673217293653
NEWS_AND_MAGAZINES : 2.829870859067939
SOCIAL : 2.6501965188096577
TRAVEL_AND_LOCAL : 2.3245367770915215
SHOPPING : 2.2459292532285233
BOOKS_AND_REFERENCE : 2.1785513756316677
DATING : 1.8528916339135317
VIDEO_PLAYERS : 1.7967434025828188
MAPS_AND_NAVIGATION : 1.4149354295339696
FOOD_AND_DRINK : 1.235261089275688
EDUCATION : 1.167883211678832
ENTERTAINMENT : 0.9545199326221224
LIBRARIES_AND_DEMO : 0.9320606400898372
AUTO_AND_VEHICLES : 0.9208309938236946
HOUSE_AND_HOME : 0.8197641774284109
WEATHER : 0.7973048848961257
EVENTS : 0.7074677147669848
PARENTING : 0.6513194834362718
ART_AND_DESIGN : 0

<h1> Interpretation (Google Play Store): </h1>
**Observations:** To contrast, the most frequent free, English-based app in the Google Play Store is actually in the Family category at 18%. This doubles the next category of Games at 9%. In third, fourth, and fifth place are the Tools, Business, and Lifestyle category apps, respectively. 
<br><br>
**Analysis:** This suggests that a free, English-based app in the *utility* category would perform much better on the Play Store than on the App Store. It is of interest that the Games category came in second place, suggesting that perrhaps a Game-type of app would fair sufficiently well on the Google Play Store because of the 2nd place popularity. Then, if the business strategy continues, we would expect our free, English-based Game-category app to recieve a boost in users on the Apple App Store.

In [18]:
display_table(android_final, -4)

Tools : 8.422234699606962
Entertainment : 6.086468276249298
Education : 5.390230207748456
Business : 4.581695676586187
Lifestyle : 3.9191465468837734
Productivity : 3.885457608085345
Finance : 3.6833239752947784
Medical : 3.5148792813026386
Sports : 3.4475014037057834
Personalization : 3.312745648512072
Communication : 3.2341381246490735
Action : 3.0881527231892196
Health & Fitness : 3.065693430656934
Photography : 2.9421673217293653
News & Magazines : 2.829870859067939
Social : 2.6501965188096577
Travel & Local : 2.313307130825379
Shopping : 2.2459292532285233
Books & Reference : 2.1785513756316677
Simulation : 2.0662549129702414
Dating : 1.8528916339135317
Arcade : 1.8416619876473892
Video Players & Editors : 1.7742841100505335
Casual : 1.7518248175182483
Maps & Navigation : 1.4149354295339696
Food & Drink : 1.235261089275688
Puzzle : 1.1229646266142617
Racing : 0.9882088714205502
Role Playing : 0.9320606400898372
Libraries & Demo : 0.9320606400898372
Strategy : 0.9208309938236946
Au

<h1>Interpretation (Google Play Store):</h1>
**Observations:** Tools category is at 8.4%, Entertainment category is at 6%, Education category is at 5.3%, Business is at 4.5%, and Lifestyle is at 3.9%.
<br><br>
**Analysis:** The Genres column (index -4) is much more granular than the Category column (index 1) and doesn't help paint a clear big picture. For now, suggest use the big picture approach and focus on Category column.

<h1> Most Popular Apps by Category on the Play and App Stores</h1>
<br>
One way to determine which genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. The Google Play dataset has this information in its Install column, but the App Store data does not. So, we will us rating_count_tot as a proxy.

In [19]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre +=1
    avg_n_ratings = (total)/(len_genre)
    print(genre, ':', avg_n_ratings)
    

**The list above is not in order** Picking through the haystack, the category with the highest number of user reviews is actually in Navigation. Suspicion tells me that the Navigation category is heavily dominated by Google Maps and Waze. 
<br> To check, let's run some code 

In [20]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

** The pattern above seems to be that a few heavyweights dominate the app category. Is the same pattern true for the social networking category? **

In [21]:
for app in ios_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

**Similarly, the social networking category is dominated by some heavyweights like Facebook, Pinterest, Skype, Messenger, and Tumblr**
<br>
<br>
What about the music category?

In [22]:
for app in ios_final:
    if app[-5] == 'Music':
        print(app[1],':', app[5])

**Looks like the pattern is the same. A few heavyweights e.g. Pandora, Shazam, and Spotify dominate the category**

<h3> Extended Analysis </h3>
At this point, we are looking for popular genres in which our free app would do well in. It seems that we need to exclude the Social Networking, the Navigation, and the Music genres because the heavyweight incumbents tend to recieve a majority of ratings. Meanwhile, other apps in the same category struggle to break 10000+ ratings.
<br>
<br>
*What I am going to do next is look at the Reference category and start removing the extremely popular apps to rework the averages*

In [23]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1],':',app[5])

Subtracting the popular apps, this category shows some promise. A prospective idea is to take a popular book and convert it into an app with extra features, making the app more practical. This could possibly help stand out on the Apple App Store because that store is saturated with Game Apps.
<br>
<br>
Other categories to analyze:
<br><br>
Weather - People don't spend enough time on these types of apps for us to make any profit
<br><br>
Food & Drink - The actual heavyweights, e.g. Starbucks, Dunkin' Donuts, and McDonald's have an actual business and service behind the app. This is outside of scope of the company.
<br><br>
Finance - While an interesting category, it would require the company to hire a finance expert just to build an app. Not worth the time

<h1> Most Populars Apps by Category on Google Play Store </h1>
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [24]:
display_table(android_final, 5)

1,000,000+ : 15.687815833801237
100,000+ : 11.577765300393038
10,000,000+ : 10.499719258843346
10,000+ : 10.252667040988209
1,000+ : 8.422234699606962
100+ : 6.917462099943853
5,000,000+ : 6.816395283548568
500,000+ : 5.53621560920831
50,000+ : 4.817518248175182
5,000+ : 4.525547445255475
10+ : 3.537338573834924
500+ : 3.2341381246490735
50,000,000+ : 2.2908478382930935
100,000,000+ : 2.1224031443009546
50+ : 1.9090398652442448
5+ : 0.7860752386299831
1+ : 0.5165637282425604
500,000,000+ : 0.26951151038742277
1,000,000,000+ : 0.22459292532285235
0+ : 0.044918585064570464
0 : 0.011229646266142616


While the above data is imprecise because of the large categories, we actually don't need the most precise data. We just want to understand what app genre attracts the most users
<br><br>
To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [25]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)


ART_AND_DESIGN : 1952105.1724137932
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8587351.855670104
BUSINESS : 1708215.906862745
COMICS : 803234.8214285715
COMMUNICATION : 38322625.697916664
DATING : 854028.8303030303
EDUCATION : 1825480.7692307692
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1436126.94
GAME : 15551995.891203703
FAMILY : 3668870.823076923
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7001693.425
PHOTOGRAPHY : 17772018.759541985
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10787009.952063914
PERSONALIZATION : 5183850.806779661
PRODUCTIVITY : 16738957.554913295
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24573948.25
NEWS_AND_MAGAZINES : 9401635.95

On average, communication apps have the most installs: 38 million. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [26]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [27]:
under_100_m =[]

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace('+','')
    n_installs = n_installs.replace(',','')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m)/ len(under_100_m)

3589717.245210728

**So, from an average of 38M we went to an average of 3.5M after removing all communication apps that have over 100 million installs** It seems these niches are dominated by a few giants.
<br> <br>
Perhaps the original idea of making a practical book & reference app makes most sense right now

In [28]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0],':',app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

**Looks decent so far. Let's check the heavy weights that skew the average**

In [29]:
for app in android_final:
    if app[1]=='BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' 
                                           or app[5]=='500,000,000+' 
                                           or app[5]=='100,000,000+'):
        print(app[0],':',app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


**There may be potential in this category because there are only a few really popular apps** 
<br>
To get a better idea, I will  check the somewhat middle intervals of popularity for this category. Intervals will be from 1M to 100M downloads.

In [30]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5]== '1,000,000+'
                                           or app[5]== '5,000,000+'
                                           or app[5]=='10,000,000+'
                                           or app[5]=='50,000,000+'):
        print(app[0],':',app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

**Task List**
- [X] Write detailed, clear project description
<br>
- [X] Import and open Google Play and Apple Store data separately
<br>
- [X] Explore both datasets
<br>
- [X] Data Cleaning
<br>
- [X] Analysis of Data

<h1> CONCLUSION </h1>
This project took datasets from the Google Play and Apple App Stores to analyze and conclude a potentially profitable free, English-based app.
<br><br>
The current conclusion suggests that a practical, utlity-based app around a popular, recent book with loads of extra features will make the most business sense. There needs to be extra features besides the raw version of the book, because the market is already full of libraries e.g. Kindle.
<br><br>
Extra features to include would be:
<br>
1)Daily reminder to read the book
<br>
2)Daily quotes from the book
<br>
3)Quizes on the book
<br>
4)An audio version of the book
<br>
5)A forum to discuss the book
<br>
6)The ability to highlight and save notes in the actual text of the book