# Finding Opportunities for New Apps - in  App Store and Google Play

Our goal for this project is to analyze data to understand what type of apps are likely to attract more users. We are goint to find profitable mobile app for the App Store and Google Play markets.  

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. 

To do this, we'll need to analyze data about mobile apps available on Google Play and the App Store. we'll try to analyze a sample of the data.

## Exploring the Data

We'll examine two data sets:
* [data](https://www.kaggle.com/lava18/google-play-store-apps/home) about approximately ten thousand Android apps from Google Play
* [data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) about approximately seven thousand iOS apps from the App Store

Function **open_dataset** - open csv file, read in the opened file using the reader() function and return - list of lists and header

Function **print_info_dataset** - exploring the data, print header, first rows of data and number of rows and number of columns

In [1]:
# Open csv file and return - list of lists with header separately

def open_dataset(file_name, key = 'r', header_ = True):  
    
    opened_file = open(file_name, key, encoding='utf8')   # open file (read mode by default)
    from csv import reader
    read_file = reader(opened_file) # reading file
    data = list(read_file)   # creating List of Lists
    header = []
    if header_:
        return data[0], data[1:]      # select the header separately
    else:
        return header, data

> Function print_info_dataset - exploring the data, print header, first rows of data and number of rows and number of columns

In [2]:
# Function to print info about Dataset

def print_info_dataset(data, header, num_rows = 2):
    if header:
        print('Header - Names of columns ','\n','\n', header)    
    print('\n', 'First', num_rows, 'rows of dataset')
    for item in data[:num_rows]:
        print('\n', item) 
    
    print('\n', len(data), '- Number of rows - всего строк с данными / без строки заголовка')
    print(len(data[0]), '- Number of columns - всего столбцов с данными')


> Let's open the two data sets and exploring the data

In [3]:

header_ios, data_ios = open_dataset ('AppleStore.csv')    
print('### IOS apps ###', '\n')
print_info_dataset(data_ios, header_ios)
print('-----------------------------')
print('\n')

header_goo, data_goo = open_dataset ('googleplaystore.csv')    
print('### Android apps ###', '\n')
print_info_dataset(data_goo, header_goo)

### IOS apps ### 

Header - Names of columns  
 
 ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

 First 2 rows of dataset

 ['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']

 ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']

 7197 - Number of rows - всего строк с данными / без строки заголовка
17 - Number of columns - всего столбцов с данными
-----------------------------


### Android apps ### 

Header - Names of columns  
 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

 First 2 rows of dataset

 ['Photo 

---


* First data set contains information about 7197 Apple iOS mobile applications. The data was extracted from the iTunes Search API at the Apple Inc website. 
*  7197 rows and 16 columns

**Columns in the data set  iOS apps from the App Store:**


| № | Column name      | Description |
| - | ----------- | ----------- |
| 1 | "id"      | App ID       |
| 2 | "track_name"   | App Name        |
| 3 |"size_bytes"| Size (in Bytes)|
| 4 |"currency"| Currency Type|
| 5 |"price"| Price amount|
| 6 |"rating_count_tot"| User Rating counts (for all version)|
| 7 |"rating_count_ver"| User Rating counts (for current version)|
| 8 |"user_rating" | Average User Rating value (for all version)|
| 9 |"user_rating_ver"| Average User Rating value (for current version)|
|10 |"ver" | Latest version code|
|11 |"cont_rating"| Content Rating|
|12 |"prime_genre"| Primary Genre|
|13 |"sup_devices.num"| Number of supporting devices|
|14 |"ipadSc_urls.num"| Number of screenshots showed for display|
|15 |"lang.num"| Number of supported languages|
|16 |"vpp_lic"| Vpp Device Based Licensing Enabled|

* Second Datase -  This information is scraped from the Google Play Store. 
* 10841 rows and 13 columns

**Columns in the data set  Android apps from Google Play**

| № | Column name      | Description |
| - | ----------- | ----------- |
| 0 | "App"      | App Name       |
| 1 | "Category"   |Category|
| 2 |"Rating"| Rating|
| 3 |"Reviews"|Reviews |
| 4 |"Size"| Size|
| 5 |"Installs"| Number of Istalls|
| 6 |"Type"| Free or not|
| 7 |"Price" |Price amount|
| 8 |"Content Rating"| Content Rating|
| 9 |"Genres" | Primary Genre|
|10 |"Last Updated"| Last Updated|
|11 |"Current Ver"| Current Ver|
|12 |"Android Ver"| Android Ver|


## Data cleaning and formatting

Let's make data cleaning before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

We need to:
* Remove or correct inaccurate data
* Detect duplicate data, and remove the duplicates. 

in our analysis we are interested in **free** apps and apps in English. We need to:
* Remove non-English apps
* Remove apps that aren't free.

### Correct data

> The Google Play data set (data_goo) has a discussion section. We've found [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) the information about wrong data in dataset.

> The problem is in row 10472.


In [4]:
print(header_goo, '\n')  # Header
print(data_goo[10472:10474])   # 2 rows

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'], ['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']]


> The maximum rating for a Google Play app is 5. The problem in row 10472 is - its "category" column data is missing in this row 10472. We'll take missing data from [here](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) and correct the row.

In [5]:

# Correcting data for row 10472
for item in data_goo:
    # two conditions - if you run this more than once
    if item[0] == 'Life Made WI-Fi Touchscreen Photo Frame' and float(item[2]) == 19:  
        item.insert(1, 'LIFESTYLE')  # list.insert(i, x)

print(data_goo[10472])

['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Removing Duplicate Entries

> * We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries.
> * FLet's find all duplicate rows in our data set. unction **duplicate**

In [6]:
def duplicate(data):
    duplicate_rows = []
    unique_rows = []
    for item in data:
        if item in unique_rows:
            duplicate_rows.append(item)
        else:
            unique_rows.append(item)
    return duplicate_rows, unique_rows

duplicate_goo, unique_goo = duplicate(data_goo)
duplicate_ios, unique_ios = duplicate(data_ios)

print('duplicate rows in data_goo - ', len(duplicate_goo), '; unique rows in data_goo - unique_goo:', len(unique_goo),  '\n')
print('duplicate rows in data_ios - ', len(duplicate_ios), '; unique rows in data_ios - ', len(unique_ios))
print('\n')
print(unique_goo[:3])

duplicate rows in data_goo -  483 ; unique rows in data_goo - unique_goo: 10358 

duplicate rows in data_ios -  0 ; unique rows in data_ios -  7197


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


> * The Google Play data set (data_goo)  has duplicate Entries. We make data set ** unique_goo** without duplicate rows

> * We have information from [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/82616) about other type of duplicate entries.
> * We have duplicate rows where the difference happens on the 3 position of each row (number of reviews). That means the data was collected at different times. Removing duplicates, we'll keep the row with the highest number of reviews (the higher the number of reviews, the more reliable the ratings).

> * We'll create a dictionary **reviews_max**, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [7]:
reviews_max = {}
for item in unique_goo:
    name = item[0]
    if name in reviews_max and float(reviews_max[name]) < float(item[3]):
        reviews_max[name] = float(item[3])
    elif name not in reviews_max:
        reviews_max[name] = float(item[3])

print('rows in dictionary reviews_max --', len(reviews_max))

rows in dictionary reviews_max -- 9660


> * We've found 9660 unique rows
> * Use dictionary  to remove the duplicates, and create a new data set **goo_clean**, which will have only one entry per app. we'll only keep the entries with the highest number of reviews. 

In [8]:
goo_clean = []
goo_added = []

for item in unique_goo:
    name = item[0]
    if float(item[3]) == reviews_max[name] and name not in goo_added:
        goo_clean.append(item)
        goo_added.append(name)

print(len(goo_clean))          

9660


> let's explore the new data set **goo_clean**, and confirm that the number of rows is 9,660.

In [9]:
print('### Unique Android apps ###')
print_info_dataset(goo_clean, header_goo)

### Unique Android apps ###
Header - Names of columns  
 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

 First 2 rows of dataset

 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

 9660 - Number of rows - всего строк с данными / без строки заголовка
13 - Number of columns - всего столбцов с данными


### Removing Non-English Apps

> We'd like to analyze only the apps that are directed toward an English-speaking audience

> **English name of apps** — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /)

> The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. 

> emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127. 

> We'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. 

> This means all English apps with up to three emoji or other special characters will still be labeled as English. *Our filter function is still not perfect, but it should be fairly effective*.

In [10]:
def eng(data, col_num):  #col_num - number of column with Name of apps
    rows_to_del = []
    del_data = data.copy()
    for item in data:
        name = item[col_num]
        string = item[col_num]
        count = 0
        for letter in string:
            if ord(letter) > 127:
                #print(ord(letter))
                count = count + 1
                #print(count)
        #print(count)
        if count > 3:
            #print('del')
            del_data.remove(item)
    return del_data

print('### Unique Android apps with English names ###')
goo_clean_eng = eng(goo_clean, 0)
print('Unique Android apps with English names- ', len(goo_clean_eng), 'from', len(goo_clean))

print('\n')

print('### Unique Apple IOS apps with English names ###')
ios_clean_eng = eng(data_ios,2)
print('Unique Aplle IOS apps with English names- ', len(ios_clean_eng), 'from', len(data_ios))

### Unique Android apps with English names ###
Unique Android apps with English names-  9615 from 9660


### Unique Apple IOS apps with English names ###
Unique Aplle IOS apps with English names-  6183 from 7197


> We have 9615 Android apps and 6183 iOS apps.

## Isolating the Free Apps

In [11]:
free_goo_apps = []    
for item in goo_clean_eng:
    if item[7] == '0':
        free_goo_apps.append(item)

free_ios_apps = []    
for item in ios_clean_eng:
    if float(item[5]) == 0:
        free_ios_apps.append(item)


print('Free Applications')
print('\n')

print('Free Android Applications', len(free_goo_apps))

print('Free IOS Applications', len(free_ios_apps))
       

Free Applications


Free Android Applications 8865
Free IOS Applications 3222


# Most Common Apps by Genre

> Our aim is to determine the kinds of apps that are likely to attract more users. Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. 


> Function **frequency**  to generate **frequency tables** that show percentages  in a descending order - for any column in data  set - *col_num*

In [12]:
## frequency table - function/ frequency of values on any column (col_num). 
# data sorted descending
# col_num = -5 by default - prime_genre Apple IOS. 

def frequency(data, col_num = -5):
    content_ratings = {}
    for item in data:
        c_rating = item[col_num]
        if c_rating in content_ratings:
            content_ratings[c_rating] += 1
        else:
            content_ratings[c_rating] = 1
    
    for key in content_ratings:
        proc = (content_ratings[key] / len(data)) * 100
        content_ratings[key] = round(proc,3)
        
    l = lambda x: x[1]
    content_ratings = sorted(content_ratings.items(), key=l, reverse=True)
    
    return content_ratings



> Let's begin the analysis by getting a sense of what are the most common genres for each market. 

* build frequency tables for the **prime_genre** column of the App Store data set 


In [13]:
import pprint  # Prints the nicely formatted dictionary

genre_ios_frequency = frequency(free_ios_apps)
print('\n', 'IOS Apps by prime_genre column ')
pprint.pprint(genre_ios_frequency)



 IOS Apps by prime_genre column 
[('Games', 58.163),
 ('Entertainment', 7.883),
 ('Photo & Video', 4.966),
 ('Education', 3.662),
 ('Social Networking', 3.29),
 ('Shopping', 2.607),
 ('Utilities', 2.514),
 ('Sports', 2.142),
 ('Music', 2.048),
 ('Health & Fitness', 2.017),
 ('Productivity', 1.738),
 ('Lifestyle', 1.583),
 ('News', 1.335),
 ('Travel', 1.241),
 ('Finance', 1.117),
 ('Weather', 0.869),
 ('Food & Drink', 0.807),
 ('Reference', 0.559),
 ('Business', 0.528),
 ('Book', 0.435),
 ('Navigation', 0.186),
 ('Medical', 0.186),
 ('Catalogs', 0.124)]


Among the free English apps, **more than a half (58.16%) are games**.
Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education

App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, this fact doesn't also imply that they also have the greatest number of users

* the **Category** columns of the Google Play data set.

In [14]:
category_goo_frequency = frequency(free_goo_apps, 1)
print('\n', 'Android Apps by categoty column ')
pprint.pprint(category_goo_frequency)


 Android Apps by categoty column 
[('FAMILY', 18.906),
 ('GAME', 9.724),
 ('TOOLS', 8.46),
 ('BUSINESS', 4.591),
 ('LIFESTYLE', 3.914),
 ('PRODUCTIVITY', 3.892),
 ('FINANCE', 3.7),
 ('MEDICAL', 3.531),
 ('SPORTS', 3.395),
 ('PERSONALIZATION', 3.316),
 ('COMMUNICATION', 3.237),
 ('HEALTH_AND_FITNESS', 3.08),
 ('PHOTOGRAPHY', 2.944),
 ('NEWS_AND_MAGAZINES', 2.798),
 ('SOCIAL', 2.662),
 ('TRAVEL_AND_LOCAL', 2.335),
 ('SHOPPING', 2.245),
 ('BOOKS_AND_REFERENCE', 2.143),
 ('DATING', 1.861),
 ('VIDEO_PLAYERS', 1.794),
 ('MAPS_AND_NAVIGATION', 1.399),
 ('FOOD_AND_DRINK', 1.241),
 ('EDUCATION', 1.162),
 ('ENTERTAINMENT', 0.959),
 ('LIBRARIES_AND_DEMO', 0.936),
 ('AUTO_AND_VEHICLES', 0.925),
 ('HOUSE_AND_HOME', 0.823),
 ('WEATHER', 0.801),
 ('EVENTS', 0.711),
 ('PARENTING', 0.654),
 ('ART_AND_DESIGN', 0.643),
 ('COMICS', 0.62),
 ('BEAUTY', 0.598)]


The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes.

However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play. This picture is also confirmed by the frequency table we see for the **Genres** column:

In [15]:
genre_goo_frequency = frequency(free_goo_apps, -4)
print('\n', 'Android Apps by genres column')
pprint.pprint(genre_goo_frequency)


 Android Apps by genres column
[('Tools', 8.449),
 ('Entertainment', 6.069),
 ('Education', 5.347),
 ('Business', 4.591),
 ('Lifestyle', 3.892),
 ('Productivity', 3.892),
 ('Finance', 3.7),
 ('Medical', 3.531),
 ('Sports', 3.463),
 ('Personalization', 3.316),
 ('Communication', 3.237),
 ('Action', 3.102),
 ('Health & Fitness', 3.08),
 ('Photography', 2.944),
 ('News & Magazines', 2.798),
 ('Social', 2.662),
 ('Travel & Local', 2.324),
 ('Shopping', 2.245),
 ('Books & Reference', 2.143),
 ('Simulation', 2.042),
 ('Dating', 1.861),
 ('Arcade', 1.85),
 ('Video Players & Editors', 1.771),
 ('Casual', 1.76),
 ('Maps & Navigation', 1.399),
 ('Food & Drink', 1.241),
 ('Puzzle', 1.128),
 ('Racing', 0.993),
 ('Libraries & Demo', 0.936),
 ('Role Playing', 0.936),
 ('Auto & Vehicles', 0.925),
 ('Strategy', 0.914),
 ('House & Home', 0.823),
 ('Weather', 0.801),
 ('Events', 0.711),
 ('Adventure', 0.677),
 ('Comics', 0.609),
 ('Art & Design', 0.598),
 ('Beauty', 0.598),
 ('Parenting', 0.496),
 ('Ca


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so *we'll only work with the Category column moving forward*.

So, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. 

**Now we'd like to get an idea about the kind of apps that have a large number of users**.

## Most Popular Apps by Genre on the App Store

> One way to find out what genres are the most popular is to calculate the average number of installs for each app genre. 

> For the Google Play data set, we can find this information in the **Installs** (5 column).

> This information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings -  **rating_count_tot app** (6 column).

In [16]:
# average number of installs for each genre 
# (rating_count_tot for Apple apps, Installs for Android)
# data - dataset, 
# col_num - column with rateing rating_count_tot (6 by default for Apple apps)
# g_num - column with genre (-5 by default for Apple apps)

def raiting(data, genre_frequency, col_num = 6, g_num = -5 ):
    
    genre = dict(genre_frequency)
    genre_rate = []
    for g in genre: 
        total = 0
        len_rate = 0
        for item in data:
            if g == item[g_num]:
                total = total + float(item[col_num])
                len_rate = len_rate + 1
        rating = total / len_rate 
        genre_rate.append([ g, round(rating,2), len_rate, total ])

    l = lambda x: x[1]
    rating = sorted(genre_rate, key=l, reverse=True)
    
    return rating
        
rate = raiting(free_ios_apps, genre_ios_frequency)
print('Apps genre--', 'raiting--', 'number of apps--', 'total raiting')
pprint.pprint(rate)

Apps genre-- raiting-- number of apps-- total raiting
[['Navigation', 86090.33, 6, 516542.0],
 ['Reference', 74942.11, 18, 1348958.0],
 ['Social Networking', 71548.35, 106, 7584125.0],
 ['Music', 57326.53, 66, 3783551.0],
 ['Weather', 52279.89, 28, 1463837.0],
 ['Book', 39758.5, 14, 556619.0],
 ['Food & Drink', 33333.92, 26, 866682.0],
 ['Finance', 31467.94, 36, 1132846.0],
 ['Photo & Video', 28441.54, 160, 4550647.0],
 ['Travel', 28243.8, 40, 1129752.0],
 ['Shopping', 26919.69, 84, 2261254.0],
 ['Health & Fitness', 23298.02, 65, 1514371.0],
 ['Sports', 23008.9, 69, 1587614.0],
 ['Games', 22788.67, 1874, 42705967.0],
 ['News', 21248.02, 43, 913665.0],
 ['Productivity', 21028.41, 56, 1177591.0],
 ['Utilities', 18684.46, 81, 1513441.0],
 ['Lifestyle', 16485.76, 51, 840774.0],
 ['Entertainment', 14029.83, 254, 3563577.0],
 ['Business', 7491.12, 17, 127349.0],
 ['Education', 7003.98, 118, 826470.0],
 ['Catalogs', 4004.0, 4, 16016.0],
 ['Medical', 612.0, 6, 3672.0]]


> **Navigation** apps have the highest number of user reviews, but there are only 6 apps in this genre and this rating is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together

> Second position is - **Reference**, 18 apps, total rating is 1348958.0, and most part of this - 985920 - for Bible

> **Social networking** apps, - the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. 

In [17]:
# Function to print data by Category in convenient way
def print_top10_category(data, category, col_name, col_num, col_category):
    print('\n', category)
    res_category = []
    for item in data:
        if item[col_category] == category:
            res_category.append([item[col_name], float(item[col_num])])
    res_category = sorted(res_category, key=lambda num: num[1], reverse=True)
    pprint.pprint(res_category[:10]) 

print_top10_category(free_ios_apps, 'Navigation', 2, 6, -5)
print_top10_category(free_ios_apps, 'Reference', 2, 6, -5)
print_top10_category(free_ios_apps, 'Social Networking', 2, 6, -5)


 Navigation
[['Waze - GPS Navigation, Maps & Real-time Traffic', 345046.0],
 ['Google Maps - Navigation & Transit', 154911.0],
 ['Geocaching®', 12811.0],
 ['CoPilot GPS – Car Navigation & Offline Maps', 3582.0],
 ['ImmobilienScout24: Real Estate Search in Germany', 187.0],
 ['Railway Route Search', 5.0]]

 Reference
[['Bible', 985920.0],
 ['Dictionary.com Dictionary & Thesaurus', 200047.0],
 ['Dictionary.com Dictionary & Thesaurus for iPad', 54175.0],
 ['Google Translate', 26786.0],
 ['Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran', 18418.0],
 ['New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition',
  17588.0],
 ['Merriam-Webster Dictionary', 16849.0],
 ['Night Sky', 12122.0],
 ['City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition '
  '(MCPE)',
  8535.0],
 ['LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods '
  'Installer Tools',
  4693.0]]

 Social Networking
[['Facebook', 2974676.0],
 ['Pinterest', 1061624.0],
 ['Skyp

> Our aim is to find popular genres, but Navigation, Social networking or Reference apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings. 

> Other genres that seem popular include weather, book, food and drink, or finance. 

* **Food and drink** — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc.
* **Weather** apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low
* **Finance** apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge

> The market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store. May be it could be  **Health & Fitness** 

*  genre-- raiting-- number of apps-- total raiting
* ['Health & Fitness', 23298.02, 65, 1 514 371.0]


In [18]:
print_top10_category(free_ios_apps, 'Health & Fitness', 2, 6, -5)


 Health & Fitness
[['Calorie Counter & Diet Tracker by MyFitnessPal', 507706.0],
 ['Lose It! – Weight Loss Program and Calorie Counter', 373835.0],
 ['Weight Watchers', 136833.0],
 ['Sleep Cycle alarm clock', 104539.0],
 ['Fitbit', 90496.0],
 ['Period Tracker Lite', 53620.0],
 ['Nike+ Training Club - Workouts & Fitness Plans', 33969.0],
 ['Plant Nanny - Water Reminder with Cute Plants', 27421.0],
 ['Sworkit - Custom Workouts for Exercise & Fitness', 16819.0],
 ['Clue Period Tracker: Period & Ovulation Tracker', 13436.0]]


## Most Popular Apps by Genre on Google Play

> For the Google Play market, we have data about the number of installs. But we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

> We don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users. We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

> To perform computations we'll need to convert each number from string to float - we need to **remove the commas and the plus** characters

In [19]:
for item in free_goo_apps:
    item[5] = item[5].replace(',', '')
    item[5] = item[5].replace('+', '')
    
rate_goo = raiting(free_goo_apps, category_goo_frequency, 5, 1)
print('Apps genre--', 'raiting--', 'number of apps--', 'total raiting')
pprint.pprint(rate_goo)

Apps genre-- raiting-- number of apps-- total raiting
[['COMMUNICATION', 38456119.17, 287, 11036906201.0],
 ['VIDEO_PLAYERS', 24727872.45, 159, 3931731720.0],
 ['SOCIAL', 23253652.13, 236, 5487861902.0],
 ['PHOTOGRAPHY', 17840110.4, 261, 4656268815.0],
 ['PRODUCTIVITY', 16787331.34, 345, 5791629314.0],
 ['GAME', 15588015.6, 862, 13436869450.0],
 ['TRAVEL_AND_LOCAL', 13984077.71, 207, 2894704086.0],
 ['ENTERTAINMENT', 11640705.88, 85, 989460000.0],
 ['TOOLS', 10801391.3, 750, 8101043474.0],
 ['NEWS_AND_MAGAZINES', 9549178.47, 248, 2368196260.0],
 ['BOOKS_AND_REFERENCE', 8767811.89, 190, 1665884260.0],
 ['SHOPPING', 7036877.31, 199, 1400338585.0],
 ['PERSONALIZATION', 5201482.61, 294, 1529235888.0],
 ['WEATHER', 5074486.2, 71, 360288520.0],
 ['HEALTH_AND_FITNESS', 4188821.99, 273, 1143548402.0],
 ['MAPS_AND_NAVIGATION', 4056941.77, 124, 503060780.0],
 ['FAMILY', 3695641.82, 1676, 6193895690.0],
 ['SPORTS', 3638640.14, 301, 1095230683.0],
 ['ART_AND_DESIGN', 1986335.09, 57, 113221100.0],


In [20]:
print_top10_category(free_goo_apps, 'COMMUNICATION', 0, 5, 1)
print_top10_category(free_goo_apps, 'SOCIAL', 0, 5, 1)
print_top10_category(free_goo_apps, 'TRAVEL_AND_LOCAL', 0, 5, 1)
print_top10_category(free_goo_apps, 'SHOPPING', 0, 5, 1)
print_top10_category(free_goo_apps, 'PRODUCTIVITY', 0, 5, 1)


 COMMUNICATION
[['WhatsApp Messenger', 1000000000.0],
 ['Messenger – Text and Video Chat for Free', 1000000000.0],
 ['Skype - free IM & video calls', 1000000000.0],
 ['Google Chrome: Fast & Secure', 1000000000.0],
 ['Gmail', 1000000000.0],
 ['Hangouts', 1000000000.0],
 ['Google Duo - High Quality Video Calls', 500000000.0],
 ['imo free video calls and chat', 500000000.0],
 ['LINE: Free Calls & Messages', 500000000.0],
 ['UC Browser - Fast Download Private & Secure', 500000000.0]]

 SOCIAL
[['Facebook', 1000000000.0],
 ['Google+', 1000000000.0],
 ['Instagram', 1000000000.0],
 ['Facebook Lite', 500000000.0],
 ['Snapchat', 500000000.0],
 ['Tumblr', 100000000.0],
 ['Pinterest', 100000000.0],
 ['Badoo - Free Chat & Dating App', 100000000.0],
 ['Tango - Live Video Broadcast', 100000000.0],
 ['LinkedIn', 100000000.0]]

 TRAVEL_AND_LOCAL
[['Maps - Navigate & Explore', 1000000000.0],
 ['Google Street View', 1000000000.0],
 ['Booking.com Travel Deals', 100000000.0],
 ['TripAdvisor Hotels Flight

> **Communication** apps have the most installs: 38 456 119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts)

> If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times


In [21]:
goo_communication_100 = []
for item in free_goo_apps:
    if item[1] == 'COMMUNICATION' and float(item[5]) < 100000000:
        goo_communication_100.append(item)

category_goo_frequency_communication =[]
for item in category_goo_frequency:
    if 'COMMUNICATION' in item:
        category_goo_frequency_communication.append(item)

rate_goo_comm = raiting(goo_communication_100, category_goo_frequency_communication, 5, 1)
print('   Apps genre---  ', 'raiting--  ', 'number of apps--', 'total raiting')
pprint.pprint(rate_goo_comm)

   Apps genre---   raiting--   number of apps-- total raiting
[['COMMUNICATION', 3603485.39, 260, 936906201.0]]



> the same situation is in categories: SOCIAL, TRAVEL_AND_LOCAL, SHOPPING.

> Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

> The GAME genre seems pretty popular, but previously we found out this part of the market seems a bit saturated (862  in Android apps and  1874 IOS Apps)

> **HEALTH_AND_FITNESS** genre looks fairly popular as well, with an average number of installs of 4,188,821 (total - 273 apps in Google Play).

>  If we explore this in more depth, since we found this genre has some potential to work well on the App Store (we are looking for genre that shows potential for being profitable on both the App Store 
and Google Play). 65 apps in App Store, 23298.02 raiting

> Let's take a look at apps where over 50 million installs and their number of installs

In [22]:
goo_health_50 = []
for item in free_goo_apps:
    if item[1] == 'HEALTH_AND_FITNESS' and float(item[5]) > 50000000:
        goo_health_50.append(item)

print_top10_category(goo_health_50, 'HEALTH_AND_FITNESS', 0, 5, 1)


 HEALTH_AND_FITNESS
[['Samsung Health', 500000000.0],
 ['Period Tracker - Period Calendar Ovulation Tracker', 100000000.0]]


>  There's only two apps with installs > 50mln. There are only a few very popular apps, so this market still shows potential. 

> Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 100,000 and 10,000,000 downloads):

In [25]:
goo_health = []
print('HEALTH_AND_FITNESS')
for item in free_goo_apps:
    if item[1] == 'HEALTH_AND_FITNESS' and float(item[5]) < 10000000  and float(item[5]) > 100000:
        goo_health.append(item)
        print(item[0], ':', item[5])

HEALTH_AND_FITNESS
Step Counter - Calorie Counter : 500000
Lose Belly Fat in 30 Days - Flat Stomach : 5000000
Pedometer - Step Counter Free & Calorie Burner : 1000000
Sportractive GPS Running Cycling Distance Tracker : 1000000
Home Workout for Men - Bodybuilding : 1000000
Buttocks and Abdomen : 500000
Running & Jogging : 500000
Sleep Sounds : 1000000
Cycling - Bike Tracker : 500000
Calorie Counter - EasyFit free : 1000000
Aunjai i lert u : 500000
BetterMe: Weight Loss Workouts : 5000000
Bike Computer - GPS Cycling Tracker : 1000000
Running Distance Tracker + : 1000000
Walking: Pedometer diet : 1000000
Recipes for hair and face tried : 500000
Keep Trainer - Workout Trainer & Fitness Coach : 1000000
Couch to 10K Running Trainer : 500000
PumpUp — Fitness Community : 1000000
Home workouts - fat burning, abs, legs, arms,chest : 1000000
Running Weight Loss Walking Jogging Hiking FITAPP : 1000000
Fabulous: Motivate Me! Meditate, Relax, Sleep : 5000000
StrongLifts 5x5 Workout Gym Log & Persona

>This genre includes a variety of different apps - Calorie Counter, exercises, Diet Diary, Timer, Recipes

# Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that makin apps in category HEALTH_AND_FITNESS - could be profitable for both the Google Play and the App Store markets. The markets are already full of such apps, so we need to add some special features or collect some different functions in one application.