# Free App Profiles on Google Play and App Store
My goal in this project is to analyze apps that are free but also profitable from in-app advertisements. This analyzation consists of app data from both the Google Play and Apple App Store markets.

Free app revenue is determined by the number of active users of that app. I want to analyze the data and determine what type of apps are profitable to help developers understand the apps categories that are likely to attract the most users. 

## The Data
As of 2018 there were 2 million iOS apps on the App Store and 2.1 million on the Google Play Store. To avoid the time and cost of collecting data on 4 million+ apps, we will use sample data from 2 separate data sets:

* A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play.
* A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store.

I will start by opening the 2 data sets to prepare for exploration:

In [1]:
from csv import reader

# Google Play
opened_file = open('../input/google-play-store-apps/googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# App Store
opened_file = open('../input/app-store-apple-data-set-10k-apps/AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

I'll write a function to explore the 2 data sets to be used further on in our analysis

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

The Google Play data set has 10,841 rows and 13 columns. Browsing the column names, the ones that appear to be most useful for our analysis are ```App```, ```Category```, ```Reviews```, ```Installs```, ```Type```, ```Price```, and ```Genres```

Let's examine the App Store data set:

In [4]:
del(ios_header[0])
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

There are 7,197 iOS apps in this data set. The column names that seem useful for analysis are ```track_name```, ```currency```, ```price```, ```rating_count_tot```, ```rating_count_ver```, and ```prime_genre```.

## Data Cleaning

The data set for the Google Play apps has a discussion session. A user posted that there is a missing entry for 'Category' on row 10,472. We will print that row below to see if the user is correct.

In [5]:
print(android[10472])
print(android_header)

There is no category for row 10,472. We will delete this row and then print the length of the data set to cofirm that the row is gone. This command should only be run once

In [6]:
del android[10472]
print(len(android))

In the iOS data set, the first column is the index for the row. This is not useful to our analysis. I will remove the index column from the data set:

In [7]:
for app in ios:
    del(app[0])
explore_data(ios, 0, 3, True)

## Duplicate Entries

Some apps in the Google Play data set have more than one entry. For example, if we search for Instagram there are 4 rows for a single app:

In [8]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

There are actually 1,181 duplicate apps in the data set:

In [9]:
duplicates = []
unique = []

for app in android:
    name = app[0]
    if name in unique:
        duplicates.append(name)
    else:
        unique.append(name)
print('Number of duplicates: ', len(duplicates))
print('\n')
print('Examples of duplicates: ', duplicates[:10])

Examing the duplicate entries for Instagram above shows us the main difference between the entries is in the 4th column which is the number of reviews. Rather than remove entries at random, we'll keep the row with the highest number of reviews as that row will give us the more reliable data.

To achieve this we will create a dictionary consisting of the app name and the max reviews of that app.

In [10]:
max_reviews = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in max_reviews and max_reviews[name] < n_reviews:
        max_reviews[name] = n_reviews
    elif name not in max_reviews:
        max_reviews[name] = n_reviews

Previously we found that there are 1,181 duplicate rows. The length of the ```max_reviews``` dictionary should be equal to the data set minus 1,181

In [11]:
print('Expected length: ', len(android) - 1181)
print('Actual length: ', len(max_reviews))

We can now use the max_reviews dictionary to remove the duplicate rows throughout the data set. We'll only keep the entries that contain the highest number of reviews.
* Create 2 lists, ```android_clean``` and ```already_added```
* Loop through the app rows
* For each row, assign the variable 'name' to the name of the app
* Convert the number of reviews to a float and assign to n_reviews
* If n_reviews equals the same amount as the ```max_reviews``` and ```name``` is not alreay in the ```already_added``` list, append the row to ```android_clean```
* Append the app name to already_added to keep track of the already added apps

In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == max_reviews[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

We will quickly explore our ```android_clean``` list to confirm the number of rows is 9,569

In [13]:
explore_data(android_clean, 0, 3, True)

## Removing Non-English Apps

We'd like to only analyze apps developed for English speaking audiences. Both the data sets contain apps developed for non-English speaking audiences. We should remove the apps that contain non-English titles. The best way to do this is to look for any non-English characters in the app name and compare that to the ASCII system. Any character greater than an ASCII value of 127 should reveal that the app name has a non-English title.

We can use the built in Python function ```ord()``` to get the corresponding encoded value for a character in a string. We will loop through each title and check the ASCII value of each character. 

In [14]:
def check_english(str):
    for char in str:
        if ord(char) > 127:
            return False
    return True

print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
    

Some apps in the data set are English but have non-English characters in them such as emoji's and specialized characters. To minimize data loss, we will only remove the app from the data set if it contains 3 or more non-English ASCII characters

In [15]:
def check_english(str):
    special_chars = 0
    for char in str:
        if ord(char) > 127:
            special_chars += 1
    if special_chars > 3:
        return False
    else:
        return True

print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

Below we will use the ```check_english()``` function to filter out non-English apps for both Google and iOS data sets:

In [16]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if check_english(name):
        android_english.append(app)
for app in ios:
    name = app[1]
    if check_english(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

## Isolating Free Apps

Our goal with this analysis is to identify the trends related to free apps that have in-app adds. We need to remove any apps in the data sets that have a cost to download:

In [17]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
for app in ios_english:
    price = app[4]
    if price == '0':
        ios_final.append(app)
      
explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

There are 8,864 Android apps and 3,222 iOS apps left to analyze. These are the cleaned data sets consisting of non-duplicate, English developed, and free apps

## Common Apps by Genre
My aim is to find the apps that attract the most amount of users. Free apps that have in-app advertisements are dependent upon amount of users to be profitable. My validation strategy for an app idea will have 3 steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

We need to find apps that are successful in both Android and Apple markets. I'll begin the analysis by generating a frequency table for the ```prime_genre``` column in the iOS data and ```Genres``` and ```Category``` columns of the Google Play data.

I'll build 2 functions:
* One to generate frequency tables that show percentages
* One to display the tables in desc order

In [18]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    table_percentage = {}
    for key in table:
        percent = (table[key] / total) * 100
        table_percentage[key] = percent
    return table_percentage

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print(display_table(ios_final, 11))

Of the free English apps in the iOS data, more than 58% are ```Games```. The next closest genre is ```Entertainment``` with close to 8%, followed by ```Photo & Video``` with almost 5%. 

The App Store is dominated by apps developed for fun, not as many for practical use. But just because these apps have the highest frequencies does not imply that they have as many users. 

We'll continue our analysis with the Google Play data set:

In [19]:
print(display_table(android_final, 1)) #Category

Practical applications appear to be better represented in the Google Play store. We can confirm this further by examing the frequency table for the ```Genres``` column

In [20]:
print(display_table(android_final, 9)) #Category

The ```Genres``` column is much more granular than the ```Category``` column. We don't need the granular genres for our analysis right now so we will stick with using ```Category``` going forward. The Apple Store is dominated by apps designed for fun and games, while the Google Play store has a blend of for-fun and practical apps. Next in our analysis will be to determine which tpyes of apps have the most users.

## Most Popular Apps by Genre

The easiest way to determine popularity of an app is to calculate the average number of installs per category. We can use the ```Installs``` column for the Google Play data but unfortunately iOS does not have a similar column. We can use the number of user ratings as a proxy. This number can be found in the ```rating_count_tot``` column.

Below I'll calculate the average number of user ratings / installs per app on the App Store:

In [21]:
ios_genres = freq_table(ios_final, 11)

for genre in ios_genres:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, ':', avg_rating)

Navigation has the highest number of reviews, but that figure is being skewed by a few large apps like Google Maps and Waze:

In [22]:
for app in ios_final:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Our goal is to find popular app genres but some of these apps may seem more poular than they are. Many categories such as Navigation, Social Networking, and Music are dominated by a few highly influential apps (Facebook, Twitter, Spotify etc.). 

Reference apps have 74,942 ratings. A majority of these are coming from 2 large apps, the Bible and Dictionary.com

In [23]:
for app in ios_final:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

In [32]:
for app in ios_final:
    if app[11] == 'Productivity':
        print(app[1], ':', app[5])

Let's review the number of installs by category on the Google Play Store. One thing to note about how this data set calculates install numbers is they are not exact, rather it is a tiered structure. See below:

In [24]:
display_table(android_final, 5) #installs column

To perform calculations, we will need to transform the installs column for each row into a float data type. Doing so will require removing commas and the '+' characters. We'll calculate the install averages in the same for loop:

In [26]:
android_categories = freq_table(android_final, 1)

for category in android_categories:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

The Books and Reference category is quite popular as well in the Google Play Store. If you recall, the Reference genre had a high number of reviews/installs in the iOS data set as well.

In [31]:
for app in android_final:
    if app[1] == 'PRODUCTIVITY' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'
                                            or app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

## Conclusion
The productivity genre appears to be a promising market as this category has high install numbers in both the Google Play and iOS ecosystems.

The market is quite saturated, so the app would need to have distinguishing features that set it apart. Some type of notetaking, calendar manager, or file manager application would yield promising numbers based on this analysis.