# Profitable App Profiles for the App Store and Google Play Markets
The aim is to find the app profiles that are profitable for the App Store and Google Play markets. 

Scenario:

We're working as data analysts for a company that builds Android and iOS mobile apps, and our task is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.
At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kind of apps are likely to attract more users.

## Data sets used:
* [Google Play store apps data](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data for approximately ten thousand Android apps from Google Play
* [iOS Apps Store apps data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data for approximately seven thousand iOS apps from the App Store

In [None]:
from csv import reader

# The Google Play Store data set
opened_file = open("datasets/googleplaystore.csv")
read_file = reader(opened_file)
android_apps_data = list(read_file)
android_header = android_apps_data[0]
android = android_apps_data[1:]

# The App Store data set
opened_file = open("datasets/AppleStore.csv")
read_file = reader(opened_file)
ios_apps_data = list(read_file)
ios_header = ios_apps_data[0]
ios = ios_apps_data[1:]

Defining a handy function `explore_data()` that we can use repeatedly to explore data in amore readable way. Also added an option for our function to show number of rows and columns for any data set. 

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n") # adds a new (empty) line between rows
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

In [None]:
print(android_header)
print("\n")
explore_data(android, 0, 5, True)

We see that Google Play store data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are `'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price'` and `'Genres'`.

The description of these column names can be found in the Google Play Store data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

Exploring iOS App Store data set:

In [None]:
print(ios_header)
print("\n")
explore_data(ios, 0, 5, True)

We have 7197 iOS apps in this data set, and the columns that seem interesting  are: `'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver'` and `'prime_genre'`.

The description of these column names can be found in the iOS App Store data set [documentation](https://www.kaggle.com/lava18/google-play-store-apps/home)


## Deleting Wrong Data

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of Google Play store data set, we find there's an issue with data at index 10472 - [discussion link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

Printing this row and comparing it to the header and another row that's correct:

In [None]:
print(android[10472])

In [None]:
print(android_header)

In [None]:
print(android[5])

Observation: Rating for the app at index 10472 is 19 which is not possible (max rating = 5). Deleting this row:

In [None]:
print(len(android))

In [None]:
del android[10472]

In [None]:
print(len(android))

## Removing Duplicate Entries

Check if Google play apps data set has duplicate entries.

In [None]:
unique_apps = []
duplicate_apps = []
for row in android:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("No. of duplicate google play apps:", len(duplicate_apps))
print("\n");
print("Examples of duplicate apps:\n", duplicate_apps[:15])

Duplicate entries need to removed as we don't want to count certain apps more than once when we analyze the data. We can can decide randomly which duplicate entries are to removed per app, but a better solution is to keep the entry with maximum number of reviews and remove the rest of the entries for that app.

We observed 1181 duplicate app entries in Google Play dataset. After we remove the duplicates expected length of android data set:

In [None]:
print('Expected length of Google play data set after removing duplicate entries:', len(android) - 1181)

In [None]:
reviews_max = {}
for row in android:
    name = row[0];
    n_reviews = float(row[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Expected length of `reviews_max` dictionary is the difference between total entries and number of duplicate entries in the data set (1181 as computed previously).

In [None]:
print("Expected length of reviews_max = ", len(android) - 1181)
print("Actual length of reviews_max = ", len(reviews_max))

Now we need to remove duplicates using the reviews_max dictionary.

Procedure:
* Create empty list `android_clean` that will reflect our data set after removing all the duplicates
* Create variable `already_added` to keep track of apps that we have already added to `android_clean`
* Loop through the Google Play data set `android`, and for each iteration:
    * Check if no. of reviews for that row is equal to the maximum review for that app using the dictionary `reviews_max` and app should not be in the `already_added` list. We need this second check as it's possible an app can have multiple entries in our data set such that two or more of those entries have no. of reviews same as max review. Without the second check, we would end up with duplicate entries for some apps.
    
    If this condition is satisfied, append this row to `android_clean` list.
    Also, append the app name to `already_added` list.

In [None]:
android_clean = []
already_added = []

In [None]:
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

We explore android_clean and confirm no. of entries is indeed 9659:

In [None]:
explore_data(android_clean, 0, 5, True)

We have 9659 rows, just as expected. 

## Removing non-English apps

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

We use the built-in `ord` function to find out the corresponding encoding no. of each character.

In [None]:
def is_english(string):
    for ch in string:
        if ord(ch) > 127:
            return False
    return True

In [None]:
print(is_english("Instagram"))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))

There is possibility of occurence of characters like (™, — (em dash), – (en dash) that falls outside of ASCII range but apps can be misjudged as non-English if we use the above function as it is.
To minimize the impact of data loss, we'll only remove an app if its name has more that three non-ASCII characters.

In [None]:
def is_english(string):
    non_ascii = 0
    for ch in string:
        if ord(ch) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    return True

In [None]:
print(is_english("Docs To Go™ Free Office Suite"))
print(is_english('Instachat 😜'))

The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [None]:
android_english = []
ios_english = []

for row in android_clean:
    name = row[0]
    if is_english(name):
        android_english.append(row)
        
for row in ios:
    name = row[1]
    if is_english(name):
        ios_english.append(row)

In [None]:
explore_data(android_english, 0, 3, True)
print("\n")
explore_data(ios_english, 0, 3, True)

We observe that we are left with 9614 Android apps and 6183 iOS apps.

## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [None]:
android_final = [row for row in android_english if row[7] == "0"]

In [None]:
ios_final = [row for row in ios_english if row[4] == "0.0"]

In [None]:
print(len(android_final))
print(len(ios_final))

Thus, we are left with 8864 Android apps 3222 iOS apps for analysis.

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

Creating two functions we can use to analyze the frequency tables:
    1. `freq_table()` to generate frequency tables that show percentages.
    2. `display_table()` to display the percentages in descending order.

In [None]:
def freq_table(dataset, index):
    table = {}
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1 
            
    table_percentages = {}
    total = len(dataset)
    
    for key in table:
        table_percentages[key] = (table[key] / total) * 100
        
    return table_percentages
    

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_content = []
    for key in table:
        table_content.append((table[key], key))
        
    table_content.sort(reverse=True)
    for item in table_content:
        print(item[1], ": ", item[0])

In [None]:
# just trying an alternate way to sort a dict based on values
def display_table_aliter(dataset, index):
    table = freq_table(dataset, index)
    table_sorted = (sorted(table.items(), key=lambda kv: kv[1], reverse=True)) 
    for item in table_sorted:
        print("{}: {}".format(item[0], item[1]))

Freq-table for `prime_genre` column of App Store data set:

In [None]:
display_table_aliter(ios_final, -5)

In [None]:
display_table(ios_final, -5)

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Freq-table for `Genres` and `Category` column of Google Play data set:

In [None]:
display_table(android_final, -4) # Genre

In [None]:
display_table(android_final, 1) # Category

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.


`Genres` column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store

In [None]:
freq_table_genre = freq_table(ios_final, -5)

In [None]:
for genre in freq_table_genre:
    len_genre = 0
    total = 0  # total no. ratings for apps of this genre  
    for row in ios_final:
        genre_app = row[-5]
        if genre_app == genre:
            len_genre += 1
            total += float(row[5])
    avg_no_of_user_ratings = total / len_genre
    print(genre, ": ", avg_no_of_user_ratings)

It seems apps of Navigation genre are most popular among the users but this figure is skewed due to apps like Waze and Google Maps which have considerably high no. of user ratings compared to other apps of this category:

In [None]:
for row in ios_final:
    genre_app = row[-5]
    app_name = row[1]
    if genre_app == "Navigation":
        no_of_user_ratings = row[5]
        print(app_name, ": ", no_of_user_ratings)

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold.

Reference apps category has around 74942 no. of average user ratings. Lets have a look at the distribution of no. of ratings for apps of this category: 

In [None]:
for row in ios_final:
    if row[-5] == "Reference":
        print(row[1], ": ", row[5])

Again, averge no. of user ratings for apps of 'Reference' category is skewed by apps like 'Bible' and 'Dictionary.com Dictionary & Thesaurus' but at least other apps are able to reach 10000 ratings. This genre looks promising.

While we observed that App Store is highly saturated by apps of fun category, making an app for reference category makes sense. A practical app might have more of a chance to stand out among the huge number of apps on the App Store. 

## Most Popular Apps by Genre on Google Play 

In the Google Play data set, we have no. of installs for each app. We will use this column for determing most popular app category.

In [None]:
display_table(android_final, 5)

The installs column values are not precise. Howerver, for our purposes we need to get an overall picture of the popularity of various genre of apps.
So, we will treat these values as it is, like 1000+ can be treated as 1000.

In [None]:
freq_table_android = freq_table(android_final, 1)

In [None]:
for category in freq_table_android:
    len_category = 0;
    total = 0
    for row in android_final:
        category_app = row[1]
        if category_app == category:
            len_category += 1
            # Removing comma and plus characters from no. of installs value
            installs = float(row[5].replace('+', '').replace(',', ""))
            total += float(installs)
    avg_installs = total / len_category
    # Printing app category and avg no. of installs for apps in that category
    print(category, ": ", avg_installs)  

It seems 'COMMUNICATION' category apps are most popular.
Looking closely at no. of installs for these apps:



In [None]:
for row in android_final:
    if row[1] == "COMMUNICATION":
        print(row[0], ": ", row[5])

Again, we observe that average no. of installs for apps of "COMMUNICATION" category is skewed by apps like WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [None]:
for row in android_final:
    if row[1] == "BOOKS_AND_REFERENCE":
        print(row[0], ": ", row[5])

Still, there are apps that seem to skew average no. of installs:

In [None]:
for row in android_final:
    if row[1] == "BOOKS_AND_REFERENCE" and (row[5] == '1,000,000,000+'
                                            or row[5] == '500,000,000+'
                                            or row[5] == '100,000,000+'):
        print(row[0], ": ", row[5])

As these are very few, developing apps in this category can be a good deccision.

This niche seems to be dominated by reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.