# Project 1 - Profitable App Profiles for the App Store and Google Play Markets


***Guided Project under Dataquest Data Analysis in Python career track***

**Author: Yiyan Kang**

**Date: January 18th, 2019**

## Introduction

This project is about anlyzing the App Profiles for the App store and Google Play to help developers understand what kinds of free apps are likely to attract more users. Our goal in this project is to get familiar with Data Analysis in Python.

We found two data sets that can help in this project:

* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store, which was collected in July 2017

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play, which was collected in August 2018



To make exploring data easier, I created a function `explore_data()` so I can repeatedly to print rows in a more readable way. In addition, I also add an option function to show the number of rows and columns.

In [None]:
# explore_data() Function
# dataset should not contain header row
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # add a new empty line after each row
    
    if rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ",len(dataset[0]))

## Opening and Exploring Datasets

We will start by opening and exploring these two data sets. 

In [None]:
from csv import reader

# The App Store data set #
opened_file1 = open("AppleStore.csv")
read_file1 = reader(opened_file1)
ios = list(read_file1)
ios_header = ios[0]
ios_data = ios[1:]

print(ios_header)
print('\n')
explore_data(ios_data, 0, 5, True)



From the result above, we can see that the Apps Store data set has 7197 rows and 16 columns. For our project, the columns that might be useful for our analysis are 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'cont_rating' and 'prime_genre'. The explanation of the dataset can be found in the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

In [None]:
# The Google Play Store data set #
opened_file2 = open("googleplaystore.csv")
read_file2 = reader(opened_file2)
gps = list(read_file2)
gps_header = gps[0]
gps_data = gps[1:]

print(gps_header)
print("\n")
explore_data(gps_data, 0, 5, True)

From the result above, we can see that the Google Play Store data set has 10841 rows and 13 columns. For our project, the columns that might be useful for our analysis are 'App', 'Catagory', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Content Rating' and 'Genres'. The explanation of the dataset can be found in the [documentation](https://www.kaggle.com/lava18/google-play-store-apps/home)

## Deleting Wrong Data

According to the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) for Google Play Store Dataset, there is one error in entry 10472, where the rating is 19 that is exceeding the highest rating score 5 (because the catagory is missing in this case). Therefore, we need to delete that entry.

In [None]:
print(gps_data[10472]) # incorrect row
print('\n') 
print(gps_header) # print header

In [None]:
print(len(gps_data))
del gps_data[10472] # You should not run this line for more than once
print(len(gps_data))

## Removing Duplicate Entries

When we explore the Google Play data set, we will find that some of the extries are duplicated. For example, Instagram has four entries.

##### Google Play Store data set

In [None]:
for row in gps_data:
    name = row[0]
    if name == 'Instagram':
        print(row)

We should find the number of cases like this.

In [None]:
duplicate_name = []
unique_name = []

for row in gps_data:
    name = row[0]
    if name in unique_name:
        duplicate_name.append(name)
    else:
        unique_name.append(name)
print(len(duplicate_name))
print(duplicate_name[:20])

There are 1181 cases where the same name occurs more than once in the data set. We don't want to count the apps more than once when we are analyzing data so we need to delete the other rows. However, randomly deleting the row might not be a good way to deal with this situation.  When we look at the duplicated Instagram example, we can find that the main difference that we care is the number of ratings. In general, the more ratings an app has, the more feedback we are getting. Therefore, we will keep the row with the highest number of reviews and remove the other entries for any given app.

In [None]:
print(len(gps_data)-1181)

The resulted length of the data set after we pick the unique names will be 9659.

In [None]:
reviews_max = {}
for row in gps_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

The `reviews_max` dictionary is correct due to the same number we got previously. Then we need to remove the duplicate rows in the Google Play Store dataset.

In [None]:
gps_clean = [] # will store our new cleaned data set
already_added = [] # will store only the names to make sure we don't have duplicate names

for row in gps_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        gps_clean.append(row)
        already_added.append(name)

explore_data(gps_clean,0,5,True)

##### App Store data set

In [None]:
duplicate_name = []
unique_name = []

for row in ios_data:
    name = row[0]
    if name in unique_name:
        duplicate_name.append(name)
    else:
        unique_name.append(name)
print(len(duplicate_name))

In conclusion, we removed the duplicate entries in Google Play data set. In addition, we did not forget to check if App Store data set has the same issue; luckily, this situation is not happening in this data set.

## Removing Non-English Apps

This project is aimed for analysis on apps in English. In both data sets, there exist many apps whose name suggests that they are not direct toward an English-speaking audience.

In [None]:
print("Some non-English apps in Apple Store:")
print(ios_data[813][1])
print(ios_data[6731][1])
print("\nSome non-English apps in Google Play Store:")
print(gps_clean[4412][0])
print(gps_clean[7940][0])

We are going to remove these rows because we are not interested in them. One way to achieve this is to use ASCII system. English characters and some commonly used symbols are all in the range from 0 to 127. Based on this information, we can simply use the built-in function `ord()` to test the ASCII number of the app name characters. We will use to function to test the rows.

In [None]:
def testEnglish(string):
    test = True
    count = 0 # for counting the number of non-English characters
    for a in string:
        if ord(a) < 127 and ord(a) > 0:
            test = test
        else:
            count += 1
            if count > 3:
                test = False
                break
    return test

print(testEnglish("Instagram"))
print(testEnglish("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(testEnglish("Docs To Go™ Free Office Suite"))
print(testEnglish("Instachat 😜"))

In this case the function returns `False` only if there are more than three characters that exceed 0 - 127 range. This should make more sense then simply return `False` even if there is only one special character. For example, 'Docs To Go™ Free Office Suite' and 'Instachat 😜' should be considered as English apps. 

Now we will clean both data sets by using `testEnglish()` function.

##### App Store data set

In [None]:
ios_eng=[]
for row in ios_data:
    name = row[1]
    if testEnglish(name) == True:
        ios_eng.append(row)

explore_data(ios_eng,0,3,True)    

After we clean the non-English name for App Store data set, it contains 6183 rows and 16 columns.

##### Google Play Store data set

In [None]:
gps_eng=[]
for row in gps_clean:
    name = row[0]
    if testEnglish(name) == True:
        gps_eng.append(row)

explore_data(gps_eng,0,3,True) 

After we clean the non-English name for Google Play Store data set, it contains 9614 rows and 13 columns.

## Isolating the Free Apps

As mentioned in the introduction, we mainly focus on analyzing the apps that are free to download and install. We will need to isolate the free apps in the data set

##### App Store data set

In [None]:
ios_final = []
for row in ios_eng:
    price = float(row[4])
    if price == 0:
        ios_final.append(row)
explore_data(ios_final,0,3,True)

In [None]:
gps_final = []
for row in gps_eng:
    price = row[7]
    if price == "0":
        gps_final.append(row)
explore_data(gps_final,0,3,True)

Finally, we have 3222 rows for App Store data set and 8864 rows for Google Store data set, which should be enough for the analysis.

In conclusion, in the data cleaning part, we spend a lot of time on cleaning the data, and

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isolated the free apps

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps as well as watching the in-app ads.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps

1. Build a minimal Android version of the app, and add it to the Google Play Store.
2. If the app has a good response from users, we will develop it further.
3. If the app is profitable after six months, we will add a iOS version of the app to the App Store

Since our goal is to add the app to both App Store and Google Play Store to minimum variable cost, we need to find app profiles that are successful on both markets. In other words, a profile that might work well for both markets might be a productivity app.

We will start by building frequency tables for some columns in our data sets.

## Most Common Apps by Genre

First, we'll build two functions to generate frequency tables and display the percentages in a descending order.

In [None]:
def freq_table(dataset, index):
    table = {}
    result = {}
    total = 0
    for a in dataset:
        total += 1
        value = a[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    for a in table:
        result[a] = table[a]/total*100
    return result

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now, we will use these two functions to analyze the two data sets.

##### App Store data set

In [None]:
display_table(ios_final, 11)
print("Number of Genres: ", len(freq_table(ios_final, 11)))

From the freqency table, we can see that there are 23 genres in total, where over 50% of the app is under `Games` genre. Another 8% of apps are `Entertainment` genre, followed by 5% `Photo & Video` apps. However, there are only 3.7% of apps are designed for `Education` which is a little bit more than `Social Networking`. 

In general, we can conclude that in App Store, the apps are mostly designed for fun, like games, entertainment, photo and video, social networking, sports, music and etc., while the apps with more practical uses like education, shopping, utilities, health and fitness, lifestyle and etc. are less in the amount. But we cannot simply look at the freqency table for the apps because the number of apps doesn't mean the popularity; Sometimes, the demand is less than the supply in app.

##### Google Play Store data set

In [None]:
display_table(gps_final, 1)
print("Number of Catagory: ", len(freq_table(gps_final,1)))

This is really interesting and totally different in Google Play Store. We have several genres that are of practice purposes and they have relatively high freqency percentage, like family, tools, business, lifestyle and etc. 

In [None]:
display_table(gps_final, 9)
print("Number of Genres: ", len(freq_table(gps_final,9)))

This is another column in the data set which can help confirm the better proportion on practical apps.

In general, App Store are mostly dominated by fun apps, while Google Play store has a balance between fun apps and practical apps.

## Most Popular Apps by Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

* Isolate the apps of each genre
* Sum up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

In [None]:
ios_pop_table = freq_table(ios_final, 11)
ios_avg_n_rating = {}

for unique_genres in ios_pop_table:
    total = 0 # store the sum of user ratings specific to each genre
    len_genre = 0 # store the number of apps specific to each genre
    for row in ios_final:
        genre_app = row[11]
        if genre_app == unique_genres:
            user_rating = float(row[5])
            total += user_rating
            len_genre += 1
    avg_n_rating = total/len_genre
    ios_avg_n_rating[unique_genres] = avg_n_rating

table_display = []

for key in ios_avg_n_rating:
    key_val_as_tuple = (ios_avg_n_rating[key], key)
    table_display.append(key_val_as_tuple)
        
ios_sorted_avg_n_rating = sorted(table_display, reverse = True)
for entry in ios_sorted_avg_n_rating:
    print(entry[1], ':', entry[0])

From the result, we can conclude that the `Navigation` genre has most user ratings.

In [None]:
for app in ios_final:
    if app[11] == 'Navigation':
        print(app[1], ": ", app[5])

That is mostly due to the large number in the user rating numbers for Waze and Google Maps.

In [None]:
for app in ios_final:
    if app[11] == 'Social Networking':
        print(app[1], ": ", app[5])
print("\n")
        
for app in ios_final:
    if app[11] == 'Reference':
        print(app[1], ": ", app[5])
print("\n")
        
for app in ios_final:
    if app[11] == 'Music':
        print(app[1], ": ", app[5])
        

The same situation occurs for other genres. For example, `Facebook` and `Pinterest` in `Social Networking` genre, `Bible` in `Reference` genre and `Pandora`, `Spotify` and `Shazam` in `Music` genre.

##### Google Play Store data set

In [None]:
display_table(gps_final, 5)

To make situation easier, though it might not be that precise, we will consider `100,000+` as `100,000`.

In [None]:
gps_pop_table = freq_table(gps_final, 1)
gps_avg_install = {}

for unique_category in gps_pop_table:
    total = 0 # store the sum of installs specific to each category
    len_category = 0 # store the number of apps specific to each category
    
    for row in gps_final:
        category_app = row[1]
        if category_app == unique_category:
            n_install = row[5]
            n_install = n_install.replace('+','')
            n_install = n_install.replace(',','')
            total += float(n_install)
            len_category += 1
                  
    avg_install = total/len_category
    gps_avg_install[unique_category] = avg_install

table_display = []

for key in gps_avg_install:
    key_val_as_tuple = (gps_avg_install[key], key)
    table_display.append(key_val_as_tuple)
        
gps_sorted_avg_install = sorted(table_display, reverse = True)
for entry in gps_sorted_avg_install:
    print(entry[1], ':', entry[0])

'Communication' apps have most installs followed by 'Video_players' and `Social`.

In [None]:
for app in gps_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

These catagories contain a lot of giant apps that are hard to compete against so we can start thinking about some genres/categories that are less popular but exist potential to develop.

In [None]:
for app in gps_final:
    if app[1] == 'BEAUTY' and (app[5] == '500,000+'
                                      or app[5] == '5,000,000+'
                                      or app[5] == '1,000,000+'):
        print(app[0], ':', app[5])

For example, the `Beauty` genre includes a variety of apps like makeup, selfie, hairstyles. It seems there's still a small number of extremely popular apps that dominates the installs. 

## Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We made an conclusion that entering a category that is less popular but has potential to gain users like beauty camera or makeup tips. Besides building on the ideas of the current market, we need to add more brilliant new features to attract users rather than simply copying other's features. 