# Profitable App Profiles for the App Store and Google Play Markets
**Project:** This project is to perform practical data analysis in Python to make data-driven decisions for the kind of app to build and make a profit from.  
**Goal:** The goal of this project is to help developers understand what types of apps are likely to attract more users (on the App Store and Google Play).  This is done by collecting and analyzing mobile app data.

**Summary**  
1. Google and Apple datasets are opened, read, and explored.
    * Field names, number of rows, number of columns, and examples of some rows are looked at.  The data headers (i.e. field names) are also separated from the data.    
2. Data cleaning is performed as well as isolating only the specific data of interest for this project.    
    * Incorrect data and duplicates are removed (based on the forum discussion thread).  A criteria for which duplicate entry to keep in the data is determined and implemented.
    * The impact of incorrect data and duplicates are also looked at in terms of the number of rows remaining in the datsets.
    * Non-English and apps that cost money are further removed. Non-English apps are determined using the `ord()` function on each character within app names.
3. Analysis is performed on the cleaned data to determine which kind of app to focus building and eventually make a profit from.
    * A frequency table function called `freq_table()` and a frequency table display function called `display_table()` are created.  These are then used to analyze the percentage of apps within each genre of both the Google and Apple datasets and make an informed decision.

## Opening and Exploring the Data

Open and read two datasets, one for Google Play and one for Apple App Store, then store them as lists.

In [None]:
from csv import reader

## Google Play dataset ##
opened_file = open('_data/AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_head = apple[0]
apple = apple[1:]

## App Store dataset ##
opened_file = open('_data/googleplaystore.csv')
read_file = reader(opened_file)
google = list(read_file)
google_head = google[0]
google = google[1:]

Define and use a function `explore_data()` to get a preview, row count, and column count of the two datasets.

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(google, 0, 3, True)

In [None]:
explore_data(apple, 0, 3, True)

View field names of each dataset to see which fields are likely to be useful and which ones may not.  
Here is the [documentation](https://www.kaggle.com/datasets/lava18/google-play-store-apps) for the Google dataset.  And here is the [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) for the Apple dataset.

In [None]:
print(google_head)

In the Google dataset, the columns that could be useful are `'App'`, `'Category'`, `'Rating'`, `'Reviews'`, `'Genres'`, `'Type'`, and `'Price'`.

In [None]:
print(apple_head)

In the Apple dataset, the columns that could be useful are `'user_rating'`, `'rating_count_tot'`, `'track_name'`, `'currency'`, `'price'`, and `'prime_genre'`.

## Deleting Incorrect Data

In [None]:
print(google[10471:10473])

Shown above, row 10472 has a 'Rating' of 19, but the maximum rating possible is 5.  As a result, that row gets deleted from the dataset shown below.

In [None]:
del google[10472]

print(google[10471:10473])

## Removing Duplicates

Per this [discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) section, the Google dataset contains duplicates. For example, `'Instagram'` shown below has four rows.

In [None]:
for app in google:
    app_name = app[0]
    if app_name == 'Instagram':
        print(app)

Next, the number of duplicates is counted and the unique rows are stored in a new list.

In [None]:
duplicate_apps = []
unique_apps = []

for app in google:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print("Number of duplicate apps in Google dataset: ", len(duplicate_apps))

Duplicates will not be removed randomy, but instead will be based on the number of reviews.  For example, although there are four rows for `'Instagram'`, they each have differing numbers of reviews.  The row with the most reviews will be kept and the rest will be removed.

In [None]:
reviews_max = {}

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Check that the `'reviews_max'` dictionary has the expected number of rows and works correctly (i.e. does not include duplicates).

In [None]:
len(google) - len(duplicate_apps) - len(reviews_max)

Create new Google dataset (list of lists) that does not contain duplicates and does not contain incorrect data.

In [None]:
google_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name)

Explore the Google dataset that excludes duplicates and incorrect data to see if the code we wrote performs as expected.

In [None]:
explore_data(google_clean, 0, 3, True)

## Removing Non-English Apps

Define a function that scans a string for (more than 3) non-English characters (i.e. those with character numbers greater than 127).  Return false if it does, meaning the string contains more than enough non-english characters, otherwise return true.  Then test it.

In [None]:
def eng_str(string):
    num_sp_char = 0

    for char in string:
        if ord(char) > 127:
            num_sp_char += 1
        
    if num_sp_char > 3:
        return False
    else:
        return True

print(eng_str('Instagram'))
print(eng_str('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_str('Docs To Go™ Free Office Suite'))
print(eng_str('Instachat 😜'))

Use the `'eng_string()'` function to remove non-English named apps from both the Google and the Apple datasets.  Then explore the results to see how many rows remain.

In [None]:
google_eng = []
apple_eng = []

for app in google:
    if eng_str(app[0]):
        google_eng.append(app)
    
for app in apple:
    if eng_str(app[0]):
        apple_eng.append(app)
        
explore_data(google_eng, 0, 3, True)
explore_data(apple_eng, 0, 3, True)

## Isolate Free Apps Only

Only keep apps that are free since that is our focus in this analysis.  New Google and Apple lists are then created for these.  The prices are kept as strings and not converted to `float` or `integer` types.

In [None]:
google_eng_free = []
apple_eng_free = []

for app in google_eng:
    if app[7] == '0':
        google_eng_free.append(app)
        
for app in apple_eng:
    if app[4] == '0.0':
        apple_eng_free.append(app)
        
print(len(google_eng_free))
print(len(apple_eng_free))

The end goal of the company is to add an app to both the Google Play and the App Store markets.  Soo an app profile that fits both markets should be sought after.  
To minimize risks and overhead, our validation strategy is as follows:  
1. Build a minimal Android version of the app and add it to Google Play  
2. If the app has a good response from users then develop it further  
3. If the app is profitable after six months then build an iOS version of it and add it to the App Store

First, to determine the most common genres in each market, frequency tables of the `'Genre'`, `'Category'`, and `'prime_genre'` can be used.  The `freq_table()` function takes a dataset and an index as parameters and returns a frequency table whose values are percentages, sorted in descending order.  The `display_table()` function does this sorting and prints the results. 

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
def freq_table(dataset, index):
    d = {}
    for row in dataset:
        if row[index] in d:
            d[row[index]] += 1
        else:
            d[row[index]] = 1
            
    d_percent = {}
    for key in d:
        percent = (d[key] / len(dataset)) * 100
        d_percent[key] = percent
        
    return d_percent
        
display_table(apple_eng_free, 11)

In [None]:
display_table(google_eng_free, 9)

In [None]:
display_table(google_eng_free, 1)

## Analysis of Apple Apps
* The most common genre of free English apps at the Apple App Store is Games followed by Entertainment.
* Games consist of the vast majority of free English apps at 56%, Entertainment is at 8%, and the remaining categories are all low percentages.
* Most of the apps are for entertainment (gaming, entertainment, photo & video, social networking) and less are for practical purposes (education, shopping, utilization).
* A gaming, entertainment, or social networking app may be good for us to add to the App Store.  Although there is a very large percentage of games already available, so there is existing competition.  However, even though there is a large number of gaming apps, that does not imply there is a large number of users.  There could be many games with few or no users at all.  Whereas social networking apps are few in numbers, but the number of users could be very large.

## Analysis of Google Apps
* The most common genre of free English apps at the Google Play Store is Tools (followed by Entertainment).
* Unlike Apple apps where games dominate the market, Google apps are spread out much more among different genres.  Also unlike Apple apps where most are for entertainment purposes, Google apps are more for practical purposes.  For example, tools, education, business, and productivity are abundant in the Google Play Store.
* The frequency tables reveal the most frequent app genres, not necessarily the genres with the most users.

The average number of user ratings per genre of apps on the Apple App Store is calculated and shown below.

In [None]:
uniq_prime_genre = freq_table(apple_eng_free, 11)

total = 0
len_genre = 0

for genre in uniq_prime_genre:
    
    for app in apple_eng_free:
        genre_app = app[11]
    
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
            
    avg_ratings_num = total / len_genre
    print(genre + ": " + str(avg_ratings_num))

In [None]:
uniq_category = freq_table(google_eng_free, 1)

for category in uniq_category:
    total = 0
    len_category = 0
    
    for app in google_eng_free:
        category_app = app[1]
        
        if category_app == category:
            installs_0 = app[5].replace("+", "")
            installs_1 = installs_0.replace(",", "")
            total += float(installs_1)
            len_category += 1
            
    avg_installs = total / len_category
    
    print(category + ": " + str(avg_installs))