# Profitable App Profiles for the App Store and Google Play Markets
An introductory practice into data analysis.

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

We analyze the two following datasets:

* A [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


In [None]:
import csv
import os
  
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


In [None]:
gps_filename = "googleplaystore.csv"
ios_filename = "AppleStore.csv"
    
with open(gps_filename, "r", encoding="utf8") as googlePS:
    googlePS_reader = list(csv.reader(googlePS))
with open(ios_filename, "r", encoding="utf8") as ios:
    ios_reader = list(csv.reader(ios))
    
del googlePS_reader[10473]
del googlePS_reader[9149]

Unfortunately, the Google Play dataset not only has broken data entries (the two deleted above); it also has duplicate entries that we must filter out before performing any analysis.

In [None]:
for app in googlePS_reader:
    name = app[0]
    if name == "Instagram":
        print(app)

print()

for app in googlePS_reader:
    name = app[0]
    if name == "Facebook":
        print(app)

In [None]:
# Counting the Number of Duplicates in a Dataset

duplicates = []
uniques = []

for app in googlePS_reader:
    name = app[0]
    if name in uniques:
        duplicates.append(name)
    else:
        uniques.append(name)

print("Number of duplicate apps:", len(duplicates))

We will filter out the duplicate apps based on their number of reviews. We assume that entries for the same app with a higher number of reviews are more recent, and thusly we keep only the entry with the most reviews.

In [None]:
print("Expected length:", len(googlePS_reader) - 1 - len(duplicates))

# Data Filtering
reviews_max = {}

# Generate a dictionary of app names and their max number of reviews found in the dataset
for app in googlePS_reader[1:len(googlePS_reader)]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print("Number of unique entries:", len(reviews_max))

googlePS_clean = []
already_added = []

# Generate a new list with only the entries corresponding to the maximum reviews for each app
for app in googlePS_reader[1:len(googlePS_reader)]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        googlePS_clean.append(app)
        already_added.append(name)

print("Confirmed number of unique entries:", len(googlePS_clean))


The dataset also includes non-English apps that are not of interest for our analysis, so they also have to be filtered out. To do this, we first define a function that identifies whether an app's name has non-English characters.

We will filter out all app names with more than THREE characters falling outside the standard English ASCII range (0-127).

In [None]:
def is_english(inputstr):
    count = 0
    
    for char in inputstr:
        if ord(char) > 127:
            count += 1
    
    return True if count <= 3 else False

android_clean = []
ios_clean = []

for app in googlePS_clean:
    name = app[0]
    
    if is_english(name) == True:
        android_clean.append(app)

for app in ios_reader[1:len(ios_reader)]:
    name = app[0]
    
    if is_english(name) == True:
        ios_clean.append(app)

print("Number of English Android Apps:", len(android_clean))
print("Number of English iOS Apps:", len(ios_clean))

To complete our data filtering, we reduce the dataset to include only Free apps.

In [None]:
android_allclean = []
ios_allclean = []

for app in android_clean:
    if app[6] == "Free":
        android_allclean.append(app)

for app in ios_clean:
    if app[4] == "0.0":
        ios_allclean.append(app)

print("Number of Free English Android Apps:", len(android_allclean))
print("Number of Free English iOS Apps:", len(ios_allclean))

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

For our specific datasets, our analysis requires build a frequency table for the *prime_genre* column of the App Store data set, and for the *Genres* and *Category* columns of the Google Play data set.

In [None]:
# Generates frequency tables that show percentages, and displays them in descending order

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

