# Profitable App Profiles for the App Store and Google Play Markets

For this project, we are data analysts for a company that builds mobile applications for both iOS and Android, with these apps being available on Google Play and the App Store, respectively.

All of the apps that our company builds are free to download and install. The main source of revenue from these apps are in-app ads.

This means that the revenue for any one of the apps is directly correlated to the number of users for that app - the more users who see and engage with the ads, the more revenue each app brings in.

**The goal for this project is to analyze data to help our development team understand what type of apps are likely to attract more users.**

***
The first step in this project will be to collect and analyze data about all mobile apps available on Google Play and the App Store.

Since there are over 4 million mobile apps, we will focus on a smaller sample size of data for the purposes of this project.

There are two datasets we will be using:
- [A dataset for approximately 10k Android apps from Google Play, collected in August 2018](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv). Saved as `googleplay.csv` within the project folder.
- [A dataset for approximately 7k iOS apps from the App Store, collected in July 2017](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). Saved as `appstore.csv` within the project folder.

***
## Opening and Exploring Data

We will extract the datasets from the CSV files into a list of lists.

In [443]:
from csv import reader

def csv_to_list(file):
    opened_file = open(file)
    read_file = reader(opened_file)
    return list(read_file)

app_store = csv_to_list('appstore.csv')
google_play = csv_to_list('googleplay.csv')

We will define a function to explore these two datasets.

In [444]:
def explore_dataset(dataset, start_index, end_index, print_rows_and_columns = False):
    dataset_sliced = dataset[start_index: end_index]
    for row in dataset_sliced:
        print(row) 
        print('\n') # Add an empty line after each row
        
    if print_rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [445]:
explore_dataset(app_store, 0, 5, print_rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7198
Number of columns:  16


In [446]:
explore_dataset(google_play, 0, 5, print_rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  10842
Number of columns:  13


***
## Data Cleaning
### Identify Key Data Points

Next, we will view only the column names of each dataset in order to identify which data points will be most helpful for our analysis.

The detailed descriptions of the columns can be found from the original page of the datasets, for [Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps) and the [App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

In [447]:
print(google_play[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


From the list of available metrics from Google Play, it appears the most important metrics to use for our analysis should be:
- App
- Category
- Rating
- Reviews
- Installs
- Type
- Price
- Content Rating
- Genres

In [448]:
print(app_store[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


From the list of available metrics from the App Store, it appears the most important metrics to use for our analysis should be:
- track_name
- currency
- price
- rating_count_tot
- user_rating
- cont_rating
- prime_genre

***
### Deleting Wrong or Inaccurate Data
*First, there is a [discussion topic](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) about an error for the Google Play data. We will fix this by removing the entry in question from our local dataset*

In [449]:
explore_dataset(google_play, start_index = 10472, end_index = 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




We can see that the data at `google_play[10473]` is missing the `Category` column. Since we can't readily assume what category it may have originally been in, we should just remove it from the dataset.

In [450]:
del google_play[10473]

# Confirm that the data is no longer in the dataset
explore_dataset(google_play, start_index = 10472, end_index = 10474)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




***
### Removing Duplicate Entries
Reading through more discussion topics about the Google Play data, it appears there are duplicate entries for many apps. Since this would impact the analysis, we will remove identify and remove any duplicate entries.

In [451]:
def identify_duplicate_entries(dataset, index, remove_header = False):
    unique_entries = []
    duplicate_entries = []
    
    if remove_header:
        dataset = dataset[1:]
    
    for row in dataset:
        value = row[index]
        if value in unique_entries:
            duplicate_entries.append(value)
        else:
            unique_entries.append(value)
            
    print("Number of duplicate entries: ", len(duplicate_entries))
    print("\n")
    print("Examples of duplicate entries: ", duplicate_entries[:10])
    
    return duplicate_entries

In [452]:
google_play_duplicates = identify_duplicate_entries(google_play, 0, remove_header = True)

Number of duplicate entries:  1181


Examples of duplicate entries:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [453]:
app_store_duplicates = identify_duplicate_entries(app_store, 1, remove_header = True)

Number of duplicate entries:  2


Examples of duplicate entries:  ['Mannequin Challenge', 'VR Roller Coaster']


***
While it would be easy to just remove any duplicate entries, it is important that we first identify why there are duplicate entries.

We can work on the first duplicate entry for each dataset to try to identify any factors that we can use to ensure we are working with the best data for a particular entry.

In [454]:
def compare_duplicates(original_dataset, duplicate_dataset, index):
    for data in original_dataset:
        value = data[index]
        if value == duplicate_dataset[0]:
            print(data)

In [455]:
compare_duplicates(google_play, google_play_duplicates, 0)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [456]:
compare_duplicates(app_store, app_store_duplicates, 1)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


For each of these two datasets, it appears the metric that differentiates each duplicate entry is the number of ratings, seen in index `3` of the Google Play data, and in index `5` of the App Store data.

With that knowledge, we know that we should only be keeping the entry with the *highest* number of ratings.

In order to do this, we will need to create a new dataset that keeps only one entry for each app, with that entry being the one with the highest number of ratings.

In [457]:
def create_dataset_unique_entries(dataset, index_name, index_unique, remove_header = True):
    if remove_header:
        dataset = dataset[1:]
        
    unique_dict = {}
    for row in dataset:
        name = row[index_name]
        value = float(row[index_unique])
        if name in unique_dict and unique_dict[name] < value:
            unique_dict[name] = value
        elif name not in unique_dict:
            unique_dict[name] = value
            
    return unique_dict

We can confirm that this process was done succesfully because we can assume that the resulting dataset should have the amount of entries equal to `Original Entries` - `Duplicate Entries`.

For the Google Play data, this should be:

In [458]:
print(len(google_play[1:]) - len(google_play_duplicates))

9659


In [459]:
google_play_unique_dict = create_dataset_unique_entries(google_play, 0, 3)
print(len(google_play_unique_dict))

9659


This checks out!

For the App Store data, this should be:

In [460]:
print(len(app_store[1:]) - len(app_store_duplicates))

7195


In [461]:
app_store_unique_dict = create_dataset_unique_entries(app_store, 1, 5)
print(len(app_store_unique_dict))

7195


This also checks out!

The next step here is to use the unique dictionaries that can tell us the highest ratings for each app to create a new *cleaned* dataset as a list of lists.

In [462]:
def create_clean_dataset(dataset, clean_dict, index_name, index_unique, remove_headers = True):
    cleaned_list = [] # Store new cleaned data
    already_added = [] # List of data that has already been added (non-unique)
    
    if remove_headers:
        cleaned_list.append(dataset[0])
        dataset = dataset[1:]
    
    for row in dataset:
        name = row[index_name]
        value = float(row[index_unique])
        if name not in already_added and value == clean_dict[name]:
            cleaned_list.append(row)
            already_added.append(name)
            
    return cleaned_list

In [463]:
google_play_cleaned = create_clean_dataset(google_play, google_play_unique_dict, 0, 3)
print(len(google_play_cleaned[1:]))
print(google_play_cleaned[:5])

9659
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


In [464]:
app_store_cleaned = create_clean_dataset(app_store, app_store_unique_dict, 1, 5)
print(len(app_store_cleaned[1:]))
print(app_store_cleaned[:5])

7195
[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']]


The lengths of both of the cleaned datasets match what we expected, so we can validate that the process was a success!

***
### Removing Other Unwanted Data
At our company, recall that we only develop mobile apps that are free to download and install. We also design our apps for an English-speaking audience.

So we will need to clean up the current datasets to remove all data points that are non-free and not primarily in English.

**Removing Non-free Apps**

For this step, we should just need to loop through each entry in our datasets and keep the entries that have a price of $0.

However, the price data points in each dataset exist as strings, or they contain the `$` character, so we will need to be sure to account for this.

We can use the `ord()` function, which returns the integer representing the Unicode character, and if we encounter anything other than a `.` or numeral, we will remove it. The ASCII integer for `.` is **46** and **48-57** for `0-9` according to this [ASCII table](https://www.asciitable.com/)

In [465]:
def string_to_float(string):
    cleaned_string = ""
    for character in string:
        char_unicode = ord(character)
        if char_unicode == 46 or (char_unicode >= 48 and char_unicode <= 57):
            cleaned_string += character
    return float(cleaned_string)

In [466]:
def create_dataset_free_only(dataset, index_name, index_price, remove_headers = True):
    dataset_free = []
    
    if remove_headers:
        dataset_free.append(dataset[0])
        dataset = dataset[1:]
        
    for row in dataset:
        name = row[index_name]
        price = string_to_float(row[index_price])
        if price == 0:
            dataset_free.append(row)
    
    return dataset_free

In [467]:
google_play_free = create_dataset_free_only(google_play_cleaned, 0, 7)

In [468]:
print(google_play_free[:10])

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone'

In [469]:
app_store_free = create_dataset_free_only(app_store_cleaned, 1, 4)

In [470]:
print(app_store_free[:10])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624',

Great! Now we have a cleaned dataset that has only free apps.

**Removing Non-English Apps**

The next step is to remove all apps that are not aimed at an English-speaking audience.

The process is similar to above. We will compare the unicode characters of the app name string and ignore any apps that have characters outside of the ASCII range of 0-127.

In [471]:
def create_dataset_english(dataset, index_name, remove_headers = True):
    dataset_english = []
    if remove_headers:
        dataset_english.append(dataset[0])
        dataset = dataset[1:]
    
    for row in dataset:
        isEnglish = True
        name = row[index_name]
        for character in name:
            char_unicode = ord(character)
            if char_unicode > 127:
                isEnglish = False
        if isEnglish:
            dataset_english.append(row)
    
    return dataset_english

Now, we will test this function to ensure it is properly filtering out non-English apps:

In [472]:
test_list = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜']

test_list_english = create_dataset_english(test_list, 0, remove_headers = False)
print(test_list_english)

['Instagram', 'Docs To Go™ Free Office Suite', 'Instachat 😜']


In [473]:
google_play_english = create_dataset_english(google_play_free, 0)
print(google_play_english[:10])

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'Ap

In [474]:
app_store_english = create_dataset_english(app_store_free, 1)
print(app_store_english[:10])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624',

**I think that wraps up all of the data cleaning for this initial analysis!**

***
# Data Analysis