# Guided Project: Profitable App Profiles for the App Store and Google Play Markets

This is a Jupyter Notebook for a DataQuest guided project.  The goal of this project is to analayse datasets relating to iOS and Android Apps, to answer the following question:

* What types of apps are more likely to attract users?

## Datasets we are using

We will be using two datasets for this project:

* The first is [a dataset](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) containing data from about 10,000 Android Apps, first collected in 2018.  Documentation is available [here](https://www.kaggle.com/lava18/google-play-store-apps)
* The second is [a similar dataset](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) containing data from about 7,000 iOS Apps, collected in 2017.  Documentation is available [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) 

## Importing data

We will start by importing the datasets into two lists, one called `ios_apps` and one called `android_apps`, and store the headers separately in `ios_apps_header` and `android_apps_header`.

In [4]:
import csv # import csv functionality to assist with reading the lists

# Importing iOS dataset into ios_apps and ios_apps_header
opened_ios_file = open(r"C:\Users\youss\Downloads\AppleStore.csv", encoding='unicode_escape')
read_ios_file = csv.reader(opened_ios_file)
list_ios_file = list(read_ios_file)
ios_apps = list_ios_file[1:]
ios_apps_header = list_ios_file[0]

# Importing Android dataset into android_apps and android_apps_header
opened_android_file = open(r"C:\Users\youss\Downloads\googleplaystore.csv", encoding='unicode_escape')
read_android_file = csv.reader(opened_android_file)
list_android_file = list(read_android_file)
android_apps = list_android_file[1:]
android_apps_header = list_android_file[0]

# Print sample rows to check list extracted correctly
print(ios_apps_header, ios_apps[1:5])
print(android_apps_header, android_apps[1:5]) 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] [['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']]
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] [['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967'

We can define a function `explore_data(dataset, start, end, rows_and_columns=False)`, which will help us easily extract data from the lists and format appropriately.  The function takes four parameters:

* `dataset` (string), which is the list of data to explore
* `start` (integer), which is the starting point of the slice of data
* `end` (integer), whcih is the end point of the slice of data
* `rows_and_columns` (Boolean, default False) which provides how many rows and columns were extracted by the function

In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

For example, to see the first ten rows of data in `android_apps`, and show how many rows and columns are in the dataset, we can write the following?

In [6]:
# Show first ten rows of data in Android dataset, and show how many rowns and columns are in the dataset.
print(explore_data(android_apps, 0, 10, True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', 

## Data Cleaning Part 1: Removing Incorrect Data

If we look at [this discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) it becomes clear that one of the rows (at index 10472) is missing data - it doesn't have a **genre** populated.

In [8]:
print(android_apps_header)

# Empty value at column 9 inidicates this row is missing genre
print(android_apps[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can remove it with the `del` function, and verify it is deleted by printing the same index to verify that a new app is displayed.

In [9]:
del android_apps[10472]
print(android_apps[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Data Cleaning Part 2: Removing Duplicate Data

Discussion on [this page](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) regarding the iOS Apps Dataset also suggests there is duplicate data in this dataset.  We can write a function that iterates over the iOS dataset and isolates any duplicate apps and their index, and then remove them if necessary.

In [10]:
unique_app_names = []
duplicate_app_names = []  
    
def duplicate_apps_detector(dataset, index_of_app_name):

# Iterate over the dataset and build up two lists for unique and duplicate names

    for row in dataset:
        app_name = row[index_of_app_name] # index is 1 for iOS
        if app_name not in unique_app_names:
            unique_app_names.append(app_name)
        else:
            duplicate_app_names.append(app_name)
    
    return duplicate_app_names

# Print results
print("iOS Duplicate Apps:") # 'Mannequin Challenge', and 'VR Roller Coaster'
print(duplicate_apps_detector(ios_apps, 1))
print('\n')
print("Number of iOS Duplicate Apps:") # 2
print(len(duplicate_app_names))

iOS Duplicate Apps:
['Mannequin Challenge', 'VR Roller Coaster']


Number of iOS Duplicate Apps:
2


However, inspecting the `id` for the duplicate apps shows these are in fact different apps, as they have diferent ids, therefore no action is required.

Repeating this for Android apps reveals a much larger problem though.

In [11]:
unique_app_names = []
duplicate_app_names = []  
    
def duplicate_apps_detector(dataset, index_of_app_name):

# Iterate over the dataset and build up two lists for unique and duplicate names

    for row in dataset:
        app_name = row[index_of_app_name] # index is 0 for android_apps
        if app_name not in unique_app_names:
            unique_app_names.append(app_name)
        else:
            duplicate_app_names.append(app_name)
    
    return duplicate_app_names

# Print results
print("Android Duplicate Apps:") # Too many to mention
print(duplicate_apps_detector(android_apps, 0))
print('\n')
print("Number of Android Duplicate Apps:") # 1,181
print(len(duplicate_app_names))

Android Duplicate Apps:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go â\x80\x94 Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF Doc Scan', 'Docs To Goâ\x84¢ Free Office Suite', 'Off

This is covered in [this thread](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/67894) and we will need to find a way to de-duplicate these whilst maintaining the usefulness of the data.  

One approach would be to keep the entry with the highest number of reviews, as these are likely to be the most recent apps.

To do this, we can build a dictionary that contains app names, and the *highest* number of reviews for that app.

In [12]:
reviews_max = {}

for row in android_apps:
    app_name = row[0]
    reviews = float(row[3])
    if app_name in reviews_max and reviews_max[app_name] < reviews:
        reviews_max[app_name] = reviews
    
    reviews_max[app_name] = reviews
    
print("Total length of android_apps = " + str(len(android_apps)))
print("Number of duplicate apps = 1181")
print("Number of unique rows = " + str(len(reviews_max))) # Expecting this to be 9659

Total length of android_apps = 10840
Number of duplicate apps = 1181
Number of unique rows = 9659


We can use `reviews_max` to now produce a new version of `android_apps` which contains only the cleaned data. The code below loops through the whole dataset and checks to see if the app contains the max reviews for that app.  If it does, it is added to the cleaned data, if it's not then it's skipped.  An auxiliary list `android_apps_already_added` keeps count of which apps have been checked, to ensure there is no undue overwriting of data.

In [13]:
android_apps_clean = []
android_apps_already_added = []

for row in android_apps:
    app_name = row[0]
    reviews = float(row[3])
    if reviews == reviews_max[app_name] and app_name not in android_apps_already_added:
        android_apps_clean.append(row)
        android_apps_already_added.append(app_name)

# Checking output is correct
print("Expected length of android_apps_clean: 9659")
print("Actual length of android_apps_clean: " + str(len(android_apps_clean)))

if len(android_apps_clean) == 9659:
    print("Check complete!")

Expected length of android_apps_clean: 9659
Actual length of android_apps_clean: 9659
Check complete!


## Data Cleaning Part 3: Removing Apps not intended for English-speaking users

The project's goal is very much interested in identifying trends for apps which are intended for English speaking users.  Whilst the data does not have a clear flag for language, we can certainly remove apps which are clearly designed for non-English speakers, for instance those where the name of the app is written in characters outside of the English alphabet.

We can identify app names which are written in such characters by iterating over the characters in the app name, and identifying whether the [ASCII number](https://en.wikipedia.org/wiki/ASCII) for **three or more characters** is greater than 127, as the range 0-127 is used to identify common English characters.  

*Note, the three character check is to ensure that if an app name using an emoji, or a trademark symbol, the app is not incorrectly labelled as non-English*

In [14]:
def is_English_detector(app_name): # This function will return False if the app is not intended for English speakers
    
    number_non_English_characters = 0
    
    for character in app_name:
        if ord(character) > 127:
            number_non_English_characters += 1
    
    if number_non_English_characters >= 3:
        return False
    else:
        return True

# Examples

print(is_English_detector('Instagram')) # Expecting True
print(is_English_detector('爱奇艺PPS -《欢乐颂2》电视剧热播')) # Expecting False
print(is_English_detector('Docs To Go™ Free Office Suite')) # Expecting True
print(is_English_detector('Instachat 😜')) # Expecting True

True
False
True
True


We can now use this helper function to loop through the iOS dataset, and the cleaned Android data, to exclude apps not seemingly intended for English speaking audiences. Doing so removes almost 1,000 iOS Apps, and a handful of Android apps.

In [15]:
cleaned_English_ios_apps = []
cleaned_English_Android_apps = []

for row in ios_apps:
    app_name = row[1]
    if is_English_detector(app_name):
        cleaned_English_ios_apps.append(row)

for row in android_apps_clean:
    app_name = row[0]
    if is_English_detector(app_name):
        cleaned_English_Android_apps.append(row)

# Checking if this has removed any apps:

print("Original number rows in iOS Dataset: " + str(len(ios_apps)))
print("Length of new iOS Dataset: " + str(len(cleaned_English_ios_apps)))
print("Number of apps removed: " + str((len(ios_apps)) - len(cleaned_English_ios_apps)))

print("Original number rows in Android Dataset: " + str(len(android_apps_clean)))
print("Length of new Android Dataset: " + str(len(cleaned_English_Android_apps)))
print("Number of apps removed: " + str((len(android_apps_clean)) - len(cleaned_English_Android_apps)))

Original number rows in iOS Dataset: 7197
Length of new iOS Dataset: 5794
Number of apps removed: 1403
Original number rows in Android Dataset: 9659
Length of new Android Dataset: 9268
Number of apps removed: 391


## Data Cleaning Part 4: Removing Paid Apps

The goal of our analysis is to understand trends in the free-to-download category, rather than in paid apps.  So we need to continue to isolate our data to remove paid apps.  This is straightforward as price is included in the datasets - as a number in the iOS Dataset, and as the string `'Free'` in the Android Dataset

In [16]:
cleaned_English_free_ios_apps = []
cleaned_English_free_Android_apps = []

for row in cleaned_English_ios_apps:
    price = float(row[4])
    if price == 0.00: # iOS uses floats to store price
        cleaned_English_free_ios_apps.append(row)
        
for row in cleaned_English_Android_apps:
    price = row[6]
    if price == 'Free': # Android dataset uses this string to denote a free app
        cleaned_English_free_Android_apps.append(row)
        
# Checking if this has removed any apps:

print("Original number rows in iOS Dataset: " + str(len(cleaned_English_ios_apps)))
print("Length of new iOS Dataset: " + str(len(cleaned_English_free_ios_apps)))
print("Number of apps removed: " + str((len(cleaned_English_ios_apps)) - len(cleaned_English_free_ios_apps)))
print('\n')
print("Original number rows in Android Dataset: " + str(len(cleaned_English_Android_apps)))
print("Length of new Android Dataset: " + str(len(cleaned_English_free_Android_apps)))
print("Number of apps removed: " + str((len(cleaned_English_Android_apps)) - len(cleaned_English_free_Android_apps)))

Original number rows in iOS Dataset: 5794
Length of new iOS Dataset: 2970
Number of apps removed: 2824


Original number rows in Android Dataset: 9268
Length of new Android Dataset: 8542
Number of apps removed: 726


## Data Analysis Part 1: Most Common Apps By Genre

The findings of this project will help us validate what strategy we might want to take for building and validating an app.  We are concerned with finding apps that work well in both iOS and Android stores, which are free-to-download.  A good place to start will be with genre, and we can build frequency tables for both datasets to determine the most common genre.  Whilst knowing that a genre is popular is not necessarily a route to success, it will at least identify which apps are most common, before we look at other aspects such as ratings and reviews.

The columns to use will be `'prime_genre'` (index 11) for iOS, and `'genres'`(index 9) and `'Category'` (index 1) for Android.

In [17]:
# Function that takes any dataset and any column index, and builds a frequency table dictionary
def freq_table(dataset, index):
    frequency_table = {}
    for row in dataset:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
     
    return frequency_table

# Example for iOS prime_genre
print(freq_table(cleaned_English_free_ios_apps, 11))

{'Social Networking': 92, 'Photo & Video': 151, 'Games': 1759, 'Music': 63, 'Reference': 16, 'Health & Fitness': 58, 'Weather': 26, 'Travel': 33, 'Shopping': 74, 'News': 40, 'Navigation': 5, 'Lifestyle': 43, 'Entertainment': 224, 'Food & Drink': 26, 'Sports': 64, 'Finance': 33, 'Education': 113, 'Productivity': 50, 'Utilities': 67, 'Book': 8, 'Business': 16, 'Catalogs': 3, 'Medical': 6}


With a second helper function `display_table()` we can convert this frequency_table into a sorted table of tuples to show more clearly the most popular genres

In [18]:
# Function that takes any dataset and any column index, and uses the existing freq_table functon to create a sorted table, in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# Display tables
print("iOS Apps - frequency by genre:")
print("\n")
print(display_table(cleaned_English_free_ios_apps, 11))
print("\n")
print("Android Apps - frequency by genre:")
print("\n")
print(display_table(cleaned_English_free_Android_apps, 9))
print("\n")
print("Android Apps - frequency by category:")
print("\n")
print(display_table(cleaned_English_free_Android_apps, 1))

iOS Apps - frequency by genre:


Games : 1759
Entertainment : 224
Photo & Video : 151
Education : 113
Social Networking : 92
Shopping : 74
Utilities : 67
Sports : 64
Music : 63
Health & Fitness : 58
Productivity : 50
Lifestyle : 43
News : 40
Travel : 33
Finance : 33
Weather : 26
Food & Drink : 26
Reference : 16
Business : 16
Book : 8
Medical : 6
Navigation : 5
Catalogs : 3
None


Android Apps - frequency by genre:


Tools : 727
Entertainment : 520
Education : 465
Business : 398
Productivity : 337
Lifestyle : 336
Finance : 324
Medical : 310
Sports : 296
Personalization : 278
Communication : 277
Health & Fitness : 265
Action : 264
Photography : 254
News & Magazines : 243
Social : 228
Travel & Local : 196
Shopping : 190
Books & Reference : 186
Simulation : 175
Arcade : 158
Dating : 154
Video Players & Editors : 149
Casual : 148
Maps & Navigation : 118
Food & Drink : 108
Puzzle : 98
Racing : 86
Role Playing : 79
Auto & Vehicles : 79
Libraries & Demo : 78
Strategy : 75
House & Home : 69
Wea

### Findings

For iOS:

* The most common genre is Games (by some way), followed by Entertainment
* It is fair to say that the majority of apps are designed for entertainment, rather than having a practical purpose, although Education is fourth-highest

For Android

* Family is the highest category (again by some way) followed by game and tools

This suggests that whilst games are not the top category of the Apple App Store, games are certainly prevalent on both platforms.

## Data Analysis Part 2: Most Popular Apps by Genre

Whilst knowing what the most common apps are is useful, knowing the most popular genres is crucial to understanding what will be received well in the market.

We can calculate the popularity of a specific genre by identifying the average user rating for each genre in each store.  Both App Store datasets provides user ratings, but only Google provides `'Installs'` so we will need to use total number of ratings for iOS apps as a proxy for installs.

In [19]:
# Generate frequency table using helper function created above
ios_genre_freq_table = freq_table(cleaned_English_free_ios_apps, 11)

# Loop over genre in the frequency table, search App Store Dataset for that genre, and keep count of user ratings and number of ratings
for genre in ios_genre_freq_table:
    total = 0 # Counter for number of ratings
    len_genre = 0 # Counter for number times that genre appears (to calculate average)
    for row in cleaned_English_free_ios_apps:
        genre_app = row[11]
        if genre_app == genre:
            number_ratings = float(row[5])
            total += number_ratings
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, avg_rating)

Social Networking 77734.16304347826
Photo & Video 29056.139072847684
Games 21970.818646958498
Music 55396.01587301587
Reference 84016.5625
Health & Fitness 19418.620689655174
Weather 48275.57692307692
Travel 34115.57575757576
Shopping 28517.72972972973
News 22800.425
Navigation 102592.0
Lifestyle 17260.53488372093
Entertainment 14782.25
Food & Drink 33333.92307692308
Sports 24458.953125
Finance 26729.090909090908
Education 6086.513274336283
Productivity 22842.22
Utilities 11402.611940298508
Book 16671.0
Business 6412.8125
Catalogs 5195.0
Medical 612.0


Perhaps surprisingly, Navigation comes out with some of the most often rated apps, with Reference and Social Media very high.  Perhaps also surprisingly, whilst popular in terms of number of apps, games are less well often rated compared to other genres.

Let's repeat the process for the Play Store, looking at the Category of the app.  We can use the installs column for this rather than use number of ratings as a proxy, though as these are in bands rather than absolute numbers, we will need to assume that `'1,000,000+'` is 1,000,000, `'10,000+'` is 10,000, and so on.

In [20]:
# Generate frequency table using helper function created above
android_category_freq_table = freq_table(cleaned_English_free_Android_apps, 1)
print(android_category_freq_table)

# Loop over genre in the frequency table, search App Store Dataset for that genre, and keep count of installs
for category in android_category_freq_table:
    total = 0
    len_category = 0
    for row in cleaned_English_free_Android_apps:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '') # Remove the + at the end of the string
            installs = installs.replace(',', '') # Remove comma separators
            total += float(installs)
            len_category += 1
    avg_instalg = total / len_category
    print(genre, avg_rating)

{'ART_AND_DESIGN': 56, 'AUTO_AND_VEHICLES': 79, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 186, 'BUSINESS': 398, 'COMICS': 49, 'COMMUNICATION': 277, 'DATING': 154, 'EDUCATION': 98, 'ENTERTAINMENT': 72, 'EVENTS': 60, 'FINANCE': 324, 'FOOD_AND_DRINK': 108, 'HEALTH_AND_FITNESS': 265, 'HOUSE_AND_HOME': 69, 'LIBRARIES_AND_DEMO': 78, 'LIFESTYLE': 337, 'GAME': 799, 'FAMILY': 1634, 'MEDICAL': 310, 'SOCIAL': 228, 'SHOPPING': 190, 'PHOTOGRAPHY': 254, 'SPORTS': 292, 'TRAVEL_AND_LOCAL': 197, 'TOOLS': 728, 'PERSONALIZATION': 278, 'PRODUCTIVITY': 337, 'PARENTING': 55, 'WEATHER': 67, 'VIDEO_PLAYERS': 149, 'NEWS_AND_MAGAZINES': 243, 'MAPS_AND_NAVIGATION': 118}
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Medical 612.0
Med