# Profitable apps within the Google Play store and the Apple iOS app store

- I aim to identify which types of apps attract the most users
    - Mobile app analytics allows for development of apps which will attract high user engagement, increasing app revenue

- There are approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on the Google Play store (Sept. 2018)
    - Android holds about 53.2% of the smartphone market, while iOS is 43%
    
- To understand which apps attract users in both markets I will analyze two freely available data sets:
    - Google Play Store Apps
        - This data set contains more than 10,000 Google Play mobile applications details
        - Web scraping tools were used to extract data from the Google Play store
        - You can download this data set [here](https://www.kaggle.com/lava18/google-play-store-apps)
        
    - Mobile App Statistics (Apple iOS app store)
        - This data set contains more than 7,000 Apple iOS mobile applications details
        - R and linux web scraping tools were used to extract data from the iTunes Search API at the Apple Inc website
        - You can download this data set [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

-----

I will begin with opening each data set, followed by data exploration, cleaning, and analysis.

In [25]:
from csv import reader

## Apple iOS app store dataset
open_apple = open('AppleStore.csv')
apple_list = list(reader(open_apple))
apple_list_header = apple_list[0] 
apple_list = apple_list[1:] # store the data set without the header

##Google Play store dataset
open_google = open('googleplaystore.csv')
from csv import reader
google_list = list(reader(open_google))
google_list_header = google_list[0]
google_list = google_list[1:] # store the data set without the header

The `explore_data()` function breaks up the rows of each data set to print them in a more readable way. This function can also be used to print the number of rows and columns of each data set.

In [26]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [27]:
print('Apple iOS app store:')
print('\n')
print(apple_list_header)
print('\n')
explore_data(apple_list,0,3, True)

Apple iOS app store:


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see that the Apple iOS app store contains 16 columns and 7197 apps. 

I expect to predominately use the track_name, price, rating_count_tot, user_rating, cont_rating, and prime_genre columns in this analysis. Details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [28]:
print('Google Play Store:')
print('\n')
print(google_list_header)
print('\n')
explore_data(google_list,0,3, True)

Google Play Store:


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play store contains 13 columns and 10841 apps. 

I expect to predominately use the App, Category, Rating, Reviews, Type, Price, Content Rating, and Genres columns in this analysis. Details about each column can be found in the data set [documentation](https://www.kaggle.com/lava18/google-play-store-apps).

## Deleting Wrong Data

- Both data sets have dedicated discussion sections
- The Google Play store data set discussions have identified an error for row 10472 

Below I will print this row and compare it to the header and another row that is correct.

In [29]:
print(google_list[10472]) # incorrect row
print('\n')
print(google_list_header)  # header
print('\n')
print(google_list[0]) # correct row as example

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 contains the app "Life Made WI-Fi Touchscreen Photo Frame" and the rating is listed as 19. The maximum rating for a Google Play app is 5, so this must be incorrect, and I will remove this row. I will print the number of rows `len()` of the Google Play store data set prior to and after deleting this row to allow for comparison, and ensure that only one row was removed.

In [30]:
print(len(google_list))
#del(google_list[10472]) # only run this code once 
print(len(google_list))
print('\n')
print(google_list[10472])

10841
10840


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


# Removing Duplicate Entries

The Google Play store data set discussions have also indicated that some apps have been given more than one entry. I will check this hypothesis using Twitter and Instagram, apps that regularly push out updates, and were first added to the store approx. 10 years ago.

In [35]:
for app in google_list:
    name = app[0]
    if name == 'Twitter':
        print(app)
print('\n')
for app in google_list:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varie

As hypothesized, some apps have more than one entry within the Google Play sotre data set. To begin removal of these apps I count the number of duplicates and list some examples below. Although the Apple iOS app store data set disscusions did not mention any duplicates in the data, I will include it in the analysis below just in case.

In [10]:
def duplicate_search(app_list):
    duplicate_apps = []
    unique_apps = []

    for each_line in app_list:
        name = each_line[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    return duplicate_apps
    return unique_apps

In [11]:
google_duplicates = duplicate_search(google_list)

print('Number of duplicate apps in Google Play store:', len(google_duplicates))
print('\n')
print('Examples of duplicate apps in Google Play store:', google_duplicates[:15])

Number of duplicate apps in Google Play store: 1181


Examples of duplicate apps in Google Play store: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [None]:
In total, there are 1,181 cases where an app occurs more than once within the Google Play store.

In [12]:
apple_duplicates = duplicate_search(apple_list)
print('Number of duplicate apps in Apple iOS store:', len(apple_duplicates))
print('\n')
print('Examples of duplicate apps in Apple iOS store:', apple_duplicates[:15])

Number of duplicate apps in App store: 0


Examples of duplicate apps in App store: []


No cases of duplicate apps were found in the Apple iOS app store.

To avoid counting some apps more than once during analysis, I will remove the duplicate entries found in the Google Play store data set, keeping only one entry per app. Looking at the Twitter app entries printed above, each line differs in either column 4 and/or column 11 which correspond to the number of reviews and the date of the last update respectively. As column 4 differs more regularly I will be using it to determine which duplicate to keep, and which to remove. The entry with the highest number of reviews will be kept as it is likely the most recent entry, and as the number of reviews increases, so too does the reliability of the associated rating.


To facilitate this I will create a dictionary `reviews_max` where each key is a unique app name, and the value is the highest number of reviews of that app. I will then use this dictionary to create a new data set, containing only one entry per app.

In [36]:
reviews_max = {}
for each_line in google_list:
    name = each_line[0]
    n_reviews = float(each_line[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

I found previously that there are 1,181 app duplications, so the length of `reviews_max` should equal the difference between the length of the Google Play app store data set, and 1,181.

In [37]:
print('Expected length of Google Play store list:', len(google_list) - 1181)
print('Actual length of Google Play store list:', len(reviews_max))

Expected length of Google Play store list: 9659
Actual length of Google Play store list: 9659


I will use `reviews_max` to remove the Google Play store data set duplicates, keeping only the entries with the highest number of reviews.

To do so I will initialize two empty lists, `google_clean` and `already_added`, loop through the Google Play data set, and for every entry isolate the name of the app and the number of reviews.

Each entry will be added to the google_clean list, and the corresponding app name to the already_added list if:
- The number of reviews of the current app matches the number of reviews of that app within the reviews_max dictionary and
- The name of the app is not already in the already_added list
    - This supplementary condition accounts for cases where the highest number of reviews of a duplicate app is the same for more than one entry

In [40]:
google_clean = []
already_added = []
for each_entry in google_list:
    name = each_entry[0]
    n_reviews = float(each_entry[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(each_entry)
        already_added.append(name)

The google_clean list is expected to now have 9659 entries. 

In [41]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps

x