# Most profitable Apple Store and Google Play Apps

This project sets to analyze app data obtained from Apple Store and Google Play. It is meant to provide developers with a better understanding of users' demands and help them build more attractive apps.

As of 2018 there were approximately 2 milliom iOS apps available on the App Store, and 2.1 million Android apps on Google Play. To analyze them we will use two sets of relevant data:

1. A data set of approximately 10000 Android apps from Google Play:

https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

2. A data set of approximately 7000 iOS apps from the App Store:

https://dq-content.s3.amazonaws.com/350/AppleStore.csv

To explore this data set we will use the following **function**:



In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

opened_file = open(r'C:\Nauka\Python\Guided Project_ Profitable App Profiles for the App Store and Google Play Markets\googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
google_data = list(read_file)
google_header = google_data[0]
google_no_header = google_data[1:]
    

In [3]:
from csv import reader

opened_file = open(r'C:\Nauka\Python\Guided Project_ Profitable App Profiles for the App Store and Google Play Markets\AppleStore.csv',encoding='utf8')
read_file = reader(opened_file)
apple_data = list(read_file)
apple_header = apple_data[0]
apple_no_header = apple_data[1:]

## Data Sets

The Google Play data set contains **10841** records (header row excluded), split into **13** columns with the following headers:

In [4]:
explore_data(google_data,0, 1, rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13


If you're having trouble understandings what a particular column describes, please follow the link to the original documentation for the data set: 

https://www.kaggle.com/lava18/google-play-store-apps

Apple Store data set contains **7197** records (header row excluded), split into **16** columns with the following headers:

In [5]:
explore_data(apple_data,0, 1, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


If you're having trouble understandings what a particular column describes, please follow the link to the original documentation for the data set:

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps


## Data Cleaning Process

1. ### Removing entries with missing data.

According to this discussion on the Google Play data set (https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) one of the entries is missing a row value for "Rating" causing the next columns to shift; the entry will be deleted.

In [6]:
explore_data(google_no_header, 10472, 10473)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In [7]:
del google_no_header[10472]

In [8]:
explore_data(google_no_header, 10472, 10473)

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




2. ### Removing duplicate entries.

Multiple duplicate entries for a number of apps in Google Play data set were found:

In [9]:
duplicate_apps = []
unique_apps = []

for row in google_no_header:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print("Number of duplicate apps:", len(duplicate_apps))
print('\n')
print("Examples of duplicate apps:", duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Below you can see the examples of multiple entries for the same application:

In [10]:
for row in google_no_header:
    app_name = row[0]
    if app_name == "Facebook":
        print(google_header)
        print('\n')
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


In [11]:
for row in google_no_header:
    app_name = row[0]
    if app_name == "Twitter":
        print(google_header)
        print('\n')
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 

In [12]:
for row in google_no_header:
    app_name = row[0]
    if app_name == "Instagram":
        print(google_header)
        print('\n')
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['App', 'Cate

The duplicates will be removed by taking into account the **number of reviews**. The rationale is that the largest number of reviews indicates the most up-to-date data, while entries with the lower number of reviews seem to be out-of-date. Thus, entries with the lower number of reviews will be deleted to clean up the data set. 

In order to delete the multiple entries a new data set in a form of a **dictionary** that will have only one value for each key and that value will be the highest number of reviews for a given application:

In [15]:
max_reviews = {}

for row in google_no_header:
    name = row[0]
    no_of_reviews = float(row[3])
    if name in max_reviews and max_reviews[name] < no_of_reviews:
        max_reviews[name] = no_of_reviews
    elif name not in max_reviews:
        max_reviews[name] = no_of_reviews
        
print(max_reviews['Instagram'])
print(len(max_reviews))
        

66577446.0
9659


By comparing the original data set to the dictionary max_reviews created above, a new cleaned up list of Google Play apps named **android_clean** will be created:

In [None]:
android_clean = []
already_added = []
rejected = []

for row in google_no_header:
    name = row[0]
    no_of_reviews = float(row[3])
    if no_of_reviews == max_reviews[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))
print(len(already_added))

#### 3. Removing entries for apps not in English

App names containing symbols that are not commonly used in English will be removed from the data set. To determine whether a given app name contains a character not used in the English alphabet the name will be iterated over to check whether any of the characters with an ASCII code higher than 127:

In [62]:
def non_English(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True
        
    
print(non_English('Instachat 😜'))

False


The above function however does not take into account any special symbols that might still be used in English app names, even though they are not part of the English alphabet, for example symbols like "™" or emojis. To rectify, a decision was made to remove only records where the app name contains more than 3 characters not part of the English alphabet. While not ideal, it should make the data signifantly clearer:

In [68]:
def non_English(string):
    if_false = []
    for character in string:
        if ord(character) > 127:
            if_false.append(character)
        if len(if_false) > 3:
            return False
    return True

print(non_English('爱奇艺艺'))

False


Using the above functions new data sets were created:

In [72]:
android_english = []

for row in android_clean:
    name = row[0]
    if non_English(name) == True:
        android_english.append(row)
        
print(len(android_english))

9614


In [81]:
apple_english = []

for row in apple_no_header:
    name = row[1]
    if non_English(name) == True:
        apple_english.append(row)

print(google_header)
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+

#### 4. Isolating free apps

Because the client is mainly concerned with developing free apps and generating profit from in-game ads, the lat part of the cleaning process is isolating free apps from the data set:

In [86]:
android_final = []

for row in android_english:
    price = row[7]
    if price == '0':
        android_final.append(row)
        
print(len(android_final))

8864


In [88]:
ios_final = []

for row in apple_english:
    price = float(row[4])
    if price == 0.0:
        ios_final.append(row)
        
print(len(ios_final))

3222
