# Profitable free mobile applications for the App Store (iOS) and Google Play Markets (Android)
## 1. Background
This project is an analysis of data from free mobile applications in order to define which types of apps are more likely to attract more users.   
All the data that is used within this project is collected from two separate datasets. 
   
Data from the iOS store is collected from: ...   
Data from the Android store is collected from: ...

## 2. Data Exploration
Before analysing the entire dataset, first we explore a sample of the dataset. To do this we develop a function named `explore_data` that takes a portion of the dataset.


In [294]:
def explore_data(dataset, start = 0, end = 5, rows_and_columns = True, header = False):
    if header == True:
        print('The header is: ')
        print(dataset[start])
        print('\n')
        start += 1

    dataset_slice = dataset[start:end]

    print('The dataset is constructed as follows: ')
    
    for row in dataset_slice:
        print(row)
        print('\n') # this adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        print('\n')

Next we use the `explore_data` function to take a look at the `ios` and `android` dataset.

In [221]:
from csv import reader

ios_file = reader(open('AppleStore.csv'))
android_file = reader(open('googleplaystore.csv'))
ios = list(ios_file)
android = list(android_file)

explore_data(ios, 0, 5, rows_and_columns = True, header = True)
explore_data(android, 0, 5, rows_and_columns = True, header = True)

The header is: 
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The dataset is constructed as follows: 
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7198
Number of columns:  16


The header is: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content R

## 3. Data Cleaning
Next the data is cleaned up:
- entry 10472 needs to be removed as it's incorrect according to the [forum](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)
- duplicate apps need to be removed
- non-English apps need to be removed
- non-free apps need to be removed

### 3.1 Remove incorrect data

In [243]:
del(android[10473]) # delete entry 10472
removed_apps = [android[10473]]

### 3.2 Remove duplicate apps

In [244]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of unique apps:' , len(unique_apps))
print('Number of duplicates', len(duplicate_apps))
print('Number of removed apps', len(removed_apps))

Number of unique apps: 9660
Number of duplicates 1181
Number of removed apps 1


To figure out why apps are listed more times than once we take one of the apps in `duplicate_apps` in android and print the duplicate results.

In [239]:
for app in android[1:]:
    name = app[0]
    if name == duplicate_apps[0]:
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We now see that the duplicate results are the same, except for the `Reviews` column. Likely the data was scraped at different timepoints, registering as a different entry when review numbers differ. As the entry with the most amount of reviews is likely the most recent, only this entry will be kept.

In [245]:
print('Expected length: ', len(android) - len(duplicate_apps) - len(removed_apps))

Expected length:  9659


In [267]:
'''
A more efficient method than the one below was developed in the next section.
'''
# def app_selector(apps_list):
#     output = []
#     reviews = -1 # this number is set to -1 in case there would be zero reviews for the app
    
#     for app in apps_list:
#         if int(app[3]) > reviews:
#             output = app
#             reviews = int(app[3])
#     return output
        
# android_unique = []

# for item in unique_apps:
#     app_selection = []

#     for app in android[1:]:
#         name = app[0]
#         if name == item:
#             app_selection.append(app)
    
#     if app_selection != []:
#         selection = app_selector(app_selection)
#         android_unique.append(selection)

# explore_data(android_unique, 0, 3, True, False)

'\nA more efficient method than the one below was developed in the next section.\n'

First we construct a dictionary named `reviews_max` to collect all the unique names as keys and the amount of reviews as variable.

In [262]:
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Next we used `reviews_max` to construct a new dataset `android_clean` to contain all the unique values with the maximum amount of reviews.

In [268]:
android_clean = []
already_added = []  # as some apps have duplicates with the same amount of reviews
                    # this dataset was made to account for that

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True, False)

The dataset is constructed as follows: 
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13




### 3.3 Remove non-English apps
As all characters that are commonly used in English text are in the range of 0 to 127 in the [ASCII system](https://en.wikipedia.org/wiki/ASCII), a function can be determined that detects whether there are non-Enlgish characters in the app title or not. To avoid emoji's or special characters to trigger the function, a minimum of 3 characters with an ASCII code larger than 127 needs to be detected.

In [298]:
def is_english(string):
    trigger = 0
    for character in string:
        if ord(character) > 127:
            trigger += 1
    if trigger > 3:
        return False
    return True

In [301]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios[1:]:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
    
explore_data(android_english)
explore_data(ios_english)

The dataset is constructed as follows: 
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  9614
Number of columns:  13


The 