# Profitability potential of free iOS and Android Mobile Apps

The objective of this project is to create mobile app profiles for the Apple App Store and Google Play Store.

We want to enable app developpers to make data-driven decisions with respect to the kind of apps they should focus on, based on which types of apps are likely to attacted more users.

*As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.*

Collecting data ourselves for all these apps is not feasible within our time and budget constraints. However, we've identified two suitable data sets for our goal:

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) collected in August 2018, containing data about approximately ten thousand Android apps from Google Play. 

* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) collected in July 2017, containing data about approximately seven thousand iOS apps from the App Store.

In [57]:
# Let's open the respective app stores
import csv

# Opening and reading the data sets

#iOS
with open('Data/AppleStore.csv', 'r') as ios:
    ios_read = csv.reader(ios, delimiter=",")
    ios_header = next(ios_read)
    ios_apps = list(ios_read)

#Google
with open('Data/googleplaystore.csv', 'r') as google:
    google_read = csv.reader(google, delimiter=",")
    google_header = next(google_read)
    google_apps = list(google_read)

In [58]:
# To make it easier to read, we'll use the following function:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [77]:
# For example, let's open the header and the first 4 rows for each app store

# iOS
print(ios_header)
print('\n')
explore_data(ios_apps, 0, 4, True)

print('\n')

# Google
print(google_header)
print('\n')
explore_data(google_apps, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic', 'game_enab']


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1', '0']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1', '0']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1', '0']


['282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1', '0']


Number of rows: 11100
Number of columns: 17


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs'

#### Summary

The iOS App Store data set has 7197 apps and 16 columns. 

The columns of interest are: 
'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. 

*Note: Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).*


The Google Play data set has 10841 apps and 13 columns. 

The columns of interest are:
'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.



## Data Cleaning

We're only interested in free apps -- remove all non-free apps from both data sets.

The target market is english speakers -- remove all non-english apps from both data sets.

We want to remove/correct inaccurate data and remove duplicate entries.

### Deleting wrong data

*Note: The Google Play data set has a discussion section, which outlines an error for row 10472.*

*Let's print that and the next row and compare them against the header.*

In [78]:
print(google_header)
print('\n')
explore_data(google_apps, 10472, 10474)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




In [79]:
problem_line = dict(zip(google_header, google_apps[10472])) # This line will be correct after deleting the problematic entries
correct_line = dict(zip(google_header, google_apps[10473]))
print(problem_line)
print('\n')
print(correct_line)

{'App': 'osmino Wi-Fi: free WiFi', 'Category': 'TOOLS', 'Rating': '4.2', 'Reviews': '134203', 'Size': '4.1M', 'Installs': '10,000,000+', 'Type': 'Free', 'Price': '0', 'Content Rating': 'Everyone', 'Genres': 'Tools', 'Last Updated': 'August 7, 2018', 'Current Ver': '6.06.14', 'Android Ver': '4.4 and up'}


{'App': 'Sat-Fi Voice', 'Category': 'COMMUNICATION', 'Rating': '3.4', 'Reviews': '37', 'Size': '14M', 'Installs': '1,000+', 'Type': 'Free', 'Price': '0', 'Content Rating': 'Everyone', 'Genres': 'Communication', 'Last Updated': 'November 21, 2014', 'Current Ver': '2.2.1.5', 'Android Ver': '2.2 and up'}


Row 10472 corresponds to the app "Life Made WI-Fi Touchscreen Photo Frame"
We can see several problems:
1. The category is missing (Assigned valus is 1.9)
2. The maximum rating for a Google Play app is 5
3. "Installs" number is assigned a "Free" label
4. "Price" is set to "Everyone"
5. "Content Rating" is empty

For all these reasons, we'll delete this apps from the list.

In [80]:
# To make sure we deleted the row, we'll look at row 10472 before and after
len(google_apps)
print(google_apps[10472])
#del(google_apps[10472]) # IMPORTANT: RUN THIS ONLY ONCE!
#print(google_apps[10472])
#len(google_apps)

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Deleting duplicate entrie

Let's see if we can find duplicate app entries in each data set

In [109]:
#To facilitate the task, we'll create an function to locate duplicate apps
duplicate_apps = []
unique_apps = []

def duplicate_finder(dataset):
    for name in dataset:
        name = dataset[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:                
            unique_apps.append(name)
    print(len(duplicate_apps))
    print(len(unique_apps))
    del duplicate_apps[:]
    del unique_apps[:]

In [110]:
# Let's first look at ios apps
duplicate_finder(ios_apps)
print(len(duplicate_apps))

11099


In [111]:
# Let's now look at google apps
duplicate_finder(google_apps)
print(len(duplicate_apps))

21938
