# Analyzing Mobile App Data

My goal in this project is to find app profiles in Google Play and Apple Store that are profitable. I will focus on free apps, where the main source of revenue is from in app ads. I will analyze data to understand what kind of free apps attract the most users, generate the most revenue.

I work with two data sources:

1.) A dataset about approx. 10,000 Android apps from Google Play: [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

2.) A dataset about approx. 7,000 iOS apps from the App Store: [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

# Opening and Exploring Data

I start by opening the data sources:

In [1]:
from csv import reader

## Open IOS data
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## Open Android data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

To make it easier to explore, I defined a function named `explore_data()` that prints the rows in a readable way and tells the number of rows and columns.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 rows in the iOS dataset and the columns that seem useful are: 'track_name', 'price', 'rating_count_tot' and 'prime_genre'. Not all the column names are self-explanatory, the documentation can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

There are 10,841 rows in the Android dataset, from which 'App', 'Category', 'Reviews', 'Installs' seems useful for our analysis. The documentation for the column names can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).

# Deleting Wrong Data

The Google Play data set has a dedicated [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [5]:
print(android[10472]) # incorrect row
print('\n')
print(android_header) # header
print('\n')
print(android[0]) # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


After printing these 3 rows, we can see that the 'Category' data is missing from row 10472 and the user rating is 19, even though the maximum rating at Google Store is 5. Let's remove the row.

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


# Removing Duplicates

If we explore the Google Play dataset long enough, we can notice that there are some duplicates. For example, Instagram has 4 entries:

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To find all the duplicate entries, I created two lists: one for storing the name of the duplicate apps and one for storing the name of the unique apps.

In [10]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


There are 1181 cases where an app occurs more than once. We don't want to count with the same apps more than once when analyzing the data, so we need to remove the duplicate entries.

If we examine the difference between the Instagram entries above, we notice that the main difference is the number of reviews in the 'Review' row. By keeping the entry which has the highest number of reviews, we can make sure that we will work with the most recent data.

I will create an empty dictionary where the keys are the app names and the values are the number of reviews. I will loop through every row and check whether the app name already exists in the dictionary. If it exists and the number of reviews are higher, then I will update the dictionary with the number of reviews for that entry. Otherwise, if the app name is not yet in the dictionary, I will just create a new entry.

In [11]:
reviews_max = {} # Create an empty dictionary

for app in android:
    name = app[0] # assign the app name to a variable
    n_reviews = float(app[3]) # convert the number of reviews to float and assingn it to a variable
    

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

To make sure the code worked as expected, let's check the length of the dictionary, it should be 1181 less than the length of the original dataset.

In [13]:
print('Expected length:', len(android) - 1181)
print('Dictionary length:', len(reviews_max))

Expected length: 9659
Dictionary length: 9659


In [None]:
android_clean = [] # list to store the new cleaned dataset
already_added = [] # list to store app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)