## Profitable App Profiles for the App Store and Google Play Markets

### About: 

We're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

### Goal: 
To analyze data to help developers understand what type of apps are likely to attract more users on Google Play and the App Store

#### Exploring Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

- A <a href = 'https://www.kaggle.com/lava18/google-play-store-apps'>data set</a> containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from <a href = 'https://dq-content.s3.amazonaws.com/350/googleplaystore.csv'> this link</a>.
- A <a href = 'https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'>data set</a> containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from <a href='https://dq-content.s3.amazonaws.com/350/AppleStore.csv'> this link</a>.

Let's start by opening the two data sets and then continue with exploring the data.

In [16]:
from csv import reader

# Apple Store (IOS) Apps data
ios_all_data = list(reader(open('data/AppleStore.csv', encoding='utf8')))
ios_header = ios_all_data[0]
ios_data = ios_all_data[1:]

# Google Play Store (Android) Apps data
android_all_data = list(reader(open('data/googleplaystore.csv', encoding='utf8')))
android_header = android_all_data[0]
android_data = android_all_data[1:]

To make things easier to explore, we created a function named explore_data() that can be used repeatedly to print rows in a readable way

In [17]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

__Let's explore the IOS dataset now__

In [20]:
print(ios_header)
print('\n')
explore_data(ios_data, 2, 6, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


Number of rows:  7197
Number of columns:  16


#### **_We noticed that the Apple Store contains 7197 mobile apps and Every app has 16 different types of information associated with it._**

The columns that could help in our analysis are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'prime_genre'

Not all of the column names are self-explanatory. In order to know more about this dataset, check out <a href= 'https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'> this link</a>

__Let's explore the Android dataset now__

In [21]:
print(android_header)
print('\n')
explore_data(android_data, 2, 6, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  10841
Number of columns:  13


#### **_We noticed that the Google Play Store contains 10841 mobile apps and Every app has 13 different types of information associated with it._**

The columns of interest are: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', 'Genres'

#### Detecting Accurate Data, and correct or remove it

Recall that at our company, we only build apps that re free to download and install.
And, that are directed toward an English-speaking audience.

In other words,

- Remove non-English Apps
- Remove apps that aren't free

This process of preparing our data for analysis is called __data cleaning__. Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

The Google Play data set has a dedicated <a href='https://www.kaggle.com/lava18/google-play-store-apps/discussion'> discussion section</a>, and we can see that <a href= 'https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015'> one of the discussions</a> describes an error for a row 10472.

Let's print this row and compare it with the header

In [28]:
print(android_data[10472])
print('\n')
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Seems like this row [10472] has a missing entry for the __'Category'__ column which shifted the results for the rest of the columns. So we will remove this row from our dataset.

__Note: DO NOT RUN the _del_ command MORE THAN ONCE. Otherwise, you'll end up LOSING DATA__

In [29]:
del android_data[10472]

Read the <a href='https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion'> discussion section</a> for the App Store data set, and see whether you can find any reports of wrong data.

To look for the same type of error as in our Google Dataset, use the code below. This code checks if there exists any missing values by comapring the length of the rows to the header.

In [32]:
for row in ios_data:
    if len(row) != len(ios_header):
        print(ios_data.index(row), row)

#### Removing Duplicate rows: Part One

If you explore the Google Play data set long enough or look at the <a href= 'https://www.kaggle.com/lava18/google-play-store-apps/discussion'> discussions</a> section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [33]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [34]:
duplicate_apps = []
unique_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))

Number of duplicate apps:  1181


Number of unique apps:  9659


In total, there are 1,181 cases where an app occurs more than once:

##### List few duplicate entries

In [35]:
print('List of a few duplicate apps:', duplicate_apps[:10])

List of a few duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


**_Note_** that we don't want to remove duplicates at random. Taking the example of Instagram, we noticed that the one entry that changes is the rating counts.

So, we can keep the highest count as it is the most recent. We will use this criterion to filter duplicates

To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

#### Part Two

1. Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

    - Start by creating an empty dictionary named reviews_max.
    - Loop through the Google Play data set (make sure you don't include the header row). For each iteration:
        - Assign the app name to a variable named name.
        - Convert the number of reviews to float. Assign it to a variable named n_reviews.
        - If name already exists as a key in the reviews_max dictionary and reviews_max[name] < n_reviews, update the number of reviews for that entry in the reviews_max dictionary.
        - If name is not in the reviews_max dictionary as a key, create a new entry in the dictionary where the key is the app name, and the value is the number of reviews. Make sure you don't use an else clause here, otherwise the number of reviews will be incorrectly updated whenever reviews_max[name] < n_reviews evaluates to False.
    - Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.

In [36]:
reviews_max={}
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


2. Use the dictionary you created above to remove the duplicate rows:

    - Start by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will just store app names).
    - Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
        - Assign the app name to a variable named name.
        - Convert the number of reviews to float, and assign it to a variable named n_reviews.
    - If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added (read the solution notebook to find out why we need this supplementary condition):
        - Append the entire row to the android_clean list (which will eventually be a list of list and store our cleaned data set).
        - Append the name of the app name to the already_added list — this helps us to keep track of apps that we already added.

In [38]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's now explore the '__android_clean__' dataset to ensure everything went as expected. The dataset should have 9,659 rows

In [40]:
explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows:  9659
Number of columns:  13


Great! We have 9659 rows, as expected

#### Removing Non-English Apps: Part One

If we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

In [41]:
print(ios_data[813][1])
print(ios_data[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

All English characters range from 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. So, if we build a function that checks if any character fall within this range to qualify as a common English character.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. Our app names, however, are stored as strings, so how could we take each individual character of a string and check its corresponding number?

We can get the corresponding number of each character using the ord() built-in function.

1. Write a function that takes in a string and returns False if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns True.

    - Inside the function, iterate over the input string. For each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately return False — the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
    - If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127 — the app name is probably English, so the functions should return True.

In [42]:
def is_english(a_string):
    for char in a_string:
        if ord(char) > 127:
            return False
    return True 

2. Use your function to check whether these app names are detected as English or non-English:

- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

In [43]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


The functions seems to miscalculate strings with emoji or other symbols such as (™,—(em dash), -(en dash)) that fall outside the ASCII range.

So, our function is not fully ready to run on the actual dataset. Let's look at ways to modify it so we do not lose useful information.

In [44]:
print(ord('™'))
print(ord('😜'))

8482
128540


#### Part Two

To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

1. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return False (identify the string as non-English), otherwise it should return True.

In [47]:
def is_english(a_string):
    non_ascii_count = 0
    for char in a_string:
        if ord(char) > 127:
            non_ascii_count += 1  
    if non_ascii_count > 3:
        return False
    else:
        return True

2. Use the new function to check whether these app names are detected as English or non-English:

- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'

In [48]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


3. Use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.

- Our Current IOS dataset:  __ios_data__
- Our Current ANdroid dataset:  __android_clean__

In [49]:
english_apps_ios = []
english_apps_android = []

# For the IOS data set
for row in ios_data:
    name = row[1]
    if is_english(name):
        english_apps_ios.append(row)

# For the Android data set
for row in android_clean:
    name = row[0]
    if is_english(name):
        english_apps_android.append(row)

4. Explore the data sets and see how many rows you have remaining for each data set.

In [50]:
explore_data(english_apps_ios, 0, 3, True) # IOS Dataset

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  6183
Number of columns:  16


Our initial set of IOS App data had 7197 rows. After removing non-English apps, we are left with 6183 apps. 

In [51]:
# The Android data
explore_data(english_apps_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


Our android_clean dataset had 9659 rows. Our new English-only dataset consists of 9614 rows.

#### Isolating Free Apps

So far in the data cleaning process, we:

1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps

Our goal is to build apps that are free to download and install, and our main source of revenue comes from in-app ads. So, we need to isolate the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process.

- Our Current IOS dataset:  __english_apps_ios__
- Our Current Android dataset:  __english_apps_android__

Note: The '__price__' column in the Android dataset has the "$$  " attached to the number in the form '$4.99'. In this case, we can either use the '__Type__' column (index 6) or remove the '$' from the price. Either way works

In [54]:
free_apps_android = []
free_apps_ios = []

# The IOS dataset
for row in english_apps_ios:
    price = float(row[4])
    if price == 0:
        free_apps_ios.append(row)
        
# The Android dataset
for row in english_apps_android:
    price = float(row[7].strip('$')) # Or you can use the 'Type' column (index 6) whose results are 'Paid or Free'
    if price == 0:
        free_apps_android.append(row)

Let's explore the datasets now to check how many rows are we left with.

In [57]:
# The IOS dataset
explore_data(free_apps_ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  3222
Number of columns:  16


In [56]:
# The Android dataset
explore_data(free_apps_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


We are now left with __3222__ IOS Apps and __8864__ Android Apps