# Profitable App Profiles for the App Store and Google Play Markets
The aim of this project is to find mobile app profiles that are profitable for the App Store and Google Play markets in other words; helping our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.


## Opening and Exploring the Data
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.
[Sample Dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) and [AppStore sample dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. ](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


This piece  of code below allows the dataset to be opened and stored in a list of list making easy to work with in Python programming language. 

In [1]:
from csv import reader
#Apple store sample file #
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_content = ios[1:]

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_content = android[1:]

we created a function named explore_data() that can be repeatedly used to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
#Test print some rows to  see what is achievablle
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [4]:
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


In the course of exploring the data it was discovered a row has two empty cells (one containing an empty string and the other completely empty). Highlighted by the piece of code below!


In [5]:
for row in android:
    header_length = len(android_header)
    row_length = len(row)
    if row_length != header_length:
        print(row)
        print('\n',android.index(row))


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

 10473


In [6]:
print(android[10473])  # incorrect row
print('\n')
print(android_header)  # header


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The row 10473 corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame`, and we can see that the rating is 19 which is clearly off as the maximum rating for a Google Play app is 5. It was further discovered that the problem is a result of a missing value in the 'Category' column). So as a consequence, we'll delete this row.

In [7]:
# to delete the Life ... row without deleting other rows
if android[10473][0] == 'Life Made WI-Fi Touchscreen Photo Frame':
    del android[10473]

In [8]:
# this is to affirm the deletion actuallly took place

print(android[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


##Removing Duplicate Entries: Part One
This parts shows that duplicates does exist in the Google play store dataset

In [9]:
duplicates = []
unique = []
for app in android[1:]:
    name = app[0]
    if name in unique:
        duplicates.append(name)
    else:
        unique.append(name)
print('Number of duplicates found : ', len(duplicates))
print('List of duplicated apps ', duplicates[:5], '\n')
print(android[:10])

Number of duplicates found :  1181
List of duplicated apps  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings'] 

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & D

`In total, there are 1,181 cases of app occuring more than once`

##To remove duplicates, we;

1. Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
2. Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [10]:
reviews_max = {}
for item in android[1:]:
    name = item[0]
    n_reviews = float(item[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))
        


9659


In [15]:
android_clean = []
already_added = []
for item in android[1:]:
    name = item[0]
    n_reviews = float(item[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(item)
        already_added.append(name) # make sure this is inside the if block
print(len(android_clean))

9659


we'd like to analyze only the apps that are designed for an English-speaking audience. However, exploring the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

`We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.`

In [18]:
print(android_clean[4412], '\n', android_clean[7940])

['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up'] 
 ['لعبة تقدر تربح DZ', 'FAMILY', '4.2', '238', '6.8M', '10,000+', 'Free', '0', 'Everyone', 'Education', 'November 18, 2016', '6.0.0.0', '4.1 and up']


In [19]:
#project 6
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

In [22]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
True


In [24]:
#project 7
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

In [27]:

print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

False
True
True
