# Profitable App Profiles for the App Store and Google Play Markets

* What the project is about
* The goal of this project
our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

Filter out apps that are NOT free to download and install.
Filter out non-English apps.

In [86]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row) 
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

---
**In the code below I open the two datasets that I'm going to work with:**
* AppleStore.csv
* googleplaystore.csv
---
1. I open both datasets with the open() function.
2. To read the datasets I use the reader() function from the csv module.
3. Then I convert them into lists, so I can work with them.


In [87]:
from csv import reader

opened_file = open('AppleStore.csv', encoding='utf8')
apple_store = list(reader(opened_file))
ios_header = apple_store[0]
ios = apple_store[1:]

opened_file = open('googleplaystore.csv', encoding='utf8')
google_store = list(reader(opened_file))
android_header = google_store[0]
android = google_store[1:]

---
**Deleting incorrect data.**

Entry number 10472 is missing the 'Rating' and its next columns have shifted.

In [88]:
print(android_header)
print("\n")
print(android[10472])



['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


---
**Deleting entry number 10472.**

In [89]:
del android[10472]

---
**Now I start the data cleaning process. As can be seen below the number of duplicate apps is over 1000.**

In [90]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name not in unique_apps:
        unique_apps.append(name)
    else:
        duplicate_apps.append(name)

print("Number of duplicate apps:", len(duplicate_apps))
print("Examples of duplicate apps:", duplicate_apps[:11])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic']


---
I am going to remove duplicate entries, but NOT randomly.
There are 4 entries for the app 'Instagram', but only one variable changes between these entries, the 'Reviews'.
I am going to keep the entry with the highest number of revies because that should be the most recent data.

In [91]:
print(android_header)
print("\n")
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


---
Creating a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [93]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print("Number of entries:", len(reviews_max))
print(reviews_max["Instagram"])
        

Number of entries: 9659
66577446.0


---
Using the dictionary I created earlier to remove the duplicate entries.
The new and cleaned datasets will be stored in the 'android_clean' list.
I don't need to do the same for the App Store data because there are no duplicates.

In [95]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

---
**Identifying Non-English Apps.**

Each character we use in a string has a corresponding number associated with it. 

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

If the number is equal to or less than 127, then the character belongs to the set of common English characters.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.

To minimize the impact of data loss, I'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English.

* If the is_english function returns True it's an English app, if it returns False the app is NOT.

In [106]:
def is_english(text):
    count = 0
    for letter in text:
        if ord(letter) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True
    
print(is_english('Instagram'))   
print(is_english('爱奇艺PPS爱奇'))

True
False


---
I am using the new function I created earlier (is_english()) to filter out non-English apps from both datasets, Applestore & googleplaystore.

In [118]:
ios_english = []
android_english = []

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
print("Number of english apps in AppleStore:", len(ios_english))
print("Number of english apps in Google Play Store:", len(android_english))

Number of english apps in AppleStore: 6183
Number of english apps in Google Play Store: 9614


**For the last step of the data cleaning process I will isolate the free apps from both datasets. (free of inaccurate data, free of duplicate entries & free of non-English apps)**

In [119]:
ios_free = []
android_free = []

for app in ios_english:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
        
for app in android_english:
    if app[7][0] == '$':
        price = float(app[7][1:])
    else:
        price = float(app[7])
    if price == 0.0:
        android_free.append(app)

print("Number of free english apps in AppleStore:", len(ios_free))
print("Number of free english apps in Google Play Store:", len(android_free))

Number of free english apps in AppleStore: 3222
Number of free english apps in Google Play Store: 8864


In [None]:
def freq_table(dataset, index):
    