# Profitable Google Play and iOS Apps

We take a look into a sample of the current pool of apps and try to extract information that would help solve the question, what would the next profitable app be? Also, questions like, what makes a popular app, and what makes an app popular, will be explored at a high level.

In [1]:
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end] if start > 0 else dataset[1:end + 1]
    
    for data in dataset_slice:
        print(data)
        print()
        
    if rows_and_columns:
        print(f'ROWS: {len(dataset)}')
        print(f'COLUMNS: {len(dataset[0])}')
    
    print('-------------------------')

def list_dataset(dataset):
    with open(dataset, encoding='utf-8') as dataset_file:
        read_file = reader(dataset_file)
        list_file = list(read_file)
        return list_file

DATASETS & DOCUMENTATION:
- [Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
- [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/home)
    
The following imports the two datasets and converts them into a list of lists by using the previously defined method `list_dataset`. The first five data rows are then printed using the `explore_data` function. The headers(column names) are then printed for analysis.

VARIABLES:
- Apple Store dataset -> `apple_dataset`
- Google Play dataset -> `googleplay_dataset`
    

In [2]:
# should've separated the header from the dataset so I don't have to type dataset[1:] 
# in order to access the data without the headers

apple_dataset = list_dataset('AppleStore.csv')
googleplay_dataset = list_dataset('googleplaystore.csv')

explore_data(apple_dataset, 0, 3)
explore_data(googleplay_dataset, 0, 3)

print('\n')
print('---HEADER APPLE---')
print(apple_dataset[0])
print('\n')
print('---HEADER GOOGLE---')
print(googleplay_dataset[0])

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']

-------------------------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art &

In [3]:
unique_apps = []
duplicated_apps = []
for data in googleplay_dataset[1:]:
    app_name = data[0]
    if app_name in unique_apps:
        duplicated_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print(f'# DUPLICATED APPS: {len(duplicated_apps)}')
print(f'# UNIQUE APPS: {len(unique_apps)}')
print()

print('---DUPLICATE SAMPLE ---')
for duplicate in duplicated_apps[0:5]:
    print(duplicate)


# DUPLICATED APPS: 1181
# UNIQUE APPS: 9660

---DUPLICATE SAMPLE ---
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings


Separated the google play dataset into two lists. 
- `unique_apps` stores the entry of an app the first time it appears in the google play dataset. 
- `duplicated_apps` stores all entries that have previously appeared in the dataset.
<br/>
<br/>
The use of the app's name (`[0]`) was used to determine whether it appeared before or not.<br/>
The results of this is that there are 10841 apps total, 1181 of which are duplicated so 9660 apps are unique.
<br/>
<br/>
In order to remove duplicates only the duplicated app with the highest amount of reviews will be taken into account. This is based on the fact that the more reviews an app has the more recent it is.


In [4]:
del googleplay_dataset[10473] # incorrect datapoint (column shift), total unique app - 9659

In [5]:
reviews_max = {}
for index, app in enumerate(googleplay_dataset[1:]):
    try:
        app_name = app[0]
        review_count = float(app[3])
        if app_name not in reviews_max:
            reviews_max[app_name] = review_count
        else:
            reviews_max[app_name] = review_count if review_count > reviews_max[app_name] else reviews_max[app_name]
    except:
        print(index)

In [6]:
cleaned_google_dataset = []
already_added = []

for app in googleplay_dataset[1:]:
    name = app[0]
    ratings = float(app[3])
    if name in already_added:
        pass
    elif ratings == reviews_max[name]:
        cleaned_google_dataset.append(app)
        already_added.append(name)

print(len(cleaned_google_dataset))

9659


In order to filter the dataset for any duplicate, we first looked at whether or not that app is in the list, if not we append it to that list with the key being the app's name and the value is the number of ratings the app has. As we continue to loop through the dataset all duplicated entries will have the number of ratings that entry has and compares it to the one already in the dictionary. If the number of ratings is greater than the one already contained within the dictionary then the old value is then replaced by the new value. <br> <br>
We then want to filter the dataset so it only contains the most recent version of the app. In order to get this filtered dataset: <br>

- Loop through the initial dataset
- For every data point we extract the name and # of ratings into the variables `name` and `ratings` respectively
- If the app is already in the `already_added` list then nothing happends, we only want data for a single copy of the most recent version of any app
- If the number of ratings is equal to `reviews_max[name]` we append that entire data point to the cleaned dataset list, and append that app's name to the `already_added` list.



In [7]:
def english_checker(string):
    """
    returns true if all characters in the string are standard charcaters
    used in english, otherwise return false.
    """
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
            if count >= 3:
                    return False
            
    return True

def filtered_data(dataset, value, filterer, filtered_list='N'):
    if filtered_list == 'N':
        fil_list = []
        for datum in dataset:
            if filterer(datum[value]):
                fil_list.append(datum)
        return fil_list
    else:
        for datum in dataset:
            if filterer(datum[value]):
                filtered_list.append(datum)

Filtering dataset for english and free apps:
    - filtered_apple -> free_filtered_apple
    - filtered_google -> free_filtered_google
    
English filter - When an app has 3 or more(`english_checker` only goes to 3 before returning) then that dataset is removed from the list.

Free filter - Checks to see whether the app is free or not. (price == 0)

In [8]:
filtered_apple = []
filtered_google = []

filtered_data(cleaned_google_dataset, 0, english_checker, filtered_google)
filtered_data(apple_dataset, 2, english_checker, filtered_apple)

print(len(filtered_apple))
print(len(filtered_google))


6156
9597


In [9]:
free_filtered_apple = filtered_data(filtered_apple[1:], 5, lambda x: x == '0')
free_filtered_google = filtered_data(filtered_google, 7, lambda x: x == '0')

print(len(free_filtered_apple))
print(len(free_filtered_google))

3203
8848


~ Frequency table functions

In [67]:
def freq_table(dataset, column, table):
    for datum in dataset:
        value = datum[column]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    for value in table:
        table[value] = (table[value] / len(dataset)) * 100
    
    return table

def display_ft(dataset, column):
    ft = freq_table(dataset, column, {})
    table_display = [(ft[key], key) for key in ft]
    
    sorted_ft = sorted(table_display, reverse=True)
    
    for index,value in enumerate(sorted_ft):
        print(f'{index + 1}: {sorted_ft[index][1]} | {sorted_ft[index][0]}')

~ Frequency Tables Below

In [68]:
# Prime Genre Frequency Table for iOS
display_ft(free_filtered_apple, 12)

1: Games | 58.25788323446769
2: Entertainment | 7.836403371838902
3: Photo & Video | 4.995316890415236
4: Education | 3.6840462066812365
5: Social Networking | 3.3093974399000934
6: Shopping | 2.5913206369029034
7: Utilities | 2.466437714642523
8: Sports | 2.1542304089915705
9: Music | 2.0605682172962845
10: Health & Fitness | 2.0293474867311896
11: Productivity | 1.7483609116453322
12: Lifestyle | 1.5610365282547611
13: News | 1.3424914142990947
14: Travel | 1.248829222603809
15: Finance | 1.0927255697783327
16: Weather | 0.8741804558226661
17: Food & Drink | 0.8117389946924758
18: Reference | 0.5307524196066188
19: Business | 0.5307524196066188
20: Book | 0.3746487667811427
21: Navigation | 0.18732438339057134
22: Medical | 0.18732438339057134
23: Catalogs | 0.1248829222603809


In [69]:
# Content Rating (Age Restriction) (iOS)
display_ft(free_filtered_apple, -6)

1: 4+ | 65.90696222291602
2: 12+ | 17.015298157976897
3: 9+ | 10.708710583827662
4: 17+ | 6.369029035279425


In [70]:
# Genres (Google Play)
display_ft(free_filtered_google, 9)

1: Tools | 8.44258589511754
2: Entertainment | 6.080470162748644
3: Education | 5.357142857142857
4: Business | 4.599909584086799
5: Productivity | 3.899186256781193
6: Lifestyle | 3.8765822784810124
7: Finance | 3.7070524412296564
8: Medical | 3.5375226039783
9: Sports | 3.4584086799276674
10: Personalization | 3.322784810126582
11: Communication | 3.2323688969258586
12: Action | 3.096745027124774
13: Health & Fitness | 3.0854430379746836
14: Photography | 2.949819168173599
15: News & Magazines | 2.802893309222423
16: Social | 2.667269439421338
17: Travel & Local | 2.328209764918626
18: Shopping | 2.2490958408679926
19: Books & Reference | 2.1360759493670884
20: Simulation | 2.0456600361663653
21: Dating | 1.8648282097649187
22: Arcade | 1.842224231464738
23: Video Players & Editors | 1.7744122965641953
24: Casual | 1.763110307414105
25: Maps & Navigation | 1.3901446654611211
26: Food & Drink | 1.2432188065099457
27: Puzzle | 1.1301989150090417
28: Racing | 0.9945750452079566
29: Role

EX