# Profitable App Profiles for the App Store and Google Play Markets

Let's suppose that we are data analysts for a company that only develops free apps that generate revenue from in-app ads. Having more users that see and engage the ads equates to more ad revenue. Therefore, the goal of this project is to examine the datasets of the apps that are available on the App Store and Google Play to discover which types of apps attract the most users.

Edit: This project is incomplete and unrefined. There is still more that can be done. I would like to return to this project eventually to add additional observations and insights, improve the structure and flow of the markdown, etc. 

Edit(2): This project does not use NumPy or Pandas as all, just plain old Python. The datasets used here are also treated as lists of lists rather than DataFrames.

### Opening the Data Sets

Here, we will import the data sets.These sets are smaller samples of much larger sets taken in 2018.

A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
# App Store data set
apple_file = open('AppleStore.csv')
from csv import reader
read_file = reader(apple_file)
apple_data = list(read_file)
apple_header = apple_data[0]
apple = apple_data[1:]

# Google Play data set
google_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(google_file)
google_data = list(read_file)
google_header = google_data[0]
google = google_data[1:]

### Exploring the Data Sets

Next, we'll begin exploring the data sets. First, we must write a function that will allow us to slice the rows of data that we want to see. It will also format and print out the slice into a more readable form. Lastly, it will print out the number of rows and columns for each data set.

In [2]:
# Function for printing section of dataset
# Accepts the dataset, starting row, end row, and boolean as parameters
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row     
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

First, we will explore the App Store data

In [3]:
print(apple_header)
print("\n")
explore_data(apple, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


Next, we will explore the Google Play data

In [4]:
print(google_header)
print("\n")
explore_data(google, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


## Data Cleaning

Now that we've explored the data a bit, we can now begin cleaning erroneous data. According to a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in the discussion section of the Google Play data set, there is an error on entry 10472. We will print this row to see if there is any error.

In [5]:
print(google[10472])
print(len(google))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10841


In entry 10472, the rating reads '19'. The maximum Google Play app rating is only 5, so the rating is invalid. Therefore, we wil use the del function to remove the row.

In [6]:
del google[10472]
print(google[10472])
print(len(google))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
10840


### Deleting Duplicate Entries

The Google Play data set also seems to contain duplicate entries. Let's check to see if any example of such duplicates exist.

In [7]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


From the cell above, it there are four instances of Instagram. The duplicate cells are the same in all aspects except for the number of reviews. Out of the duplicates, we want to keep the entry that has the highest number of reviews, and discard the others. We'll start by creating two separate lists for duplicate rows and unique rows.

In [8]:
duplicate_apps = []
unique_apps = []

# Loop to count duplicates
for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(len(duplicate_apps))
print(len(unique_apps))

1181
9659


Next, we'll create a dictionary that will store the app name as a key, and the max number of reviews as a value. The purpose of the cell below that is to populate a list for cleaned data by looping through the google data and appending only the entries that match the dictionary.

In [9]:
# Dictionary for duplicate data
reviews_max = {}
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


Finally, using the dictionary 'reviews_max' created in the last cell, we'll run a loop to remove the duplicate rows. Wel'll create two empty lists: 'android_clean', and 'already_added'. If the app matches an entry in the dictionary, and if that app's name is not already in the list 'already_added', then the apps gets added to 'android_clean'. At the same time, the app's name gets added to 'already_added'.

In [10]:
# Lists for storing cleaned data set
android_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3]) # Number of reviews
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9659


### Removing Non-English Apps

Our company only targets an English-speaking audience, so apps that are not in English are not useful for our analysis. We will write a function that can loop through all the characters for each app's name in both data sets. If the name contains more than three non-English characters (ASCII values greater than 127) then the function will return false. We'll print four examples to test the function.

In [11]:
def is_english(string):
    counter = 0
    for character in string:
        if ord(character) > 127:
            counter+= 1  # Counter counts ASCII values highter than 127 in name
        if counter > 3:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


We will now use the function written above on both data sets. The data that has had the non-English apps removed are assigned to the lists 'android_english', and 'ios_english'. Then, we'll use the explore function to see how many apps we have left in each data set.

In [12]:
android_english = []
ios_english = []

for app in android_clean:
    app_name = app[0]
    if is_english(app_name):
        android_english.append(app)
        
for app in apple:
    app_name = app[1]
    if is_english(app_name):
        ios_english.append(app)
        
explore_data(android_english, 0, 4, True)
print('\n')
explore_data(ios_english, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '

### Isolating the Free Apps

This process is straightforward: loop through the lists android_english and ios_english, check the price for each app, and if its price is 0, assign the app to the empty lists 'free_android' and 'free_ios'. And then, check their lengths.

In [13]:
free_android = []
free_ios = []

for app in android_english:
    price = app[7]
    if price == '0':
        free_android.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        free_ios.append(app)
        
print(len(free_android))
print(len(free_ios))

8864
3222


## Most Common Apps by Genre

After spending much time cleaning the data by removing the erroneous entry, deleting duplicate entries, removing non_English apps, and isolating the free apps, we can finally start doing some analyzing. As mentioned in the introduction, we want to discover which types of apps attact more users in both the iOS and Android markets, because  our company's revenue is mostly determined by the number of people using our apps.

We can start by finding out the most common genre for each market, by building frequency tables for the 'prime_genre' column of App Store, and 'Genres' and 'Category' columns for Google Play.

In [14]:
#Fuction to generate frequency table for data sets
#Android Genres = [1]
#Android Category = [-4]
#Ios prime_genre = [11]

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
print('IOS PRIME GENRE BY PERCENT')
display_table(free_ios, 11)

IOS PRIME GENRE BY PERCENT
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Of the App Store genres listed in the table above, gaming apps make up an overwhelming majority (at about 58 percent), followed by entertainment in distant 2nd place (at 7.89 percent). Overall, the most popular genres tend to be entertainment-oriented (like for games, entertainment, photo and video, and social networking), rather than for practical purposes. Although it's tempting to assume that the gaming genre attracts the most users, we can't jump to that conclusion without knowing how many people actually use the apps in that genre (based on the rating_count_tot column).

In [16]:
print('ANDROID GENRE BY PERCENT')
display_table(free_android, 1)

ANDROID GENRE BY PERCENT
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.654332129

In [17]:
print('ANDROID CATEGORY BY PERCENT')
display_table(free_android, -4)

ANDROID CATEGORY BY PERCENT
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto 

For Android, the frequency of genres and categories is more evenly distributed. In both tables, the apps for games and entertainment are closely followed by apps for tools and business. Recommending an app profile based on these findings is more ambiguous. 

## Most Popular Apps by Genre - App Store

We will now calculate the average number of installs for each app genre. We will use the Installs column in Google PLay, and the rating_count_tot column in iOS.

In [18]:
#rating_count_tot = [5]
#Installs = [5]

ios_genre = freq_table(free_ios, 11)
for genre in ios_genre:
    total = 0    #sum or user ratings
    len_genre = 0    #no. of apps in genre
    for app in free_ios:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre,':', avg_ratings)

Utilities : 18684.456790123455
Catalogs : 4004.0
Music : 57326.530303030304
Reference : 74942.11111111111
Medical : 612.0
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Photo & Video : 28441.54375
Book : 39758.5
Travel : 28243.8
Finance : 31467.944444444445
Education : 7003.983050847458
News : 21248.023255813954
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Games : 22788.6696905016
Business : 7491.117647058823
Sports : 23008.898550724636
Productivity : 21028.410714285714
Social Networking : 71548.34905660378
Lifestyle : 16485.764705882353
Weather : 52279.892857142855
Navigation : 86090.33333333333


Of the genres listed, navigation apps have the highest average number of user ratings. This is one type of app that we could consider developing, but we need to see how well they perform in the android market first.

## Most Popular Apps by Genre - Google Play

We will apply the same steps above to the Google Play data set. But first, we need to convert the numbers in the Installs column from string to float, and remove any non-numerical characters. 

In [19]:
android_category = freq_table(free_android, 1)
for category in android_category:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_ratings = total / len_category
    print(category, ':', avg_ratings)

DATING : 854028.8303030303
EVENTS : 253542.22222222222
AUTO_AND_VEHICLES : 647317.8170731707
TOOLS : 10801391.298666667
COMICS : 817657.2727272727
EDUCATION : 1833495.145631068
MAPS_AND_NAVIGATION : 4056941.7741935486
SOCIAL : 23253652.127118643
FINANCE : 1387692.475609756
HEALTH_AND_FITNESS : 4188821.9853479853
BEAUTY : 513151.88679245283
COMMUNICATION : 38456119.167247385
PARENTING : 542603.6206896552
SHOPPING : 7036877.311557789
LIBRARIES_AND_DEMO : 638503.734939759
WEATHER : 5074486.197183099
MEDICAL : 120550.61980830671
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
LIFESTYLE : 1437816.2687861272
HOUSE_AND_HOME : 1331540.5616438356
ENTERTAINMENT : 11640705.88235294
NEWS_AND_MAGAZINES : 9549178.467741935
GAME : 15588015.603248259
ART_AND_DESIGN : 1986335.0877192982
TRAVEL_AND_LOCAL : 13984077.710144928
PERSONALIZATION : 5201482.6122448975
BUSINESS : 1712290.1474201474
FAMILY : 3695641.8198090694
FOOD_AND_DRINK : 1924897.7363636363
VIDEO_PLAYERS : 24727872.452830188
PRO

Communications apps have the highest number of downloads.