# Profitable App Profiles for the App Store and Google Play Markets

* Our company only builds *free* apps towards *English-speaking* audience

* About: Analyze data to help our developers understand what type of apps are likely to attract more users.

* Goal: To learn and do my first project


Okay so first we define a function that helps us explore our datasets to see what's happening
This function take 4 parameters, the first is the data set, second and third are index start and end, and fourth set to True to print the number of columns and rows

In [209]:
def explore_data (dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice: 
        print (row)
        print ("\n")
        
    if rows_and_columns: 
        print ('Number of rows: ', len(dataset))
        print ('Number of columns: ', len(dataset[0]))

* we need to import our data sets into python to be able to use them, to do that:
1. from csv import reader: imports reader from the csv module
2. opened_file = open(file): opens the file we want into our editor
3. read_file = reader (opened_file): reads the opened file pretty much converts it to something python can read
4. a_list = list(read_file): converts the read file to a list so we can use it with indexing and stuff

In [210]:
from csv import reader 
opened_file = open ('AppleStore.csv', encoding = "utf8")
read_file = reader (opened_file)
a_list = list (read_file)
apple_list = a_list [1:]

opened_file = open ('googleplaystore.csv', encoding = 'utf8')
read_file = reader (opened_file)
g_list = list (read_file)
google_list = g_list [1:]

* This was done as an exploratory measure to see if our open file code above worked, this also helps us to see which columns seem important. 
* since we want only free apps that have the highest user reviews, we want the app name, price and number of reviews that should give us a criteria for selecting which apps are performing the best

In [211]:
explore_data (google_list, 0, 3, True)
print ("\n")
explore_data (apple_list, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '

* from reading the google discussion, we saw that row number 10472 has an error in the category column resulting in skewed data and maybe our function cant perform
* to clean this data, we need to delete the row using the del function for that index
* but first, we have to see if the error is actually there so we'll compare

In [212]:
print ("A normal google row looks like: ", google_list[0])
print ("\n")
print ("The error row looks like: ", google_list[10472])

A normal google row looks like:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The error row looks like:  ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [213]:
# now we use the del function to delete that row
# only run this once so we don't keep deleting rows we actually use
del google_list[10472]

* Another data cleaning step is removing duplicate apps
* To test if we have duplicate apps, we create a list of original apps and a list of empty apps 
* we loop through the google_list and for every row in that list we want to:
1. assign the name variable to the row[0], this means that name variable will have the value of the name from the list 
2. if the name is in the oringial list we created, then add (append) the name to the duplicate 
3. if not (else) add the name to the orinigal list

In [214]:
original_apps = []
duplicate_apps = []

for every_row in google_list: 
    name = every_row [0]
    if name in original_apps: 
        duplicate_apps.append(name)
    else:
        original_apps.append(name)

* so now we have to see the number of duplicate apps and then the number of original apps becomes the total number from google list minus the number of duplicate apps
* We also should list some exampels of duplicate apps and then see how many times one happens

In [215]:
print ("Number of duplicate apps: ", len(duplicate_apps))
print ("\n")
print ("Some examples of duplicate apps: ", duplicate_apps[0:10])
print ("\n")
for element in google_list: 
    name = element [0]
    if name == "Instagram":
        print ("Instagram is listed this many times: ", "\n", element)

print ('\n')
print ('Expected length: ', len(google_list) - len(duplicate_apps))

Number of duplicate apps:  1181


Some examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Instagram is listed this many times:  
 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Instagram is listed this many times:  
 ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Instagram is listed this many times:  
 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Instagram is listed this many times:  
 ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1

* now we have to remove the duplicate rows
* we can't just go randmoly and delete the rows that wouldn't make sense 
* so we can set a criterion, for example, if the app has the highest number of reviews that might suggest that the data for that app was taken most recently 
* so we should filter out based on the number of reviws 
* to do so, we need to create an empty dictionary and for every key name, which is the app name, we want to add the number of reviews to that key and only keep the highest reviews 
* then we should test the length of our dictionary to see if it matches our expected length

In [216]:
reviews_max = {}

for every_row in google_list:
    name = every_row[0]
    num_reviews = float(every_row[3])
    
    if name in reviews_max and reviews_max[name] < num_reviews: 
        reviews_max[name] = num_reviews 
        
    elif name not in reviews_max: 
        reviews_max[name] = num_reviews
        
len (reviews_max)

9659

* Now to need to use the dictionary we created to remove the duplicate apps
* To do so we need to create two lists, one that has the 'cleaned' data i.e without duplicates and the other is the already added to make sure we dont add the same app twice 
* We loop through the google list, and for every iteration:
1. assign name and num_reviews variables to their respective indecies 
2. if name is in reviews_max and the number of reviews for that app is less than the number of reviews from the reviews max dic, then add the app to clean list
3. e

In [217]:
google_clean = []
already_added = []

for every_row in google_list: 
    name = every_row[0]
    num_reviews = float(every_row[3])
    
    if (num_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(every_row)
        already_added.append(name)
        

In [218]:
print ("Expected and Actual length: ", len(google_clean))

Expected and Actual length:  9659


Okay now that we've cleaned our data from errors and duplicates, we already checked the apple list and it is clean. Now, since our company only builds English apps that means that we need to filter out non-english apps out. To do so, we use the ASCII's number for english from 0 to 127. 
To do so, we create a function that takes in a string as the app name and returns:
1. False: if for every character in the string, the ord (ASCII number) is larger than 127.
2. True otherwise

* Now we created a problem in that soem enlgish apps have emojis and stuff that are beyond 127 but the app is still english. So now we need to change our criterion. 
1. Instead of checking the whole name we check if 3 of the characters in name are larger than 127, the value returns False
2. True otherwise

* We then modified the function, this is becuase the orinigal function marked an app to be non-english if ANY of the characters is non-english. 
* For the modified function, we:
1. created a variable with value 0
2. loop through the name and for every character, if the character is non-english, i.e >127, then add 1 to the variable above 
3. if the variable is ever >3, then the name is non-english
* This isnt a perfect system but it allows us to count in apps wiht emojis and nonenglish characters while the app still being english

In [219]:
def english_apps(name):
    non_english = 0
    
    for character in name:
        if ord(character) > 127: 
            non_english += 1
                
    if non_english > 3:
        return False        
    else: 
        return True

In [220]:
print (english_apps('instagram'))
print ('\n')
print (english_apps(google_clean[2344][0]))
print ('\n')
print (english_apps('に最後後後後'))

True


True


False


Now we need to use the function on our data sets, the google_clean and apple_list

In [221]:
google_english = []
apple_english = []

for every_row in apple_list: 
    name = every_row[1]
    if english_apps(name):
        apple_english.append(every_row)

        
for every_row in google_clean:
    name = every_row[0]
    if english_apps(name):
        google_english.append(every_row)
        
explore_data (google_clean, 0, 3, True)
print ("\n")
explore_data (apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

So far we've cleaned out data, we removed the error row in the google list then we removed duplicate apps then we removed non-english apps. 
Now our lists are google_english and apple_english.
However, we still need to isolate free apps as our company only deals with free apps. 

In [222]:
free_google = []
for every_row in google_english: 
    price = every_row[7]
    if price == '0':
        free_google.append(every_row)
        
        
free_apple = []
for every_row in apple_english: 
    price = every_row[4]
    if price == '0.0':
        free_apple.append(every_row)
        
explore_data (free_google, 0, 3, True)
print ('\n')
explore_data (free_apple, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

Ok so now that our data is fully cleaned to how we want it, we want to see which apps categories make the most profit. 
The reason is because in our company, we see which apps are doing good, we make an android version of this app and release it to the google play store, if the app does good after 6 months, we release it to the Apple store. So now we need to see which apps occur the most frequent because that might give us an idea of which apps are the most popular. 

In [223]:
explore_data(free_google, 0, 3, True)
print ('\n')
explore_data(free_apple, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

Below we create a function that generates a frequency table. To do that, we define the fre_table function that takes in two parameters, the dataset and the index.
1. create an empty dictionary (freq_table)
2. loop through the data set and for every row: 
3. take the category index and assign itto a variable 
4. if the variable is in the ft dic, then we add 1 to the current value at that key 
5. else, we just add 1

In [224]:
def freq_table(dataset, index):
    ft = {}
    
    
    for every_row in dataset:
        category = every_row[index]
        if category in ft: 
            ft[category] += 1
        else: 
            ft[category] = 1
            
    ft_percentage = {}       
    for key in ft: 
        ft_percentage[key] = ((ft[key] / len(dataset)) * 100)
        
    return ft_percentage


Since we created this, now we need a function that gives us the desceinding order (THIS WAS GIVEN)
* The function takes in two parameters, a dataset and the index: 
* Generates a freuency table using our defined funciton above
* Since the sorted function doesnt work well with dictionaries, we need to convert the freq table (which is a dic) to a list
* We create an empty list
* We loop through the frequency table and for every key, we assign a variable to the ft table value and the key, then we append the empty list with that variable, (This makes our value show up first so we can sort based on percentages)
* Then we created the sorted variable and assigned the table display list to in reverse (i.e from highest to lowest) order

In [225]:
def display_table(dataset,index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table: 
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted (table_display, reverse = True)
    return table_sorted
#     for entry in table_sorted: # This is useless
#         print (entry[1], ':', entry[0]) # I would've just used return

In [226]:
google_category = display_table(free_google, 1)
print (google_category)
print ('\n')
apple_category = display_table(free_apple, 11)
print ('\n')
google_genre = display_table(free_google, 9)

[(18.907942238267147, 'FAMILY'), (9.724729241877256, 'GAME'), (8.461191335740072, 'TOOLS'), (4.591606498194946, 'BUSINESS'), (3.9034296028880866, 'LIFESTYLE'), (3.892148014440433, 'PRODUCTIVITY'), (3.7003610108303246, 'FINANCE'), (3.531137184115524, 'MEDICAL'), (3.395758122743682, 'SPORTS'), (3.3167870036101084, 'PERSONALIZATION'), (3.2378158844765346, 'COMMUNICATION'), (3.0798736462093865, 'HEALTH_AND_FITNESS'), (2.944494584837545, 'PHOTOGRAPHY'), (2.7978339350180503, 'NEWS_AND_MAGAZINES'), (2.6624548736462095, 'SOCIAL'), (2.33528880866426, 'TRAVEL_AND_LOCAL'), (2.2450361010830324, 'SHOPPING'), (2.1435018050541514, 'BOOKS_AND_REFERENCE'), (1.861462093862816, 'DATING'), (1.7937725631768955, 'VIDEO_PLAYERS'), (1.3989169675090252, 'MAPS_AND_NAVIGATION'), (1.2409747292418771, 'FOOD_AND_DRINK'), (1.1620036101083033, 'EDUCATION'), (0.9589350180505415, 'ENTERTAINMENT'), (0.9363718411552346, 'LIBRARIES_AND_DEMO'), (0.9250902527075812, 'AUTO_AND_VEHICLES'), (0.8235559566787004, 'HOUSE_AND_HOME

Now we can analyze the data above, this list contains free apps, that have the highest ratings and the percentages of how many times they show up. 
* What is the most common genre? Family, Games and Tools
* What is the runner-up? Game, Entertainment, Entertainment
* What other patterns do I see? Family, games, tools, entertainment and business apps have the highest frequency, that may suggest that theyre the most popular apps to users 
* What are the most apps designed for? Practical or entertainment purposes? 
(We use the freq_table to test) 
* Based on the data below I think the app category is close to call, the general genres of life vs entertaimnet has about the same values, bue over all games, family and tools dominate so I would recommend we make an entertainment app such as a game or social or something along those lines


In [227]:
google_test = freq_table(google_clean, 1)
print (google_test['EDUCATION'] + google_test['TOOLS'] 
       + google_test['SHOPPING'] + google_test['PRODUCTIVITY'] 
      + google_test['LIFESTYLE'])
print ('\n')
print (google_test['GAME'] + google_test['PHOTOGRAPHY'] 
       + google_test['SOCIAL'] + google_test['ENTERTAINMENT'] 
      + google_test['SPORTS'])

19.47406563826483


19.44300652241433


Now we're gonna analyze the apple list: 
* What are the most common genres? Social networking, photo & video and Games
* What are the runner-ups? Music, reference and Health & Fitness
* What other patterns? I notice that entertainment has way more apps 70% vs 12% which suggests a strong corelation that the best apps are entertainment apps
* This is weird becuase our whole strategy is to release the app on google market first to see if it performs well but if we compare, we see that apple apps for entertainment are more popular by about 50% than google so it may not be a good indication if the app will work well on both markets


In [228]:
apple_test = freq_table(free_apple, -5)
print (apple_test)
print (apple_test['Education'] + apple_test['Utilities'] 
       + apple_test['Shopping'] + apple_test['Productivity'] 
      + apple_test['Lifestyle'])
print ('\n')
print (apple_test['Games'] + apple_test['Photo & Video'] 
       + apple_test['Social Networking'] + apple_test['Music'] 
      + apple_test['Sports'])

{'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Games': 58.16263190564867, 'Music': 2.0484171322160147, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Weather': 0.8690254500310366, 'Utilities': 2.5139664804469275, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'News': 1.3345747982619491, 'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Entertainment': 7.883302296710118, 'Food & Drink': 0.8069522036002483, 'Sports': 2.1415270018621975, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Business': 0.5276225946617008, 'Catalogs': 0.12414649286157665, 'Medical': 0.186219739292365}
12.104283054003725


70.60831781502173


Now we should analyze the number of downloads for each app to give us an idea

In [229]:
# APPLE

unique_genres = freq_table(free_apple, -5)

for genre in unique_genres:
    total = 0
    len_genre = 0
    for every_row in free_apple: 
        genre_app = every_row[-5]
        if genre_app == genre:
            user_rating = float(every_row[5])
            total += user_rating
            len_genre += 1
    avg_num_user = total / len_genre
    print (genre, ':', round(avg_num_user,2))



Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


We can see that nagivation apps have the highest number of reviews, now since our system is not the best, becuase it doenst actually take into account the number of downloads, and based on what we've seen before that entertainment apps have the highest ratings, I would suggest to either make a Social Networking app, a Navigation app, or a Referece app (whatever that may be)

In [230]:
print(g_list[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [231]:
# GOOGLE
# Before, google install has un numeral characters ike + and , which 
# we need to remove 

unique_g_apps = freq_table (free_google, 1)

for category in unique_g_apps: 
    total = 0
    len_cat = 0
    for every_row in free_google: 
        category_g = every_row[1]
        if category_g == category:
            avg_install = (every_row[5])
            avg_install = avg_install.replace('+', '')
            avg_install = avg_install.replace(',', '')
            avg_install = float(avg_install)
            total += avg_install
            len_cat += 1
    avg_install_tot = total / len_cat
    print (category, ";", round(avg_install_tot, 2))

ART_AND_DESIGN ; 1986335.09
AUTO_AND_VEHICLES ; 647317.82
BEAUTY ; 513151.89
BOOKS_AND_REFERENCE ; 8767811.89
BUSINESS ; 1712290.15
COMICS ; 817657.27
COMMUNICATION ; 38456119.17
DATING ; 854028.83
EDUCATION ; 1833495.15
ENTERTAINMENT ; 11640705.88
EVENTS ; 253542.22
FINANCE ; 1387692.48
FOOD_AND_DRINK ; 1924897.74
HEALTH_AND_FITNESS ; 4188821.99
HOUSE_AND_HOME ; 1331540.56
LIBRARIES_AND_DEMO ; 638503.73
LIFESTYLE ; 1437816.27
GAME ; 15588015.6
FAMILY ; 3695641.82
MEDICAL ; 120550.62
SOCIAL ; 23253652.13
SHOPPING ; 7036877.31
PHOTOGRAPHY ; 17840110.4
SPORTS ; 3638640.14
TRAVEL_AND_LOCAL ; 13984077.71
TOOLS ; 10801391.3
PERSONALIZATION ; 5201482.61
PRODUCTIVITY ; 16787331.34
PARENTING ; 542603.62
WEATHER ; 5074486.2
VIDEO_PLAYERS ; 24727872.45
NEWS_AND_MAGAZINES ; 9549178.47
MAPS_AND_NAVIGATION ; 4056941.77


From the above ratings, we see that game and social have the highest total number of installs which I would suggest being the best options for us as more people see the ads and we make more money