# Apps that attract the most users

   In this project we will look at the current market for free apps on the IOS and android app stores to see what kinds of apps attract the most users. 

   The goal of this project is to find out which apps are the most successful based on how many users they have.
    
   The dataset for the Android apps can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).
   
  The dataset for the Apple apps can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [2]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apple_apps_data = list(read_file)

opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
goog_apps_data = list(read_file)


def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
explore_data(apple_apps_data, 0, 3, True)
explore_data(goog_apps_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone

Here is how we read in and created a list of lists to store the data sets. In addition to being able to choose which rows to print, this function also can print the total number of row and columns. 

In [3]:
explore_data(goog_apps_data, 10471, 10473, True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [4]:
del goog_apps_data[10471]

In [5]:
explore_data(goog_apps_data, 10471, 10473, True)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13


Now we will clean our data to make sure our analysis does not have any errors (or at least the errors are reduced). After looking at the dicussion forums for the Android dataset on kaggle, I found that there might be something wrong with row 10471, so I decided to take a look at that row. Here we can see that there are not enough columns in row 10471. This means that there will be incorrect data, and so I deleted that row to clean the data.

I also looked at the Apple dataset discussion forum. At first someone mentioned that there might be a possible duplicate row, but it was determined that there were two different apps with the same name merely by chance.

In [6]:
duplicates = []
unique_apps = []

for app in goog_apps_data:
    name = app[0]
    if name in unique_apps:
        duplicates.append(name)
    else: 
        unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicates))
print('\n')
print('Examples of duplicate apps:', duplicates[:10])

Number of duplicate apps:  1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [7]:
for app in goog_apps_data:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Upon doing my own search for duplicate apps, I found that there are in fact multiple versions of apps on the android store, including apps like Slack. This was done by looking at each app name in the dataset and then adding the name to a new list called unique_apps. If there was already an app with the same name in unique_apps, then I would add the name of that app to another list called duplicates. Thus, I found every duplicate in the store, and there were 1181 of them!

To clean the data, I am going to remove duplicates that appear to be older versions of the app. The criteria I will use to determine which app is the newest is the total amount of reviews, which should never go down between app versions. As you can see, the reviews (column 4) is the only value that changes at least in the example using Slack above. 

In [8]:
reviews_max = {}

for row in goog_apps_data[1:]:
    name = row[0]
    n_reviews = (row[3]) 
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] =  n_reviews 
    if (name not in reviews_max):
        reviews_max[name] = n_reviews
    
print(len(reviews_max))       


9659


Here I created a dictionary call reviews_max that goes through each of the rows in the google apps dataset and adds a news entry if there is no app and its review yet. And if there is a app with the same name already, it checks to see if the reviews are greater than the reviews already in the dictionary, and if it is then it replaces it in the dictionary. Then I printed the length to see if there are the correct amount of entries. 

In [9]:
android_clean = []
already_added = []

for row in goog_apps_data[1:]:
    name = row[0]
    n_reviews = (row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))


9659


This time we create two lists, one for the new cleaned dataset: android_clean, and one for the names of apps which are already cleaned. If the reviews match the max number for reviews that we found in reviews_max dictionary, and if the name if not already added to the list, then we add the row to the new cleaned dataset and add the name to the dataset that keeps track of the apps that have already been added so that we don't add more than one of each app.

In [10]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
        else:
            return True
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
False
True
True


In [11]:
android_cleanest = []
for row in android_clean[1:]:
    name = row[0]
    if is_english(name):
        android_cleanest.append(row)
        
print(len(android_cleanest))

apple_cleanest = []
for row in apple_apps_data[1:]:
    name = row[0]
    if is_english(name):
        apple_cleanest.append(row)
        
print(len(apple_cleanest))

9622
7197


In [12]:
free_android = []
for row in android_cleanest[1:]:
    price = row[6]
    if price == 'Free':
        free_android.append(row)
print(len(free_android))

free_apple = []
for row in apple_cleanest[1:]:
    price = row[4]
    if price == '0.0':
        free_apple.append(row)
print(len(free_apple))

8867
4055


We want to find an app profile that fits both App Store and Google Play becuase neglecting eaither one means we would miss out of roughly 50% of the market. Therefore it makes sense to look at both app stores to see what kinds of apps do well there. Looking at columns 10 (index 9) in the android data set reveals the kind of genre, and looking at column 12 (index 11)

In [13]:
def freq_table(dataset, index):
    frequency_table = {}
    count = 0
    for row in dataset[1:]:
        count +=1
        genre = row[index]
        if genre in frequency_table:
            frequency_table[genre] = (1 + frequency_table[genre]) 
        else:
            frequency_table[genre] = 1 
    for genre in frequency_table:
        frequency_table[genre] /= count
        frequency_table[genre] *= 100
        
    return frequency_table

In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print(freq_table(free_android, 9))
print('\n')
print(freq_table(free_apple, 11))

{'Art & Design;Creativity': 0.06767426122264832, 'Art & Design': 0.5639521768554027, 'Auto & Vehicles': 0.9248815700428604, 'Beauty': 0.5977893074667269, 'Books & Reference': 2.1768554026618543, 'Business': 4.601849763140086, 'Comics': 0.5865102639296188, 'Comics;Creativity': 0.011279043537108053, 'Communication': 3.225806451612903, 'Dating': 1.8610421836228286, 'Education': 5.380103767200541, 'Education;Creativity': 0.04511617414843221, 'Education;Education': 0.34965034965034963, 'Education;Pretend Play': 0.05639521768554027, 'Education;Brain Games': 0.03383713061132416, 'Entertainment': 6.068125422964132, 'Entertainment;Brain Games': 0.07895330475975637, 'Entertainment;Creativity': 0.03383713061132416, 'Entertainment;Music & Video': 0.1691856530566208, 'Events': 0.7105797428378074, 'Finance': 3.6882472366343335, 'Food & Drink': 1.2294157455447778, 'Health & Fitness': 3.0791788856304985, 'House & Home': 0.8120911346717798, 'Libraries & Demo': 0.9361606135799684, 'Lifestyle': 3.9251071

In [15]:
explore_data(free_android, 0, 3, True)

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8867
Number of columns: 13


Also, I had to change the orignal column I was going to use becuase column 2 in the android dataset changes between using numbers and strings to classify their genres.

The most common genre on the app store is games by far with 55%, whereas the most common genre on the android store is tools and education. Games are also somewhat common on the android store, but overall this store is much more balanced in terms of the kinds of apps provided. However, these tables only tell us how many of each kind of app are on the respective stores, and not how popular these apps are. For instance, even though social networking apps only take up 3.5% of apps on the apple store, it seems unlikely that these apps are similarly used by only 3.5% of people.

In [18]:
 for genre in freq_table(free_apple, 11):
        total = 0 
        len_genre = 0
        for row in free_apple[1:]:
            genre_app = row[11]
            if genre_app == genre:
                ratings = float(row[5])
                total += ratings
                len_genre +=1
        avg_rating = total / len_genre
        print(genre, avg_rating)
                

Games 18924.68896765618
Music 56482.02985074627
Social Networking 32503.563380281692
Reference 67447.9
Health & Fitness 19952.315789473683
Weather 47220.93548387097
Utilities 14010.100917431193
Travel 20216.01785714286
Shopping 18746.677685950413
News 15892.724137931034
Navigation 25972.05
Lifestyle 8978.308510638299
Photo & Video 14392.614457831325
Entertainment 10822.961077844311
Food & Drink 20179.093023255813
Sports 20128.974683544304
Book 8498.333333333334
Finance 13522.261904761905
Education 6266.333333333333
Productivity 19053.887096774193
Business 6367.8
Catalogs 1779.5555555555557
Medical 459.75


I would reccomend based on the top number of ratings that we should make a reference app, or possibly a music app. These two genres have the most number of ratings, which means there there will be plenty of users that will watch adds and increase our revenue. 

In [21]:
for category in freq_table(free_android, 9):
    total = 0
    len_category = 0
    for row in free_android[1:]:
        category_app = row[9]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category +=1
    avg_installs = total / len_category
    print(category, avg_installs)
    



Art & Design;Creativity 285000.0
Art & Design 1150022.0
Auto & Vehicles 647317.8170731707
Beauty 513151.88679245283
Books & Reference 8631794.093264248
Business 1708215.906862745
Comics 843675.9615384615
Comics;Creativity 50000.0
Communication 38590546.15734266
Dating 854028.8303030303
Education 539389.7295597484
Education;Creativity 2875000.0
Education;Education 4606306.774193549
Education;Pretend Play 1800000.0
Education;Brain Games 5333333.333333333
Entertainment 5599186.827137547
Entertainment;Brain Games 3314285.714285714
Entertainment;Creativity 4000000.0
Entertainment;Music & Video 6413333.333333333
Events 253542.22222222222
Finance 1361355.1437308867
Food & Drink 1942465.605504587
Health & Fitness 4188821.9853479853
House & Home 1348645.2916666667
Libraries & Demo 638503.734939759
Lifestyle 1415357.5545977012
Lifestyle;Pretend Play 10000000.0
Arcade 23028723.558282208
Puzzle 8302861.91
Racing 15910645.681818182
Sports 4596842.615635179
Casual 19630958.51612903
Simulation 343859

Books and references also appears to be one of the largest app markets for androids, with 8.5 million average installs. Therefore I will suggest we make a reference app.