# Analysis of Mobile Application Data

## Purpose
As of mid-2018, there were over 2 million mobile applications available in either the iOS or Android systems. As makers of free apps, we would like to analyze existing app data to identify avenues for growth. 

The goal of this analysis is to improve understanding of what types of applications are likely to attract more users on the app stores, and select a genre of app that will be successful across platforms.  

This project utlizes data sets collected from the Google Play store ([Android](https://www.kaggle.com/datasets/lava18/google-play-store-apps)) and the App Store ([iOS](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?select=appleStore_description.csv)) containing sample data from approximately 10,000 and 7,000 apps, respectively.

### Data 

In [1]:
from csv import reader

# Android Google Play store data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]
        
# Apple App Store data
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice: 
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(apple_header)
print('\n')
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


### Data Cleaning
As this aims to be an analysis of free applications for an English-speaking audience, we remove the data from apps that aren't free and aren't in English as well as any duplicate or erroneous data

#### Erroneous Data

In [5]:
print(android[10472]) #flagged as an incorrect row
print('\n')
print(android_header)
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [6]:
#As row 10472 seems to have mssing/incorrect data, we'll delete that row
print(len(android))
del android[10472]
print(len(android))

10841
10840


#### Duplicate data

As you can see below, the Instagram app is represented several times in the android data set

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We want to make sure we're not including duplicate data in the analysis, so we'll go through and check for duplicate entries, making separate lists for unique and duplicate app names

In [8]:
duplicate_apps = []
unique_apps = []

for app in android: 
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We want to make sure we use the most up-to-date information, so for duplicates with different numbers of reviews, we'll keep the one with the most reviews (presumably the latest information). To accomplish this we'll bulid a dictionary with the key set as each unique app name, and the corresponding value as the highest number of reviews.

In [9]:
reviews_max = {}

for app in android: 
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

We found earlier that there existed 1181 duplicate entries in the Android data set, which means that the true dataset should be 10840 - 1181 = *9659*

In [10]:
print('Expected length:', len(android)-1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now we'll use our dictionary of reviews to select only the unique data we want in our set

In [11]:
android_clean = []
already_added = []

for app in android: 
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [12]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Now we'll just check for duplicate apps in the Apple store data

In [13]:
duplicate_apps = []
unique_apps = []

for app in apple: 
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 0


#### Keeping only apps in English

We'll start by creating a function to find apps with names that are not in English. English characters and puctuation are identified by the using the `ord()` function, which returns a number < 127 for English characters. We're also allowing for up to three non-English characters, to allow for emojis and other symbols that occur in names of English language apps. 

In [14]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else: 
        return True

Now we'll run our app data through the function and keep only the records for English language apps. 

In [15]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in apple:
    name = app[1]
    if is_english(name):
        apple_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

#### Selecting only free apps

As we're interested in researching free apps, we're restricting our analysis to exlude pay apps. We'll cycle through the data and remove those with prices listed. 

In [16]:
android_final = []
apple_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)
        
print(len(android_final))
print(len(apple_final))
    

8864
3222


### Analysis of Application Data

The goal of this analysis is to identify applications with broad appeal. To understand and reach the widest audience, we're including data from both Android and Apple app stores, and we'll identify the most commonly accessed and best reviewed applications.

Once we've identified the direction we'd like to pursue for our application, testing will unfold in three parts: 
1. we plan to test the app with a minimal build in Google Play. 
2. If the app generates a positive response from users, we will develop the app further. 
3. If, after six months the app is profitable, we will build an iOS version and add it to the App Store.

#### Frequency of Apps by Genre

We will begin the analysis by identifying the most common app genres for each market.

In [25]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset: 
        total += 1
        value = row[index]
        if value in table: 
            table[value] += 1
        else: 
            table[value] = 1
        
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', '{:.2f} %'.format(entry[0]))


Now we can look at the categories/genre of apps in both App Store and Play Store data sets

In [26]:
display_table(apple_final,-5)

Games : 58.16 %
Entertainment : 7.88 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.51 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.33 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


From the percentages listed above, it's clear that games and entertainment are most heavily represented in the App Store (together accounting for ~66% of apps) while more practical apps (education, health, finance, etc.) comprise a small portion of the apps available. This does not necessarily imply that these apps have the greatest number of users - notice that Social Networking apps are only 3.6% of the total despite being very popular. 

Next we'll look at the applications in the Google Play data

In [27]:
display_table(android_final, 1) #Category

FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.90 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.70 %
MEDICAL : 3.53 %
SPORTS : 3.40 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.80 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.40 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.80 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.60 %


The Google Play landscape seems a bit different than the App Store. In the Android store, apps targeting lifestyle, productivity, and other practical needs seem to represent a larger share. Let's look at the frequency table for the "genre" data to confirm. 

In [28]:
display_table(android_final, -4) #Genres

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.70 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.10 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.80 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.40 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.80 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.60 %
Art & Design : 0.60 %
Parenting : 0.50 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music :

While there is a bit of overlap between the `Category` and `Genre` data in the Google Play data set, the `Genre` data is extensive (many more categories) and since for this analysis, we're interested in the big picture, we'll work with the `Category` data going forward. 

The frequency tables above give us an idea about the distribution of offereings, but do not provide data on the distribution of users - which apps are the most popular? 

#### Most Popular Apps by Genre 

In the App Store data, there is no field indicating number of installs per app, but we can look at the number of reviews, `rating_count_tot`, to give us an approximation of users per app in a particular genre. 

In [29]:
genres_apple = freq_table(apple_final, -5)

for genre in genres_apple: 
    total = 0
    len_genre = 0
    
    for app in apple_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ':', '{:.2f}'.format(avg_n_ratings))

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.80
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.90
Book : 39758.50
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.00
Medical : 612.00


We can perform a similar analysis with the Google Play data, but instead of user ratings, we can use the `Installs` variable to gauge use. 

In [30]:
display_table(android_final, 5) #Installs column

1,000,000+ : 15.73 %
100,000+ : 11.55 %
10,000,000+ : 10.55 %
10,000+ : 10.20 %
1,000+ : 8.39 %
100+ : 6.92 %
5,000,000+ : 6.83 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.30 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.05 %
0 : 0.01 %


One issue with this data is that the numbers are imprecise (i.e., 1,000+ rather than a specific number between 1,000 and 10,000). For our purposes, however, this is okay as we want to gauge general interest, and we will just use the base number given. That is to say, an app with 1,000,000+ installs will be considered to have 1,000,000 installs, and so on.  

To use the given data, we will have to convert the install numbers to `float` by removing the commas and '+' characters before performing any calculations. We will do this at the same time as we average the number of installs per genre/category. 

In [38]:
categories_android = freq_table(android_final, 1)

for category in categories_android: 
    total  = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
            
    avg_n_installs = total / len_category
    print(category, ':', '{:.2f}'.format(avg_n_installs))

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.60
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.40
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.30
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.20
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


### Discussion and Recommendations

Some of the numbers above could be a bit misleading. At first one might be tempted to interpret the fact that the reference genre has so many ratings (86090 per app in the App Store) and such a small percentage of the market (0.19 %) is indicative of a large audience clamoring for few apps and an opportunity for growth. A careful examination of the genre, however, reveals that those numbers are skewed by a very few apps garnering a large number of reviews, and the rest of the genre struggling:

In [40]:
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


A few other genres, including weather, reference, and social networking suffer from this same problem - a few very popular apps dominating the field. 

Other genres, such as Game and Entertainment are heavily downloaded, but also comprise a large portion of the offerings, leading to a fear of developing an app that might be overlooked as one of many in a saturated field. 

One potentially fruitful direction would be to appeal to the seemingly general interest in books, social networking, and games by creating an app that presents popular books, generates quizzes or trivia games based on those books, and connects readers who enjoy the same books. One potential avenue for social networking could be reading groups where people can read the book "together" and discuss, or compete in book knowledge games. This sort of app could appeal to those wishing to read more, and to connect with other readers. 