# Mobile Apps Analysis

The goal of this project is to help developers understand what attracts users to particular free apps to aid in developing more appealing apps. We will use a dataset of more than 7,000 apps from the Apple App Store in 2017 (iOS apps) and more than 10,000 apps from the Google Play Store in 2018 (Android apps).

In [43]:
#Function to read a csv and make it into a list of lists
def open_file(file_name):
    opened_file = open(file_name, encoding = 'utf8')
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    return data

#Function to neatly print a few rows of a dataset to examine it
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#Reading in the Apple Store and Google Play datasets
apple = open_file("AppleStore.csv")
google = open_file("googleplaystore.csv")
#Removing header rows
apple_header = apple[0]
apple = apple[1:]
google_header = google[0]
google = google[1:]

#Exploring the two datasets
print("Apple Store Dataset")
print(apple_header)
print("\n")
explore_data(apple, 0, 3, rows_and_columns = True)
print("\n")
print("Google Play Dataset")
print(google_header)
print("\n")
explore_data(google, 0, 3, rows_and_columns = True)

Apple Store Dataset
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


Google Play Dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'AR

## Documentation
The code below prints out the column names for both of the datasets. For detailed documentation for the Apple Store dataset, see [this link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps); for the Google Play dataset, see [this link](https://www.kaggle.com/lava18/google-play-store-apps).

In [44]:
#Printing the column names
print("Apple Store dataset columns")
print(apple_header)
print("\n")
print("Google Play dataset columns")
print(google_header)

Apple Store dataset columns
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google Play dataset columns
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Data Cleaning and Preparation

### Removing incorrect row

In [45]:
#Removing row with missing column
print(google[10472])
print(len(google[10472]))
#Length of correct rows
print(len(google[10471]))
del google[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
13


### Removing duplicate rows
The Google Play dataset has some duplicate apps as can be seen below. We will remove duplicated rows, only keeping the row for each app name which has the highest number of reviews. This row should be the most recently collected.

In [46]:
#Displaying the duplicate rows for Instagram
for row in google:
    if row[0] == 'Instagram':
        print(row)
        
#Finding the number of duplicate apps and storing the duplicate app names
duplicate_apps = []
unique_apps = []

for row in google:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('\n')
print("Number of duplicate apps:", len(duplicate_apps))
print('\n')
print("Examples of duplicate apps:", duplicate_apps[:10])

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [47]:
#Creating a dictionary of app names with the maximum number of reviews of all
#rows with that name
reviews_max = {}
for row in google:
    name = row[0]
    n_reviews = float(row[3])
    
    #If the app name has already been observed and the current number of reviews
    #is less than what is stored then, update the value in reviews_max to be
    #that number of reviews, the new maximum
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

#Initialize the new clean dataset
google_clean = []
#Storing just app names
already_added = []

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    
    #If the number of reivews is the max for the the app name and that app has
    #not already been added to the dataset (because two rows could have the
    #same number of reviews), then add the row to the dataset and the name to
    #the already_added list
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)
        
#Checking length and a few rows
explore_data(google_clean, 0, 3, rows_and_columns = True)  

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing non-English apps
The target audience for the company we are working for is English-speaking. Therefore, we want to remove all apps that are targeted to an non-English-speaking audience.

In [48]:
def detect_english(string):
    
    #Initialize number of non-english characters
    n_non_english = 0
    
    #Iterate over the characters in the string and check if the number associated is greater than 127
    #If it is, increase the number of non english characters in the string by 1
    for c in string:
        if ord(c) > 127:
            n_non_english +=1
            
        #If at any point in the loop, the number of non english characters is greater than 3, stop and
        #return false (could do this at the end, but prevents loop running more than necessary)
        if n_non_english > 3:
            return False
        
    #If the whole loop runs without returning False, return True (app is for an English speaking audience)
    return True

#Use these strings to test the function
print(detect_english('Instagram'))
print(detect_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(detect_english('Docs To Go™ Free Office Suite'))
print(detect_english('Instachat 😜'))

True
False
True
True


In [49]:
#Creating separate datasets with just English apps

#Initializing apple and google datasets
apple_english = []
google_english = []

for row in apple:
    #Check if the name is in English ("track_name" is index 2 in the apple dataset)
    english = detect_english(row[2])
    #If the name is english, add it to the dataset
    if english == True:
        apple_english.append(row)
        
for row in google_clean:
    #Check if the name is in English ("app" is index 0 in the google dataset)
    english = detect_english(row[0])
    #If the name is english, add it to the dataset
    if english == True:
        google_english.append(row)

#Exploring the two datasets
print("Apple Store Dataset English Only")
print(apple_header)
print("\n")
explore_data(apple_english, 0, 1, rows_and_columns = True)
print("\n")
print("Google Play Dataset English Only")
print(google_header)
print("\n")
explore_data(google_english, 0, 1, rows_and_columns = True)

Apple Store Dataset English Only
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


Number of rows: 6183
Number of columns: 17


Google Play Dataset English Only
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


### Removing apps that are not free
The company we are reporting to is interested in free apps with in-app ads so we want to restrict our analysis to free apps.

In [51]:
#Restricting to free apps

#Initializing apple and google datasets
apple_final = []
google_final = []

#Checking if price is "0" in apple dataset
for row in apple_english:
    price = row[5]
    if price == "0":
        apple_final.append(row)

#Checking if price is "Free" in google dataset
for row in google_english:
    price = row[7]
    if price == "0":
        google_final.append(row)
        
#Exploring the two datasets
print("Apple Store Dataset Free English Only")
print(apple_header)
print("\n")
explore_data(apple_final, 0, 1, rows_and_columns = True)
print("\n")
print("Google Play Dataset Free English Only")
print(google_header)
print("\n")
explore_data(google_final, 0, 1, rows_and_columns = True)

Apple Store Dataset Free English Only
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 3222
Number of columns: 17


Google Play Dataset Free English Only
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


After our data cleaning, we are left with 8864 Android apps and 3222 iOS apps.

## Analysis

The company's goal is to indentify apps that attact the most users. Their strategy for development is as follows:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Becasuse they aim to have the app in both the Google Play store and the Apple Store, they want to make sure it appeals to both margets.

### Finding most common genres
We are going to use the "prime_genre" column in the Apple dataset and the "Genres" and "Category" columns in the Google Play dataset to find the most common genres of apps in each market.

In [65]:
#Function to create a frequency table and display the percentages
def freq_table(dataset, index):
    table = {}
    total = 0
    
    #Buliding the dictionary
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
        #Finding the total number of apps
        total += 1

    #Transforming to percentages
    for i in table:
        percentages = round(100 * (table[i] / total), 2)
        table[i] = percentages
        
    return table

#Function to display the frequency table on separate lines sorted by percentage
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Apple genres

In [66]:
display_table(apple_final, 12)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


#### Google genres

In [67]:
display_table(google_final, 1)
print("\n")
display_table(google_final, 9)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Ar

Of the Free, English apps in the Apple store, most of the games (more than half) are games. The second most popular category is Entertainment. In the Google Play store, the most popular category of free, English apps is "FAMILY" which is primarily kids games, then comes games and tools with very similar values. In general, the Apple store seems to have a greater proportion of games and the Google Play store, a greater proportion of practical apps. However, the number of apps available does not tell us which apps are the most popular. There could be many apps with low usership.

### Finding the most popular apps
We are going to use the "Installs" column in the Google Play dataset to find the genres that have the most installs. In the Apple dataset, we will use the "rating_count_tot" column to find the the genres that have the most ratings (as a proxy for installs since this information is not available.

#### Apple dataset

In [78]:
#Finding the different genres in the apple dataset (the keys in the frequency table dictionary)
genre_dict = freq_table(apple_final, 12)

for genre in genre_dict:
    #Initiate the total number of apps in the genre
    total_apps = 0
    #Initiate the number of ratings
    total_ratings = 0
    
    for row in apple_final:
        app_genre = row[12]
        #If the genre of this app matches the genre of our outer loop then update values
        if app_genre == genre:
            n_ratings = float(row[6])
            total_ratings += n_ratings
            total_apps += 1

    #Finding the average number of ratings by per app in the genre
    print(genre)
    print(round(total_ratings / total_apps, 2))
    
#Exploring the reference category

print("\n")
for app in apple_final:
    if app[12] == 'Reference':
        print(app[2], ':', app[6])

Productivity
21028.41
Weather
52279.89
Shopping
26919.69
Reference
74942.11
Finance
31467.94
Music
57326.53
Utilities
18684.46
Travel
28243.8
Social Networking
71548.35
Sports
23008.9
Health & Fitness
23298.02
Games
22788.67
Food & Drink
33333.92
News
21248.02
Book
39758.5
Photo & Video
28441.54
Entertainment
14029.83
Business
7491.12
Lifestyle
16485.76
Education
7003.98
Navigation
86090.33
Medical
612.0
Catalogs
4004.0


Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats 

In the apple store, the genres that have the most ratings per app are "Social Networking", "Navigation", and "Reference". "Music" also has a high value of average ratings per app. However, I think that these results are skewed by the fact that these categories have extrodinarly popular apps that likely bias the averages (Facebook, Instagram, Google Maps, Bing, Apple Music, etc.) For someone building a brand new app, the popularity of these name-brand apps is not useful. Therefore this analysis would be better served by removing these outliers. Excluding those categories, "Weather" and "Food and Drink" are popular options that are less skewed by the popular outliers. The "Reference" category is skewed high because of the Bible app and Dictionary.com so that category could also have potential.

#### Google Play Dataset

In [111]:
#Finding the different genres in the google dataset (the keys in the frequency table dictionary)
category_dict = freq_table(google_final, 1)

for category in category_dict:
    #Initiate the total number of apps in the genre
    total_apps = 0
    #Initiate the number of ratings
    total_installs = 0
    
    for row in google_final:
        app_category = row[1]
        #If the genre of this app matches the genre of our outer loop then update values
        if app_category == category:
            n_installs = row[5]
            n_installs = float(n_installs.replace("+", "").replace(",", ""))
            total_installs += n_installs
            total_apps += 1

    #Finding the average number of ratings by per app in the genre
    print(category)
    print(round(total_installs / total_apps, 2))

ART_AND_DESIGN
1986335.09
AUTO_AND_VEHICLES
647317.82
BEAUTY
513151.89
BOOKS_AND_REFERENCE
8767811.89
BUSINESS
1712290.15
COMICS
817657.27
COMMUNICATION
38456119.17
DATING
854028.83
EDUCATION
1833495.15
ENTERTAINMENT
11640705.88
EVENTS
253542.22
FINANCE
1387692.48
FOOD_AND_DRINK
1924897.74
HEALTH_AND_FITNESS
4188821.99
HOUSE_AND_HOME
1331540.56
LIBRARIES_AND_DEMO
638503.73
LIFESTYLE
1437816.27
GAME
15588015.6
FAMILY
3695641.82
MEDICAL
120550.62
SOCIAL
23253652.13
SHOPPING
7036877.31
PHOTOGRAPHY
17840110.4
SPORTS
3638640.14
TRAVEL_AND_LOCAL
13984077.71
TOOLS
10801391.3
PERSONALIZATION
5201482.61
PRODUCTIVITY
16787331.34
PARENTING
542603.62
WEATHER
5074486.2
VIDEO_PLAYERS
24727872.45
NEWS_AND_MAGAZINES
9549178.47
MAPS_AND_NAVIGATION
4056941.77


We see a similary pattern here, where the most popular genres are dominated by a few very popular apps. Again, the reference genre could be promising.