# Finding Perspective App Niches - Analyzing AppStore and Google Play Data

The goal of this project is to analyze app data and determine what kind of apps is most likely to attract more users both on iOS and Android.

We will look at two datasets, for iOS and Android apps:

- A dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018:

https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

- A dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017:

https://dq-content.s3.amazonaws.com/350/AppleStore.csv

We will analyze the data from the point of view of an app-developing company focused on free apps, with revenue coming from in-app ads, thus we are keen to know what kind of apps are the most popular and successful.

In [35]:
# Create a helper function to better visualize the data

def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [36]:
# Read and transform the datasets in lists of lists
dataset1 = open('AppleStore.csv')
from csv import reader
ios = list(reader(dataset1))

In [37]:
dataset2 = open('googleplaystore.csv')
android = list(reader(dataset2))

In [38]:
# Explore the data
explore_data(android, 0, 4, rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [39]:
explore_data(ios, 0, 4, rows_and_columns = True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7198
Number of columns: 17


## Cleaning the data

**First we check for and remove any rows with missing values that may deteriorate data processing**

In [40]:
android_header = android[0] # We will use the header as a criterion
android_clean0 = []
for row in android[1:]:
    if len(row) != len(android_header): # Check if any row has different length from the header
        print(row)
        print("\n")
        print("Index postion is:", android.index(row))
    else:
        android_clean0.append(row)
print(len(android))
print(len(android_clean0))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index postion is: 10473
10842
10840


In [41]:
ios_header = ios[0]

for row in ios[1:]:
    if len(row) != len(ios_header):
        print(row)
        print("\n")
        print("Index postion is:", ios.index(row))

**Then we check for and remove duplicate entries**

In [42]:
unique_apps_ios = []
duplicate_apps_ios = []
for app in ios:
    if app[1] in unique_apps_ios:
        duplicate_apps_ios.append(app[1])
    else:
        unique_apps_ios.append(app[1])
print('Duplicates: ', len(duplicate_apps_ios))
print('\n')
print('Some examples of duplicates:', duplicate_apps_ios[:10])

Duplicates:  0


Some examples of duplicates: []


In [43]:
unique_apps_android = []
duplicate_apps_android = []
for app in android_clean0:
    if app[0] in unique_apps_android:
        duplicate_apps_android.append(app[0])
    else:
        unique_apps_android.append(app[0])
print('Duplicates: ', len(duplicate_apps_android))
print('\n')
print('Some examples of duplicates:', duplicate_apps_android[:10])

Duplicates:  1181


Some examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


There seems to be a lot of duplicates for Android data, let's investigate further.

In [44]:
for app in android_clean0:
    if app[0] == 'Slack': # Examine the duplicates
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Since the number of reviews seems to be the only difference among duplicates, implying that the data was collected at different times, we will use it as a criterion for removal. We will keep only the row with the highest number of reviews, which should be the most recent data.

In [45]:
# Create a dictionary to store rows with unique names and max reviews
reviews_max = {} 
for app in android_clean0:
    name = app[0]
    reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    if name not in reviews_max:
        reviews_max[name] = reviews
print(len(reviews_max))

9659


In [46]:
# Clean Android data
android_clean = []
already_added = []
for app in android_clean0:
    name = app[0]
    reviews = float(app[3])
    if reviews == reviews_max[name] and name not in already_added: # We only need one (unique) row with max reviews
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean)) # Check if we stored the necessary number of unique apps

9659


**Now we remove any non-English apps**

Assuming our company works solely in English-speaking markets, we will exclude non-English apps.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. So if the number is equal to or less than 127, then the character belongs to the set of common English characters, and if an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. To account for emojis and other symbols that may occur in English-named apps, we set the criterion on more than 3 symbols with ASCII number over 127

In [47]:
# Create a function that detects non-English symbols
def eng_detector(string):
    i = 0
    for letter in string:
        if ord(letter) > 127:
            i += 1
            if i > 3: # Check if our criterion for the number of non-English symbols holds
                return False
    return True

In [48]:
# Clean iOS data
ios_clean = []
for app in ios[1:]:
    name = app[2]
    if eng_detector(name):
        ios_clean.append(app)
explore_data(ios_clean, 0, 2, rows_and_columns = True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 6183
Number of columns: 17


In [49]:
# Clean Android data
android_clean2 = []
for app in android_clean:
    name = app[0]
    if eng_detector(name):
        android_clean2.append(app)
explore_data(android_clean2, 0, 2, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


**Last we remove non-free apps as they are out of the scope of this project**

In [50]:
ios_clean3 = []
for app in ios_clean:
    price = float(app[5])
    if price == 0.0:
        ios_clean3.append(app)
explore_data(ios_clean3, 0, 2, rows_and_columns = True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 3222
Number of columns: 17


In [51]:
android_clean3 = []
for app in android_clean2:
    price = app[7] # There are symbols in Android data' prices so we cannot use float() 
    if price == '0.0' or price == '0':
        android_clean3.append(app)
explore_data(android_clean3, 0, 2, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


Though it is not our concern right now, one can notice that the proportion of paid apps on AppStore is much bigger than on Google Play.

## Analyzing the Data

Our goal is to determine what kind of app is more likely to attract users and thus genereate more revenue. Let's suppose we have the following validation strategy for an app idea:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. Thus we begin the analysis by determining the most common genres for each market.

In [52]:
# Create a function that makes a frequency table for any given column of the dataset
def freq_table(dataset, index):
    fr_table = {}
    total = len(dataset)
    for row in dataset:
        key = row[index]
        if key in fr_table:
            fr_table[key] += 1
        else:
            fr_table[key] = 1
    for key in fr_table:
        fr_table[key] = round(fr_table[key]/total*100, 2) # Show distribution as percentage of total
    return fr_table

In [53]:
# Create a function to sort the frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [54]:
# Use new functions to look at most popular genres for both datasets
display_table(ios_clean3, 12)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


In [55]:
display_table(android_clean3, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [56]:
display_table(android_clean3, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

The frequency tables show that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we're going to determine the kind of apps with the most users.

One way to find out what genres are the most popular (i.e. have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can use the average number of installs for each app genre ('Installs'), for the App Store data set we'll take the total number of user ratings as a proxy ('rating_count_tot').

In [57]:
# Create frequency table for genres based on total number of user ratings (iOS dataset)
table = freq_table(ios_clean3, 12)
for genre in table:
    total = 0
    len_genre = 0
    for row in ios_clean3:
        genre_app = row[12]
        if genre_app == genre:
            users = float(row[6]) # Take the total number of user ratings as a proxy for popularity
            total += users
            len_genre += 1
    aver_users = round(total / len_genre) # Show average number of user rating per genre
    print(genre, ' : ', aver_users)

Productivity  :  21028
Weather  :  52280
Shopping  :  26920
Reference  :  74942
Finance  :  31468
Music  :  57327
Utilities  :  18684
Travel  :  28244
Social Networking  :  71548
Sports  :  23009
Health & Fitness  :  23298
Games  :  22789
Food & Drink  :  33334
News  :  21248
Book  :  39758
Photo & Video  :  28442
Entertainment  :  14030
Business  :  7491
Lifestyle  :  16486
Education  :  7004
Navigation  :  86090
Medical  :  612
Catalogs  :  4004


Most popular app genres on AppStore are Social Networking, Navigation, and Reference (apps that assist the user in accessing or retrieving information).

In [58]:
# Create frequency table for categories based on the number of installation (Android dataset) 
table = freq_table(android_clean3, 1)
for category in table:
    total = 0
    len_category = 0
    for row in android_clean3:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace(',', '') # Remove ','
            installs = installs.replace('+', '') # Remove '+'
            installs = float(installs) / 1000000 # Convert to float and show in millions
            total += installs
            len_category += 1
    aver_installs = round(total / len_category, 2) # Show average number of installs per category
    print(category, ' : ', aver_installs)

ART_AND_DESIGN  :  1.99
AUTO_AND_VEHICLES  :  0.65
BEAUTY  :  0.51
BOOKS_AND_REFERENCE  :  8.77
BUSINESS  :  1.71
COMICS  :  0.82
COMMUNICATION  :  38.46
DATING  :  0.85
EDUCATION  :  1.83
ENTERTAINMENT  :  11.64
EVENTS  :  0.25
FINANCE  :  1.39
FOOD_AND_DRINK  :  1.92
HEALTH_AND_FITNESS  :  4.19
HOUSE_AND_HOME  :  1.33
LIBRARIES_AND_DEMO  :  0.64
LIFESTYLE  :  1.44
GAME  :  15.59
FAMILY  :  3.7
MEDICAL  :  0.12
SOCIAL  :  23.25
SHOPPING  :  7.04
PHOTOGRAPHY  :  17.84
SPORTS  :  3.64
TRAVEL_AND_LOCAL  :  13.98
TOOLS  :  10.8
PERSONALIZATION  :  5.2
PRODUCTIVITY  :  16.79
PARENTING  :  0.54
WEATHER  :  5.07
VIDEO_PLAYERS  :  24.73
NEWS_AND_MAGAZINES  :  9.55
MAPS_AND_NAVIGATION  :  4.06


The situation is quite different for Google Play, where most installed apps are Communication, Game, Social, Photography, Travel, and Video Players.

## Conclusion

We analyzed app data across both iOS and Android app markets to understand which apps appeal to users, and which ones we need to focus upon.

We analysed AppStore and Google Play data, removed apps irrelevant to our project's objectives, and investigated the most popular genres.

Looking at both AppStore and Google Play analytics, we can conclude that our developers should focus on new social networking apps (one of the hottest categories on both marketplaces), the success of which may, however, be dependent on what new features we can offer users in comparison to established social networks (Facebook, Telegram, Whatsapp, etc.).

Alternatively, we could explore opportunities in a less saturated segment by creating a dictionary or some other reference app, using cutting-edge technology such as ChatGPT or Midjourney to give us a competitive edge.