# Analyzing Mobile Apps to identify potential factors that can used to increase subscriber growth

* Dataset description
* The dataset is collection of information on Applications ('Apps') from the Apple Store and Google Play Store for iPhones and Android phones resp.
* Detailed information regarding the individual datasets and download links can be found at the following. 
* Apple App Store dataset: [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
* Google Play Store dataset: [Link](https://www.kaggle.com/lava18/google-play-store-apps/home)

Apple App Store Variable description:

| Column Name      | Description |
| ----------- | -----------      |
| "id"        | App ID           |
|"track_name" | App Name         |
|"size_bytes" | Size (in Bytes)  |
|"currency"   | Currency Type    |
|"price"      | Price amount     |
|"rating_count_tot"  | User Rating counts (for all version) |
|"rating_count_ver"  | User Rating counts (for current version) |
|"user_rating"       | Average User Rating value (for all version) |
|"user_rating_ver"   | Average User Rating value (for current version) |
|"ver"          | Latest version code | 
|"cont_rating"  | Content Rating |
|"prime_genre"  | Primary Genre |
|"sup_devices.num"   | Number of supporting devices |
|"ipadSc_urls.num"   | Number of screenshots showed for display |
|"lang.num"     | Number of supported languages |
|"vpp_lic"      | Vpp Device Based Licensing Enabled |

Google Play Store Variable description:

| Column Name      | Description |
| ----------- | -----------      |
|"App" | Application name |
|"Category" | Category the app belongs to |
|"Rating" | Overall user rating of the app (as when scraped) |
|"Reviews" | Number of user reviews for the app (as when scraped) |
|"Size" | Size of the app (as when scraped) |
|"Installs" | Number of user downloads/installs for the app (as when scraped) |
|"Type" | Paid or Free |
|"Price" | Price of the app (as when scraped) |
|"Content Rating" | Age group the app is targeted at - Children / Mature 21+ / Adult |
|"Genres" | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. |
|"Last Updated" | Date when the app was last updated on Play Store (as when scraped) |
|"Current Ver" | Current version of the app available on Play Store (as when scraped) |
|"Android Ver" | Min required Android version (as when scraped) |

# Exploration of both datasets

* Using the a user defined function we explore both sets of data and take a peek at the structure of the data

In [1]:
from csv import reader
file1 = open('AppleStore.csv')
file2 = open('googleplaystore.csv')
ios = reader(file1)
play = reader(file2)
ios_store = list(ios)
play_store = list(play)

In [2]:
def explore_data(dataset, start, end, header=True, rows_and_columns=False):
    if header:
        data_set = dataset[1:]
    
    dataset_slice = data_set[(start-1):end]
            
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(data_set))
        print('Number of columns:', len(dataset[0]))

In [3]:
print('Apple Store App details: \n')
explore_data(ios_store, 1, 5, True, True)
print('\n')
print('Google Play Store App details: \n')
explore_data(play_store, 1, 5, True, True)
print('\n')

print(play_store[0])

Apple Store App details: 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Google Play Store App details: 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up

# Removing erroneous data and identifying duplicate data entries 

* Most data that has been scraped from multiple sources tends to contain some degree of mistakes.
* Mistakes could be erroneous data entries or duplicates.
* Cleaning/Purging the dataset of these entries to facilitate better analysis of the data is a crucial part of the process
* Deleting duplicate entries cannnot be done radomnly. Ideally, we retain the latest/most recent version of the app in our dataset. We can establish some rule to assist in its determination.

## Part 1 - Removing incorrect entries

In [4]:
# The Google Play Store data contained 1 data entry that is out of place. The index of said entry is 10473
print(play_store[10473])
print(len(play_store),'\n')
# Deleting Error index
del play_store[10473]
# Check length to see index has been deleted
print(len(play_store))


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10842 

10841


## Part 2 - Checking for duplicate entries

In [5]:
# This user defined function allows us to estimate the nubmer of duplicate entries
def duplicate(dataset, header=True):
    if header:
        data_set = dataset[1:]
    
    duplicate_apps = []
    unique_apps = []
    
    for row in data_set:
        name = row[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    
    return unique_apps, duplicate_apps

In [6]:
ios_unique, ios_duplicate = duplicate(ios_store, True)
play_unique, play_duplicate = duplicate(play_store, True)

print('Number of unique apps in the Apple Store:', len(ios_unique), '\n')
print('Number of duplicate apps in the Apple Store:', len(ios_duplicate), '\n')

print('Number of unique apps in the Google Play Store:', len(play_unique), '\n')
print('Number of duplicate apps in the Google Play Store:', len(play_duplicate), '\n')

Number of unique apps in the Apple Store: 7197 

Number of duplicate apps in the Apple Store: 0 

Number of unique apps in the Google Play Store: 9659 

Number of duplicate apps in the Google Play Store: 1181 



## Part 3 - Removing duplicate entries

* To identify which duplicate iterations of an app must be deleted, we use the "number of reviews" variable. 
* The number of reviews received by an app can be used as an identifier to distinguish between multiple entries for the same app.
* We only the store the data entry for a given app with the highest number of user reviews.
* The following code blocks output a cleaned version of our original dataset.

In [7]:
reviews_max = {}

for row in play_store[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if (name not in reviews_max):
        reviews_max[name] = n_reviews
        
#print('\n', len(reviews_max))        

In [8]:
android_clean = []
already_added = []

for row in play_store[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
#print('\n', len(android_clean))

# Removing Non-English Apps from the cleaned dataset

* Some of the apps in the cleaned dataset are targeted towards a non-english speaking audience and hence are not of interest to our analysis.
* We can eliminate these apps from our dataset using the ASCII code designation for their characters.

In [9]:
def english_android(dataset, header=True):
    if header:
        data_set = dataset[1:]
    else:
        data_set = dataset
    english_apps = []
    
    for row in data_set:
        name = row[0]
        flag = 0
        for char in name:
            if(ord(char) > 127):
                flag += 1
        if(flag<=3):
            english_apps.append(row)
    
    return english_apps   

def english_ios(dataset, header=True):
    if header:
        data_set = dataset[1:]
    else:
        data_set = dataset
    english_apps = []
    
    for row in data_set:
        name = row[1]
        flag = 0
        for char in name:
            if(ord(char) > 127):
                flag += 1
        if(flag<=3):
            english_apps.append(row)
    
    return english_apps   

In [10]:
ios_english = english_ios(ios_store)
play_english = english_android(android_clean, False)

#print('\n',len(ios_english))
#print('\n',len(play_english))

# Selecting only the free to use/download apps from the dataset

* Not all apps available in the Apple app store and Google play store are free.
* We can eliminate paid apps from our dataset using the price variable.

In [11]:
def free_android(dataset):
    free_apps = []
    
    for row in dataset:
        price = row[7]
        if(price == '0') or (price == 'Free'):
            free_apps.append(row)
    
    return free_apps   

def free_ios(dataset):
    free_apps = []
    
    for row in dataset:
        price = row[4]
        if(price == '0.0') or (price == 'Free'):
            free_apps.append(row)
    
    return free_apps


In [12]:
ios_free = free_ios(ios_english)
play_free = free_android(play_english)

#print('\n',len(ios_free))
#print('\n',len(play_free))  

# Identifying apps that can be released in iOS and Android environments to maximize revenue

* The aim is to determine the kinds of apps that are likely to attract more users because app based revenue is highly influenced by the number of people using them.

* Typically the life cycle of an app that has been development are:

    * Build a minimal Android version of the app, and add it to Google Play.
    * If the app has a good response from users, it is further developed.
    * If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

* The end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are/can be successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [13]:
def freq_table(dataset, index):
    freq = {}
    table = {}
    for row in dataset:
        genre = row[index]
        if genre in freq:
            freq[genre] += 1
        else:
            freq[genre] = 1
    
    for key in freq:
        table[key] = (freq[key]/len(dataset))*100
        
    return table
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [14]:
print('iOS app genres:\n')
display_table(ios_free, -5)
print('\n')
print('Android app categories: \n')
display_table(play_free, 1)
print('\n')
print('Android app genres: \n')
display_table(play_free, -4)
print('\n')

iOS app genres:

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Android app categories: 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.39575812274

## Part 1 - Apple App Store
Looking at the genre of apps installed from the Apple Store we can observe that most of the applications are designed to entertain the customer.

To further our analysis we look at the most popular apps in each genre. This should give us a more detailed idea about the preferences of users.

In [15]:
ios_genre_table = freq_table(ios_free, -5)
app_genre_max = 0
app_genre_max_avg = 0
app_gen = ""
for genre in ios_genre_table:
    user_total = 0
    genre_total = 0
    for row in ios_free:
        app_genre = row[-5]
        if app_genre == genre:
            user = float(row[5])
            user_total += user
            genre_total += 1
            temp = user_total/genre_total
    
    if(temp > app_genre_max_avg):
        app_genre_max_avg = temp
        app_gen = genre
        
        
    avg_user_total = user_total/genre_total
    print('App Genre: ',genre, ' | Average no. of user ratings: ',avg_user_total)
    
print("\nGenre with the highest average user ratings: ",app_gen)
print('Average user ratings: ',app_genre_max_avg)

App Genre:  Photo & Video  | Average no. of user ratings:  28441.54375
App Genre:  Music  | Average no. of user ratings:  57326.530303030304
App Genre:  Health & Fitness  | Average no. of user ratings:  23298.015384615384
App Genre:  Business  | Average no. of user ratings:  7491.117647058823
App Genre:  Lifestyle  | Average no. of user ratings:  16485.764705882353
App Genre:  Social Networking  | Average no. of user ratings:  71548.34905660378
App Genre:  News  | Average no. of user ratings:  21248.023255813954
App Genre:  Travel  | Average no. of user ratings:  28243.8
App Genre:  Utilities  | Average no. of user ratings:  18684.456790123455
App Genre:  Reference  | Average no. of user ratings:  74942.11111111111
App Genre:  Productivity  | Average no. of user ratings:  21028.410714285714
App Genre:  Shopping  | Average no. of user ratings:  26919.690476190477
App Genre:  Food & Drink  | Average no. of user ratings:  33333.92307692308
App Genre:  Navigation  | Average no. of user rat

## Part 2 - Google Play Store 
Google Play store apps are more distributed between applications designed to have fun and practical apps that aid in day to day life.

To further our analysis we look at the most popular apps in each category(category as a variable was more truncated compared to genres). This should give us a more detailed idea about the preferences of users.

In [20]:
play_genre_table = freq_table(play_free, 1)

app_genre_play_avg = 0
app_gen_play = ""
for genre in play_genre_table:
    user_total = 0
    genre_total = 0
    for row in play_free:
        app_genre = row[1]
        if app_genre == genre:
            user = row[5]
            user = user.replace(',', '')
            user = user.replace('+', '')
            user_float = float(user)
            user_total += user_float
            genre_total += 1
            temp = user_total/genre_total
    
    if(temp > app_genre_play_avg):
        app_genre_play_avg = temp
        app_gen_play = genre
        
        
    avg_user_total = user_total/genre_total
    print('App Category: ',genre, ' | Average no. of user installs: ',avg_user_total)
    
print("\nCategory with the highest average user installs: ",app_gen_play)
print('Average no. of user intalls: ',app_genre_play_avg)

App Category:  ENTERTAINMENT  | Average no. of user installs:  11640705.88235294
App Category:  BOOKS_AND_REFERENCE  | Average no. of user installs:  8767811.894736841
App Category:  WEATHER  | Average no. of user installs:  5074486.197183099
App Category:  PHOTOGRAPHY  | Average no. of user installs:  17840110.40229885
App Category:  FOOD_AND_DRINK  | Average no. of user installs:  1924897.7363636363
App Category:  HOUSE_AND_HOME  | Average no. of user installs:  1331540.5616438356
App Category:  HEALTH_AND_FITNESS  | Average no. of user installs:  4188821.9853479853
App Category:  MEDICAL  | Average no. of user installs:  120550.61980830671
App Category:  EDUCATION  | Average no. of user installs:  1833495.145631068
App Category:  FINANCE  | Average no. of user installs:  1387692.475609756
App Category:  SPORTS  | Average no. of user installs:  3638640.1428571427
App Category:  SHOPPING  | Average no. of user installs:  7036877.311557789
App Category:  COMICS  | Average no. of user i

# Post result analysis

* With a more detailed look at the free and english only app market for both iOS and Android devices, we can see that certain genres/categories dominate the market.
* Some of the ratings and installs numbers are skewed due to the presence of highly popular apps such as WhatsApp, Messanger, Gmail, Google Maps, Instagram etc.
* Since there is essentially no singular right answer to the question posed as to which direction a new app to be developed should take, we can pick and choose a genre/category that holds promise between both app markets.
* We could even consider blending together 2 genres/categories to design and develop an app that can be used to entice new users to download and use them. Since, games and entertainment were a popular genre/category across both markets, an appropriate example would be releasing a productivty app that uses and involves gamification principles to improve user lifestyle, we could even design an app that is directed at children to encourage them to read more books using gamification