# Profitable App profile for App Store and Google Play Markets

The goal of the project is to analyse the application profiles that are more likely to generate revenue for the company. The company only build apps that are free to download and install, and the main source of revenue consists of in-app ads. This means the revenue for any given app is mostly influenced by the number of users who use our app.


## Opening and exploring the data. 

In [1]:
from csv import reader

def read_file(dataset):
    open_file = open(dataset)
    read_file = reader(open_file)
    data = list(read_file)
    return data

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

android_data = read_file("googleplaystore.csv")
print(android_data[0:20])
android_header = android_data[0]
android_data = android_data[1:]
print("Wrong data",android_data[10472])


[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyon

In [2]:
ios_data = read_file("AppleStore.csv")
print(ios_data[0:20])
ios_header = ios_data[0]
ios_data = ios_data[1:]
print(len(ios_data))

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624',

## Data Cleaning

### Deleting bad data

One of the rows in the android app dataset is missing value for the category column. We need to remove the column at that index so that it does not affect out result.

In [3]:
 del(android_data[10472])

### Removing duplicate data

Duplicate data needs to be analysed and removed so that it does not influence the results.

In [4]:

def duplicate_apps(dataset):
    unique_data = []
    duplicate_data =[]
    for app in dataset:
        name = app[0]
        if name in unique_data:
            duplicate_data.append(name)
        else:
            unique_data.append(name)
    return unique_data,duplicate_data   
    
unique_data_android,duplicate_data_android = duplicate_apps(android_data)
print("Number of unique android apps = ", len(unique_data_android))
print("Number of duplicate android apps = ", len(duplicate_data_android)) 


Number of unique android apps =  9659
Number of duplicate android apps =  1181


In [5]:
unique_data_ios,duplicate_data_ios = duplicate_apps(ios_data)
print("Number of unique ios apps = ", len(unique_data_ios))
print("Number of duplicate ios apps = ", len(duplicate_data_ios))  

Number of unique ios apps =  7197
Number of duplicate ios apps =  0


Only the Android dataset has duplicates. We need to elimniate duplicate entries but at the same time want to keep the entry with the most number of user reviews. One way of doing this is to create a dictionary with keys as app name and the value as the highest number of reviews associated with it. Also, the size of the dictionary should be equal to the number of unique android apps. 

In [6]:
reviews_max = {}
for app in android_data:
    name = app[0]
    n_reviews = app[3]
    if name in reviews_max and n_reviews>reviews_max[name]:
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews
        
print(len(reviews_max))
print(len(unique_data_android))
print(len(android_data))

9659
9659
10840


We eliminate the redundant entries by matching them with the entries in the review_max dictionary. The size of the dictionary should be equal to the number of apps in android_clean list. 

In [7]:
android_clean = []
ios_clean = ios_data
already_added = []


for app in android_data:
    name = app[0]
    n_reviews = app[3]
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print("number of android apps",len(android_clean))
print("size of reviews dictionary",len(reviews_max))


number of android apps 9659
size of reviews dictionary 9659


### Removing non-english apps

We are analysing the data only for English Apps. We need to remove all the apps that are not in English. The logic to identify if an app is English is to scan it's characters and if any character's ASCII value is greater 127 then mark it as non-english and return false. Some corner cases with special characters (i.e - "", whitespace) need to be accounted for. So the max number of character with ASCII value beyond the specified range is set at three. 

In [8]:
def is_English(name):
    count = 0
    for char in name:
        if ord(char)>127:
            if count==3:
                return False
            else:
                count+=1
    return True

res = is_English("Docs To Go™ Free Office Suite")#test changes
print(res)    

True


After applying the filter we observe that only the android dataset has non-english apps which are eliminated but the ios dataset has no change.

In [9]:
english_android_apps = []
english_ios_apps = []
        
def remove_non_english(dataset):
    english_apps = []
    for app in dataset:
        name = app[0]
        if is_English(name) == True:
            english_apps.append(app)
    return english_apps       


english_android_apps = remove_non_english(android_clean)
english_ios_apps = remove_non_english(ios_clean)

print(len(english_android_apps))
print(len(english_ios_apps))

9614
7197


Isolating the free apps from non-free from the dataset

In [10]:
android_free_apps = []
ios_free_apps = []
for app in english_android_apps:
    price = app[7]
    if price == '0':
        android_free_apps.append(app)

for app in english_ios_apps:
    price = app[4]
    if price == '0.0':
        ios_free_apps.append(app)
        
        
print("Number of free android apps",len(android_free_apps))   
print("Number of free ios apps",len(ios_free_apps))   

Number of free android apps 8862
Number of free ios apps 4056


## Data Analysis

The validation strategy of the app idea is as follows:

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add    it to the App Store.

In order to do this, we need to analyse the data to find the most common genres and categories of apps in the dataset. We need to create a frequency table to assess percentange of apps in each those categories.

In [11]:
def freq_table(dataset,index):
    freq = {}
    total = 0
    for app in dataset:
        val = app[index]
        if val in freq:
            freq[val]+=1
        else:
            freq[val]=1
        total+=1
     
    table_percentages = {}
    for key in freq:
        percentage = (freq[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

Computing the frequency table and sorting it in descending order.

In [12]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print("\n\nandroid_free_apps list in category:\n")
display_table(android_free_apps,1) #category 



android_free_apps list in category:

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING :

As we can see, most number of apps in Android dataset belong to the family category

In [13]:
print("ios_free_apps list in prime_genre:\n")
display_table(ios_free_apps,-5) #Genre

ios_free_apps list in prime_genre:

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


The results are significantly different in the ios app dataset. Most number of games seem to belong to the Games category.


We need to probe a little furthur. Just because a certain cetegory has more number of apps does not mean they are also very popular. Number of installs and user ratings can be a good indicator of popularity. The install columnn is only present in the android dataset and not in ios dataset. We can use rating_count_tot column in ios dataset as a proxy for that.


### Popular IOS apps by genre

In [14]:
genres_ios = freq_table(ios_free_apps, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free_apps:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
    

Games : 18924.68896765618
Book : 8498.333333333334
Food & Drink : 20179.093023255813
Health & Fitness : 19952.315789473683
Music : 56482.02985074627
Business : 6367.8
Navigation : 25972.05
Sports : 20128.974683544304
Shopping : 18746.677685950413
Education : 6266.333333333333
Lifestyle : 8978.308510638299
Productivity : 19053.887096774193
Finance : 13522.261904761905
Reference : 67447.9
Social Networking : 53078.195804195806
Medical : 459.75
News : 15892.724137931034
Weather : 47220.93548387097
Catalogs : 1779.5555555555557
Utilities : 14010.100917431193
Travel : 20216.01785714286
Entertainment : 10822.961077844311
Photo & Video : 27249.892215568863


The most number of user ratings for app per genre is for social networking and Music category. This space is mainly dominated by giants like Facebook, Twitter, Google and Spotify.  


### Popular Android apps by genre

In [15]:
genre_android = freq_table(android_free_apps, -4)

for genre in genre_android:
    total_installs = 0
    genre_len = 0
    for app in android_free_apps:
        n_installs = app[5]
        if genre == app[-4]:
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            n_installs = float(n_installs)
            total_installs+=n_installs
            genre_len+=1
    average = (total_installs/genre_len)
    print(genre," : ",average)   

Action  :  12603588.872727273
Casual;Action & Adventure  :  12916666.666666666
Role Playing;Action & Adventure  :  7000000.0
Sports  :  4596842.615635179
Arcade;Action & Adventure  :  3190909.1818181816
Word  :  9094458.695652174
Board;Action & Adventure  :  3000000.0
Casual;Creativity  :  5333333.333333333
Entertainment;Brain Games  :  3314285.714285714
Adventure  :  4922785.333333333
Lifestyle  :  1412998.3449275363
Strategy;Action & Adventure  :  1000000.0
Casual;Pretend Play  :  6957142.857142857
Music  :  9445583.333333334
Video Players & Editors  :  24947335.796178345
Productivity  :  16787331.344927534
Casual  :  19630958.51612903
Racing;Pretend Play  :  1000000.0
Social  :  23253652.127118643
Travel & Local  :  14051476.145631067
Art & Design;Pretend Play  :  500000.0
Entertainment;Creativity  :  4000000.0
Casual;Brain Games  :  1425916.6666666667
Puzzle;Action & Adventure  :  18366666.666666668
Auto & Vehicles  :  647317.8170731707
Food & Drink  :  1924897.7363636363
Lifestyle

So on an average communication genre has the most number of installs. 

## Conclusion

We tried to assess the two datasets by cleaning and analysing them. We could see that for the ios dataset, apps in social networking and music category are more popular whereas for the android dataset, apps in the communication is greatly popular. The metrics used for the analysis are the genre, category, user ratings and the number of installs. These meaningful insights can be used to make data driven decision while creating new applications for the contemporary mobile appplication markets.