__Profitable App Profiles for the App Store and Google Play Markets__

This Project simulates real world scenario of __data analysis__ of __Android__ and __iOS__ mobile apps.

__Aim__ of this project is to find _mobile app profiles_ for the App Store and Google Play markets. The project simulates a __data analyst's__ task in a company that builds Android and iOS mobile apps, that enables team of __developers__ to make data_driven decisions with respect to the kind of apps they build.

Presumably the company, builds apps that are free to download and install, and main source of revenue consists of in-app ads. This means that revenue for any given app is mostly influenced by the __number of users__ that use the app.
__Goal__ for this project is to analyze data to help developers understand what kind of apps are likely to attract more users.

Starting initially by opening the two data sets and then continue with exploring data

In [20]:
from csv import reader

### The Google Play data set ###
opened_file=open('googleplaystore.csv')
read_file=reader(opened_file)
android=list(read_file)
android_header=android[0]
android=android[1:]

### The App Store data set ###
opened_file=open('AppleStore.csv')
read_file=reader(opened_file)
ios=list(read_file)
ios_header=ios[0]
ios=ios[1:]


Below is a function named __explore_data()__, this function will be repeatedly invoked to explore rows in a more redable way

In [21]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    count=start
    for row in dataset_slice:
        count+=1
        print('row : ' + str(count))
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print('## android dataset header: ')
print('\n')
print(android_header)
print('\n')
print('## android dataset first two rows')
print('\n')
print(explore_data(android, 0, 2, True))

print('\n')

print('## iOS appstore dataset header')
print('\n')
print(ios_header)
print('\n')
print('## iOS dataset first row')
print('\n')
print(explore_data(ios, 0, 1, True))

## android dataset header: 


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## android dataset first two rows


row : 1
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


row : 2
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
None


## iOS appstore dataset header


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


## iOS dataset first row


row : 1
['284882215', 'Facebook', '389879808', 'USD', '0.0'

__googleplaystore__ dataset documentation : https://www.kaggle.com/lava18/google-play-store-apps<br/>
__applestore__ dataset documentation : https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

__Deleting wrong data__

Previously, I have opened the datasets and explored the data. Before analysis, it should be made sure that the data is __accurate__, or the results of the analysis might be __wrong__. To ensure such the follwing need to be done:

1. Detect inaccurate data, and correct or remove it.
2. Detect duplicate data, and remove duplicates. 

As this project focuses on apps that are free to install and download, and for English-speaking audience.

1. Filter apps that are not in English language.
2. Filter apps that are not free.

The above mentioned process of data preparation is called __data cleaning__.

The discussions section has a mention that rating at entry 10473 is wrong.
    I will try to print the row at that index to check if it's incorrect.

In [22]:
print(android_header)
print(explore_data(android, 10471, 10473, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
row : 10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


row : 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13
None


As evident from the above function return that the row has errors in the __Category__ column.
I will delete the error row to clean the dataset of errors.


In [23]:
del android[10472] # deleting the incorrect row
print(explore_data(android, 10471, 10474, True))

row : 10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


row : 10473
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


row : 10474
['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows: 10840
Number of columns: 13
None


__Removing duplicate entries__

The discussuns sections of the dataset has mentions about redundant entries. For instance, instagram has four entries.


In [24]:
for row in android[1:]:
    name=row[0]
    if name=='Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [25]:
# Segregating the duplicate apps and printing names of some duplicate apps

duplicate_apps=[]
unique_apps=[]
for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('Names of duplicate apps', duplicate_apps[:5])
print('\n')
print('Number of unique apps: ', len(unique_apps))

Number of duplicate apps:  1181
Names of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Number of unique apps:  9659


__Removing the duplicate rows__

Above I have iterated through the _Google play_ store data set and identified the duplicate or redundant apps.<br />
Going forward I need to find a way to permanently remove these duplicate rows based on some criterion.

In [26]:
# Examining instagram data to identify a condition based on which duplicate rows can be removed

print('google play store dataset header: ')
print(android_header)
print('\n')
for row in android:
    name='Instagram'
    if name in row:
        print(row)


google play store dataset header: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


From the above result, we may notice that there is no diffrence in the _data points_ of the Instagram rows based on which I might delete the redundant rows except, __Reviews__ column. The count in the __Reviews__ column increases as more redundant rows are added. I can try to remove all the redundant rows except the one that has maximum reviews, as presumably that might be the most recent one.

In [27]:
# Removing duplicate rows

reviews_max={}

for row in android:
    name=row[0]
    n_reviews=float(row[3])
    
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
        
#print(reviews_max)
print(len(reviews_max))

9659


In [28]:
 # using review_max to remove duplicate rows
    
android_clean=[]
already_added=[]

for row in android:
    name=row[0]
    n_reviews=float(row[3])
    
    #1 Checking whether number of reviews of a app(row) in google playstore dataset
    # match with number of reviews present as value against a key of same name in 
    # review_max dictionary and if name not in already_added list, appending that
    # row as list in android_clean list.
    
    #2 Appending the name of the app in already_added list to keep track of the
    # apps already added
    
    if (reviews_max[name]==n_reviews) and (name not in already_added):
        android_clean.append(row)  #1
        already_added.append(name) #2

# Exploring android_clean length implies that there are 9659 unique rows
print(len(android_clean))

9659


As mentioned earlier this Project aims to derive results from the apps that are for English speaking audience.<br />
Which means that all the data of the data set that are for a non English speaking audience needs to be cleaned.<br />
Each english characters has a number mapped to it as per ASCII system that ranges from 0-127. <br />
I will try and define a function that will iterate over the character of a string and will try to distinguish base on the above mentioned condition.

In [29]:
def test_string(string):
    for character in string:
        ascii_val=ord(character)
        print(character, ascii_val)
        if ascii_val>127:
            return False
    return True
        
# Test test_string() function

#test_string('Instagram')
#test_string('爱奇艺PPS -《欢乐颂2》电视剧热播')
test_string('Docs To Go™ Free Office Suite')
#test_string('Instachat 😜')

D 68
o 111
c 99
s 115
  32
T 84
o 111
  32
G 71
o 111
™ 8482


False

The _test_string()_ function gives correct result on almost all occasions except when there is a _imoji_ in the
name of the app. This indicates that the apps that contains _imojis_ but are still for an English speaking audience will be filtered out, which is not desired.<br />
To overcome this problem I will modify the _test_string()_ function in a way that it will only filter those strings which have __more than three__ chracters that exceed the ASCII range(0-127) 

In [30]:
def test_string(string):
    ascii_exceed=0;
    for character in string:
        ascii_val=ord(character)
        if ascii_val>127:
            ascii_exceed+=1
    if ascii_exceed>3:
        return False
    return True

    
#test_string('Docs To Go™ Free Office Suite')
#test_string('Instachat 😜')
test_string('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Now the _test_string()_ function is defined so that it will only filter out those strings that have more than three _imojis_<br />
I will use _test_string()_ function to filter out non-English apps from both the datasets. Loop through each dataset. If an app name is identified as English, append the whole row to a separate list.


In [31]:
playstore_cleaned=[]
ios_cleaned=[]

for row in android_clean:
    name=row[0]
    if test_string(name):
        playstore_cleaned.append(row)
        
for row in ios:
    name=row[0]
    if test_string(name):
        ios_cleaned.append(row)

# Initial playstore dataset
print('google playstore dataset: ', len(android))

# playstore dataset after cleaning duplicate rows
print('playstore dataset without duplicates: ', len(android_clean))

#playstore dataset after removing non-English apps
print('playstore dataset English apps: ', len(playstore_cleaned))

# ios dataset after removing non-English apps
print('ios dataset English apps ', len(ios_cleaned))

google playstore dataset:  10840
playstore dataset without duplicates:  9659
playstore dataset English apps:  9614
ios dataset English apps  7197


It can be noted that out of 10840 rows 9614 rows are obtained after cleaning the playstore dataset.

This project aims to analyze a dataset that contains apps that are free to download and install, implies that
the dataset should be isolated from non-free apps, which will be the last step in the __data cleaning__ process


In [32]:
playstore_final=[]
ios_final=[]

for row in playstore_cleaned:
    price=row[7]
    if price=='0':
        playstore_final.append(row)
        
for row in ios_cleaned:
    price=row[4]
    if price=='0.0':
        ios_final.append(row)
        
        ## CHECKING IF FINAL DATASET(iOS and Google playstore) has a non-free app ##
        
# Checking if playstore final dataset has non free apps
for row in playstore_final:
    price=row[7]
    if price!='0':
        print('non free app')
print('playstore_final: ', 'no non-free apps')

# Checking if ios final dataset has non free apps
for row in ios_final:
    price=row[4]
    if price!='0.0':
        print('non free app')
print('ios_final: ', 'no non-free apps')

playstore_final:  no non-free apps
ios_final:  no non-free apps


The playstore and ios datasets are now cleaned and ready for __Analysis__

Most Common Apps By Genre : 

Part One:

So far I have completed the following steps to clean the data:
    
    -> Removing inaccurate data
    -> Removing duplicate app entries
    -> Removing non-english apps
    -> Isolating the free apps
    
    
My goal is to determine thw kind of apps that are likely to attract more users because  the number of people using my app
affects our revinue.

To minimise the risks and overhead, validation strategy for an app idea has three steps:

    1. Build a minimal Android version of the app, and add it to Google play.
    
    2. If  the App has good response from the users, we develop it further.
    
    3. If the App is profitable after six months, we build an iOS version of the app and add it
        to the App Store.
        
My Goal is to add the app both on Google play and App Store, due to which I need to find App Profiles that are
Successful in both markets. For instance, a profile that works well for both markets might be a productivity app
that makes use of gamification.

I will begin analysis of the most common genres in each market. For this I will build frequency tables for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google play data set.

Part Two:

1. I will define a function that takes the dataset as parameter 
    

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0 
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
            

Part Three:

I will start by examining the frequency table for the prime_genre column of the App Store data set.

In [34]:
display_table(ios_final, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032
