## First DataScience Project: Mobile Apps Analysis
The company builds apps that are free to download and install, and the main source of revenue consists of in-app ads.
Our goal for this project is to analyze data to help our developers understand what `type of apps are likely to attract more users`.

We will be using 2 datasets from kaggle: [GooglePlayStore](https://www.kaggle.com/lava18/google-play-store-apps) and [AppleStore](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
#function to print rows repeatedly

def explore_data(dataset, start, end, rows_and_columns=False):
    data_slice= dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')
    
    if rows_and_columns==True:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [2]:
#function to open file

def openfile(file="C:/Users/User/Datasets/AppleStore.csv"):
    opened_file= open(file,encoding='utf8')
    from csv import reader
    read_file=reader(opened_file)
    apps_data=list(read_file)
    return apps_data

In [3]:
googleplaystore= openfile("C:/Users/User/Datasets/googleplaystore.csv")
applestore=openfile()
explore_data(googleplaystore,0 , 5, True)
print('---------------------------------------------------------------------------------------------------------------------\n')
explore_data(applestore,0,5,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  10842
Number of columns:  13
-----------------------------------------------------------------

In [4]:
googleplaystore= openfile("C:/Users/User/Datasets/googleplaystore.csv")
applestore=openfile()
print("Columns in GooglePlayStore dataset:")
explore_data(googleplaystore,0 , 1, True)
print('\n---------------------------------------------------------------------------------------------------------------------\n')
print("Columns in AppleStore dataset:")
explore_data(applestore,0,1,True)

Columns in GooglePlayStore dataset:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows:  10842
Number of columns:  13

---------------------------------------------------------------------------------------------------------------------

Columns in AppleStore dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows:  7198
Number of columns:  16


In [5]:
#row with error is found here
print(googleplaystore[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
#remove the erratic data
del googleplaystore[10473]

In [7]:
#verify that the erratic data has been removed
print(googleplaystore[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Checking for duplicate entries
Now we need to check the Google Play data set for duplicate entries.

In [8]:
duplicate_apps=[]
unique_apps=[]
for row in googleplaystore[1:]:
    if row[0] in unique_apps:
        duplicate_apps.append(row[0])
    else:
        unique_apps.append(row[0])

print(len(duplicate_apps))
print(len(unique_apps))

1181
9659


As seen above, there are actually a number of duplicate rows in our dataset. Let us take a deeper look into these duplicate_apps.

In [10]:
occurences={}
for name in duplicate_apps:
    if name in occurences:
        occurences[name]+=1
    else:
        #we define as 2 here because the first occurence of duplicate means there are 2 entries in dataset
        occurences[name]=2

sorted(occurences.items(), key=lambda x:x[1], reverse=True)[:100]
#we can see that ROBLOX has 9 duplicates

[('ROBLOX', 9),
 ('CBS Sports App - Scores, News, Stats & Watch Live', 8),
 ('Duolingo: Learn Languages Free', 7),
 ('8 Ball Pool', 7),
 ('Candy Crush Saga', 7),
 ('ESPN', 7),
 ('Nick', 6),
 ('Subway Surfers', 6),
 ('Bubble Shooter', 6),
 ('Sniper 3D Gun Shooter: Free Shooting Games - FPS', 6),
 ('Zombie Catchers', 6),
 ('Temple Run 2', 6),
 ('slither.io', 6),
 ('Helix Jump', 6),
 ('Bowmasters', 6),
 ('Bleacher Report: sports news, scores, & highlights', 6),
 ('Viber Messenger', 5),
 ('Netflix', 5),
 ('Calorie Counter - MyFitnessPal', 5),
 ('Plants vs. Zombies FREE', 5),
 ('Granny', 5),
 ('Zombie Tsunami', 5),
 ('Farm Heroes Saga', 5),
 ('Angry Birds Classic', 5),
 ('Flow Free', 5),
 ('MeetMe: Chat & Meet New People', 5),
 ('Wish - Shopping Made Fun', 5),
 ('eBay: Buy & Sell this Summer - Discover Deals Now!', 5),
 ('BeautyPlus - Easy Photo Editor & Selfie Camera', 5),
 ('Yahoo Fantasy Sports - #1 Rated Fantasy App', 5),
 ('theScore: Live Sports Scores, News, Stats & Videos', 5),
 ('ML

In [None]:
for app in googleplaystore:
    if app[0]=="ROBLOX":
        print(app)
#Here, we can see that the only difference between the rows is the column 4 (number of reviews). It is reasonable to assume that
#the number of reviews increased over time, so we should take the entry with the highest value:4450890. 

We will be filtering the data set by removing all duplicates that do not have the highest review count.

In [None]:
reviews_max={}
for row in googleplaystore[1:]:
    name=row[0]
    reviews=float(row[3])
    if name in reviews_max and reviews>reviews_max[name]:
        reviews_max[name]=reviews
    elif name not in reviews_max:
        reviews_max[name]=reviews
        
print(len(reviews_max))

Now we have a dictionary that contains the highest review count paired with the name of the app. All we need to do is identify the rows of these relevant data and assign it to a new list.

In the below step, we are creating a new list to contain the clean data. We will check that only the row with the highest review count is appended to the new list(`android_clean`). Additionally, to prevent duplicate data that have the same number of review counts, we have another list(`already_added`) to keep track of the apps which we have already added.

In [None]:
android_clean=[]
already_added=[]
for row in googleplaystore[1:]:
    name = row[0]
    n_reviews= float(row[3])
    if reviews_max[name]== n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

#let us verify that the ROBLOX app has the correct review count of 4450890 in the new dataset
for app in android_clean:
    if app[0]=="ROBLOX":
        print(app)

## Removing non-english apps
As the company is mainly involved with english apps, we need to do some data cleaning to remove non-english apps.

In [None]:
#function that checks if name only consists of english characters
def language_check(name):
    for x in name:
        if ord(x)>127:
            return False
    return True

In [None]:
print(language_check('Instagram'))
print(language_check('Docs To Go™ Free Office Suite'))
print(language_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_check('Instachat 😜'))

In [None]:
#function that checks if name consists of >3 non-english characters
def language_checker(name):
    count=0
    for x in name:
        if ord(x)>127:
            count+=1
    if count>3:
        return False
    return True

In [None]:
print(language_checker('Instagram'))
print(language_checker('Docs To Go™ Free Office Suite'))
print(language_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_checker('Instachat 😜'))

Now lets filter out non-English apps from the datasets!

In [None]:
android_clean_english=[]
applestore_english=[]
for row in android_clean:
    name=row[0]
    if language_checker(name)==True:
        android_clean_english.append(row)
print(len(android_clean_english))

for row in applestore[1:]:
    name=row[1]
    if language_checker(name)==True:
        applestore_english.append(row)
print(len(applestore_english))

At this point, we have successfully removed inaccurate data, duplicate app entires and non-English apps for both datasets!
The cleaned datasets are namely `android_clean_english` and `applestore_english`. 
## Isolating only free apps
Since the company is interested in only free apps, now let us clean the data further and isolate the free apps in the datasets.

In [None]:
free_android=[]
free_apple=[]
for row in android_clean_english:
    price=row[7]
    if price=='0':
        free_android.append(row)
    
for row in applestore_english:
    price=float(row[4])
    if price==0.0:
        free_apple.append(row)
        
print(len(free_android))
print(len(free_apple))

We are left with 8864 Android apps and 3222 iOS Apps.
## Identifying free apps with a large user base
Since the company's main focus is through the development of free apps, having a large user base is extremely important since it directly influences the revenue of the company. 

Now that we have isolated the free apps, we need to find a type of app that attracts large numbers of iOS and Android users. By doing so, the company is able to target both the iOS and Android market at the same time.

In [None]:
#let us first generate a frequency table to find out the most common genres in each market
#define a function that creates a frequency table based on a dataset and index
def freq_table(dataset,index):
    freq_tbl={}
    total=0
    for row in dataset:
        total+=1
        column=row[index]
        if column in freq_tbl:
            freq_tbl[column]+=1
        else:
            freq_tbl[column]=1
    table_percentages={}
    for key in freq_tbl:
        percentage=(freq_tbl[key]/total)*100
        table_percentages[key]=percentage
    return table_percentages

Now we have 2 frequency tables based on genres for the iOS and Android market. To obtain the genre with the highest frequency, we will need to sort our frequency tables.

In [None]:
def display_table(dataset,index):
    table = freq_table(dataset,index)
    #list of tuples
    table_display=[]
    for key in table:
        key_val_as_tuple= (table[key],key)
        table_display.append(key_val_as_tuple)
    #sort the table in descending order
    table_sorted= sorted(table_display,reverse=True)
    for entry in table_sorted:
        print(entry[1],':',entry[0])

## Most Popular Apps by Genre & Category in PlayStore

In [None]:
#sorted freq_table for android genres
display_table(free_android,9)

In [None]:
#sorted freq_table for android categories
display_table(free_android,1)

In the android genres table, we observe that the most common genre is `Tools` followed by `Entertainment`. While if we look at the android categories table, `Family` and `Game` rank top 2 followed by `Tools`. This is likely due to a more generic classification of the apps which grouped into the `Game` and `Family` categories respectively.

## Most Popular Apps by Genre on AppStore

Let us now take a look at the the AppleStore dataset.
Since there is no data on the number of installations for apps in AppleStore, we will use rating_count_column to extract the total number of user ratings as a proxy instead.

In [None]:
#sorted freq_table for iOS
display_table(free_apple,11)

In the apple prime_genre table, we see that `Games` rank first followed by `Entertainment`. This is generally in line with the Android market as Games are ranked in top 2 for both markets. However, just this information is not enough to provide a conclusive insight. 

Let us look further with the average number of user ratings for each genre of apps.

In [None]:
for genre in freq_table(free_apple,11):
    total=0
    len_genre=0
    for row in free_apple:
        genre_app=row[11]
        if genre_app== genre:
            total+=float(row[5])
            len_genre+=1
    avg = total/len_genre
    print(genre,':',avg)

From above, we can see that `Social Networking` is approximated to be the genre that has the most user ratings per app in iOS.

## Category with most number of installations in PlayStore
After inspection of our android data set, we realise that the number of installations are non-precise(shown below). However, we will still use it to approximate the category of app that has the most users.

In [None]:
display_table(free_android,5)

In [None]:
total=0
len_category=0
install_tbl={}
for category in freq_table(free_android,1):
    for row in free_android:
        category_app=row[1]
        if category_app==category:
            total+=float(row[5].replace('+','').replace(',',''))
            len_category+=1
    avg_installs=total/len_category
    print(category,':',avg_installs)
    install_tbl[category]=avg_installs

In [None]:
#lets sort the results for easier reading
table_display=[]
for key in install_tbl:
    key_val_as_tuple= (install_tbl[key],key)
    table_display.append(key_val_as_tuple)
#sort the table in descending order
table_sorted= sorted(table_display,reverse=True)
for entry in table_sorted:
    print(entry[1],':',entry[0])

From the results shown, we see that `communication` genre has the highest avg installations per app. It is important to remember that the number of installations is not accurate and only serves as an estimation. This result is likely to be skewed by a few apps that have a huge amount of installations (like Whatsapp, LINE, Facebook Messenger, Gmail, Telegram etc..)

In [None]:
for app in free_android:
    if app[1] =='COMMUNICATION' and ( app[5] == '1,000,000,000+' or app[5] =='500,000,000+' or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

This trend where the top few apps takes up majority of the installs seems to common amongst categories that people require or use on a daily basis. Rather than look at categories that have a large number of dominant apps already, perhaps the company should look at other categories that have less of such dominant apps. In such categories, there will be less resistance for users to move from pre-existing apps to new apps and provide a better chance of gaining more users.

Let us take a look at several other categories before deciding.

In [None]:
for app in free_android:
    if app[1] =='DATING' and ( app[5] == '1,000,000,000+' or app[5] =='500,000,000+' or app[5] == '100,000,000+'
                             or app[5]=='10,000,000+' or app[5] =='5,000,000+'):
        print(app[0],':',app[5])

In the `Dating` category, we can clearly see that the dominant apps have less installs as compared to `Communication` category. This could be a potential category that the company wants to target. But let us look further into other categories.

In [None]:
for app in free_android:
    if app[1] =='BEAUTY' and ( app[5] == '1,000,000,000+' or app[5] =='500,000,000+' or app[5] == '100,000,000+'
                             or app[5]=='10,000,000+' or app[5] =='5,000,000+' or app[5] == '1,000,000+'):
        print(app[0],':',app[5])

In the `Beauty` category, we can clearly see that the dominant apps have even less installs as compared to `Dating` category. 
However, it is worth noting that the avg number of installs per app is 10 times less than `Dating` category. This is likely due to how niche this category is in comparison to `Dating` category.

In [None]:
count=0
for row in free_android:
    if row[1]=='BEAUTY':
        count+=1
print(count)

In [None]:
count=0
for row in free_android:
    if row[1]=='DATING':
        count+=1
print(count)

The total count of apps in each category also tells us the number of apps in the market that will be a competitor. 

I would recommend a category that is not too niche nor is it too popular. Being ranked 2 for the average number of installs, `dating` category has quite a large number of user base compared to `Beauty`. While it also does not have too many dominant apps in the market as compared to `Communications` category. This would give an app in the `Dating` category a fair chance to succeed in the Android market.

Let us take a look at the iOS market too.

In [None]:
dating_dictionary={}

#as there is no genre for 'dating', we need to list keywords and sieve out the relevant data
for row in free_apple:
    if ('date' in row[1] or 'Date' in row[1] or 'dating' in row[1] or 'Dating' in row[1] or 'meet' in row[1] or 'Meet' in row[1]):
        dating_dictionary[row[1]]=float(row[5])

temp_list=[]
for key in dating_dictionary:
    key_as_val=(dating_dictionary[key],key)
    temp_list.append(key_as_val)
sorted_list=sorted(temp_list,reverse=True)

for entry in sorted_list:
    print(entry[1],':',entry[0])


From this, we are able to ascertain that the number of dating apps in the iOS market is quite low and it could mean a higher possibility of success in the market.

# Conclusion
In this project, we analyzed data from both the AppStore and PlayStore with the goal of recommending an app profile that will likely be profitable in both markets.
We concluded that the dating apps in both markets are still not saturated and there is still opportunity for growth and high user adoption rate. With the help of some outstanding features like group dates, planned activity dates ,restaurant collaboration and other special features, there is a high likelihood of success for the company.