# Profitable Free Application Recommendation. Analysis of Apple and Google Play Store.
The goal of this guided project is to analyse applications on Apple Store and Google Play Store to find out which type of application have the most users and use that information to make a data driven recommendation for developers on the type of applications that would attract more users. (For the purpose of this project I pretend work as a data analyst in a company that creates free android and ios apps.)

In [None]:
opened_1 = open('../input/google-and-apple-store/AppleStore.csv', encoding='utf8')
opened_2 = open('../input/google-and-apple-store/googleplaystore.csv', encoding='utf8')
from csv import reader
read_file1= reader(opened_1)
read_file2 = reader(opened_2)
ios = list(read_file1)
google_play = list(read_file2)

>I created a function `explore_data()` which prints the specified part of any datasets of choice along with the number of rows and columns if the `row_and_columns` default argument is set to true.

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('----------------------------------------------------------------------------')
        
        
print('For Apple Store: ')
explore_data(ios[1:], 0, 2, rows_and_columns=True)
print('For Google Play: ')
explore_data(google_play[1:], 0, 2, rows_and_columns=True)

> Here I printed the column names so we know the data_point(element) index.

In [None]:
ios_column_names = ios[0]
google_play_column_names = google_play[0]

print(ios_column_names)
print('\n')
print(google_play_column_names)

>The entry in google_play dataset at row 10473 is wrong data as can be seen after printing it, so I deleted it in the next cell as I want to work with correct data to draw accurate conclusions.

In [None]:
print(google_play[10473])

In [None]:
del google_play[10473] #DO NOT RUN DEL STATEMENT MORE THAN ONCE!

## Checking for duplicate entries in Apple and Google Stores.
**I will print an example of the duplicates and the number of duplicate entries and unique apps.**
> I created a function `duplicate_and_unique_entries()` which collects the duplicates and unique apps in the lists; `unique_apps` and `duplicate_apps`, prints the length of both lists and an example of the duplicate app.

In [None]:
def duplicate_and_unique_entries(dataset, index):
    unique_apps = []
    duplicate_apps = []
    
    for apps in dataset:
        name = apps[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    print('Number of duplicate apps: ', len(duplicate_apps)) 
    print('Number of unique apps: ', len(unique_apps))
    print('\n')
    print('Examples of duplicate apps: ', duplicate_apps[:3])
   
 
print('For Google Play: ')
duplicate_and_unique_entries(google_play[1:], 0)
print('--------------------------')
print('For ios:')
duplicate_and_unique_entries(ios[1:], 1)

>I printed the occurences where the duplicate app example above occured.(To confirm that it is indeed a duplicate entry)

In [None]:
print('For Google Play: ')
for app in google_play[1:]: 
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)
        
        
print('\n')
print('For ios: ')
for app in ios[1:]:
    name = app[1]
    if name == 'Mannequin Challenge':
        print(app)

>From calling the function `duplicate_and_unique_entries()` abobe we can see that the number of unique apps is the same as the expected length of the google and apple dataset(not counting the first row which contains the column names) after subtracting the number of duplicates from the original length of the google play dataset. (just to confirm).


In [None]:
print('Expected length: ', len(google_play[1:]) - 1181)
print('Expected length for ios: ', len(ios[1:]) - 2)

## Getting the cleaned datasets of Google and Apple Store.
> By clean what I mean is that the datasets no longer contain duplicate entries and only one of the duplicates of the most recent entry is added to the list.
For that I used dictionaries, lists, if and elif statements and multiple conditions:
* First I created a dictionary which contains the applications with their respective ratings or reviews.
* Then I looped through the original dataset and for each iteration I checked if the rating or review and the name of the app is not in the dictionary and a list(to keep track of which apps have been added already).
* I repeated the same process for the ios apps using different variables.
* Printed the lengths of the cleaned datasets to be sure it's the same as the expected lengths after removing the duplicates.
Now we have datasets that we can work with going forward. 


In [None]:
reviews_max = {}
for apps in google_play[1:]:
    name = apps[0]
    n_reviews = float(apps[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews



google_play_cleaned = []
already_added = []

for app in google_play[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_play_cleaned.append(app)
        already_added.append(name)
print('Length of cleaned google dataset is: ', len(reviews_max))

print('-----------------------------------------------------------------------------------------------')
ios_rating_max = {}
for apps in ios[1:]:
    name = apps[1]
    n_ratings = float(apps[5])
    if name in ios_rating_max and ios_rating_max[name] < n_ratings:
        ios_rating_max[name] = n_ratings
    elif name not in ios_rating_max:
        ios_rating_max[name] = n_ratings


ios_cleaned = []
ios_already_added = []
for App in ios[1:]:
    app_name = App[1]
    N_ratings = float(App[5])
    if N_ratings == ios_rating_max[app_name] and app_name not in ios_already_added:
        ios_cleaned.append(App)
        ios_already_added.append(app_name)
print('Length of cleaned  ios dataset is: ', len(ios_cleaned))

## Removing non English apps: App targeted at English speaking audience.
> Since the recommendation for the application I am going to make is based on english speaking audience only I would have to carry out our analysis without non English apps.
Hence the creation of `is_english()` function to check if an app is english or not.

In [None]:
def is_english(name):
    count = 0
    for char in name:
        if ord(char) > 127:
            count += 1
            if count > 3:
                return False
        
    return True
    
is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

> * First I created two lists, one for goole and ios respectively(`google_play_eng_appse` and `app_store_eng_apps`).
* Then I looped through the cleaned datasets, and for each iteration I used the `is_english()` function to check if the name of the app is english and if it is then add to the respective list.
* I proceeded to print the lengths of the datasets to know how many apps I have left to work with.

In [None]:
google_play_eng_apps = []
app_store_eng_apps = []

for app in ios_cleaned:
    name = app[1]
    if is_english(name):
        app_store_eng_apps.append(app)
        
for app in google_play_cleaned:
    name = app[0]
    if is_english(name):
        google_play_eng_apps.append(app)
        
print('The number of apps left in google_play dataset after removing the duplicates and non english apps are: ', len(google_play_eng_apps))
print('The number of apps left in ios afer removing non english apps are: ', len(app_store_eng_apps))        

## Is it free? Taking away paid apps in Apple and Google Stores.
> Since the analysis is based specifically on free applications, it would make no sense to include paid apps in our analysis. 
So again we create new datasets `free_google_apps` and `free_ios_apps` which would only contain apps with the prices of `0.0`

In [None]:
free_google_apps = []
free_ios_apps = []

for apps in google_play_eng_apps:
    price = apps[7]
    if price == '0':
        free_google_apps.append(apps)

for apps in app_store_eng_apps:
    price = apps[4]
    if price == '0.0':
        free_ios_apps.append(apps)

print('Google apps left: ', len(free_google_apps))
print('ios apps left: ', len(free_ios_apps))


## If it's popular then there is potential for success.
### Finding popular applications
> In the cell below I created two functions, `freq_table()` and `display_table()`, both takes a dataset and an index as argument, `freq_table()` returns an unsorted frequency table of the genre and the frequency of their apperance in percentage. The `display_table()` function displays in descending order the frequency table.

In [None]:
def freq_table(dataset, index):
    freq_dict = {}
    for apps in dataset:
        genre = apps[index]
        if genre not in freq_dict:
            freq_dict[genre] = 1
        elif genre in freq_dict:
            freq_dict[genre] += 1

    for genre in freq_dict:
        freq_dict[genre] /= len(dataset)
        freq_dict[genre] *= 100
    return freq_dict
        

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_as_tuple = (table[key], key)
        table_display.append(key_value_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':',entry[0])



> The code below prints out the genre that has the highest average ratings in Apple Store.
To achieve this:
* I created a list and a dictionary (`genres` and `genre_avg_user_ratings` respectively)
* Looped through the dictionary generated by calling the `freq_table()` function created earlier.
* Extracted the keys (which are the genres in the ios store) and added them to the list.
* Looped through the `genres` list created above and for each iteration created two variables `total` and `len_genre` to store the sum of the ratings and number of apps specific to each genre. (This is so that the average for each genre can be calculated.)
* Use a nested loop to loop through the `free_ios_app` list and for each iteration  calculated and stored the genre as a key and `average_user_rating` as a value in the `genre_avg_user_ratings` dictionary created earlier.
* Since I can't sort a dictionary and a sorted result would make for quick analysis, I created a tuple with the contents of the dictionary,  sorted it in descending order, put it inside a list and displayed it's content

In [None]:
genres = []
genre_avg_user_ratings = {}

for key in freq_table(free_ios_apps, 11):
    genres.append(key)

for genre in genres:
    total = 0 #stores the sum of the number of ratings specific to each genre
    len_genre = 0 #stores the number of apps specific to each genre
    for app in free_ios_apps:
        genre_app = app[11]
        if genre_app == genre:
            no_of_user_rating = float(app[5])
            total += no_of_user_rating
            len_genre += 1
    average_user_rating = total / len_genre
    genre_avg_user_ratings[genre] = average_user_rating

genre_avg_user_ratings_display = []
for key in genre_avg_user_ratings:
    key_value_as_tuple = (genre_avg_user_ratings[key], key)
    genre_avg_user_ratings_display.append(key_value_as_tuple)
    
genre_avg_user_ratings_display = sorted(genre_avg_user_ratings_display, reverse=True)
print('The averager ratings for apps in ios store is: ')
for entry in genre_avg_user_ratings_display:
    print(entry[1], ': ', entry[0])

>The code below prints out the most app by category with the highest average number of installs in Google Play Store in decreasing order.

In [None]:
google_app_categories = []
google_app_avg_no_installs = {}

for key in freq_table(free_google_apps, 1):
    google_app_categories.append(key)


for category in google_app_categories:
    total = 0
    len_category = 0
    for app in free_google_apps:
        category_app = app[1]
        if category_app == category:
            no_of_installs = app[5]
            no_of_installs = no_of_installs.replace('+', '')
            no_of_installs = no_of_installs.replace(',', '')
            no_of_installs = float(no_of_installs)
            total += no_of_installs
            len_category += 1
    avg_no_installs = total / len_category
    google_app_avg_no_installs[category] = avg_no_installs

    
Google_avg_installs = []
for key in google_app_avg_no_installs:
    keyValue_as_tuple = (google_app_avg_no_installs[key], key)
    Google_avg_installs.append(keyValue_as_tuple)

Google_avg_installs = sorted(Google_avg_installs, reverse=True)
print("The average number of installs for each app category in Google Play Store is: ")
for entry in Google_avg_installs:
    print(entry[1], ': ', entry[0])

From the results of the analysis, the niche seems to be dominated by communication and social application so I recommend a free productive social application with video communication capabilities.

It seems that there are a lot of social apps so I suggest building a free learning platform where teachers and students can meet for the purposes of learning, and it should also have video and audio capabilities to facilitate the learning experience.

# Conclusions
In this guided project I analysed the data on Google and Apple Markets with the goal of recommending a free profitable app profile that will be successful in both markets.

I concluded that building a productive and social application with video and voice communication capabilites as a learning platform where teachers and students meet. Since the niche in both markets tends towards that area, so combining the functionalities of social and networking apps as well as productive apps is an area worth looking into.