# Determening the characteristics of profitable mobile apps.

This project aims to determine main characteristics of mobile applications leading to increasing the number of users.
The results can be used in marketings purposes. 

The method used for determination is analysis of statistical data from Google Play (approximately 10.000 applications) and App Store (approximately 7000 applications)

## Introduction.

Our first step is to open files containing the datasets and extract dataset from them. 

In [43]:
from csv import reader

In [44]:
# Extracting android data
android_file_open = open('googleplaystore.csv')
android_file_read = reader(android_file_open)
android_dataset = list(android_file_read)

In [45]:
#Extracting apple data
apple_file_open = open('AppleStore.csv')
apple_file_read = reader(apple_file_open)
apple_dataset = list(apple_file_read)

### Initial look at the datasets. Android.

In this section we want to have an initial took at the Android dataset to determine the structure of the data.

In [46]:
print("The number of columns: ", len(android_dataset[0]))

The number of columns:  13


In [47]:
print("The dataset columns' names: " , android_dataset[0])

The dataset columns' names:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [48]:
print("The number of entries: ", len(android_dataset) - 1) #the first row contains the columns' names

The number of entries:  10841


In [49]:
print("Example of an entry: ", android_dataset[1])

Example of an entry:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


### Initial look at the datasets. Apple.

In this section we will determine the structure of the Apple dataset.

In [50]:
print("The number of columns: ", len(apple_dataset[0]))

The number of columns:  16


In [51]:
print("The dataset columns' names: " , apple_dataset[0])

The dataset columns' names:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [52]:
print("The number of entries: ", len(apple_dataset) - 1) #the first row contains the columns' names

The number of entries:  7197


In [53]:
print("Example of an entry: ", apple_dataset[1])

Example of an entry:  ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


## Data cleaning.

In this section we will prepare the data for analysis.
Preparation will consist of:
* Removing duplicates
* Removing invalid entries
* Removing non-free applications
* Removing non-English apps.

### Removing duplicates

Several entries for the same application may lead to statistical errors. So, we will check the datasets for duplicates and remove those. 

We start with the counting of the number of duplicates. We consider two entries *duplicates* if they represent the same application (both entries have the same application name)

In [75]:
# Takes the dataset in a form of list and the index of the column that contains the name.
# Returns the number of duplicates

def number_of_duplicates(dataset: list, name_index: int) -> int:
    uniques = set()
    duplicates = 0
    
    for row in dataset[1:]:
        app_name = row[name_index]
        if  app_name in uniques:
            duplicates += 1
        else:
            uniques.add(app_name)
        
    return duplicates

#### Technique

When we see the duplicate entries for one application, we would like to leave the entry with the maximum number of reviews because that's the newest one. 
To achieve that, we would create a dictionary called ```unique_entries``` the keys of which would be the unique names of the applications. Each value would be a diactionary of the number of reviews and the original row. 

We go through the dataset. When finding a duplicate of an entry for an application, we check if the number of reviews for in this entry larger than the saved one. If so, we update the saved data.
However, if we encounter an entry for an app for the first time, we just save its data alongside the number of reviews to the unique_entries dictionary. 

At the end we go through the dictionary, adding all its entries to the list of unique elements. The cleaning from the duplicates is finished

In [73]:
# Takes the dataset in a form of list,the index of the column that contains the name
# and the index of the column that contains the number of reviews.
# Returns the initial dataset without the duplicates. 
# The order of rows is not guaranteed.


def remove_duplicates(dataset: list, name_index: int, reviews_number_index: int) -> list:
    unique_entries = dict()

    for i in range(1, len(dataset)):
        if dataset[i][name_index] not in unique_entries:
            unique_entries[dataset[i][name_index]] = {'reviews': dataset[i][reviews_number_index], 'data': dataset[i]}
    else:
        if unique_entries[dataset[i][name_index]]['reviews'] < dataset[i][reviews_number_index]:
            unique_entries[dataset[i][name_index]] = {'reviews': dataset[i][reviews_number_index], 'data': dataset[i]}
            
    unique_list = list()
    unique_list.append(dataset[0])

    for app in unique_entries:
        unique_list.append(unique_entries[app]['data'])
        
    return unique_list

#### Android dataset 

Check how many duplicates the Android dataset contains.

In [76]:
duplicates = number_of_duplicates(android_dataset, 0)
print('The number of duplicates in Android dataset is: ', duplicates)

The number of duplicates in Android dataset is:  1181


Remove the duplicates of the Android dataset.
The name index is 0, the number of reviews column's index is 3.

In [77]:
unique_list_android = remove_duplicates(android_dataset, 0, 3)

In [78]:
# Number of unique items of the DS.
resulting_length = len(unique_list_android) - 1
print('The number of unique entries: ', resulting_length)


The number of unique entries:  9660


#### Apple dataset

Check if the Apple dataset contains duplicates. 

The two entries are duplicates if they have the same name of the applciation

In [80]:
duplicates = number_of_duplicates(apple_dataset, 1)
print('The number of duplicates in Apple dataset is: ', duplicates)

The number of duplicates in Apple dataset is:  2


We will use the same techniques of deleting duplicate rows for Apple dataset. 
The name column index is 1, the column index of the number of review is 5. 


In [81]:
unique_list_apple = remove_duplicates(apple_dataset, 1, 5)

In [82]:
# Number of unique items of the Apple DS.
resulting_length = len(unique_list_apple) - 1
print('The number of unique entries in Apple dataset: ', resulting_length)

The number of unique entries in Apple dataset:  7195


### Removing incorrect values.

#### Rows with wrong number of cells.

The one example of invalid rows is the rows with wrong number of cells. 
Android dataset contains 13 columns and each rows must have 13 cells. 
Apple dataset contains 16 columns, so each row must have the length of 16

In each dataset without duiplicates we check if each rows contains correct number of cells and remove invalid rows.

In [84]:
# Takes a dataset with heading row.
# Returns a dataset without rows with incorrect number of cells.

def check_rows_length(dataset: list) -> list:
    correct_rows = list()
    correct_rows.append(dataset[0])
    
    number_of_cols = len(dataset[0])
    
    for row in dataset[1:]:
        if len(row) == number_of_cols:
            correct_rows.append(row)
    
    return correct_rows

In [85]:
# Android
correct_rows_android = check_rows_length(unique_list_android)

In [86]:
removed_rows = len(unique_list_android) - len(correct_rows_android)
print('The number of removes entries: ', removed_rows)

The number of removes entries:  1


In [87]:
# Apple
correct_rows_apple = check_rows_length(unique_list_apple)

In [88]:
removed_rows = len(unique_list_apple) - len(correct_rows_apple)
print('The number of removes entries: ', removed_rows)

The number of removes entries:  0


#### Removing non-English apps.

Our analysis is concerned with applications for English-speaking audience. So, we need to remove applications not designed for them. 

To do that, we will remove all applications that contain not-english, not punctuational and not-digits in its name.

We start with a function detecting if a name given is owned by a foreign app.
It checks is the string contains more than three foreign symbols which could point to a foreign app.

In [94]:
def contains_foreign(text: str) -> bool:
    count = 0
    
    for char in text:
        if ord(char) > 127:
            count += 1
    
    return count > 3

In [95]:
print(contains_foreign('Instagram'))
print(contains_foreign('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(contains_foreign('Docs To Go™ Free Office Suite'))
print(contains_foreign('Instachat 😜'))

False
True
False
False
