#          Apple Store and Google Play

The Data Sets are [Google Play](https://www.kaggle.com/lava18/google-play-store-apps/home) and [IOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) apps. The Columns are app related variables and are organized in the following order:



| App                                            | Category       | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres       | Last Updated    | Current Ver | Android Ver  |
|------------------------------------------------|----------------|--------|---------|------|----------|------|-------|----------------|--------------|-----------------|-------------|--------------|
| Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1    | 159     | 19M  | 10,000+  | Free | 0     | Everyone       | Art & Design | January 7, 2018 | 1.0.0       | 4.0.3 and up |

This project analyses _IOS_ and _Android_ apps related variables from a public data base in order to rank the characteristics that can increase free apps **profitability** in both markets.

We already know that the number of users have great influence in free apps profitability. This work intends to figure out what characteristics attract users and are more likely to make a free app profitable for both markets (IOS and Android).

## 1. Exploring Apple Store and Google Play Data sets.


In this section I will explore the Apple Store and Google Play data sets by doing the following steps:

   1. Opening the csv files;
   2. Using the `explore_data()` function to print few rows;
   3. Print the column names to gain insight for future analysis.

## 1.1. Opening the csv files

In [1]:
opened_data_ios = open('AppleStore.csv')
opened_data_andr = open('googleplaystore.csv')
from csv import reader
read_ios = reader(opened_data_ios)
read_andr = reader(opened_data_andr)
ios_data = list(read_ios)
andr_data = list(read_andr)
ios_data = ios_data[1:]
andr_data = andr_data[1:]

## 1.2 Defining `explore_data()` and exploring the data sets

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(ios_data,0,3,True)
explore_data(andr_data,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2

## 2. Cleaning Data

In this section, I'm going to remove the data that are not useful for my objectives such as **non free apps**, **repeated apps** or **apps with missing information** and, finally, apps which are not build in **USA**.

## 2.1 Removing row with error in Google Play Data Set

In [3]:
print(andr_data[10472])
print(len(andr_data[10472]))
print(len(andr_data[10471]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
13


It can be seen that the row number 10472 has only **12** columns compared to the row number 10471. This error was reported in [Kaggle discussion Google Play Dataset](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015Thus). To handle this error, I've removed the column number 10472 by using the `del` built in function.

In [4]:
del andr_data[10472]

In [5]:
print(andr_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## 2.2 Removing duplicated rows

By exploring the Google Play data set, it is possible to see that some apps have **duplicated entries**. in the following, I will track all apps that have repeated entries and print some of them.

In [6]:
unique_apps = []
duplicate_apps =[]
for i in andr_data:
    name_app = i[0]
    if name_app in unique_apps:
        duplicate_apps.append(name_app)
    else:
        unique_apps.append(name_app)
print(duplicate_apps[0:4])
number_duplicates = len(duplicate_apps)        
print("The number of duplicate apps:",(number_duplicates))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']
The number of duplicate apps: 1181


   
  Now, a methodology to remove the duplicated will be employed. First, lets check which variable differ from each repeated app.

In [7]:
for i in andr_data:
    app_name = i[0]
    if app_name == "Quick PDF Scanner + OCR FREE":
        print(i)
        

    

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


The criteria to select one of all repeated apps is the number of reviews. The highest, more recent is the app data and better for our analysis. This is used to remove the remaining data from the data set.

1. First we start creating an empty list `reviews_max{}`
2. Then we loop in the Google Play Data Set and assign, at each iteration, the app name to a variable `name` and the number of reviews to a variable `n_reviews`.
3. The first `if` test two conditions
    * if there is already the current app in the dictionary
    * if the associated value is lesser than the one already in position
4. The second `if` checks if the current app name in evaluation is already in the dictionary. If not, add it to the dictionary

In [8]:
reviews_max = {}
for i in andr_data:
    name= i[0]
    n_reviews = float(i[3])
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews
print('The amount of repeated apps is equal to',len(reviews_max))
    


The amount of repeated apps is equal to 9659


The next code line creates a initially empty list which will be filled with the non repeated entries.
Then, we loop through the Google Play Data Set and then check two conditions:
1. If the current number of reviews equals the maximum *AND*
2. IF the app name was already assessed.
 **The step 2 is important because the Google Play Data Set contains apps that may have the same numbers of reviews and all being the maximum value**

In [9]:
android_clean = []
already_added = []
for i in andr_data:
    name = i[0]
    n_reviews = float(i[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(i)
        already_added.append(name)
print('New Data set size is:',len(android_clean))
        

New Data set size is: 9659


The obtained Data set size is in agreement with the result obtained in the dictionary construction `review_max{}`

## 2.3 Removing apps with titles not in English Language

Since we are only interested in apps developed in English Language, the non English Language written apps will be removed from the analysis.To do that, i'm going make use of the way the computer stores strings. It associate numbers to each symbol representing a letter. We can obtain this number by using the built in function `ord()`. The words in english are built with characters that ranges from 0 to 127. Then, we can check if the any string in the title exceed this range to exclude the corresponding app from the data set.

The following function can classify most words in English or non-English.

In [10]:
def word_class(string):
    for letter in string:
        if ord(letter)>127:
            return False
word_class('Docs To Go™ Free Office Suite') #testing function

        

False

It can be that the previous algorithm missclassify English words due to the presence of special strings. To improve it, a new function will be defined in order to reduce the possibility of missclassification by introducing a more robust checking. The algorithm will return False only if **FOUR** or more strings are indexed greater than 127.

In [11]:
def word_class2(string):
    true_count = 0
    for letter in string:
        letter_indx = ord(letter)    
        if letter_indx > 127:
            true_count += 1
    if true_count > 3:
        return(False)
        
        
word_class2('Instachat 😜') #testing function

## 2.4 Removing non-free apps

In [12]:
android_finalclean = []
for i in android_clean:
    label = i[6]
    if label == 'Free':
        android_finalclean.append(i)

print('The final length is',len(android_finalclean))
        
    


The final length is 8904


As the first paragraph stated, the objective of the present analysis is finding the characteristics that makes free apps profitable in both IOS and Android Markets. An strong indication of profitability is the number of users. So, this is the sensor to measure profitability success.

## 3.Data Set Analysis

## 3.1 Finding out proportions of apps by genre

In [13]:
def frequence_table(dataset,index):
    proportion = {}
    dictionary = {}
    for row in dataset:
        genre = row[index]
        if genre in dictionary:
            dictionary[genre] +=1
            proportion[genre] = round(dictionary[genre]/len(dataset)*100)
        else:
            dictionary[genre] = 1
            proportion[genre] = round(dictionary[genre]/len(dataset)*100)
    return dictionary
            

ANDROID_frequency = frequence_table(android_clean,1)
#print(ANDROID_frequency)

print('This is the unsorted relative frequency table')
print('\n')
print(ANDROID_frequency)

    
    

This is the unsorted relative frequency table


{'GAME': 946, 'FINANCE': 345, 'SOCIAL': 239, 'TOOLS': 829, 'ENTERTAINMENT': 87, 'WEATHER': 79, 'FOOD_AND_DRINK': 112, 'COMMUNICATION': 315, 'DATING': 170, 'NEWS_AND_MAGAZINES': 254, 'COMICS': 56, 'MAPS_AND_NAVIGATION': 131, 'PERSONALIZATION': 376, 'HEALTH_AND_FITNESS': 288, 'BUSINESS': 420, 'EVENTS': 64, 'PRODUCTIVITY': 374, 'VIDEO_PLAYERS': 164, 'TRAVEL_AND_LOCAL': 219, 'PHOTOGRAPHY': 281, 'ART_AND_DESIGN': 61, 'MEDICAL': 395, 'FAMILY': 1874, 'PARENTING': 60, 'LIBRARIES_AND_DEMO': 84, 'LIFESTYLE': 369, 'BEAUTY': 53, 'HOUSE_AND_HOME': 73, 'SPORTS': 325, 'BOOKS_AND_REFERENCE': 222, 'EDUCATION': 107, 'AUTO_AND_VEHICLES': 85, 'SHOPPING': 202}


In [14]:
def display_table(dataset, index):
    table = frequence_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('Category relative frequence')
display_table(android_clean,1)
print('\n')
print('Genre relative frequence')
display_table(android_clean,9)


Category relative frequence
FAMILY : 1874
GAME : 946
TOOLS : 829
BUSINESS : 420
MEDICAL : 395
PERSONALIZATION : 376
PRODUCTIVITY : 374
LIFESTYLE : 369
FINANCE : 345
SPORTS : 325
COMMUNICATION : 315
HEALTH_AND_FITNESS : 288
PHOTOGRAPHY : 281
NEWS_AND_MAGAZINES : 254
SOCIAL : 239
BOOKS_AND_REFERENCE : 222
TRAVEL_AND_LOCAL : 219
SHOPPING : 202
DATING : 170
VIDEO_PLAYERS : 164
MAPS_AND_NAVIGATION : 131
FOOD_AND_DRINK : 112
EDUCATION : 107
ENTERTAINMENT : 87
AUTO_AND_VEHICLES : 85
LIBRARIES_AND_DEMO : 84
WEATHER : 79
HOUSE_AND_HOME : 73
EVENTS : 64
ART_AND_DESIGN : 61
PARENTING : 60
COMICS : 56
BEAUTY : 53


Genre relative frequence
Tools : 828
Entertainment : 561
Education : 510
Business : 420
Medical : 395
Personalization : 376
Productivity : 374
Lifestyle : 368
Finance : 345
Sports : 331
Communication : 315
Action : 299
Health & Fitness : 288
Photography : 281
News & Magazines : 254
Social : 239
Books & Reference : 222
Travel & Local : 218
Shopping : 202
Simulation : 193
Arcade : 184
Dat