# Most Popular Apps for the App Store and Google PLay Market 

In this project is to find the most popular, profitable, and free apps for the App Store and Google Play Market.
Since these apps are free to install and download, their revenue depends on the number of ads. The goal of this project is to help developers understand what type of apps are more profitable for the market and attract more users. The higher number of users means higher revenue. These apps are Android and iOS mobile apps. 

## Opening the Collected Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money. Luckily, I found two data sets that seem suitable for the purpose:
- [GooglePlayData](https://www.kaggle.com/lava18/google-play-store-apps): This data set is containing information about almost 10,000 Android app from Google Play. The data was collected in Ausgust 2018. To download this data, click [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- [AppleStoreData](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): This data set is containing information about more than 7,000 IOS apps from App Store. It was collected in July 2017. You can click [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) to download the data set. 


### Lets open the data sets 

In [2]:
from csv import reader

# open the Google PLay Store "Android" apps dataset
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
androids_data = list(read_file)
a_header = androids_data[0]
android = androids_data[1:]

# open the Apple Store "ios" apps dataset
open_file = open('AppleStore.csv')
read_file = reader(open_file)
ios_data = list(read_file)
ios_header = ios_data[0]
ios = ios_data[1:]

Now lets explore the data. We are going to use a function called explore_data(). This function will allow us to explore rows is a better readable way. We are going to use this function repeatedly. Also, there will be an option for the explore_Data() function to show the number of rows and columns of any data set

In [3]:
# explore function to make our data readable or easier to understand 

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Now lets explore the Android data set:

In [4]:
print(a_header) #printing the first row of the dataset, which is the header. It has the names of columns
print("\n")    # adds a line between rows 
explore_data(android, 1, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', '']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design', 'Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up', '']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up', '']


Number of rows: 10841
Number of columns: 14


We can see that the Android data set has 10841 apps and 14 columns. 

The columns that can help us for the purpose of our analysis are: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

### Now lets explore the IOS data set 

In [5]:
print (ios_header)
print ("\n")
explore_data(ios, 1, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic', '']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1', '']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1', '']


['420009108', 'Temple Run', '65921024', 'USD', '0', '1724546', '3842', '4.5', '4', '1.6.2', '9+', 'Games', '40', '5', '1', '1', '']


Number of rows: 7197
Number of columns: 17


The IOS data set has 7197 apps and 17 columns. 

The columns that can help us for the purpose of our analysis are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'.

## Data Cleaning 

For this projcet we need to do the following to clean our data:
  - Find inaccurate data, and correct or remove it.
  - Find duplicate data, and remove the duplicates. 
  - Remove non-English apps.
  - Isolating the free apps

### Deleting Inaccurate Data

The Google PLay data has a [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion). According to [this](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) section of the discussion, there is an error in row 10472. To check this error, we will print this row, compare it to the header and another accurate row. 

In [6]:
print (android[10472])   
print ('\n')
print (a_header)
print ('\n')
print (android[1]) # an accurate row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up', '', '']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', '']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design', 'Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


The row 10472 is for an app called 'Life Made WI-Fi Touchscreen Photo Frame'. The review and rating data in this row are not in order. Also, the number of reviews should be an integer. Therefore, we are going to delete this row

In [7]:
print (len(android))
del (android[10472]) # Don't run this cell more than one time, because more rows will be deleted
print (len(android)) 

10841
10840


Now we have 10840 apps for Google PLay, because we have deleted one app.
According to the App Store data [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), we don't have inaccurate data for IOS apps. 
We can move to the second step of the Data Cleaning, which is "removing duplicate data". 

### Find Duplicate Data

By exploring the Google PLay data set and looking carefully to the names of the apps, you will notice that some apps have duplicate data. For example, Facebook has two entries

In [8]:
for app in android:
    name = app[0]
    if name == 'Facebook':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device', '']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device', '']


if you examine the rows that we printed for the Facebook app, you will notice that the main difference is on the fourth position of each row. This position represents the number of reviews. Different number of reviews means that the data was collected different times.
We should always keep the data was the highest number of reviews. The highes number of reviews indicates that the data is the most recent data. 

Before we delete any apps. We need to write some codes that give us information about duplicates data. We are dealing with a large number of apps. It is not easy to find all the duplicates data by looking at the full data sets. The following code will print the number of duplicate apps, and name of some duplicate apps to give some examples. 

In [9]:
duplicate_apps = []
unique_apps = []
for apps in android:
    name = apps[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print ('There are '+str(len(duplicate_apps))+' duplicate apps.')
print('\n')
print ('Examples of duplicated apps: ',duplicate_apps[:20])
print ('\n')
print ('The expected number of apps after deleting the duplicated once: ' + str(len(android)-len(duplicate_apps)))

There are 1181 duplicate apps.


Examples of duplicated apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


The expected number of apps after deleting the duplicated once: 9659


There are 1181 duplicate apps in the Google Play data set.

Lets check the App Store data set to see if there is any duplicates.

In [10]:
duplicate_apps = []
unique_apps = []
for apps in ios:
    name = apps[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print ('There are '+str(len(duplicate_apps))+' duplicate apps.')

There are 0 duplicate apps.


There is no duplicated data on the App Store data set. 


### Removing Duplicate Data

Now, its time to deleted duplicated data from the Google PLay data set.

To delete the duplicated apps, we should do the following:
- Create a dictionary, where each dictionary key is a unique app name and its value 
  will be the highest number of reviews for that app.
- Use the information stored in the dictionary and create a new data set, which will 
  have only one entry per app. For each app, we will only select the entry with the 
  highest number of reviews.

In [11]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659



The next step is using the review_max dictionary to remove any duplicate apps. For the duplicate apps, we will keep only the app with the highest reviews.

To do so, we will:
- Start by creating two empty lists: 
  - android_clean, which will store our new cleaned data set.
  - already_added, which will just store app names.
- Loop through the Google Play data set, and for each iteration:

    - Assign the app name to a variable named name.
    - Assign the number of review to a variable named n_reviews.
- If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the   
  reviews_max dictionary) and name is not already in the list already_added:
  - Append the entire row to the android_clean list.
  - Append the name of the app name to the already_added list — this helps us to keep track of apps that we   
    already added.

In [12]:
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

Now lets use the explore_data() function to make sure than everything went well

In [13]:
explore_data(android_clean, 4, 7, True)

['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1', '2.3 and up', '']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up', '']


['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up', '']


Number of rows: 9659
Number of columns: 14


After deleting the duplicated apps, we have 9659 apps. 

The next step of this project is removing apps that are not for English speakers.

### Deleting Non-English Apps

When we explore the data sets, we can see that there are apps that are not for English speakers. We do not want these apps. We are interested only in English apps.

Both of the Android and IOS apps data sets have non-English apps. Let's look at some examples of these apps:

In [14]:
print(android_clean[4412][0])
print(android_clean[7940][0]) # Example of non-English android apps
print('\n')
print(ios[813][1]) #example of non-English IOS apps
print(ios[6731][1])


中国語 AQリスニング
لعبة تقدر تربح DZ


爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).
Each character we use in a string has a corresponding number associated with it. 
We can get the corresponding number of each character using the [built in function](https://docs.python.org/3/library/functions.html#ord) ord(). 

In [15]:
print(ord('g'))
print(ord('='))
print(ord('t'))
print(ord('!'))

103
61
116
33


According to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system, The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127.
We can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters. If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.
Lets detect app names that have non-English characters, so we can remove these apps later.

In [16]:
def English(String):
    for char in String:
        if ord(char) > 127:
            return False
        return True
print(English('Facebook')) #should return True
print(English('لعبة تقدر تربح DZ')) #should return False
print(English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(English('Docs To Go™ Free Office Suite'))
print(English('Instachat 😜'))

True
False
False
True
True


The previous funtion works well, but there is a problem. 
The problem is that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. Because emojis and characters like '™' do not fall inside the ASCII range and have corresponding number greater than 127.

In [17]:
print(ord('😜'))
print(ord('™'))

128540
8482


If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

Let's edit the function we created in the previous screen, and then use it to filter out the non-English apps.

In [18]:
def English(String):
    not_ASCII = 0
    for char in String:
        if ord(char) > 127:
            not_ASCII += 1
    if not_ASCII > 3:
        return False
    else:
        return True
    
print(English('Docs To Go™ Free Office Suite'))
print(English('Instachat 😜'))
print(English('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now, we are going to use the new English() function to filter out non-English apps from both data sets.

We are goint to loop through each data set. If an app name is identified as English, append the whole row to a separate list.

Remember, we going to loop through the android_clean data set, not android data set. Because android_clean does not have duplicated apps. The IOS data set still accurate, because we did not find any duplicate data in that data set. 

In [19]:
android_English = []
ios_English = []

for app in android_clean:
    name = app[0]
    if English(name):
        android_English.append(app)

for app in ios:
    name = app[0]
    if English(name):
        ios_English.append(app)
        
explore_data(android_English, 1, 4, True)
print('\n')
explore_data(ios_English, 1, 4, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up', '']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up', '']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design', 'Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 14


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1', '']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1', '']


['420009108', 'Temple Run', '65921024', 'USD', '0', '1724546', '3842', '4.5', '4', '1.6.2', '9+', 'G

Now, after removing duplicated and non-English apps, we have 9614 android apps and 7197 ios apps left

As we mentioned in the introduction, we only interested in apps that are free to download and install, and our main source of revenue consists of in-app ads. The last thing we need to do for the [data cleaning](https://www.sisense.com/glossary/data-cleaning/) process, is isolating the free apps.


### Isolating Non-free Apps

Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis. After we isolate the free apps, we will start analyzing the data

In [20]:
android_cleaned = []
ios_cleaned = []

for app in android_English:
    price = app[7]
    if price == '0':
        android_cleaned.append(app)
        
for app in ios_English:
    price = app[4]
    if price == '0':
        ios_cleaned.append(app)
        
        
print(len(android_cleaned))
print('\n')
print(len(ios_cleaned))

8864


4056


We are done with the data cleaning process. The remaining apps are 8864 android apps and 4056 ios apps. Now we can start analyzing the data

### Most Popular Apps By Genre 

Our revenure is based on the number of people using our apps. For that reason, We want to determine what kinds of apps attract more users.

We'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.need to build frequency tables for a few columns in our data sets.

We'll build two functions we can use to analyze the frequency tables:
- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order


In [21]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    t_percentages = {}
    for key in table:
        percentage = (table[key]/total)*100
        t_percentages[key] = percentage
    return t_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now we are going to analyze the frequency table that we have generated for the 'prime_genre' column of the App Store dataset. 

In [22]:
display_table(ios_cleaned, -6)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


From the output, we can see that more than (55.64%) of the apps are Games. Followed by the Entertainment apps (8.23%), followed by Photo and Video apps (4.12%), followed by the Social Networking apps (3.53%). 

This is surprising! The data is telling us that most of the Apple Store English and free apps were designed for fun. For example, games, entertainment, photos and videos. Only 3.25% of the apps were designed for education. And the apps that were designed for practical purposes are very rare. 


The Google Play dataset has two columns which seem to be related. These columns are named Genres and Category.

We are going to analyze the Category column first

In [23]:
display_table(android_cleaned, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

now lets analyze the Genres column

In [24]:
display_table(android_cleaned, 9)

Tools : 8.461191335740072
Entertainment : 6.419223826714801
Education : 5.88898916967509
Business : 4.591606498194946
Lifestyle : 3.91471119133574
Productivity : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.48601083032491
Personalization : 3.3167870036101084
Communication : 3.2490974729241873
Action : 3.203971119133574
Health & Fitness : 3.1024368231046933
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Casual : 2.36913357400722
Travel & Local : 2.33528880866426
Shopping : 2.2450361010830324
Simulation : 2.154783393501805
Books & Reference : 2.154783393501805
Arcade : 1.9855595667870036
Dating : 1.861462093862816
Video Players & Editors : 1.8050541516245486
Maps & Navigation : 1.3989169675090252
Puzzle : 1.3650722021660648
Food & Drink : 1.2409747292418771
Racing : 1.1732851985559567
Role Playing : 1.026624548736462
Educational : 0.9927797833935018
Strategy : 0.947653429602888
Libraries & Demo : 

# Conclusion 

The difference between the Category and Genres columns is not clear. But we can notice that the Genres column has more categories and more data. 

From the output, we can notice that Google Play has a good number of apps that were designed for practical purposes than what the Apple Store has. Google play has a balance between apps that were designed for fun and apps for practical purposes.