# Profitable App Profiles for the App Store and Google Play Markets 

The goal for this project is to analyze data to understand what kinds of apps are likely to attract more users on the iOS and Android platforms

This [data set](https://www.kaggle.com/lava18/google-play-store-apps) contais data about approximately ten thousand Android apps from Google Play

This [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) contains data about approximately seven thousand iOS apps from the App Store

Opening the Google_Play_Dataset

In [1]:
opened_file = open('googleplaystore.csv',encoding='utf8')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)

Opening the Apple_Store_Dataset

In [2]:
opened_file = open('AppleStore.csv',encoding='utf8')
from csv import reader
read_file = reader(opened_file)
ios = list(read_file)


In [3]:
def explore_data(dataset,start,end,rows_column = False):
    dataset_slice = dataset[start:end]
    for x in dataset_slice:
        print(x,"\n")
        
    if rows_column:
        print("Number of rows :" ,len(dataset))
        print("Number of column :" ,len(dataset[0]))
        
print("Preview of the GooglePlay Dataset\n")
android_preview = explore_data(android,0,2)

print("\nPreview of the AppleStore Dataset\n")
ios_preview = explore_data(ios,0,2)



Preview of the GooglePlay Dataset

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 


Preview of the AppleStore Dataset

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 



# Data Cleanup Process
### Part 1 : Deleting the wrong data
The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101), and we can see that one of the discussions outlines an error for row `10473`. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print(android[0],"\n\n",android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row `10473` corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame`, and we can see that the rating is `19`. This is clearly off because the maximum rating for a Google Play app is 5. As a result, we'll delete this row.

In [5]:
print("Length of the original GooglePlay Dataset : ",len(android))
del android[10473]
print("Length of the updated GooglePlay Dataset :" , len(android))

Length of the original GooglePlay Dataset :  10842
Length of the updated GooglePlay Dataset : 10841


### Part 2 : Removing duplicate entries
If we explore the Google Play data, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [6]:
for x in android[1:]:
    name = x[0]
    if name == 'Instagram':
        print(x)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
def duplicate_values(dataset):
    duplicate_apps = []
    unique_apps = []
    for x in dataset[1:]:
        name = x[0]
        if name in unique_apps:
            duplicate_apps.append(x[0])
        else:
            unique_apps.append(x[0])
    return duplicate_apps
            
print("The number of duplicate row data in GooglePlayStore : " , len(duplicate_values(android)),"\n")
print("The number of duplicate row data in ApplePlayStore : " , len(duplicate_values(ios)),"\n")


The number of duplicate row data in GooglePlayStore :  1181 

The number of duplicate row data in ApplePlayStore :  0 



Instead of removing all the duplicate values, we'll keep the rows that have the highest number of reviews because <i><u>the higher the number of reviews, the more reliable the ratings</i></u>.

In [8]:
reviews_max = {} 

for x in android[1:]:
    name = x[0]
    reviews = float(x[3])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
        
    elif name not in reviews_max:
        reviews_max[name] = reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [9]:
print('Expected length:', len(android[1:]) - len(duplicate_values(android)))
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


We start by initializing two empty lists, `android_clean` and `already_added`

We loop the dataset and add the unique record from the values obtained from the dictionary to the <b>android_clean</b> and the name of that app in <b>already_added</b>

In [10]:
android_clean = []
already_added = []

for x in android[1:]:
    name = x[0]
    review = float(x[3])
    if(reviews_max[name] == review) and (name not in already_added):#Appending the highest reviews of every duplicate app to list
        android_clean.append(x)
        already_added.append(name)
        
print("Number of records :" ,len(android_clean)) #Cross verfying to ensure there's proper data accuracy 

Number of records : 9659


In [11]:
explore_data(android_clean, 0, 3, True) # Ensuring the records are same by calling the explore data function

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows : 9659
Number of column : 13


### Part 3 : Removing Non-English Apps

As the business problem focuses on for English Speaking audience we eliminate any apps that are in the <i><u>non-english category</i></u>

According to ASCII system,The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127. Since some english apps have emoticons on their app name we keep the max allowed for it as **3** and anything outside the scale we remove it

In [12]:
def is_eng(my_string):
    n_ascii = 0
    for x in my_string:
        if ord(x) > 127:
            n_ascii += 1
        if n_ascii > 3:
            return False
        
    return True

print(is_eng('Instachat 😜'))

True


In [13]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_eng(name):
        android_english.append(app)
        
for app in ios[1:]:
    name = app[1]
    if is_eng(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows : 9614
Number of column : 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+',

### Part 4: Isolating the Free Apps

In [14]:
# Isolating the iOS apps
ios_free = []

for x in ios_english:
    if x[4] == '0.0':
        ios_free.append(x)
        
# Isolating the android apps   
android_free = []
for x in android_english:
    if x[7] == '0':
        android_free.append(x)
        
print("Number of iOS apps for analysis :",len(ios_free))
print("Number of android apps for analysis :",len(android_free))
    

Number of iOS apps for analysis : 3222
Number of android apps for analysis : 8864


## Data Analyzing

### Part 1 : Data Validation Strategy 

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we then develop it further.
If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [64]:
def freq_table(dataset,a):
    empty_dict = {}
    for x in dataset:
        column = x[a]
        if column in empty_dict:
            empty_dict[column] += 1
        else:
            empty_dict[column] = 1
            
    return empty_dict

g_ft = freq_table(ios_free, -5)
print(ratings_ft,"\n")

g_ft = freq_table(android_free , 1)
print(g_ft)



{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1874, 'Music': 66, 'Reference': 18, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 81, 'Travel': 40, 'Shopping': 84, 'News': 43, 'Navigation': 6, 'Lifestyle': 51, 'Entertainment': 254, 'Food & Drink': 26, 'Sports': 69, 'Book': 14, 'Finance': 36, 'Education': 118, 'Productivity': 56, 'Business': 17, 'Catalogs': 4, 'Medical': 6} 

{'ART_AND_DESIGN': 57, 'AUTO_AND_VEHICLES': 82, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 190, 'BUSINESS': 407, 'COMICS': 55, 'COMMUNICATION': 287, 'DATING': 165, 'EDUCATION': 103, 'ENTERTAINMENT': 85, 'EVENTS': 63, 'FINANCE': 328, 'FOOD_AND_DRINK': 110, 'HEALTH_AND_FITNESS': 273, 'HOUSE_AND_HOME': 73, 'LIBRARIES_AND_DEMO': 83, 'LIFESTYLE': 346, 'GAME': 862, 'FAMILY': 1676, 'MEDICAL': 313, 'SOCIAL': 236, 'SHOPPING': 199, 'PHOTOGRAPHY': 261, 'SPORTS': 301, 'TRAVEL_AND_LOCAL': 207, 'TOOLS': 750, 'PERSONALIZATION': 294, 'PRODUCTIVITY': 345, 'PARENTING': 58, 'WEATHER': 71, 'VIDEO_PLAYERS': 159, 'NEWS_AND_MAG