# Profitable App Profiles for the App Store and Google Play Markets

## What is this project?
Analyze data from App Store and Google Play Store to idetify the most profitable mobile apps in order to suggest data-driven decisions type of features or/and products should to be implemented.

## What is its goal?
Develop essential skills for data analysis in Python.

## Next steps
- Import pandas and matplotlib
- Data visualization
- More conclusions

## Resources
- Dataquest.io:
https://app.dataquest.io/m/350/guided-project%3A-profitable-app-profiles-for-the-app-store-and-google-play-markets
- App Store data set:
https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home
- Google Play Store data set: https://www.kaggle.com/lava18/google-play-store-apps/home



In [6]:
from csv import reader

#Google Play Store data set
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

#Apple Store data set
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [7]:
def explore_data(dataset, start, end, apps_and_columns=False):
    dataset_slice = dataset[start:end]    
    for app in dataset_slice:
        print(app)
        print('\n') # adds a new (empty) line between apps
        
    if apps_and_columns:
        print('Number of apps:', len(dataset))
        print('Number of columns:', len(dataset[0]))

`explore_data` function helps to increase readablity of data

In [8]:
### Print the header, first three columns & number of columns and appsfor Google Play Store###
print(android_header)
print('\n')
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of apps: 10841
Number of columns: 13


The Google Play data set has 10841 rows of data excluding the header within 13 columns (`'App'`, `'Category'`, `'Rating'`, `'Reviews'`, `'Size'`, `'Installs'`, `'Type'`, `'Price'`, `'Content Rating'`, `'Genres'`, `'Last Updated'`, `'Current Ver'`, `'Android Ver'`)

In [12]:
#Print the header and first five columns for Apple Store
print(ios_header)
print('\n')
explore_data(ios, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of apps: 7197
Number of columns: 16


The Apple Store data set has 7197 rows of data excluding the header within 16 columns (`'id'`, `'track_name'`, `'size_bytes'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, `'user_rating'`, `'user_rating_ver'`, `'ver'`, `'cont_rating'`, `'prime_genre'`, `'sup_devices.num'`, `'ipadSc_urls.num'`, `'lang.num'`, `'vpp_lic'`)

## Data cleaning

- Removing the row with n/a value
- Removing duplicates entities
- Removing non-English apps
- Isolating the free apps

### Removing the row with n/a value

The Google Play data set has a dedicated discussion section, and since one of the rows (10472) has an error, the data set needs to be deleted.

In [13]:
#Checking for the error

print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [14]:
#Delete this row

del android[10472]

In [15]:
#Check if deletion was done properly

explore_data(android, 0, 0, True)

Number of apps: 10840
Number of columns: 13


The deletion was done well since before there were 10841 apps and now ther are 10840

### Removing duplicate entries

In [16]:
#Check for duplicates (count if there are some)

import random
secure_random = random.SystemRandom()

duplicate_app_names = []
unique_app_names = []

for app in android:
    app_name = app[0]
    if app_name in unique_app_names:
        duplicate_app_names.append(app_name)
    else:
        unique_app_names.append(app_name)
        
        
print('Number of duplicates: ', len(duplicate_app_names))
print('An example of duplicates: ', secure_random.choice(duplicate_app_names))

Number of duplicates:  1181
An example of duplicates:  YouTube Kids


In [17]:
#Check duplicates
for app in android:
    name = app[0]
    if name == 'Subway Surfers':
        print(app)

['Subway Surfers', 'GAME', '4.5', '27722264', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27723193', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27724094', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27725352', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27725352', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27711703', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']


Some lists which include `'Subway Surfers'` (an example of app with duplicates) were printed above. It is noticable that the amount of ratings differs in those lists. 

As per this information, the next steps would aim to keep the app which includes the highest numer of ratings and delete of apps which have lower level.

In [19]:
#How many apps will we have after deleting duplicates?
print('Expected length: ', len(android) - len(duplicate_app_names))

Expected length:  9659


In [20]:
#Delete duplicates

reviews_max = {}

for row in android:
    app_name = row[0]
    n_reviews = float(row[3])
    
    if app_name in reviews_max and reviews_max[app_name] < n_reviews:
        reviews_max[app_name] = n_reviews
        
    elif app_name not in reviews_max:
        reviews_max[app_name] = n_reviews

In [21]:
# Is the length like expected?
if len(reviews_max) == 9659:
    print(True)

True


The boolean value of `len(review_max) == 9659` means that the deletion was done properly.

In [22]:
android_clean = []
already_added = []

for row in android:
    app_name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[app_name] == n_reviews) and (app_name not in already_added):
        android_clean.append(row)
        already_added.append(app_name)

print('Number of rows after cleaning: ', len(android_clean))

Number of rows after cleaning:  9659


### Removing non-English apps

In [23]:
#Check for non-English apps

def is_english(app_name):
    
    for character in app_name:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Instagram'))
print(is_english('Instachat üòú'))

False
False
True
False


Each sign has its number (for standard English alphabet all letters are below 127)

In [24]:
# Check for special characters in English app names

print(ord('‚Ñ¢'))
print(ord('üòú'))

8482
128540


Above signs (`'‚Ñ¢'` and `'üòú'`) are assigned to numbers above 127 which means that they are not in the standard English alphabet.

However, those signs do not show that the name of apps are in different language than English. Those entries cannot be deleted and need to be kept.

In [25]:
# Change `is_english` function to include special characters

def is_english(app_name):
    n_of_special_characters = 0
    
    for character in app_name:
        if ord(character) > 127:
            n_of_special_characters += 1
    if n_of_special_characters > 3:
            return False
    else:
        return True

print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))

True
True
False


In [27]:
# Update `android_clean` and `ios` with `is_english`

#Google Play Store
android_clean_english = []

for app in android_clean:
    app_name = app[0]
    if is_english(app_name):
        android_clean_english.append(app)
        

#Apple Store
ios_english = []

for app in ios:
    app_name = app[1]
    if is_english(app_name):
        ios_english.append(app)
        
        
        
explore_data(android_clean_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
  

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of apps: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

The data sets for Google Play Store and App Store were updated to include just English apps.

### Isolating the free apps

In [28]:
#Number of the free and non-free apps 

#Google Play Store

free_apps = []

for row in android_clean_english:
    app_name = row[0]
    rate = row[6]
    if rate == 'Free':
        free_apps.append(app_name)
    
print('Number of free apps (Google): ', len(free_apps))
print('Number of paid apps (Google): ', len(android_clean_english) - len(free_apps))
        
    
#Apple Play Store

free_apps = []

for row in ios_english:
    app_name = row[0]
    rate = row[4]
    if rate == '0.0':
        free_apps.append(app_name)
    
print('Number of free apps (Apple): ', len(free_apps))
print('Number of paid apps (Apple): ', len(ios_english) - len(free_apps))
        

Number of free apps (Google):  8863
Number of paid apps (Google):  751
Number of free apps (Apple):  3222
Number of paid apps (Apple):  2961


In [29]:
#Most common genres in each market
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [30]:
display_table(ios_english, -5)

Games : 54.860100274947435
Entertainment : 7.261846999838266
Education : 6.6310852337053205
Photo & Video : 5.515122109008572
Utilities : 3.4449296458030085
Productivity : 2.7171276079573023
Health & Fitness : 2.6686074721009216
Music : 2.215752870774705
Social Networking : 2.037845705967977
Sports : 1.6820313763545207
Lifestyle : 1.6011644832605532
Shopping : 1.3747371825974446
Weather : 1.1159631246967492
Travel : 0.9704027171276078
News : 0.9218825812712276
Book : 0.8895358240336406
Reference : 0.8571890667960537
Business : 0.8571890667960537
Finance : 0.7924955523208799
Food & Drink : 0.7116286592269124
Navigation : 0.452854601326217
Medical : 0.3396409509946628
Catalogs : 0.08086689309396733


The data represents percentage distribution of apps genres in App Store.

In [57]:
display_table(android_clean_english, -4)

Tools : 8.602038693571874
Entertainment : 5.793634283336801
Education : 5.231953401289786
Business : 4.358227584772207
Medical : 4.108591637195756
Personalization : 3.900561680882047
Productivity : 3.879758685250676
Lifestyle : 3.775743707093822
Finance : 3.588516746411483
Sports : 3.442895776991887
Communication : 3.2660703141252343
Action : 3.110047846889952
Health & Fitness : 2.995631370917412
Photography : 2.9124193883919283
News & Magazines : 2.600374453921365
Social : 2.485957977948825
Travel & Local : 2.26752652381943
Books & Reference : 2.26752652381943
Shopping : 2.090701060952777
Simulation : 1.9762845849802373
Arcade : 1.9138755980861244
Dating : 1.768254628666528
Casual : 1.7162471395881007
Video Players & Editors : 1.674641148325359
Maps & Navigation : 1.3417932182234242
Puzzle : 1.2377782400665696
Food & Drink : 1.1649677553567712
Role Playing : 1.0817557728312877
Strategy : 0.9777407946744331
Racing : 0.9465363012273768
Libraries & Demo : 0.8737258165175785
Auto & Vehicl

In [58]:
display_table(android_clean_english, 1) # Category

FAMILY : 19.325982941543582
GAME : 9.819013938007073
TOOLS : 8.61244019138756
BUSINESS : 4.358227584772207
MEDICAL : 4.108591637195756
PERSONALIZATION : 3.900561680882047
PRODUCTIVITY : 3.879758685250676
LIFESTYLE : 3.786145204909507
FINANCE : 3.588516746411483
SPORTS : 3.3804867900977738
COMMUNICATION : 3.2660703141252343
HEALTH_AND_FITNESS : 2.995631370917412
PHOTOGRAPHY : 2.9124193883919283
NEWS_AND_MAGAZINES : 2.600374453921365
SOCIAL : 2.485957977948825
TRAVEL_AND_LOCAL : 2.2779280216351157
BOOKS_AND_REFERENCE : 2.26752652381943
SHOPPING : 2.090701060952777
DATING : 1.768254628666528
VIDEO_PLAYERS : 1.6954441439567296
MAPS_AND_NAVIGATION : 1.3417932182234242
FOOD_AND_DRINK : 1.1649677553567712
EDUCATION : 1.1025587684626585
ENTERTAINMENT : 0.9049303099646349
LIBRARIES_AND_DEMO : 0.8737258165175785
AUTO_AND_VEHICLES : 0.8737258165175785
WEATHER : 0.8217183274391513
HOUSE_AND_HOME : 0.7593093405450385
EVENTS : 0.6656958602038693
PARENTING : 0.6240898689411275
ART_AND_DESIGN : 0.6240

The data represents percentage distribution of apps genres and category
in Google Play Store.

In [32]:
#The average number of user ratings per app genre on the App Store

genres_ios = freq_table(ios_english, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_english:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 60253.84920634921
Photo & Video : 14688.715542521993
Games : 15586.759433962265
Music : 29047.109489051094
Reference : 27037.188679245282
Health & Fitness : 10802.157575757576
Weather : 23145.246376811596
Utilities : 7927.525821596244
Travel : 19030.183333333334
Shopping : 26635.011764705883
News : 16980.315789473683
Navigation : 19370.821428571428
Lifestyle : 8930.373737373737
Entertainment : 8862.409799554565
Food & Drink : 19934.386363636364
Sports : 15350.913461538461
Book : 10359.2
Finance : 23353.530612244896
Education : 2472.278048780488
Productivity : 8508.089285714286
Business : 5149.320754716981
Catalogs : 3465.0
Medical : 648.952380952381


#### Conclusions

The highest number of user ratings per app on the App Store were within entertirement apps including social networking, photo & video apps related and game apps.

In [61]:
#The average number of user ratings per app genre on the Google Play Store

category_android = freq_table(android_clean_english, 1)

for category in category_android:
    total = 0
    len_category = 0
    for app in android_clean_english:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_n_ratings = total / len_category
    print(category, ':', avg_n_ratings)

AUTO_AND_VEHICLES : 632501.3214285715
FOOD_AND_DRINK : 1891060.2767857143
SPORTS : 3373767.6861538463
ART_AND_DESIGN : 1887285.0
HOUSE_AND_HOME : 1331540.5616438356
DATING : 828971.2176470588
TRAVEL_AND_LOCAL : 13218662.767123288
WEATHER : 4570892.658227848
BUSINESS : 1663758.627684964
LIFESTYLE : 1369954.7774725275
HEALTH_AND_FITNESS : 3972300.388888889
MAPS_AND_NAVIGATION : 3900634.7286821706
SOCIAL : 22961790.384937238
MEDICAL : 96944.49873417722
SHOPPING : 6966908.880597015
PERSONALIZATION : 4086652.4853333333
COMICS : 817657.2727272727
EDUCATION : 1782566.0377358492
BOOKS_AND_REFERENCE : 7641777.871559633
LIBRARIES_AND_DEMO : 630903.6904761905
BEAUTY : 513151.88679245283
FAMILY : 3345018.516684607
TOOLS : 9785955.211352658
PARENTING : 525351.8333333334
VIDEO_PLAYERS : 24121489.079754602
PRODUCTIVITY : 15530942.008042896
PHOTOGRAPHY : 16636241.267857144
COMMUNICATION : 35153714.17515924
NEWS_AND_MAGAZINES : 9472807.04
ENTERTAINMENT : 11375402.298850575
EVENTS : 249580.640625
FINANC

#### Conclusions

The highest number of user ratings per app on the Google App Store were within video apps but also educational/ growth apps including those for books or education.