# Project 1: Profitable App Profiles for the App Store and Google Play Markets

Scope: find mobile app profiles that are profitable for the App Store and Google Play markets

Goal: Analyze data and understand what kind of apps are likely to attract more users

# The Data

- Existing data from Kaggle for more documentation

- [Google Play Andriod apps](https://www.kaggle.com/lava18/google-play-store-apps/home) 

- [App Store iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Open and Explore Data

In [1]:
from csv import reader

#Google Play
open_google = open("googleplaystore.csv", encoding = "utf8")
read_google = reader(open_google)
andriod_app_data = list(read_google) #list of list
andriod_header = andriod_app_data[0]
andriod_data = andriod_app_data[1:]

#App Store
open_app = open("AppleStore.csv", encoding = "utf8")
read_app = reader(open_app)
ios_app_data = list(read_app)
ios_header = ios_app_data[0]
ios_data = ios_app_data[1:]

print("Opened datasets!!!")

Opened datasets!!!


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
print("Exploring datasets . . . ")

Exploring datasets . . . 


In [3]:
print("Google headers:")
print(andriod_header)
print("\nRows 1 - 3")
explore_data(andriod_data, 0, 3, True)
print("\n\nApp headers:")
print(ios_header)
print("\nRows 1 - 3")
explore_data(ios_data, 0, 3, True)

Google headers:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Rows 1 - 3
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


App headers:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.n

- Note the number of rows and cols for both datasets are noted.

# Clean Data
- Detect inaccurate data and correct (or remove) it
- Detect duplicate data and remove them
- Remove non-English apps

[Google Play discussion for errors](https://www.kaggle.com/lava18/google-play-store-apps/discussion)

Incorrect row found.

In [4]:
#check found information
print(andriod_header)
print(andriod_data[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We must delete this row because this app has a rating above 5 (max rating score is 5 in Google Play)

In [5]:
print(len(andriod_data))
del andriod_data[10472]
print(len(andriod_data))

10841
10840


Duplicate rows found.

In [6]:
#Check found information
duplicate_apps = []
unique_apps = []

for row in andriod_data:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
print("Number of duplicate apps/rows:", len(duplicate_apps))

Number of duplicate apps/rows: 1181


Must delete extra rows, but how? 

In [7]:
print(andriod_header, "\n")
for row in andriod_data:
    app_name = row[0]
    if app_name == "Instagram":
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Notice: we can take the row that has the highest reviews for more accurate ratings.

In [8]:
print("Expected length of dataset:", len(andriod_data) - len(duplicate_apps))

Expected length of dataset: 9659


In [9]:
new_data_dict = {}

for row in andriod_data:
    app_name = row[0]
    reviews = row[3]
    
    if app_name in new_data_dict and new_data_dict[app_name] < reviews:
        new_data_dict[app_name] = reviews
    elif app_name not in new_data_dict:
        new_data_dict[app_name] = reviews
        
print("Actual length of dataset:", len(new_data_dict))

Actual length of dataset: 9659


The expected and actual length match!!!

Processing clean dataset for Android apps . . .

In [10]:
andriod_data_clean = []
temp_app_names = []
for row in andriod_data:
    app_name = row[0]
    reviews = row[3]
    if (new_data_dict[app_name] == reviews) and (app_name not in temp_app_names):
        andriod_data_clean.append(row)
        temp_app_names.append(app_name)

In [11]:
#check by explore
print(andriod_header, "\n")
explore_data(andriod_data_clean, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have the expected rows!!!

Must check if each app is English.

In [12]:
def is_english(string):
    non_ascii = 0
    for i in string:
        if ord(i) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

Filtering apps for both data sets . . .

In [13]:
andriod_updated = []
ios_updated = []

#For Google Play apps
for row in andriod_data_clean:
    name = row[0]
    if is_english(name):
        andriod_updated.append(row)

#For App Store apps
for row in ios_data:
    name = row[1]
    if is_english(name):
        ios_updated.append(row)

Check for total apps left . . .

In [14]:
print("Google Play\n")
explore_data(andriod_updated, 0, 3, True)

print("\n\nApp Store\n")
explore_data(ios_updated, 0, 3, True)

Google Play

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


App Store

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', 

Total:
- 9614 Andriod apps
- 6183 iOS apps

Finding the free apps!

In [16]:
android_final = []
ios_final = []

#Google Play
for row in andriod_updated:
    price = row[7]
    if price == '0':
        android_final.append(row)
        
#App Store        
for row in ios_updated:
    price = row[4]
    if price =='0.0':
        ios_final.append(row)
        
print("Total Google Play apps:", len(android_final))
print("Total App Store apps:", len(ios_final))

Total Google Play apps: 8862
Total App Store apps: 3222


# Analyze Data
- Finding most common app by genre
- Finding most popular app by genre on Google Play
- Finding most popular app by genre on App Store

Building a frequency table to analyze data . . . 

In [17]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

Building a display to examine frequency of a dataset . . .

In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Explore/examine . . .

In [20]:

print("Andriod\n\n", andriod_header, "\n")
display_table(andriod_final, 1) #Category column
print("-------------------------------\n")
print("iOS\n\n", ios_header, "\n")
display_table(ios_final, -5) #prime_genre column

Andriod

 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES 

Foundings in Andriod:
- More columns compared to the iOS dataset
- balanced landscape of practical and fun apps 

Foundings in iOS:
- 58% are in games, with entertainment runner-up
- fun apps dominate compared to practical apps

For Andriod: Calculate user installation . . .

In [21]:
print("\nFor Andriod apps . . . \n\n")
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)


For Andriod apps . . . 


TRAVEL_AND_LOCAL : 13984077.710144928
GAME : 15560965.599534342
PARENTING : 542603.6206896552
EDUCATION : 1820673.076923077
WEATHER : 5074486.197183099
BUSINESS : 1712290.1474201474
AUTO_AND_VEHICLES : 647317.8170731707
HOUSE_AND_HOME : 1331540.5616438356
SHOPPING : 7036877.311557789
SOCIAL : 23253652.127118643
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
TOOLS : 10682301.033377837
NEWS_AND_MAGAZINES : 9549178.467741935
MEDICAL : 120616.48717948717
COMICS : 817657.2727272727
VIDEO_PLAYERS : 24727872.452830188
FOOD_AND_DRINK : 1924897.7363636363
MAPS_AND_NAVIGATION : 4056941.7741935486
DATING : 854028.8303030303
HEALTH_AND_FITNESS : 4188821.9853479853
PERSONALIZATION : 5201482.6122448975
COMMUNICATION : 38456119.167247385
LIFESTYLE : 1437816.2687861272
PHOTOGRAPHY : 17805627.643678162
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1986335.0877192982
PRODUCTIVITY : 16787331.344927534
BEAUTY : 513151.88679245283
SPORTS : 3638640.1428571427
LIBRARI

Popular app findings in Andriod:
- Communication apps have the most installs, which may be skewed
- Social Network, Productivity seem very popular (mostly dominated by giant companies)
- Book and Reference genre is promising; there are various apps for software processing, reading ebooks, dictionaries, etc

For iOS: Calculate the average number of user ratings per app genre  . . .

In [22]:
print("\nFor iOS apps . . . \n\n")
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)


For iOS apps . . . 


Medical : 612.0
Shopping : 26919.690476190477
Lifestyle : 16485.764705882353
Sports : 23008.898550724636
Social Networking : 71548.34905660378
Productivity : 21028.410714285714
Travel : 28243.8
Utilities : 18684.456790123455
Book : 39758.5
Reference : 74942.11111111111
Education : 7003.983050847458
Navigation : 86090.33333333333
Music : 57326.530303030304
Business : 7491.117647058823
Games : 22788.6696905016
Entertainment : 14029.830708661417
Catalogs : 4004.0
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Weather : 52279.892857142855
Health & Fitness : 23298.015384615384
News : 21248.023255813954
Photo & Video : 28441.54375


Popular app findings in iOS:
- Navigation have the highest number of user reviews 
- Social Networking and Music apps also seem popular 
- Other apps are more popular than it seems (compared to Games before)
- The market may be saturated with fun-apps
- Practical apps stand out have a larger chance in standing out for attracting users

# Conclusion
- this project/notebook cleaned and analyzed data from the Google Play and App Store of mobile devices
- the goal was to understand what kind of apps attract users when browsing mobile apps