# Dataquest.io Project 1

This is the first project as part of the Dataquest.io Data Scientist in Python path. In this project we work as an analyst for an Android/iOS app development company that generates all revenue from in-app ads. (No paid apps). We use data analysis to identify trends for our team to understand what types of apps are likely to attract more users. 

## Pull the data
First we pull and inspect the data to make sure it's in a format that we can work with. We can check a sample of the data to see what it looks like. 

The datasets come in format described here: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

https://www.kaggle.com/lava18/google-play-store-apps

In [1]:
#Get the apple store and google play store sample data
openedFile = open('AppleStore.csv')
from csv import reader
read_file = reader(openedFile)
appleStore_Data = list(read_file)

openedFile = open('googleplaystore.csv')
read_file = reader(openedFile)
googlePlayStore_Data = list(read_file)

#Print the first few lines of each to confirm the're there
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
        
explore_data(appleStore_Data, 0, 5)
explore_data(googlePlayStore_Data, 0,5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'A

In [2]:
#Check how many columns are in each
print("Apple Store Data:")
explore_data(appleStore_Data, 0, 0, rows_and_columns = True)
print("Google Play Store Data:")
explore_data(googlePlayStore_Data, 0, 0, rows_and_columns = True)

Apple Store Data:
Number of rows: 7198
Number of columns: 16
Google Play Store Data:
Number of rows: 10842
Number of columns: 13


## Begin cleaning the data:

There is a row with an error in the Google Play data, using this as an example we can remove the erroneous row to begin cleaning

https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015

In [3]:
#It seems like the error might be around row 10472 (from the link). print the rows around there to check
print(googlePlayStore_Data[0])
explore_data(googlePlayStore_Data, 10470, 10475)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [4]:
#It looks like row 10473 is missing the CATEGORY field. Let's confirm that
print(googlePlayStore_Data[10473])
print(len(googlePlayStore_Data[10473]))
print(len(googlePlayStore_Data[0]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
13


In [5]:
#There is a missing field in that data, lets just delete that row
del googlePlayStore_Data[10473]

## Removing duplicate entries
### Identify duplicates
In this section we will identify duplicate entries in the data. We aren't randomly deleting duplicates, we will keep the entry with the highest number of reviews for futher analysis



In [6]:
#Create lists to hold duplicates
androidDuplicates = []
androidUnique = []
appleDuplicates = []
appleUnique = []

for app in googlePlayStore_Data:
    appName = app[0]
    if appName in androidUnique:
        androidDuplicates.append(appName)
    else :
        androidUnique.append(appName)

for app in appleStore_Data:
    appName = app[1]
    if appName in appleUnique :
        appleDuplicates.append(appName)
    else: 
        appleUnique.append(appName)

print("Number of duplicate Android Apps: " + str(len(androidDuplicates)))
print("Number of duplicate Apple Apps: " + str(len(appleDuplicates)))

Number of duplicate Android Apps: 1181
Number of duplicate Apple Apps: 2


In [7]:
#Create a dictionary of the app name & number of reviews that row has
reviews_max = {}
for app in googlePlayStore_Data[1:] :
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews :
        reviews_max[name] = n_reviews
    if name not in reviews_max :
        reviews_max[name] = n_reviews
        
#Subrtact 1 for the header row
print('The expected length of the dictionary is : ', (len(googlePlayStore_Data) - 1181)-1)
print('The actual length of the dictionary is : ', len(reviews_max))
    

The expected length of the dictionary is :  9659
The actual length of the dictionary is :  9659


In [8]:
#Now that I have a list of duplicates, I can get cleaned data from it
android_clean = []
already_added = []

for row in googlePlayStore_Data[1:] :
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added :
        android_clean.append(row)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing non-english entries
The apple store data contains non-english entries that need to be removed

### Identifying non-english characters using ASCII code


In [9]:
#Write a function to check a string for ascii codes > 127. If there are >3 non-english characters bounce it
#test it with some strings
def isStringEnglish(inputString) :
    nonEnglishCharacterCounter = 0
    for character in inputString :
        if(ord(character) > 127) :
            nonEnglishCharacterCounter += 1
            if(nonEnglishCharacterCounter > 3) :
                return False
    
    return True


print(isStringEnglish('Instagram'))
print(isStringEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isStringEnglish('Docs To Go™ Free Office Suite'))
print(isStringEnglish('Instachat 😜'))

True
False
True
True


In [10]:
#Remove non-english apps from the apple dataset. 
apple_clean = []

for row in appleStore_Data[1:] :
    if isStringEnglish(row[1]) :
        apple_clean.append(row)
        
explore_data(apple_clean, 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6183
Number of columns: 16


## Finish cleaning the data
### Isolating ONLY free apps, remove paid apps from both Android and Apple Clean lists

Since our team produces free-to-play apps, we only want to compare against other free to play apps. Remove anything that's paid from each list

In [11]:
android_clean_free = []
apple_clean_free = []

def isAppFree(dataRow, android = True) :
    #for android app rows
    if (android):
        price = dataRow[7]
        if price == '0' :
            return True
        return False
    #For apple store rows
    if (android == False):
        price = dataRow[4]
        if price == '0.0' :
            return True
        return False
    
for app in android_clean :
    if isAppFree(app, True) :
        android_clean_free.append(app)

for app in apple_clean : 
    if isAppFree(app, False) :
        apple_clean_free.append(app)
        
explore_data(android_clean_free, 0, 5, True)
explore_data(apple_clean_free, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8905
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0

# End Data Cleaning -----------------------


# Analysis 
Now that the data is clean, we can begin to analyse it. 

#### We want to identify games that are likely to attract users to our fremium apps. 
The strategy is to: 
1 - build an MVP android app and post to Google Play
2 - If the response is good we build out features
3 - If it is profitable after 6 months, build for iOS

Therefore we need to identify apps that are successful in both Android AND iOS

In [12]:
#Create frequency tables of app types for both android and iOS
android_freq = {}
apple_frew = {}

#Fields I might need
#11 - apple
#1,9 - android

#A function that takes in a dataset and an indec value, returns a frequency table of that index
def freq_table(inData, inIndex) :
    numItems = 0
    outDict = {}
    for row in inData :
        numItems += 1
        if row[inIndex] in outDict :
            outDict[row[inIndex]] += 1
        else :
            outDict[row[inIndex]] = 1
    #convert the dict values to percentage
    for item in outDict :
        outDict[item] = round((outDict[item]/numItems) * 100,2)
    return outDict

#A function to display tables
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#display frequency tables for prime_genre, genre, and category
print("******Apple")
display_table(apple_clean_free, 11)
print(("******Android"))
display_table(android_clean_free, 1)
print("******Android")
display_table(android_clean_free, 9)

******Apple
Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12
******Android
FAMILY : 18.98
GAME : 9.7
TOOLS : 8.43
BUSINESS : 4.58
LIFESTYLE : 3.93
PRODUCTIVITY : 3.89
FINANCE : 3.68
MEDICAL : 3.51
SPORTS : 3.38
PERSONALIZATION : 3.31
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.07
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.83
SOCIAL : 2.65
TRAVEL_AND_LOCAL : 2.32
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.18
DATING : 1.85
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.41
FOOD_AND_DRINK : 1.24
EDUCATION : 1.17
ENTERTAINMENT : 0.95
LIBRARIES_AND_DEMO : 0.93
AUTO_AND_VEHICLES : 0.92
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN

## At a glance analysis
By checking the frequency tables above, we can observe some trends

### Apple Store trends
 - The most common iOS app by far is games (58%), followed by entertainment (7%). 
 - Games and entertainment combined make up nearly 2/3 of iOS apps
We can surmise that iOS apps are geared towards entertainment and gaming rather than lifestyle. 
Therefore, I would recommend that any iOS app that our team develops fit into the game or entertainment profile.

### Google Store trends
 - The most common google store trends are family-tool oriented, however the gap is not nearly as pronounced as in the iOS apps. 
 - Categories for google apps are more evenly distributed, though this might be attributed to the Google data splitting entertainment/games into more specific categories (ie: strategy, role playing). 
 
##### Summary:
The iOS store is loaded with apps designed for fun, while the Google Play Store is more evenly distributed with apps for fun and practical apps. 
 
Note that these tables reveal the most frequent APPS, not necessarily the apps with the most users. Therefore, we should find a way to estimate the number of USERS an app has. 

Google Play data includes an "Installs" field
Apple Store data does not, but we will use "Ratings Count" as a proxy for installs. 

In [14]:
#Get a freq table of apple prime genres
applePrimeGenres = freq_table(apple_clean_free, 11)

#for each genre in the apple prime genres
for item in applePrimeGenres :
    total = 0
    len_genre = 0
    for app in apple_clean_free :
        genre_app = app[11]
        if genre_app == item :
            numRatings = float(app[5])
            total += numRatings
            len_genre += 1
    avgRatings = total/len_genre
    print(item, ':', avgRatings)


Health & Fitness : 23298.015384615384
Lifestyle : 16485.764705882353
Catalogs : 4004.0
Finance : 31467.944444444445
Travel : 28243.8
Utilities : 18684.456790123455
News : 21248.023255813954
Medical : 612.0
Photo & Video : 28441.54375
Education : 7003.983050847458
Book : 39758.5
Entertainment : 14029.830708661417
Business : 7491.117647058823
Music : 57326.530303030304
Social Networking : 71548.34905660378
Navigation : 86090.33333333333
Games : 22788.6696905016
Shopping : 26919.690476190477
Reference : 74942.11111111111
Productivity : 21028.410714285714
Weather : 52279.892857142855
Sports : 23008.898550724636
Food & Drink : 33333.92307692308


If we check the number of reviews for apple store apps, we can see that Navigation apps have a high number of reviews. Therefore assumed a high number of users (reviews as a proxy). The same is true for Social Networking. 

However, using our frequency table from previous steps, we can see that navigation and social networking apps make up <4% of all iOS apps. Therefore we can assume that a small number of apps here are holding a large number of reivews. 

For this reason, I might suggest that we go with photo video or entertainment apps, which hold 11% of apps in the store, and a combined 42000 reviews/users. 

In [17]:
#Create a frequency table for Android Category column
android_category = freq_table(android_clean_free, 1)

#Find the number of installs for each category
for category in android_category :
    total = 0
    len_category = 0
    for app in android_clean_free :
        app_category = app[1]
        if app_category == category :
            numInstalls = app[5]
            numInstalls = numInstalls.replace('+', '')
            numInstalls = numInstalls.replace(',', '')
            numInstalls = float(numInstalls)
            total += numInstalls
            len_category += 1
    avgInstalls = total/len_category
    print(category, ':', avgInstalls)

EDUCATION : 1825480.7692307692
MAPS_AND_NAVIGATION : 3993339.603174603
COMICS : 803234.8214285715
LIFESTYLE : 1436126.94
GAME : 15551995.891203703
HEALTH_AND_FITNESS : 4188821.9853479853
SPORTS : 3638640.1428571427
SOCIAL : 23253652.127118643
AUTO_AND_VEHICLES : 647317.8170731707
ENTERTAINMENT : 11640705.88235294
HOUSE_AND_HOME : 1331540.5616438356
PARENTING : 542603.6206896552
BOOKS_AND_REFERENCE : 8587351.855670104
NEWS_AND_MAGAZINES : 9401635.952380951
MEDICAL : 120550.61980830671
ART_AND_DESIGN : 1952105.1724137932
FOOD_AND_DRINK : 1924897.7363636363
PRODUCTIVITY : 16738957.554913295
BEAUTY : 513151.88679245283
LIBRARIES_AND_DEMO : 638503.734939759
PHOTOGRAPHY : 17772018.759541985
EVENTS : 253542.22222222222
FAMILY : 3668870.823076923
WEATHER : 5074486.197183099
DATING : 854028.8303030303
BUSINESS : 1708215.906862745
FINANCE : 1387692.475609756
COMMUNICATION : 38322625.697916664
TOOLS : 10787009.952063914
SHOPPING : 7001693.425
TRAVEL_AND_LOCAL : 13984077.710144928
VIDEO_PLAYERS : 

From this list we can see that communication apps have a very high number of installs, but we can also assume that there are not very many of these apps in the Google Play Store (only 3% of all Play store apps). Therefore the results here are skewed by a few apps with very high numbers of users. 

I would suggest that we investigate reading/reference apps, since they have 8million + installs.