# Profitable App Profiles for the App Store and Google Play Markets

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Open and read the files, seperate headers from data

In [2]:
from csv import reader

#android play store stats
openedFile = open("googleplaystore.csv")
readFile = reader(openedFile)
android = list(readFile)
androidHeader = android[0]
android = android[1:]
print("Android header:")
print(androidHeader)

#app store stats
openedFile = open("AppleStore.csv")
readFile = reader(openedFile)
ios = list(readFile)
iosHeader = ios[0]
ios = ios[1:]
print("Ios header:")
print(iosHeader)

Android header:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Ios header:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Function to explore the data

In [3]:
def exploreData(dataset,start,end,rowsAndColumns=False):
    datasetSlice = dataset[start:end]
    for row in datasetSlice:
        print(row)
        print('\n')

    if rowsAndColumns:
        print("Number of rows: {0}".format(len(dataset)))
        print("Number of columns: {0}".format(len(dataset[0])))
        

In [4]:
print("Android data *******")
exploreData(android,2,8,True)

print("iOS data *******")
print(exploreData(ios,2,8,True))

Android data *******
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Fre

Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.
We call this process of preparing our data for analysis data cleaning. We do data cleaning before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

It's often said that data scientists spend around 80% of their time cleaning data, and only about 20% actually analyzing (cleaned) data. In this project, we'll see that this is not far from the truth.

There is a problematic row which is missing data so we will delete it from our list

In [5]:
exploreData(android,10471,10473)

#delete the problematic row that is missing data
del android[10472]

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In each dataset however we have to deal with duplicate rows

In [6]:
def findDuplicates(dataset, appNameIndex):
    uniqueApps = set()
    duplicateApps = []
    for row in dataset:
        app = row[appNameIndex]
        if app in uniqueApps:
            duplicateApps.append(app)
        else:
            uniqueApps.add(app)
    
    print("Unique apps: {}".format(len(uniqueApps)))
    print("Duplicate apps: {} ".format(len(duplicateApps)))
    print("*"*10)

findDuplicates(android,0)
findDuplicates(ios,1)


Unique apps: 9659
Duplicate apps: 1181 
**********
Unique apps: 7195
Duplicate apps: 2 
**********


If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.



In [7]:
appNameAndReviews = {}

for row in android:
    appName = row[0]
    numOfReviews = row[3]
    if appName not in  appNameAndReviews:
        appNameAndReviews[appName] = numOfReviews
    else:
        if numOfReviews >= appNameAndReviews[appName]:
            appNameAndReviews[appName] = numOfReviews


uniqueAndroid = []
alreadySeen = set()
for row in android:
    appName = row[0]
    numOfReviews = row[3]
    maxNumberOfReviews = appNameAndReviews[appName]
    if appName not in alreadySeen and numOfReviews == maxNumberOfReviews:
        uniqueAndroid.append(row)
        alreadySeen.add(appName)

print(len(uniqueAndroid)) #should be 9659
    

9659


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

So we should filter out any non english apps as we don't care about those

<v1 function>
On the previous screen, we wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [8]:
def nameIsEnglish(appName):
    countOfNonEnglishChars = 0
    for char in appName:
        if ord(char) > 127:
            countOfNonEnglishChars +=1
            if countOfNonEnglishChars > 3:
                return False
    return True

engAndroid = [row for row in uniqueAndroid if nameIsEnglish(row[0])]
engIos = [row for row in ios if nameIsEnglish(row[1])]

print(len(uniqueAndroid))
print(len(engAndroid))
print('*' * 50)
print(len(ios))
print(len(engIos))

print(nameIsEnglish('Docs To Go™ Free Office Suite'))
print(nameIsEnglish('Instachat 😜'))
print(nameIsEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))

9659
9614
**************************************************
7197
6183
True
True
False


So far in the data cleaning process, we've done the following:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

now lets filter to only find the free apps, investigate the rows above and see which indexes for prices and the data values that indicate free



In [9]:

engUniqueAndroidFree = [row for row in engAndroid if row[7] == '0']
engUniqueIosFree = [row for row in engIos if row[4] == '0.0']

print(len(engUniqueAndroidFree))
print(len(engUniqueIosFree))

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

def freq_table(dataset,index):
    freq_table = {}
    totalNumberOfRows = 0
    for row in dataset:
        if row[index] not in freq_table:
            freq_table[row[index]] = 1
        else:
            freq_table[row[index]] +=1
        totalNumberOfRows+=1
    
    for key,_ in freq_table.items():
        freq_table[key] = round((freq_table[key] /totalNumberOfRows) * 100,1)
    return freq_table

freq_table(engUniqueAndroidFree,1)
#display_table(engUniqueAndroidFree,1)

print('*'*100)
display_table(engUniqueIosFree,11)

print('Android - category*'*100)
display_table(engUniqueAndroidFree,1)

print('Android - Genres*'*100)
display_table(engUniqueAndroidFree,9)



8862
3222
****************************************************************************************************
Games : 58.2
Entertainment : 7.9
Photo & Video : 5.0
Education : 3.7
Social Networking : 3.3
Shopping : 2.6
Utilities : 2.5
Sports : 2.1
Music : 2.0
Health & Fitness : 2.0
Productivity : 1.7
Lifestyle : 1.6
News : 1.3
Travel : 1.2
Finance : 1.1
Weather : 0.9
Food & Drink : 0.8
Reference : 0.6
Business : 0.5
Book : 0.4
Navigation : 0.2
Medical : 0.2
Catalogs : 0.1
Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - category*Android - 

Let's finish by  calculating the average number of user ratings per app genre on the App Store & Android store. To do that, we'll need to do the following:

- Isolate the apps of each genre
- Add up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)


In [16]:
primeGenreAppStoreTable = freq_table(engUniqueIosFree,11)
primeGenreAppStoreTable

# most popular ios rated app genres
totalRatingIndex = 5
genreIndex = 11
for genre,freq in primeGenreAppStoreTable.items():
    totalRating = 0
    genreLength = 0
    for record in engUniqueIosFree:
        if record[genreIndex] == genre:
            totalRating += float(record[totalRatingIndex])
            genreLength +=1
    averageRating = totalRating / genreLength
    print("IOS Genre: {} Average no. of ratings {:_}".format(genre,round(averageRating,1)))

# most popular android rated app genres
display_table(engUniqueAndroidFree,5)

aTotalRatingIndex = 5
aGenreIndex = 1
primeGenreTable = freq_table(engUniqueAndroidFree,aGenreIndex)

for genre,freq in primeGenreTable.items():
    totalRating = 0
    genreLength = 0
    for record in engUniqueAndroidFree:
        if record[aGenreIndex] == genre:
            numRatings = record[aTotalRatingIndex]
            numRatings = numRatings.replace(",",'').replace('+','')
            totalRating += float(numRatings)
            genreLength += 1
    averageRating = round(totalRating / genreLength, 1)
    print("Android Genre: {} Average no. of ratings {:_}".format(genre,averageRating))



IOS Genre: Social Networking Average no. of ratings 71_548.3
IOS Genre: Photo & Video Average no. of ratings 28_441.5
IOS Genre: Games Average no. of ratings 22_788.7
IOS Genre: Music Average no. of ratings 57_326.5
IOS Genre: Reference Average no. of ratings 74_942.1
IOS Genre: Health & Fitness Average no. of ratings 23_298.0
IOS Genre: Weather Average no. of ratings 52_279.9
IOS Genre: Utilities Average no. of ratings 18_684.5
IOS Genre: Travel Average no. of ratings 28_243.8
IOS Genre: Shopping Average no. of ratings 26_919.7
IOS Genre: News Average no. of ratings 21_248.0
IOS Genre: Navigation Average no. of ratings 86_090.3
IOS Genre: Lifestyle Average no. of ratings 16_485.8
IOS Genre: Entertainment Average no. of ratings 14_029.8
IOS Genre: Food & Drink Average no. of ratings 33_333.9
IOS Genre: Sports Average no. of ratings 23_008.9
IOS Genre: Book Average no. of ratings 39_758.5
IOS Genre: Finance Average no. of ratings 31_467.9
IOS Genre: Education Average no. of ratings 7_00

In [15]:


for app in engUniqueAndroidFree:
    if app[1] == 'PHOTOGRAPHY':
        if 'Beauty' in app[0] or 'Makeup' in app[0]:
            print(app[0],':',app[5])



B612 - Beauty & Filter Camera : 100,000,000+
Makeup Editor -Beauty Photo Editor & Selfie Camera : 1,000,000+
Makeup Photo Editor: Makeup Camera & Makeup Editor : 1,000,000+
InstaBeauty -Makeup Selfie Cam : 50,000,000+
Selfie Camera: Beauty Camera, Photo Editor,Collage : 1,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Pretty Makeup, Beauty Photo Editor & Snappy Camera : 5,000,000+
Sweet Camera - Selfie Filters, Beauty Camera : 10,000,000+
Beauty Makeup – Photo Makeover : 1,000,000+
Beauty Makeup Snappy Collage Photo Editor - Lidow : 10,000,000+
PhotoWonder: Pro Beauty Photo Editor Collage Maker : 50,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
Meitu – Beauty Cam, Easy Photo Editor : 10,000,000+
MakeupPlus - Your Own Virtual Makeup Artist : 50,000,000+
