### Profitable App Profiles for the Google Play Market
1.What kinds of apps are likely to attract more users on Google Play Store.<br/>
2.To do this, we'll need to collect and analyze data about mobile apps available on Google Play.

In [96]:
from csv import reader
opened_file=open('Data\google-play-store-apps\googleplaystore.csv',encoding="utf8")
read_file=reader(opened_file)
google_list=list(read_file)

### Dataset Documentation
1.App-Application name <br/>
2.Category-Category the app belongs to <br/>
3.Rating-Overall user rating of the app (as when scraped) <br/>
4.Reviews-Number of user reviews for the app (as when scraped) <br/>
5.Size-Size of the app (as when scraped) <br/>
6.Installs-Number of user downloads/installs for the app (as when scraped) <br/>
7.Type-Paid or Free <br/>
8.Price-Price of the app (as when scraped) <br/>
9.Content Rating-Age group the app is targeted at - Children / Mature 21+ / Adult <br/>
10.Genres-An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. <br/>
11.Last Updated-Date when the app was last updated on Play Store (as when scraped) <br/>
12.Current Ver-Current version of the app available on Play Store (as when scraped) <br/>
13.Android Ver-Min required Android version (as when scraped) <br/>

In [97]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Data set is open and performed a brief exploration of the data. Before beginning the analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we'll need to:

Detect inaccurate data and correct (or remove) it
<br/>Detect duplicate data and remove the duplicates

In [98]:
del google_list[10473]
#Row deleted after comment from dataset provider
#DO NOT run this code twice

In [99]:
explore_data(google_list,10473,10475,True)
#checking how many rows left after the del function, should be 10841

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows: 10841
Number of columns: 13


In the data set it is a possibility that the some of the data is duplicate.<br/>
We need to check and remove the duplicates.

In [100]:
duplicate_app=[]
unique_app=[]

for app in google_list:
    name=app[0]
    if name in unique_app:
        duplicate_app.append(name)
    else:
        unique_app.append(name)
print(len(duplicate_app))
print(duplicate_app[:5])

for app in google_list:
    name=app[0]
    if name=='Box':
        print(app)


1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


Returning to our discussion, we don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly.<br/>

If you examine the rows we printed above for the Box app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times.<br/>

We could use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.<br/>

In [101]:
print('Expected data' , len(google_list)-len(duplicate_app))
# Expected length including header

Expected data 9660


In [102]:
review_max={}
for app in google_list[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if name in review_max and review_max[name] < n_reviews:
        review_max[name]=n_reviews
    review_max[name]=float(app[3])
print(len(review_max))
        
    

9659


In [103]:
google_clean=[]
name_already_added=[]

for app in google_list[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if name not in name_already_added and n_reviews==review_max[name]:
        google_clean.append(app)
        name_already_added.append(name)
        
explore_data(google_clean,0,5,True)    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


We managed to remove the duplicate app entries in the Google Play data set. We'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps whose name suggests that they are not direct toward an English-speaking audience. <br/>
We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).<br/>
The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters, otherwise it doesn't.

In [104]:
def english(app_name):
    for char in app_name:
        if ord(char) > 127: #We can get the corresponding number of each character using the ord() built-in function.
            return False
    return True
            
english('Instagram')
english('Instachat 😜')
#Above two are used just to check if function is correct

False

In the previous screen, we wrote a function that detects non-English app names, but we saw that the function couldn't identify correctly certain English app names like 'Instachat 😜'. This is because emojis and some characters like ™ fall outside the ASCII range and have corresponding numbers that are over 127.<br/><br/>

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [105]:
def english(app_name):
    non_ascii=0
    for char in app_name:
        if ord(char) > 127: #We can get the corresponding number of each character using the ord() built-in function.
            non_ascii+=1
        if non_ascii >3:
            return False
        else:
            return True

english('Instachat 😜')

True

In [106]:
english_app=[]

for app in google_clean:
    name=app[0]
    if english(name):
        english_app.append(app)
    else:
        print(name)
 
explore_data(english_app,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


In [107]:
# Isolating free apps

app_final=[]

for app in english_app:
    price=app[7]
    if price == '0':
        app_final.append(app)
        
explore_data(app_final,0,5,True)        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8905
Number of columns: 13


#### Most common apps by genre

We'll build two functions we can use to analyze the frequency tables:<br/><br/>

One function to generate frequency tables that show percentages<br/>
Another function we can use to display the percentages in a descending order<br/>

In [108]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [109]:
display_table(app_final,1)

FAMILY : 19.292532285233015
GAME : 9.48905109489051
TOOLS : 8.433464345873105
BUSINESS : 4.570466030320045
LIFESTYLE : 3.9303761931499155
PRODUCTIVITY : 3.885457608085345
FINANCE : 3.6833239752947784
MEDICAL : 3.5261089275687816
SPORTS : 3.402582818641213
PERSONALIZATION : 3.312745648512072
COMMUNICATION : 3.245367770915216
HEALTH_AND_FITNESS : 3.0544637843907916
PHOTOGRAPHY : 2.9421673217293653
NEWS_AND_MAGAZINES : 2.829870859067939
SOCIAL : 2.6501965188096577
TRAVEL_AND_LOCAL : 2.3245367770915215
SHOPPING : 2.2459292532285233
BOOKS_AND_REFERENCE : 2.1785513756316677
DATING : 1.8528916339135317
VIDEO_PLAYERS : 1.785513756316676
MAPS_AND_NAVIGATION : 1.4149354295339696
FOOD_AND_DRINK : 1.235261089275688
EDUCATION : 1.1341942728804042
LIBRARIES_AND_DEMO : 0.9320606400898372
AUTO_AND_VEHICLES : 0.9208309938236946
ENTERTAINMENT : 0.8759124087591241
HOUSE_AND_HOME : 0.8197641774284109
WEATHER : 0.7973048848961257
EVENTS : 0.7074677147669848
PARENTING : 0.6513194834362718
ART_AND_DESIGN : 0

In [110]:
display_table(app_final,9)

Tools : 8.422234699606962
Entertainment : 6.086468276249298
Education : 5.390230207748456
Business : 4.570466030320045
Lifestyle : 3.9191465468837734
Productivity : 3.885457608085345
Finance : 3.6833239752947784
Medical : 3.5261089275687816
Sports : 3.4475014037057834
Personalization : 3.312745648512072
Communication : 3.245367770915216
Action : 3.0881527231892196
Health & Fitness : 3.0544637843907916
Photography : 2.9421673217293653
News & Magazines : 2.829870859067939
Social : 2.6501965188096577
Travel & Local : 2.313307130825379
Shopping : 2.2459292532285233
Books & Reference : 2.1785513756316677
Simulation : 2.0662549129702414
Dating : 1.8528916339135317
Arcade : 1.8528916339135317
Video Players & Editors : 1.785513756316676
Casual : 1.7405951712521055
Maps & Navigation : 1.4149354295339696
Food & Drink : 1.235261089275688
Puzzle : 1.1229646266142617
Racing : 0.9882088714205502
Role Playing : 0.9320606400898372
Libraries & Demo : 0.9320606400898372
Strategy : 0.9208309938236946
Aut

### Most Popular Apps by Genre on Google Play

In [112]:
display_table(app_final, 5) # the Installs columns

1,000,000+ : 15.71027512633352
100,000+ : 11.588994946659179
10,000,000+ : 10.454800673778776
10,000+ : 10.263896687254352
1,000+ : 8.422234699606962
100+ : 6.917462099943853
5,000,000+ : 6.816395283548568
500,000+ : 5.53621560920831
50,000+ : 4.817518248175182
5,000+ : 4.525547445255475
10+ : 3.537338573834924
500+ : 3.2341381246490735
50,000,000+ : 2.2908478382930935
100,000,000+ : 2.1224031443009546
50+ : 1.9090398652442448
5+ : 0.7860752386299831
1+ : 0.5165637282425604
500,000,000+ : 0.26951151038742277
1,000,000,000+ : 0.22459292532285235
0+ : 0.044918585064570464
0 : 0.011229646266142616
