# Project 1 - App Store Data Analysis
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [11]:
def file_opener(filename):
    opened_file=open(filename)
    from csv import reader
    read_file=reader(opened_file)
    data_list=list(read_file)
    return data_list
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [21]:
Android_Data=file_opener('googleplaystore.csv')
Apple_Data=file_opener('AppleStore.csv')

In [22]:
explore_data(Android_Data,0,1,True)
explore_data(Apple_Data,0,1,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


In [25]:
del Android_Data[10473]

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

In [27]:
def duplicate_finder(dataset,index):
    duplicate_apps=[]
    unique_apps=[]
    for each_app in dataset:
        name=each_app[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        if name not in unique_apps:
            unique_apps.append(name)
    return duplicate_apps

    

In [30]:
len(duplicate_finder(Apple_Data,0))

0

In [31]:
reviews_max={}
for each_app in Android_Data[1:]:
    name=each_app[0]
    n_reviews=float(each_app[3])
    if name in reviews_max and n_reviews>reviews_max[name]:
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews


In [33]:
len(reviews_max)

9659

In [37]:
android_clean=[]
already_added=[]
for each_app in Android_Data[1:]:
    name=each_app[0]
    n_reviews=float(each_app[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(each_app)
        already_added.append(name)
print (len(android_clean))

9659


To remove the duplicates, we will:

Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [39]:
def string_analyzer(a_string):
    counter=0
    for each_char in a_string:
        if ord(each_char) >127:
            counter+=1
            if counter >3:
                return False
    return True
string_analyzer('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [45]:
def dataset_cleaner(dataset,index):
    clean_data=[]
    for each_app in dataset:
        name=each_app[index]
        if string_analyzer(name):
            clean_data.append(each_app)
    return clean_data

In [46]:
Android_NEN=dataset_cleaner(android_clean,0)
Apple_NEN=dataset_cleaner(Apple_Data[1:],0)

In [47]:
print(len(Android_NEN))
print(len(Apple_NEN))

9614
7197


In [50]:
# Analyze free Android Apps
Android_VF=[]
Apple_VF=[]
for each_app in Android_NEN:
    cost=each_app[7]
    if cost =='0':
        Android_VF.append(each_app)
        
# Analyze free Apple Apps
for each_app in Apple_NEN:
    cost=each_app[4]
    if cost=='0.0':
        Apple_VF.append(each_app)

print(len(Android_VF))
print(len(Apple_VF))

8864
4056


To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [58]:
def freq_table(dataset,index):
    freq_table={}
    percentage_table={}
    Total=0
    for each_app in dataset:
        Total+=1
        var=each_app[index]
        if var in freq_table:
            freq_table[var]+=1
        if var not in freq_table:
            freq_table[var]=1
    for each_key in freq_table:
        percentage=(freq_table[each_key]/Total)*100
        percentage_table[each_key]=percentage
    return percentage_table


In [61]:
freq_table(Apple_VF,11)

{'Book': 1.6272189349112427,
 'Business': 0.4930966469428008,
 'Catalogs': 0.22189349112426035,
 'Education': 3.2544378698224854,
 'Entertainment': 8.234714003944774,
 'Finance': 2.0710059171597637,
 'Food & Drink': 1.0601577909270217,
 'Games': 55.64595660749507,
 'Health & Fitness': 1.8737672583826428,
 'Lifestyle': 2.3175542406311638,
 'Medical': 0.19723865877712032,
 'Music': 1.6518737672583828,
 'Navigation': 0.4930966469428008,
 'News': 1.4299802761341223,
 'Photo & Video': 4.117357001972387,
 'Productivity': 1.5285996055226825,
 'Reference': 0.4930966469428008,
 'Shopping': 2.983234714003945,
 'Social Networking': 3.5256410256410255,
 'Sports': 1.947731755424063,
 'Travel': 1.3806706114398422,
 'Utilities': 2.687376725838264,
 'Weather': 0.7642998027613412}

Analyze the frequency table you generated for the prime_genre column of the App Store data set.

What is the most common genre? What is the runner-up?
What other patterns do you see?
What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

What are the most common genres?
What other patterns do you see?
Compare the patterns you see for the Google Play market with those you saw for the App Store market.
Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?

In [69]:
prime_genre=freq_table(Apple_VF,11)
for genre in prime_genre:
    total=0
    len_genre=0
    for each_app in Apple_VF:
        genre_app=each_app[11]
        if genre_app==genre:
            ratings=float(each_app[5])
            total=total+ratings
            len_genre+=1
    avg_ratings=total/len_genre
    print (genre,avg_ratings)
            
    

Music 56482.02985074627
Travel 20216.01785714286
Food & Drink 20179.093023255813
Navigation 25972.05
News 15892.724137931034
Entertainment 10822.961077844311
Education 6266.333333333333
Weather 47220.93548387097
Productivity 19053.887096774193
Shopping 18746.677685950413
Business 6367.8
Games 18924.68896765618
Medical 459.75
Catalogs 1779.5555555555557
Photo & Video 27249.892215568863
Lifestyle 8978.308510638299
Social Networking 53078.195804195806
Finance 13522.261904761905
Book 8498.333333333334
Reference 67447.9
Health & Fitness 19952.315789473683
Sports 20128.974683544304
Utilities 14010.100917431193


In [78]:
for each_app in Apple_VF:
    genre=each_app[11]
    if genre=='Social Networking':
        print(each_app[1], ':', each_app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

The genre with highest average number of ratings is Social & Networking, highly influenced by apps such as Facebook, Whatsapp,Messenger, Skype.

In [85]:
app_category=freq_table(Android_VF,1)
for category in app_category:
    total=0
    len_category=0
    for each_app in Android_VF:
        app_genre=each_app[1]
        if app_genre==category:
            installs=each_app[5]
            installs=installs.replace('+','')
            installs=installs.replace(',','')
            installs=float(installs)
            total=total+installs
            len_category+=1
    avg_installs=total/len_category
    print(category,':',avg_installs)

COMMUNICATION : 38456119.167247385
SPORTS : 3638640.1428571427
LIFESTYLE : 1437816.2687861272
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BUSINESS : 1712290.1474201474
MAPS_AND_NAVIGATION : 4056941.7741935486
PRODUCTIVITY : 16787331.344927534
FAMILY : 3695641.8198090694
COMICS : 817657.2727272727
EVENTS : 253542.22222222222
WEATHER : 5074486.197183099
PERSONALIZATION : 5201482.6122448975
HEALTH_AND_FITNESS : 4188821.9853479853
TRAVEL_AND_LOCAL : 13984077.710144928
SOCIAL : 23253652.127118643
MEDICAL : 120550.61980830671
EDUCATION : 1833495.145631068
GAME : 15588015.603248259
HOUSE_AND_HOME : 1331540.5616438356
FOOD_AND_DRINK : 1924897.7363636363
PHOTOGRAPHY : 17840110.40229885
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1986335.0877192982
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
DATING : 854028.8303030303
BOOKS_AND_REFERENCE : 8767811.894

The Communication has the highest number of average installs followed by Games and Entertainment.
The Android app market is clearly dominated by Social and Games/Entertainment app installs.