# Analyzing Mobile App Data
* apps dataを分析し、人気アプリに共通の特徴を見つける。
* 開発者に、ユーザー獲得のために有益な分析結果を提示することが目的である。

In [1]:
from csv import reader

#App Store data set
open_file = open('AppleStore.csv') 
read_file = reader(open_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

#Google Play data set
open_file = open('googleplaystore.csv') 
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 1, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


このデータセットには、7197個のiOSアプリがあり、カラム数は16である。16のカラムのうち、`'track_name'`, `'size_bytes'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`などは今回の分析に役立つであろう。

In [4]:
print(android_header)
print('\n')
explore_data(android, 1, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


このデータセットには、10841個のandroidアプリがあり、カラム数は13である。13のカラムのうち、`'App'`, `'Category'`, `'Rating'`, `'Reviews'`, `'Size'`, `'Installs'`,`'Price'`などは今回の分析に役立つであろう。

# Delete Wrong Data

In [5]:
#android apps
print(android[10472])
print('\n')
print(android_header)
print('\n')

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




上の出力から、データセット中の`'Life Made WI-Fi Touchscreen Photo Frame'`の`'Rating'`が19であることがわかる。Google Play Storeの`'Rating'`は最大で5であるから、このデータは間違っている。したがって、データセットから取り除く。

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


In [7]:
#iOS apps
for row in ios:
    if len(row) != len(ios_header): #To find wromg data
        print(row)

iOSのデータセット内に、カラム数が13以外のデータはない

# Removing Duplicate Entries

Google Play data setはいくつかの重複したエントリを含んでいる。例えば、Instagramは４つのエントリがある：

In [8]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [9]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))    

Number of duplicate apps: 1181


Goolge Play data setは1181個のアプリが重複してエントリしている。
重複したデータのうちから１つをランダムに抽出するのも１つの方法だが、より良い方法があると考える。重複をなくす方法を説明するために、再びInstagramを例にあげる。先にプリントされた4つのデータは4番目の値、すなわち、レビュー数がそれぞれ異なっている。これはデータが異なる時間に集められたことを示唆し、レビュー数が多いデータほど、より新しいデータであるはずである。
以上より、無作為に抽出するのではなく、レビュー数の最も多いデータを取り出して、それ以外の重複したデータは除くこととする。

まずdirectionaryの作成から始める。

In [10]:
reviews_max = {}

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected length:', len(android) - 1181)
print('Actual lendth:', len(reviews_max))

Expected length: 9659
Actual lendth: 9659


loopを用いてkeyをアプリ名、valueをレビュー数とするdirectionary`'review_max'`を作成した。
 `len(reviews_max)`=`len(android) - 1181`となった。

In [11]:
android_clean = []
already_added = []

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


重複のないGoogle Play data setと、9659個のアプリ名の入ったリストを作成した。

# Remove Non-English Apps

In [12]:
def check_string(a_string):
    a_number_of_non_English = 0
    for character in a_string:
        
        if ord(character) > 127:
            a_number_of_non_English += 1
    if a_number_of_non_English > 3:
        return False
    return True

print(check_string('Instagram'))
print(check_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_string('Docs To Go™ Free Office Suite'))
print(check_string('Instachat 😜'))

True
False
True
True


😜や™ などのcorrespoding numbers over 127を含む有益なEnglish appsのデータも取得するために、タイトルに含まれるover127の文字が3つ以内であるアプリをEnglish appとみなすことにした。この関数は完璧ではないが、有効なものであると考える。

In [13]:
android_English = []
ios_English = []

for app in android_clean:
    if check_string(app[0]):
        android_English.append(app)
        
for app in ios:
    if check_string(app[1]):
        ios_English.append(app)
        
explore_data(android_English, 0, 3, True)
print('\n')
explore_data(ios_English, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

英語のアプリの数は、androidが9614、iOSが6183である

# Isolating the Free Apps

In [14]:
android_final = []
ios_final = []

for app in android_English:
    price = (app[6])
    if price == 'Free':
        android_final.append(app)
        
for app in ios_English:
    price = float(app[4])
    if price == 0:
        ios_final.append(app)
        
explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

クリーニングの結果、androidは8863個、iOSは3222個残った

In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        if row[index] in table:
            table[row[index]] += 1
        else:
            table[row[index]] = 1
            
    table_percentage = {}
    for key in table:
        percentage = table[key] / total * 100
        table_percentage[key] = percentage
        
    return table_percentage

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True) #降順
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


無料の英語版アプリにおいて、`Games`が過半数をしめており、`Entertainment`が次点でありおよそ8％を占める。他のジャンルは5％に満たない。概観でいうと、ほとんどが娯楽目的のアプリである一方、実用的なアプリは極めて少ない。しかし、娯楽用のアプリがほとんどであることが、このジャンルのアプリのユーザー数が最も

In [16]:
display_table(android_final, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Google Play Storeでは、`'family'`が最も多いカテゴリーである。App Storeに比べると`'Games'`は多くない。一方で、実用的なカテゴリー(TOOLS,BUSINESS,LIFESTYLE,PRODUCTIVITY,FINANCE,etc.)はApp Storeにおいて占める割合よりも大きい。２つのジャンルが同程度を占めている。

In [17]:
display_table(android_final, -4)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

genreはcategoryよりもずっと細かに分けられている。

# Most Popular Apps by Genre on App Store

In [18]:
ios_genres_fb = freq_table(ios_final, -5)
print(ios_genres_fb)

{'Finance': 1.1173184357541899, 'Productivity': 1.7380509000620732, 'Catalogs': 0.12414649286157665, 'Book': 0.4345127250155183, 'Education': 3.662321539416512, 'Business': 0.5276225946617008, 'Shopping': 2.60707635009311, 'Navigation': 0.186219739292365, 'Food & Drink': 0.8069522036002483, 'Games': 58.16263190564867, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Utilities': 2.5139664804469275, 'Sports': 2.1415270018621975, 'News': 1.3345747982619491, 'Entertainment': 7.883302296710118, 'Weather': 0.8690254500310366, 'Photo & Video': 4.9658597144630665, 'Lifestyle': 1.5828677839851024, 'Social Networking': 3.2898820608317814, 'Music': 2.0484171322160147, 'Travel': 1.2414649286157666, 'Medical': 0.186219739292365}


In [19]:
for genre in ios_genres_fb:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            a_number_of_user_rating = float(app[5])
            total += a_number_of_user_rating
            len_genre += 1
        
    average_num_user_rating = total / len_genre
    print(genre, ':', average_num_user_rating)
    

Finance : 31467.944444444445
Productivity : 21028.410714285714
Catalogs : 4004.0
Book : 39758.5
Education : 7003.983050847458
Business : 7491.117647058823
Shopping : 26919.690476190477
Navigation : 86090.33333333333
Food & Drink : 33333.92307692308
Games : 22788.6696905016
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Utilities : 18684.456790123455
Sports : 23008.898550724636
News : 21248.023255813954
Entertainment : 14029.830708661417
Weather : 52279.892857142855
Photo & Video : 28441.54375
Lifestyle : 16485.764705882353
Social Networking : 71548.34905660378
Music : 57326.530303030304
Travel : 28243.8
Medical : 612.0


各ジャンルのユーザー評価総数を平均してみると、`'Navigation'`が最も大きな値となった。これはGoogleマップのユーザーが非常に多いためであると考える。

# Most Popular App by Genre on Google Play

In [20]:
android_category_ft = freq_table(android_final, 1)
print(android_category_ft)

{'EDUCATION': 1.1621347173643235, 'HEALTH_AND_FITNESS': 3.0802211440821394, 'EVENTS': 0.7108202640189552, 'HOUSE_AND_HOME': 0.8236488773552973, 'BEAUTY': 0.5979916506826132, 'MEDICAL': 3.5315355974275078, 'MAPS_AND_NAVIGATION': 1.399074805370642, 'COMICS': 0.6205573733498815, 'SOCIAL': 2.6627552747376737, 'PARENTING': 0.6544059573507841, 'PRODUCTIVITY': 3.8925871601038025, 'VIDEO_PLAYERS': 1.7939749520478394, 'TRAVEL_AND_LOCAL': 2.335552296062281, 'LIFESTYLE': 3.9038700214374367, 'GAME': 9.725826469592688, 'FAMILY': 18.898792733837304, 'SPORTS': 3.396141261423897, 'AUTO_AND_VEHICLES': 0.9251946293580051, 'ART_AND_DESIGN': 0.6431230960171499, 'DATING': 1.8616721200496444, 'BOOKS_AND_REFERENCE': 2.1437436533904997, 'BUSINESS': 4.592124562789123, 'PHOTOGRAPHY': 2.944826808078529, 'WEATHER': 0.8010831546880289, 'FOOD_AND_DRINK': 1.241114746699763, 'TOOLS': 8.462146000225657, 'FINANCE': 3.7007785174320205, 'PERSONALIZATION': 3.317161232088458, 'COMMUNICATION': 3.2381812027530184, 'NEWS_AND_

In [22]:
for category in android_category_ft:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

EDUCATION : 1833495.145631068
HEALTH_AND_FITNESS : 4188821.9853479853
EVENTS : 253542.22222222222
HOUSE_AND_HOME : 1331540.5616438356
BEAUTY : 513151.88679245283
MEDICAL : 120550.61980830671
MAPS_AND_NAVIGATION : 4056941.7741935486
COMICS : 817657.2727272727
SOCIAL : 23253652.127118643
PARENTING : 542603.6206896552
PRODUCTIVITY : 16787331.344927534
VIDEO_PLAYERS : 24727872.452830188
TRAVEL_AND_LOCAL : 13984077.710144928
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
AUTO_AND_VEHICLES : 647317.8170731707
ART_AND_DESIGN : 1986335.0877192982
DATING : 854028.8303030303
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
PHOTOGRAPHY : 17840110.40229885
WEATHER : 5074486.197183099
FOOD_AND_DRINK : 1924897.7363636363
TOOLS : 10801391.298666667
FINANCE : 1387692.475609756
PERSONALIZATION : 5201482.6122448975
COMMUNICATION : 38456119.167247385
NEWS_AND_MAGAZINES : 9549178.467741935
SHOPPING : 7036877.311557789
L

`'COMMUNICATION'`アプリのインストール数が最も多い。これは、非常に多くのユーザーを持つアプリ(WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, etc)のインストール数が極端に大きな値、すなわち外れ値をとることで、平均値が歪曲されているためであると考えられる。