In [1]:
import os
import pandas as pd

In [2]:
def explore_data(folder_path):
    data = []
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        df = pd.read_csv(file_path)    
        data.append(df)
    return data

In [3]:
folder_path = 'dataset'

ios, android = explore_data(folder_path=folder_path)

print(f'Ios dataset:\nTotal rows: {len(ios)}, Total columns: {len(ios.columns)}')
print(f'Android dataset:\nTotal rows: {len(android)}, Total columns: {len(android.columns)}')

Ios dataset:
Total rows: 7197, Total columns: 16
Android dataset:
Total rows: 10841, Total columns: 13


## **Data Preprocessing**

### Removing inaccurate data

The Google Play data set has a dedicated discussion section, and one of the discussions outlines an error for row 10472.

In [4]:
print(android.iloc[10472])

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                               19.0
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object


We can see that row 10472 have a mistake in two columns:
- `Category`: it must be a string.
- `Rating`: the range of this value is within $[0, 5]$, so it cannot be $4.1$.

We don't know the correct information, so we'll delete this row.

In [5]:
android = android.drop(index=10472)
print(len(android))

10840


### Removing duplicate app entries

After receiving [feedback](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) from Google Play users, they noticed that some apps has duplicate entries.

For example, let's see how many entries the Instagram app has:

In [6]:
instagram_android = android[android.iloc[:,0] == 'Instagram']
instagram_android

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


Check out the number of similar cases in the Android dataset:

In [7]:
duplicate_apps, unique_apps = [], []

[duplicate_apps.append(name) if name in unique_apps else unique_apps.append(name) for name in android.iloc[:,0]]

print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 1181


To build android dataset with distinct app, we will create a dictionary, where each key is a unique app name and the corresponding dictionary value is the **highest number** of reviews of that app.

In [8]:
reviews_max = {}

for app in range(len(android)):

    name = android.iloc[app, 0]
    n_reviews = float(android.iloc[app, 3])

    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews

    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


Comparing our result with the expected length:

In [9]:
len(reviews_max) == (len(android) - 1181)

True

Now, let's use the `reviews_max` dictionary to remove the duplicates.

In [10]:
android_cleaned, already_added = [], []

for app in range(len(android)):
    name = android.iloc[app, 0]
    n_reviews = float(android.iloc[app, 3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_cleaned.append(android.iloc[app, :])
        already_added.append(name)

In [11]:
len(android_cleaned)

9659

<br>

Because App Store data has no duplicates, so we go to the next step.

In [12]:
print(ios.duplicated(keep=False).sum())

0


### Removing Non-English apps

To make the dataset more personalized and relevant to the needs of companies, we will build a dataset containing only apps with English names.

The numbers corresponding to the English characties are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

However, it is easy to see exceptional cases. Due to the ASCII number range, our function cannot recognize English names containing special characters.

Therefore, to minimize the impact of data loss, we set a threshold before deleting any app. After consulting a number of apps in both mobile stores, we chose `threshold = 3`, which means we only delete an app if it has more than 3 non-ASCII characters.

In [13]:
def is_english(string):
    non_ascii = 0

    for letters in string:
        if ord(letters) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True            

In [14]:
english_android, english_ios = [], []

for app in android_cleaned:
    name = app[0]
    if is_english(name):
        english_android.append(app)

for app in range(len(ios)):
    name = ios.iloc[app, 1]
    if is_english(name):
        english_ios.append(ios.iloc[app])     

print(len(english_ios), len(english_android))           

  name = app[0]


6183 9614


### Isolating the free apps

In [15]:
# Check data types of price column

print(f'ios: {pd.DataFrame(english_ios)['price'].dtypes}')
print(f'android: {pd.DataFrame(english_android)['Price'].dtypes}')

ios: float64
android: object


In [16]:
final_ios, final_android = [], []

for app in english_ios:
    price = app[4]
    if price == 0.0:
        final_ios.append(app)

for app in english_android:
    price = app[7]
    if price == '0':
        final_android.append(app)

print(len(final_ios), len(final_android))

  price = app[4]
  price = app[7]


3222 8864


## **Classify the most common apps by genre**

Our goal is to identify the type of app that is likely to attract more users, which will increase our revenue.

In my country, Android is more accessible to many people in many different jobs than IOS. Therefore, our validation strategy for app ideas has three steps, first building on Android and then IOS:
1. First, we build a minimal Android version of the app, then add it to Google Play.
2. If we get a lot of positive feedback from users, we will develop it further.
3. If the app is profitable after six months, we will build an IOS version and add it to the App Store.

### Build the functions

Now, we will build functions to analyze the frequency tables of our dataset:

In [17]:
def freq_table(dataset, index):
    '''
    Generate frequency tables that show percentages 
    '''
    table, total = {}, 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1

    percentage_table = {}
    for key in table:
        percentage_table[key] = table[key] * 100 / total

    return percentage_table

In [18]:
def display_table(dataset, index):
    '''
    Display the percentages in a descending order
    '''
    table = freq_table(dataset, index)
    displayed_table = []
    for key in table:
        displayed_table.append((table[key], key))
    
    sorted_table = sorted(displayed_table, reverse=True)
    for entry in sorted_table:
        print(f'{entry[1]}: {entry[0]}')

### Examining

Let's check by examining the frequency table for the `prime_genre` column of the App Store:

In [19]:
display_table(final_ios, -5)

Games: 58.16263190564867
Entertainment: 7.883302296710118
Photo & Video: 4.9658597144630665
Education: 3.6623215394165114
Social Networking: 3.2898820608317814
Shopping: 2.60707635009311
Utilities: 2.5139664804469275
Sports: 2.1415270018621975
Music: 2.0484171322160147
Health & Fitness: 2.017380509000621
Productivity: 1.7380509000620732
Lifestyle: 1.5828677839851024
News: 1.3345747982619491
Travel: 1.2414649286157666
Finance: 1.1173184357541899
Weather: 0.8690254500310366
Food & Drink: 0.8069522036002483
Reference: 0.5586592178770949
Business: 0.5276225946617008
Book: 0.4345127250155183
Navigation: 0.186219739292365
Medical: 0.186219739292365
Catalogs: 0.12414649286157665


  value = row[index]


Looking at the data, we see that more than half are games (58.16%). Other apps only contribute a small percentage, such as Entertainment coming in second with around 8%, followed by Photo & Video apps with almost 5%. There is a sharp decrease with only 3.66% for educational apps, 3.29% for Social Networking apps, etc.

Although it is easy to see that there are many free English apps designed for entertainment (games, entertainment, etc.), while practical apps are rarer. However, this fact does not mean that entertainment apps also have the largest number of users.

That is why we need to continue looking at the `Genre` and `Category` columns of the Google Play Store for more details.

In [20]:
display_table(final_android, 1)

  value = row[index]


FAMILY: 18.907942238267147
GAME: 9.724729241877256
TOOLS: 8.461191335740072
BUSINESS: 4.591606498194946
LIFESTYLE: 3.9034296028880866
PRODUCTIVITY: 3.892148014440433
FINANCE: 3.700361010830325
MEDICAL: 3.5311371841155235
SPORTS: 3.395758122743682
PERSONALIZATION: 3.3167870036101084
COMMUNICATION: 3.237815884476534
HEALTH_AND_FITNESS: 3.079873646209386
PHOTOGRAPHY: 2.9444945848375452
NEWS_AND_MAGAZINES: 2.7978339350180503
SOCIAL: 2.6624548736462095
TRAVEL_AND_LOCAL: 2.33528880866426
SHOPPING: 2.2450361010830324
BOOKS_AND_REFERENCE: 2.1435018050541514
DATING: 1.861462093862816
VIDEO_PLAYERS: 1.7937725631768953
MAPS_AND_NAVIGATION: 1.3989169675090252
FOOD_AND_DRINK: 1.2409747292418774
EDUCATION: 1.1620036101083033
ENTERTAINMENT: 0.9589350180505415
LIBRARIES_AND_DEMO: 0.9363718411552346
AUTO_AND_VEHICLES: 0.9250902527075813
HOUSE_AND_HOME: 0.8235559566787004
WEATHER: 0.8009927797833934
EVENTS: 0.7107400722021661
PARENTING: 0.6543321299638989
ART_AND_DESIGN: 0.6430505415162455
COMICS: 0.620

Surprisingly, unlike the App Store, there are a few actual apps designed in the Google Play Store (Family, Tools, Business, etc.).

However, if we dig deeper, we can see that the family category (19%) is mainly made up of games for kids.

<br>

Now we go to `Genres` column:

In [21]:
display_table(final_android, -4)

Tools: 8.449909747292418
Entertainment: 6.069494584837545
Education: 5.347472924187725
Business: 4.591606498194946
Productivity: 3.892148014440433
Lifestyle: 3.892148014440433
Finance: 3.700361010830325
Medical: 3.5311371841155235
Sports: 3.463447653429603
Personalization: 3.3167870036101084
Communication: 3.237815884476534
Action: 3.1024368231046933
Health & Fitness: 3.079873646209386
Photography: 2.9444945848375452
News & Magazines: 2.7978339350180503
Social: 2.6624548736462095
Travel & Local: 2.3240072202166067
Shopping: 2.2450361010830324
Books & Reference: 2.1435018050541514
Simulation: 2.041967509025271
Dating: 1.861462093862816
Arcade: 1.8501805054151625
Video Players & Editors: 1.7712093862815885
Casual: 1.759927797833935
Maps & Navigation: 1.3989169675090252
Food & Drink: 1.2409747292418774
Puzzle: 1.128158844765343
Racing: 0.9927797833935018
Role Playing: 0.9363718411552346
Libraries & Demo: 0.9363718411552346
Auto & Vehicles: 0.9250902527075813
Strategy: 0.9138086642599278
H

  value = row[index]


#### **| Conclusion**

So far, we’ve seen that the App Store data is dominated by fun apps, while Google Play shows a more balanced mix of practical and entertaining apps.