In [1]:
import os
import pandas as pd

In [4]:
def explore_data(folder_path):
    data = []
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        df = pd.read_csv(file_path)    
        data.append(df)
    return data

In [8]:
folder_path = 'dataset'

ios, android = explore_data(folder_path=folder_path)

print(f'Ios dataset:\nTotal rows: {len(ios)}, Total columns: {len(ios.columns)}')
print(f'Android dataset:\nTotal rows: {len(android)}, Total columns: {len(android.columns)}')

Ios dataset:
Total rows: 7197, Total columns: 16
Android dataset:
Total rows: 10841, Total columns: 13


## **Data Preprocessing**

### **Delete Wrong data**

The Google Play data set has a dedicated discussion section, and one of the discussions outlines an error for row 10472.

In [9]:
print(android.iloc[10472])

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                               19.0
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object


We can see that row 10472 have a mistake in two columns:
- `Category`: it must be a string.
- `Rating`: the range of this value is within $[0, 5]$, so it cannot be $4.1$.

We don't know the correct information, so we'll delete this row.

In [60]:
android = android.drop(index=10472)
print(len(android))

10840


### **Removing Duplicate**

After receiving [feedback](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) from Google Play users, they noticed that some apps has duplicate entries.

For example, let's see how many entries the Instagram app has:

In [19]:
instagram_android = android[android.iloc[:,0] == 'Instagram']
instagram_android

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


Check out the number of similar cases in the Android dataset:

In [26]:
duplicate_apps, unique_apps = [], []

[duplicate_apps.append(name) if name in unique_apps else unique_apps.append(name) for name in android.iloc[:,0]]

print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 1181


To build android dataset with distinct app, we will create a dictionary, where each key is a unique app name and the corresponding dictionary value is the **highest number** of reviews of that app.

In [68]:
reviews_max = {}

for app in range(len(android)):

    name = android.iloc[app, 0]
    n_reviews = float(android.iloc[app, 3])

    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews

    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


Comparing our result with the expected length:

In [69]:
len(reviews_max) == (len(android) - 1181)

True

Now, let's use the `reviews_max` dictionary to remove the duplicates.

In [70]:
android_cleaned, already_added = [], []

for app in range(len(android)):
    name = android.iloc[app, 0]
    n_reviews = float(android.iloc[app, 3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_cleaned.append(android.iloc[app, :])
        already_added.append(name)

In [72]:
len(android_cleaned)

9659

## **Removing Non-English apps**