# App Store Analysis

## What is this project about?
This project analyzes data from the Apple App Store and the Google Play Store.

## Goal:
The goal of the project is to understand the nature of each app store and whether a potential app on each platform should be paid or free. We also want to understand which app genre is popular in each platform to help inform our development.



## Exploring Data

### Data Details:
There are over 2 million apps on each of the two platforms available. Such a dataset is much larger than needed so this project is using a subset of data that has been collected.

The [Google Play Store data](https://www.kaggle.com/datasets/lava18/google-play-store-apps) is a subset of 10,000 apps scraped in 2018.
The [Apple App Store data](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) is a subset of 7200 apps scraped in 2018. 

Though they are a bit outdated and small, they demonstrate the use of Python to come to analyze data, answer questions, and potentially come to a conclusion.

### Opening the Data


We'll start with exploring the data by creating a function that takes a dataset (should be a list) a start and end integer and whether or not we want to take a peek at the total number of rows and columns (by default False).

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [22]:
from csv import reader
opened_file = open('AppleStore.csv', encoding='utf8')
app_store_data = list(reader(opened_file))
explore_data(app_store_data, 0,10, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061

We can see that the iOS App Store has 7,198 rows and 16 columns. Some columns that might help our analysis are: 'track_name', 'price', 'rating_count_ver', 'rating_count_tot', 'prime_genre'.

Now lets look at the Play Store data set.

In [23]:
opened_file = open('googleplaystore.csv', encoding='utf8')
play_store_data = list(reader(opened_file))
explore_data(play_store_data, 0,10, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

We can see that the Play Store has 10,842 rows and 13 columns. Some columns that might help our analysis are: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', 'Genre'

### Deleting wrong data

In the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) we can see that there is an erro rin the Play data set for row 10472. Let's take a quick look at this row and compare it to the header and another row that is correct.

In [32]:
print(play_store_data[0])
print(play_store_data[10473]) # incorrect row
print('\n', play_store_data[1]) # correct row

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We can see that for the 'Life Made WI-Fi Touchscreen Photo Frame' app, the Reviews say 19, but the maximum value allowed is actually 5. This is caused by a prolbem in the 'Category' column that is actually missing. As a result, we'll delete this row.

In [33]:
print(len(play_store_data))
del play_store_data[10473]
print(len(play_store_data))

10842
10841


### Identifying and deleting duplicates

Now that we have solved the issue related to a bad row, we want to remove any duplicates that might exist inside the apps. This means searching for potential duplicates first (based on app name) and then deleting them. 

In [35]:
duplicate_apps = []
unique_apps = []

for app in play_store_data[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Number of unique apps:', len(unique_apps))

Number of duplicate apps: 1181
Number of unique apps: 9659


We found 1181 duplicate apps in the play store data. Let's delete those by adding them to a dictionary to remove them from the the. Let's take a look at an example app to see what the difference might actually be.

In [38]:
for app in play_store_data[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


So for Instagram, we can see there are 4 different entries and the big difference is the 'Reviews' column. We can use this as the criteria for knowing which row to keep versus rows to delete. We'll keep the one with the highest number of 'Reviews'.

In [44]:
reviews_max = {}

for app in play_store_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected rows:', len(play_store_data[1:])-1181)
print('Clean rows:', len(reviews_max))

Expected rows: 9659
Clean rows: 9659


We created a dictionary with the highest number of reviews for each app. If the app does not exist in the dictionary already, we add it in and assign it's number of reviews as whatever is in the current row. This gives us a total of 9659 rows that are unique.

In [43]:
play_store_clean = []
already_added = []

for app in play_store_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        play_store_clean.append(app)
        already_added.append(name)

print('Expected rows:', len(play_store_data[1:])-1181)
print('Actual rows:', len(play_store_clean))

Expected rows: 9659
Actual rows: 9659


Here we are creating two lists, one that is a clean list (only unique apps have been added) and another that tracks apps that have already been added to the clean list. If it has not been added, we add it to the already_added list, if it has been added, we skip and go to the next app. Thankfully, our expected and actual row counts match.

### Removing non-English apps

We know some apps exist that are not in English and since we are primarily building for an English-language audience, we do not want to include those in our analysis. Below I've defined a function that takes into account if an apps full name is English. Because of some specific characters, I will only disqualify an app if it has more than 3 non-standard English characters (such as emoji's)>

In [59]:
def is_english(string):
    char_count =  0
    for char in string:
        if ord(char) > 127:
            char_count +=1
        if char_count > 3:
            return False
    return True

In [60]:
play_store_eng = []
for app in play_store_clean[1:]:
    name = app[0]
    if is_english(name):
        play_store_eng.append(app)

explore_data(play_store_eng, 0,3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13


In [61]:
app_store_eng = []
for app in app_store_data[1:]:
    name = app[1]
    if is_english(name):
        app_store_eng.append(app)

explore_data(app_store_eng, 0,3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


We can see that we only have 9614 Play Store apps and 6183 iOS apps.

### Focusing on Free Apps

Since we only want to build apps that are free for users and it will be monetized using in-app ads. Since we ahve both free and non-free apps, we need to only isolate free apps for analysis. Here we isolate the free apps.

In [63]:
play_store_final = []
app_store_final = []

for app in play_store_eng:
    price = app[7]
    if price == '0':
        play_store_final.append(app)

for app in app_store_eng:
    price = app[4]
    if price == '0.0':
        app_store_final.append(app)

print(len(play_store_final))
print(len(app_store_final))

8863
3222


We have 8863 Android apps and 3222 iOS apps. This is a good enough sample for our analysis.