## Analyzing Mobile App Data

This project analyzes mobile app data on the Google Play and App Store. 

This data specifically pertains to apps that a are free to download and the main source of revenue contains of in-app ads. 

The __goal__ of this project is to analyze data to understand what type of apps are more likely to attract more users. 

In [1]:
import csv
# for App Store
with open('AppleStore.csv', 'r') as apple: 
    reader = csv.reader(apple)
    df_apple = list(reader)
    
with open('googleplaystore.csv', 'r') as google: 
    reader = csv.reader(google)
    df_google = list(reader)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
#Prints the first three rows and outputs total 
#no. of rows and columns in the apple dataset
apple_explore = explore_data(df_apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
#does the same for the google dataset
google_explore = explore_data(df_google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


__Printing column names for both datasets:__

In [5]:
print('Apple:\n')
print(df_apple[0])
print('\n')
print('Google:\n')
print(df_google[0])

Apple:

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


#### Apple columns that may be useful for analysis: 

- "id": App ID
- "track_name": App Name
- "size_bytes": Size (in Bytes)
- "currency": Currency Type
- "price": Price amount
- "rating_count_tot": User Rating counts (for all version)
- "rating_count_ver": User Rating counts (for current version)
- "user_rating": Average User Rating value (for all version)
- "user_rating_ver": Average User Rating value (for current version)
- "ver": Latest version code
- "cont_rating": Content Rating
- "prime_genre": Primary Genre
- "sup_devices.num": Number of supporting devices
- "lang.num": Number of supported languages

#### Play Store columns that may be useful for analysis
- 'App': App name
- 'Category': Type of app
- 'Rating': Average Rating
- 'Reviews': No. of reviews, 'Size', 'Installs', 'Type', 'Price',
- 'Content Rating' - Maturity level
- 'Genres' - App genres
- 'Last Updated' - Most recent update 
- 'Current Ver' - Current app version
- 'Android Ver' - Android version

[Click to view documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/data)


## Data Cleaning

Checking the datasets for which data to clean:

In [6]:
print(df_google[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
#since we see the content rating is missing we can delete the row
print("Length before deleting:")
print(len(df_google))
del df_google[10473]
print("Length after deleting:")
print(len(df_google))

Length before deleting:
10842
Length after deleting:
10841


#### The dataset contains duplicate entries which need to be deleted
For example The Instagram app has duplicate entries which will can be seen here:

In [8]:
#remove header to loop through the dataframe
for app in df_google[1:]: 
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Now that we have seen an example of duplicates, let's implement a function that will allow us to remove all duplicates for the Google Play dataset

In [9]:
duplicate_apps = []
unique_apps = []

for app in df_google[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print(f'The number of duplicate apps are {len(duplicate_apps)}\n')
print(f'The number of unique apps are {len(unique_apps)}\n')

The number of duplicate apps are 1181

The number of unique apps are 9659



#### Let's reassign df_google to store it without the header 

In [10]:
df_google = df_google[1:]

In [11]:

print('Expected length: ', len(df_google) - 1181)

Expected length:  9659


#### We are going to use a dictionary to iterate through the android data and create a dict with {'app_name':no_of_reviews}
This is done so that only one entry is present per app and the app with the highest number of reviews is chosen from the dataset.

In [12]:
reviews_max = {}

for app in df_google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews :
        reviews_max[name] = n_reviews
        
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews
len(reviews_max)

9659

Then we are going to create two lists, one which will store the cleaned data set (without the duplicate values, and correct no. of reviews and also already_added to keep stack of what app has been added to the dictionary previously. 

In [13]:
android_clean = []
already_added = []

for app in df_google:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

In [14]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English Apps: Part One

Checking to ensure there are no duplicate entries in the App Store dataset

In [24]:
duplicate_apps = []
unique_apps = []

for app in df_apple[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print(f'The number of duplicate apps are {len(duplicate_apps)}\n')
print(f'The number of unique apps are {len(unique_apps)}\n')

The number of duplicate apps are 0

The number of unique apps are 7196

