# Analyzing Popular Android & iOS Apps

In this project, I will analyze usage data from 17,000+ apps on the Google Play and Apple iOS App Stores, to understand what types of apps are likely to attract the most users.

Datasets used:
- Android app data from the Google Play store: [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- iOS data from the Apple App Store: [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

## Setup

In [1]:
import numpy as np
import pandas as pd

## Load the data

First, I load the data. I create a function to load each dataset from its respective URL as a list of lists:

In [2]:
def load_data_from_url(url):
    df = pd.read_csv(url)
    a = df.columns.values.tolist()  # column row
    b = df.values.tolist()  # body rows
    b.insert(0, a)  # add column row as first element of list
    return b

I use this function to load each dataset:

In [3]:
data_android = load_data_from_url('https://dq-content.s3.amazonaws.com/350/googleplaystore.csv')
data_ios = load_data_from_url('https://dq-content.s3.amazonaws.com/350/AppleStore.csv')

## Explore the data

Next, I define a function to explore each dataset. This function returns a slice of a given dataset, taking several parameters:

- `dataset`: dataset of interest, as a list of lists
- `start`, `end`: start and end integer indices of desired slice from the dataset
- `rows_and_columns`: boolean indicating whether to return number of rows and columns in the dataset

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    for row in dataset[start:end]:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Now I use this function to preview the first 2 rows of each dataset:

In [5]:
# preview android app data
print(explore_data(data_android, 0, 2, rows_and_columns=True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', 4.1, '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows:  10842
Number of columns:  13
None


In [6]:
# preview iOS app data
print(explore_data(data_ios, 0, 2, rows_and_columns=True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


[284882215, 'Facebook', 389879808, 'USD', 0.0, 2974676, 212, 3.5, 3.5, '95.0', '4+', 'Social Networking', 37, 1, 29, 1]


Number of rows:  7198
Number of columns:  16
None


As expected, the first row of each dataset correctly contains the column names. We have data from more than 17000 apps, and the data includes metrics on app ratings, price, genre and more.

## Clean the data

### Removing incorrect data

Next, we remove incorrect data; specifically, there was one row in the Google Play store data that had an invalid entry for the `Reviews` column (see discussion [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)). Specifically, in this row, the review value couldn't be converted to a `float`. So we check for rows where that could be the case and delete accordingly:

In [32]:
for i, row in enumerate(data_android[1:]):
    try: 
        float(row[3])
    except:
        print('Could not convert string to float for row with index: ' + str(i))
        index_to_del = i

Could not convert string to float for row with index: 10472


In [33]:
del data_android[index_to_del]

### Removing duplicates

We first identify cases where an app appears more than once:

In [35]:
duplicate_apps = []
unique_apps = []

# build list of unique and duplicate app names
for row in data_android[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)    

In [36]:
# print the number of duplicates
print('Identified ' + str(len(duplicate_apps)) + ' duplicate app names...')

Identified 1181 duplicate app names...


Taking a look at one of these duplicate cases, we can see that 

## Filter the data

We want to examine only apps that are free to install and designed for an English-speaking audience, so we filter the data accordingly: