# Analyzing profitable App profiles on Google Play Store and iOS App Store

We are trying to analyze apps data from Google Play Store and iOS App Store to help developers
understand profitable app profiles.
For a company building only free apps, the main source of revenue is customers. More the amount
of customers enagaging with the apps and ads, more the revenue.

### Exploring the datasets
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

A data set containing data about approximately ten thousand Android apps from Google Play. Available directly from [this link.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
A data set containing data about approximately seven thousand iOS apps from the App Store. Available directly from [this link.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [10]:
from csv import reader

In [11]:
# ios AppStore
ios_opened_file = open('AppleStore.csv')
ios_read_file = reader(ios_opened_file)
ios = list(ios_read_file)
ios_header = ios[0]
ios = ios[1:]

# Google PlayStore
android_opened_file = open('googleplaystore.csv')
android_read_file = reader(android_opened_file)
android = list(android_read_file)
android_header = android[0]
android = android[1:]

In [14]:
# helper function to print rows in a readable way
# dataset is expected to a list of lists
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [25]:
print(android_header)
print('\n')
explore_data(android, 100, 110, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Natural recipes for your beauty', 'BEAUTY', '4.7', '1150', '9.8M', '100,000+', 'Free', '0', 'Everyone', 'Beauty', 'May 15, 2018', '4.0', '4.1 and up']


['BestCam Selfie-selfie, beauty camera, photo editor', 'BEAUTY', '3.9', '1739', '21M', '500,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 12, 2018', '1.0.6', '4.0.3 and up']


['Mirror - Zoom & Exposure -', 'BEAUTY', '3.9', '32090', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Beauty', 'October 24, 2016', 'Varies with device', 'Varies with device']


['Beauty Selfie Camera', 'BEAUTY', '4.2', '2225', '52M', '500,000+', 'Free', '0', 'Everyone', 'Beauty', 'February 28, 2018', '1.6', '4.1 and up']


['Hairstyles step by step', 'BEAUTY', '4.6', '4369', '14M', '100,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 25, 2018', '1.9', '4.0.3 and up']


['Filters for Selfie',

We can see that we have data of 10841 Android apps. Columns that might be interesting for our analysis are Category, Reviews, Type, Price, Genre, Installs and App

In [17]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


We can see that we have data of 7197 iOS apps. Columns that might be interesting for our analysis are price, currency, rating_count_tot, 
rating_count_ver, prime_genre and track_name. More details about the columns are available [here.](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)


### Looking for missing column enteries in rows

In [21]:
header_len = len(android_header)
for row in android:
    if len(row) != header_len:
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


We see above that the row 10472 has a missing value for column 'Category' causing a column shift in all the values.
Here we can either delete the row using "del android[10472]" or we can insert the Category value if we can find out the value.
Quick search on PlayStore show us that this [app](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe)
belongs to the 'LifeStyle' category. So we can also insert that value into the row

In [27]:
android[10472].insert(1, "LIFESTYLE")

We did not find any such missing column value for the iOS dataset

### Looking for duplicate enteries

In [30]:
unique_apps = []
duplicate_apps = []

for row in android:
    if row[0] in unique_apps:
        duplicate_apps.append(row[0])
    else:
        unique_apps.append(row[0])

In [33]:
print(len(duplicate_apps))
print(len(unique_apps))

1181
9660


We can see that there are 1181 duplicate app enteries. Lets take a closer look at the duplicates

In [35]:
for row in android:
    if row[0] == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We see that the 4th column which is the count of reviews is different. This probably means that the different enteries were recorded
at different points of time. We choose to keep only the latest entry with the max count of reviews as that will give us a more 
reliable avg. rating of the app.

In [47]:
reviews_max = {}
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    # if the app is not present, we add the key and review to the dictionary
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
    # if the app is present and review count in the dictionary is less that current 
    # app entry we replace the review count with the new higher review count
    elif name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    

Previously we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) 
should be equal to the difference between the length of our data set and 1,181.

In [48]:
print('Expected Length ', len(android) - 1181)
print('Actual Length ', len(reviews_max))

Expected Length  9660
Actual Length  9660


In [49]:
android_clean = []
already_added = []
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

print(len(android_clean))        

9660


In [50]:
unique_apps_ios = []
duplicate_apps_ios = []

for row in ios:
    if row[0] in unique_apps_ios:
        duplicate_apps_ios.append(row[0])
    else:
        unique_apps_ios.append(row[0])
        
print(len(duplicate_apps_ios))
print(len(unique_apps_ios))

0
7197


So there are no duplicate enteries in the iOS AppStore dataset

### Removing non-English app

In [51]:
We are focusing our analysis only towards apps in English language and need to remove other language apps.
All the characters commonly used in English text are in the ASCII range 0 to 127

SyntaxError: invalid syntax (<ipython-input-51-a8bfe4b8a006>, line 1)