__Profitable App Profiles for the App Store and Google Play Markets__

This Project simulates real world scenario of __data analysis__ of __Android__ and __iOS__ mobile apps.

__Aim__ of this project is to find _mobile app profiles_ for the App Store and Google Play markets. The project simulates a __data analyst's__ task in a company that builds Android and iOS mobile apps, that enables team of __developers__ to make data_driven decisions with respect to the kind of apps they build.

Presumably the company, builds apps that are free to download and install, and main source of revenue consists of in-app ads. This means that revenue for any given app is mostly influenced by the __number of users__ that use the app.
__Goal__ for this project is to analyze data to help developers understand what kind of apps are likely to attract more users.

Starting initially by opening the two data sets and then continue with exploring data

In [1]:
from csv import reader

### The Google Play data set ###
opened_file=open('/media/sudipto/New Volume/690/Python/datasets/googleplaystore.csv', encoding='utf8')
read_file=reader(opened_file)
android=list(read_file)
android_header=android[0]
android=android[1:]

### The App Store data set ###
opened_file=open('/media/sudipto/New Volume/690/Python/datasets/AppleStore.csv')
read_file=reader(opened_file)
ios=list(read_file)
ios_header=ios[0]
ios=ios[1:]


Below is a function named __explore_data()__, this function will be repeatedly invoked to explore rows in a more redable way

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    count=start
    for row in dataset_slice:
        count+=1
        print('row : ' + str(count))
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print('## android dataset header: ')
print('\n')
print(android_header)
print('\n')
print('## android dataset first two rows')
print('\n')
print(explore_data(android, 0, 2, True))

print('\n')

print('## iOS appstore dataset header')
print('\n')
print(ios_header)
print('\n')
print('## iOS dataset first row')
print('\n')
print(explore_data(ios, 0, 1, True))

## android dataset header: 


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## android dataset first two rows


row : 1
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


row : 2
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
None


## iOS appstore dataset header


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


## iOS dataset first row


row : 1
['284882215', 'Facebook', '389879808', 'USD', '0.0'

__googleplaystore__ dataset documentation : https://www.kaggle.com/lava18/google-play-store-apps<br/>
__applestore__ dataset documentation : https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

__Deleting wrong data__

Previously, I have opened the datasets and explored the data. Before analysis, it should be made sure that the data is __accurate__, or the results of the analysis might be __wrong__. To ensure such the follwing need to be done:

1. Detect inaccurate data, and correct or remove it.
2. Detect duplicate data, and remove duplicates. 

As this project focuses on apps that are free to install and download, and for English-speaking audience.

1. Filter apps that are not in English language.
2. Filter apps that are not free.

The above mentioned process of data preparation is called __data cleaning__.

The discussions section has a mention that rating at entry 10472 is wrong.
    I will try to print the row at that index to check if it's incorrect.

In [3]:
print(android_header)
print(explore_data(android, 10471, 10473, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
row : 10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


row : 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13
None


As evident from the above function return that the row has errors in the __Category__ column.
I will delete the error row to clean the dataset of errors.


In [4]:
del android[10472] # deleting the incorrect row
print(explore_data(android, 10471, 10474, True))

row : 10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


row : 10473
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


row : 10474
['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows: 10840
Number of columns: 13
None


__Removing duplicate entries__

The discussuns sections of the dataset has mentions about redundant entries. For instance, instagram has four entries.


In [5]:
for row in android[1:]:
    name=row[0]
    if name=='Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
# Segregating the duplicate apps and printing names of some duplicate apps

duplicate_apps=[]
unique_apps=[]
for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('Names of duplicate apps', duplicate_apps[:5])

Number of duplicate apps:  1181
Names of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
