## Analyse profitable app profiles for the app store and google play markets
This is a guided project on Dataquest for implementing fundamental python programming skills

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app - the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what types of apps are likely to attract more users.

In [9]:
import pandas as pd

In [16]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [17]:
import csv
with open('AppleStore.csv') as csv_file:
    file = csv.reader(csv_file, delimiter=',')
    applestore = list(file)

with open('googleplaystore.csv') as csv_file:
    file = csv.reader(csv_file, delimiter=',')
    playstore = list(file)
    

In [18]:
explore_data(applestore, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


In [20]:
explore_data(playstore, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


- We can see that Apple app store dataset has 7198 rows and 16 columns. Among the features `track_name`, `size_bytes`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, `user_rating`, `prime_genre` are seems interesting. The details of the column name is found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
- Google play store as 10842 rows and 13 columns. We can see that `App`, `Category`, `Price`, `Installs`, `Type`, `Price`, `Genre` are interesting to find information about the data set.

### Data Cleaning

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), we find that there is a bad row which is 10472 on Google play store dataset. Now we have to remove the row from the dataset.

In [25]:
len(playstore)

10840

In [26]:
del playstore[10472]

In [27]:
len(playstore)

10839

#### Removing duplicate entries

There are some entries in `App` coulmn that can be seen for more than one time, means duplicate entries are present in the dataset. FOr example, 'Gmail' app can be seen three times in the dataset.

In [57]:
for app in playstore:
    app_name = app[0]
    if app_name == 'Gmail':
        print(app)

['Gmail', 'COMMUNICATION', '4.3', '4604324', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Gmail', 'COMMUNICATION', '4.3', '4604483', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Gmail', 'COMMUNICATION', '4.3', '4604324', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', 'Varies with device', 'Varies with device']


In [61]:
playstore_app = []
duplicate_app = []

for app in playstore:
    app_name = app[0]
    if app_name in playstore_app:
        duplicate_app.append(app_name)
    playstore_app.append(app_name)
    
print(duplicate_app[:10])
print('\n')
print('Duplicate apps: ', len(duplicate_app))
print('Total apps: ', len(playstore_app))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Duplicate apps:  1181
Total apps:  10839


We need to keep the unique entries only so that there will be only one entry per app. We can use the app with highest reviews rather removing the apps randomly. To do that we can create an dictionay with the app name as key and reviews as value.

In [None]:
app_max_reviews = {}

for column in playstore:
    app_name = column[0]
    reviews = column[3]
    if reviews >= 