# Profitable App Profiles for the App Store and Google Play Markets 

The goal for this project is to analyze data to understand what kinds of apps are likely to attract more users on the iOS and Android platforms

The [data set](https://www.kaggle.com/lava18/google-play-store-apps)containing data about approximately ten thousand Android apps from Google Play

The [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)data set containing data about approximately seven thousand iOS apps from the App Store

Opening the Google_Play_Dataset

In [1]:
opened_file = open('googleplaystore.csv',encoding='utf8')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)

Opening the Apple_Store_Dataset

In [2]:
opened_file = open('AppleStore.csv',encoding='utf8')
from csv import reader
read_file = reader(opened_file)
ios = list(read_file)


In [3]:
def explore_data(dataset,start,end,rows_column = False):
    dataset_slice = dataset[start:end]
    for x in dataset_slice:
        print(x,"\n")
        
    if rows_column:
        print("Number of rows :" ,len(dataset))
        print("Number of column :" ,len(dataset[0]))
        
print("Preview of the GooglePlay Dataset\n")
android_preview = explore_data(android,0,2)

print("\nPreview of the AppleStore Dataset\n")
ios_preview = explore_data(ios,0,2)



Preview of the GooglePlay Dataset

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 


Preview of the AppleStore Dataset

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 



# Data Cleanup Process
##### Part 1 : Deleting the wrong data
The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101), and we can see that one of the discussions outlines an error for row `10473`. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print(android[0],"\n\n",android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row `10473` corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame`, and we can see that the rating is `19`. This is clearly off because the maximum rating for a Google Play app is 5.As a result, we'll delete this row.

In [5]:
print("Length of the original GooglePlay Dataset : ",len(android))
del android[10473]
print("Length of the updated GooglePlay Dataset :" , len(android))

Length of the original GooglePlay Dataset :  10842
Length of the updated GooglePlay Dataset : 10841


##### Part 2 : Removing duplicate entries
If we explore the Google Play data, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [11]:
for x in android[1:]:
    name = x[0]
    if name == 'Instagram':
        print(x)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [16]:
def duplicate_values(dataset):
    duplicate_apps = []
    unique_apps = []
    for x in dataset[1:]:
        name = x[0]
        if name in unique_apps:
            duplicate_apps.append(x[0])
        else:
            unique_apps.append(x[0])
    return duplicate_apps
            
print("The number of duplicate row data in GooglePlayStore : " , len(duplicate_values(android)),"\n")
print("The number of duplicate row data in ApplePlayStore : " , len(duplicate_values(ios)),"\n")


The number of duplicate row data in GooglePlayStore :  1181 

The number of duplicate row data in ApplePlayStore :  0 

