# Prj1 - Profitable App Profiles for the App Store and Google Play Markets

## Project Background and goals: 

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, here are two data sets that seem suitable for our goals:
* A dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can find the dataset [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps).
* A dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017

## Instructions for the assignment:

**1. Open the two datasets we mentioned above, and save both as lists of lists.**


* The App Store dataset is in a CSV file named AppleStore.csv, and the Google Play dataset is in a CSV file named googleplaystore.csv.

* You can open both CSV files directly in the Jupyter Notebook interface you see on the right of the screen.

**2. Explore both datasets using the explore_data() function.**

* Print the first few rows of each dataset.
* Find the number of rows and columns of each dataset (recall that the function assumes the argument for the dataset parameter doesn't have a header row).

**3.  Print the column names, and try to identify the columns that could help us with our analysis. Use the documentation for the datasets if you're having trouble understanding what a column describes. Add a link to the documentation for readers if you think the column names aren't descriptive enough.**

Data Cleaning:

* remove rows with errors
* remove duplicates
* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.


In [5]:
from csv import reader

# Google
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
g_data_header = android[0]
g_data = android[1:]

# Apple
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
a_data_header = ios[0]
a_data = ios[1:]

In [6]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Explore apple_ds
print(a_data_header)
print('\n')
explore_data(a_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [7]:
# Explore google_ds
print(g_data_header)
print('\n')
explore_data(g_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [14]:
# Data Cleaning (the one we need is #10473):
#double-check to see if it is indeed an incorrect row according to the forum discussion.

print(g_data_header)
print('\n')
explore_data(g_data, 10472, 10473, True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13


In [15]:
# Delete row
del g_data[10473]

In [26]:
# Check and count for unique && duplicate apps in the lists:

#function:
def duplicate_check(ls):
    duplicate_apps = []
    unique_apps = []
    for app in ls:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    return len(duplicate_apps), len(unique_apps)

In [34]:
# Check
print('The number of duplicates, respective unique apps in the Apple list are:', duplicate_check(a_data))

print('The number of duplicates, respective unique apps in the Goolge list are:', duplicate_check(g_data))

The number of duplicates, respective unique apps in the Apple list are: (0, 7197)
The number of duplicates, respective unique apps in the Goolge list are: (1180, 9660)
