# App Profile Recommendation

The goal of this project is to find mobile app profiles in the Google Play and Apple Store that are profitable for their respective markets. As a thought experiment, we are playing the roles of data analysts for a company that builds these apps, and our focus is provide software engineers the necessary information to make data-driven decisions on how to make popular, profitable, apps.

Many apps are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
from csv import reader #downloading necessary packages
# opening apple/google files
apple_file = open('AppleStore.csv', encoding="utf8") 
google_file = open("googleplaystore.csv", encoding="utf8")
# reading opened files
read_apple = reader(apple_file)
read_google = reader(google_file)
# making list of lists using open files
apps_apple = list(read_apple)
apps_google = list(read_google)

In [2]:
# creating an explore dataset function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Exploring the data

Below we explore the first few rows of the data for each data set and determined the number of columns and the numbers of rows in each dataset.

In [3]:
print(explore_data(apps_apple, 0, 2, rows_and_columns = True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16
None


In [4]:
print(explore_data(apps_google, 0, 2, rows_and_columns = True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
None


# Isolating appropriate variables

For more information on the variables listed in the Apple file, [click here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

For more information on the variables listed in the Google file, [click here](https://www.kaggle.com/lava18/google-play-store-apps).

In [6]:
print("Apple File Column Names:")
print(apps_apple[0])
print("\n")
print("Google File Column Names:") 
print(apps_google[0])
print("\n")

Apple File Column Names:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google File Column Names:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




# Cleaning the data

As you may recall, we are a company that only make free to download and install and are directed to English-speaking audiences. As a result, we may want to remove apps not in english and apps that are not free to target our analysis.

One of the first things we need to do is correct the error listed in [this discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on the dataset.

In [53]:
# Checking that all rows have same number of columns as header
nvar_google = len(apps_google[0])
for app in apps_google[1:]:
    len_vars = len(app)
    if len_vars != nvar_google:
        missing_index = apps_google.index(app)
missing_var = apps_google[missing_index]
# Deleting row that is not complete
del apps_google[missing_index]

10473
