# Profitable App Profiles for the App Store and Google Play Markets

## Project 1: Which Apps attract users?

### What the project is about?
For this project we are wanting to see the amount of users that use each app. For the apps that are high in numbers, these will be the focus for exploring more into the details

### What your goal is in the project?
The goal of this project it to analyze the data to help developers understand what type of apps that are likely to attrach more users. 

+ The format for the Apple Store data is different than the Google Store data

In [42]:
# Open the Apple Store and Google Play Store data
from csv import reader

# The Apple data set #
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple_data = apple[1:] 

# The Google data set #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]


In [43]:
# Exploring and extracting the data from the file called
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start: end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows', len(dataset))
        print('Number of columns', len(dataset[0]))
        
print(android_header)
#print(android_data)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [44]:
# Print few rows of Apple data sets using the explore_data function
apple_data1 = explore_data(apple_data, 0, 5, False)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [45]:
# Print few rows of Android data sets using the explore_data function
android_data1 = explore_data(android_data, 1, 5, False)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â€“ FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']




In [50]:
# Find the number of rows and columns of each dataset
apple_rows = explore_data(apple_data, 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows 7197
Number of columns 16


In [47]:
android_rows_cols = explore_data(android_data, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows 10841
Number of columns 13


In [58]:
# Print the column names
apple_columns = apple_header
print('The apple columns are:', apple_columns)
print('\n')
google_columns = android_header
print('The google columns are:', google_columns)

The apple columns are: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The google columns are: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Understanding the columns categories
+ Apple Store column details data [Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
+ Google Play column details data [Google Play](https://www.kaggle.com/lava18/google-play-store-apps)

These columns that could help with the analysis:
+ Apple:
    + price: if looking at free vs. paid
    + rating_count_total: looking at percentages
    + User_rating: scale of if the user liked the product
    + prime_genre: the scope of the app
+ Google: 
    + rating: overall user rating of the app
    + reviews: number of user reviews for the app
    + installs: number of user downloads for the app
    + type: paid or free
    + price: price of the app
    + genre: category that app belongs to

## Data Cleaning
The company builds apps that are *free* and for *English-speaking* audience. We will need to:
+ remove non-English apps
+ Remove apps that aren't free

Google play has an error, read the discussion to find out where the error occurs:
+ From the discussion there is an error for entry 10472 [Google play Discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)

In [71]:
# Print the row at index to check
# index may/ may not include header
print(android[10470])
print('\n')
print(android[10471])
print('\n')
print(android[10472])
print('\n')
print(android[10473])
print('\n')
print(android[10474])
print('\n')
print(android[10475])

['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


In [70]:
# This just checks the area user mentioned
# Error appears to just affect that row
print(len(android[10470]))
print(len(android[10471]))
print(len(android[10472]))
print(len(android[10473])) # this is probably the real 10472
print(len(android[10474]))
print(len(android[10475]))

13
13
13
12
13
13


In [78]:
# To check over all the rows 
for row in android: # including the header in the count
    headerlength = len(android_header) # defined this earlier 
    rowlength = len(row) 
    if rowlength != headerlength:
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [80]:
# To check over the rows not including the header
# android_data doesn't include header so index matches 10472
for row in android_data: 
    headerlength = len(android_header)
    rowlength = len(row)
    if rowlength != headerlength:
        print(row)
        print(android_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [81]:
# Delete the row from data, use the right index
# Delete the row from data that does not include the header
del android_data[10472]

In [92]:
# Check data again
""" when I uncomment the else/print statement it says Data is good for each
row, I moved the else statement outside and got an error. I want it to say 
data is good just 1 time once it's checked all the rows""" 

for row in android_data: 
    headerlength = len(android_header)
    rowlength = len(row)
    if rowlength != headerlength:
        print(row)
        print(android_data.index(row))
    #else:
        #print('Data is good') 

Apple Store data has duplicates according to user community document [Apple store discrepancies](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176)
+ Need to check the Apple Store data for duplicates
+ Remove duplicate data

In [106]:
# want to find duplicate row data
# want to find the rows that are the same
# App_names should be unique so can check if there are duplicates of this
""" if app name is not in apple_app_names list, append list to add app_name
if app name is in list then, append list of apple_name_duplicates"""
# Create lists and compare
apple_app_names = []
apple_name_duplicates = []
for app_name in apple_data: # this does not include the header
    app_name = apple_data[1]
    if app_name != apple_app_names:
        apple_app_names.append(app_name) 
    else:
        apple_name_duplicates.append(app_name)

