## Profitable App Profiles
* We pretend to be analysts at an app making company/unit. We only create free apps, and our revenue is solely generated by in-app ads. When more users view and engage with ads, it generates better revenue for us.

* My goal is to analyze which apps are able to attract more users, so as to help the developers.

[Documentation for Apple Store DataSet](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

[Documentation for Google Play Store DataSet](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
apple_open=open('AppleStore.csv',encoding='utf8')
apple=apple_open.readlines()
apple_list=list(apple)
google_open=open('googleplaystore.csv',encoding='utf8')
google=google_open.readlines()
google_list=list(google)
# print('Printing column names from dataset of Apple Store...\n')
# print(apple_list[0],'\n')
# print('Printing Apple Store dataset...\n')
# explore_data(apple_list,1,7,True)
# print('\n\nPrinting column names from dataset of Google Play Store...\n')
# print(google_list[0],'\n')
# print('\n\nPrinting Google Play Store dataset...\n')
# explore_data(google_list,1,7,True)

#### Identifying and Removing Duplicate Apps
* We identify the duplicate apps from the Google Play Store data set and display a few examples
* If there are, say N duplicates for an app named ABC, then we will keep the instance having the highest number of reviews, as it is most likely to be the most used and genuine app. We will remove the other rows with same app name.

In [3]:
duplicate_google_apps=[]
unique_google_apps=[]

for app in google_list[1:]:
    each_app=app.split(",")
    name=each_app[0]
    if name in unique_google_apps:
        duplicate_google_apps.append(name)
    else:
        unique_google_apps.append(name)

# print('\nNumber of duplicate apps in Google Store dataset : ',len(duplicate_google_apps))
# print('Examples of some duplicate apps : ',duplicate_google_apps[:10])        

In [4]:
def uniqueMaker(received_list):
    #reviews_max stores app name:reviews 
    reviews_max={}
    unique_list=[]
    for app in received_list[1:]:
        each_app=app.split(",")
        name=each_app[0]    
        try:
            #assigning reviews to n_reviews
            n_reviews=float(each_app[3])
            #checking app in reviews_max dictionary, by default
            #the 'in' operator compares key values
            if (name in reviews_max) and (reviews_max[name]<n_reviews):
                reviews_max[name]=n_reviews
            elif name not in reviews_max:
                reviews_max[name]=n_reviews
                unique_list.append(app)
        except ValueError:
            continue
    return  unique_list

#unique_google_list stores entire records, after cleaning
unique_google_list=uniqueMaker(google_list) 
unique_apple_list=uniqueMaker(apple_list) 
print("Records in Google Play data set : ",len(google_list))
print("Unique records after cleaning : ",len(unique_google_list))

Records in Google Play data set :  10842
Unique records after cleaning :  9471


#### Removing Apps which have a Non-English Name
* Checking characters in App names with ASCII value > 127.
* The apps with 3 or more such characters within their name will be most likely non English apps. This method might not be the best to filter out non-English apps, but can include most with one or two special characters of emojis, whose ASCII value falls outside 127.

In [5]:
def englishCheck(received_list):
    sending_list=[]
    for row in received_list:
        parsed_row=row.split(",")
        appname=parsed_row[0]
        i=0
        for ch in appname:
            if i==3:
                break
            if ord(ch)>127:
                i+=1        
        if i!=3:
            sending_list.append(row)
    return sending_list


#calling englishCheck on unique app lists
unq_eng_google_list=englishCheck(unique_google_list)

#for apple_list, as the fields are different, we will have to 
#separate code to match it's column name rightly

print("Total Apps in Google Unique Dataset : ",len(unique_google_list))
print("Total English Apps in Google Unique Dataset : ",len(unq_eng_google_list))

Total Apps in Google Unique Dataset :  9471
Total English Apps in Google Unique Dataset :  9413


#### Segregating Free Apps from the cleaned list above
* Below we will move the free apps to a different list, taking the unique, English based app list received from englishCheck() function

In [6]:
def freeApps(received_list):
    sending_list=[]
    for row in received_list:
        parsed_row=row.split(",")
        price=parsed_row[7]
        if price=='Free':
            sending_list.append(row)
    return sending_list
unq_eng_free_google_list=freeApps(unq_eng_google_list)
print("Total English based apps in Google dataset : ",len(unq_eng_google_list))
print("Total Free English based apps in Google dataset : ",len(unq_eng_free_google_list))

Total English based apps in Google dataset :  9413
Total Free English based apps in Google dataset :  3848


#### What do we have yet?
* We have removed inaccurate data from both the lists
* We used uniqueMaker() to removed duplicate app entries, based on app names - we only kept the one with highest reviews in case duplicates were found
* We used englishCheck() to remove non-English apps
* We used freeCheck() to remove the paid apps

#### Plan for profitable apps
* As we only build free apps and earn through ad revenues, we will build a minimal app for Google Play Store, and enhance it over time if we see good user response over time
* If the app is profitable in first 6 months, we will build an iOS version of the app

In [7]:
#freq_table creates a frequency table i.e. a dictionary
def freq_table(dataset,index):
    send_back={}
    total=0
    for raw_row in dataset:
        row=raw_row.split(",")
        total+=1
        val=row[index]
        if val in send_back:
            send_back[val]+=1
        else:
            send_back[val]=1
    table_percentages = {}
    for key in send_back:
        percentage = (send_back[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages
    
def display_table(dataset, index):
    #dataset=dataset_in.split(",")
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#display_table(unq_eng_free_google_list,-5)       

In [8]:
#print(google_list[0])


#works similar to freq_table(), but send back values
def freq_table_values(dataset,index):
    send_back={}
    total=0
    for raw_row in dataset:
        row=raw_row.split(",")
        total+=1
        val=row[index]
        if val in send_back:
            send_back[val]+=1
        else:
            send_back[val]=1
    return send_back


unique_google_genres=freq_table_values(unq_eng_free_google_list,10)

#print(unique_google_genres)
for genre in unique_google_genres:
    total=0
    len_category=0
    for raw_row in unq_eng_free_google_list:
        row=raw_row.split(",")
        #storing category for the row
        category_app=row[10]
        try:
            if category_app==genre:
                installs=row[5].replace('+','')
                installs=installs.replace(',','')
                installs=installs.replace('"','')
                installs=installs.replace('M','')
                total+=float(installs)
                len_category+=1
        except:
            continue
    avg_installs=total/len_category
    print('avg installs for ',genre,' :',avg_installs)


avg installs for  Art & Design;Action & Adventure  : 100.0
avg installs for  Word  : 28.666666666666668
avg installs for  Beauty  : 99.25
avg installs for  Maps & Navigation  : 52.89090909090909
avg installs for  Music  : 127.5
avg installs for  Entertainment;Music & Video  : 300.0
avg installs for  Finance  : 62.21666666666667
avg installs for  Tools  : 70.12968299711815
avg installs for  Education;Music & Video  : 100.0
avg installs for  Personalization  : 90.54838709677419
avg installs for  Arcade  : 178.1081081081081
avg installs for  Art & Design  : 90.02564102564102
avg installs for  Parenting;Education  : 42.5
avg installs for  Puzzle;Action & Adventure  : 100.0
avg installs for  Books & Reference;Education  : 1.0
avg installs for  Education;Creativity  : 500.0
avg installs for  News & Magazines  : 57.8
avg installs for  Art & Design;Creativity  : 285.0
avg installs for  Photography  : 111.0
avg installs for  Card  : 119.0
avg installs for  Educational;Brain Games  : 300.0
avg i

In [7]:
print(google_list[])

App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver

