## Develop Relevant Apps 
### Analyse Android and iOS mobile apps data and explore opportunities to guide app development 
___


### Introduction

The goal of this project is to use fundamental data analysis techniques and drive insights from the existing Android and iOS apps data. The insights are used to help a company that builds mobile apps such that the development of the type of apps is more focussed and profit driven.

To achieve our goal, we extract the **kind of apps that have maximum user traction and hence are more profitable to the company.** The data used is extracted from the **Google Play Store [Link](https://www.kaggle.com/lava18/google-play-store-apps) and Apple Store [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).**

We found that .....

___

Let's start by exploring the dataset:

In [4]:
#open data set, read and store as a list of lists
from csv import reader
open_apps_data = open('AppleStore.csv')
read_apps_data = reader(open_apps_data)
apps_data = list(read_apps_data)

open_play_data = open('googleplaystore.csv')
read_play_data = reader(open_play_data)
play_data = list(read_play_data)

We will define a function `explore_data` that will have parmaters including the dataset, start and end index, and an optional parameter to show the number of rows and columns. 

The functions prints useful information such as Number of rows and columns in our dataset and the rows information from start to end index.

In [5]:
#explore datasets
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Explore iOS apps data:

In [6]:
explore_data(apps_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


Explore PlayStore apps data:

In [7]:
explore_data(play_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Print Column names of both datasets

In [8]:
#Column Names 
print("Play Store columns:")
explore_data(play_data, 0, 1)

print("Apple Store columns:")
explore_data(apps_data, 0, 1)

Play Store columns:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Apple Store columns:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




### Data Cleaning

Below we perform data pre-processing tasks.
___

##### Remove Missing Data - 
From discussions [Link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), we observed that one of the values in one column indexed at `10473`(index is not `10472` as our data includes header) is missing, hence removing that row such that it does not affect our data analysis process. 

In [9]:
#Check if there is missing data - row 10473 has missing data
print(len(play_data[0]))
print(len(play_data[10473]))

13
12


As the length of row `10473` is less than the actual length of other rows, it confirms that the row has a missing data, hence deleting the row below:

In [10]:
del play_data[10473]

### Handle duplicate data in PlayStore app
The data contains multiple rows for the same app. To handle this we will remove duplicate rows by creating a filtered list with unique apps data.

In [11]:
#Check if duplicate rows exist
duplicate_apps = []
unique_apps = []

for app in play_data[1:]: #start with index 1 to not include header at index 0
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of duplicate apps: ", len(duplicate_apps))
print("Example of duplicate apps: ", duplicate_apps[:3])

print("Duplicate rows with same app name 'Box' :")
for app in play_data[1:]:
    #printing duplicate rows with same app name Box
    if app[0] == 'Box':
        print(app)
print("\n")
print("Duplicate rows with same app name 'Instagram' :")
for app in play_data[1:]:
    #printing duplicate rows with same app name Box
    if app[0] == 'Instagram':
        print(app)

Number of duplicate apps:  1181
Example of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
Duplicate rows with same app name 'Box' :
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


Duplicate rows with same app name 'Instagram' :
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0',

From the above, it can be seen that there are 1181 row entries that are duplicate.
**Criteria to remove duplicates**:
1. For apps such as Instagram the 'Reviews' have different values in the duplicate rows, hence the reviews with maximum value can be the latest reviews and this row can be retained and others can be deleted.
2. However, for apps that have exactly same information such as 'Box' printed above, we can remove all except one.

Both of the above cases will be handled by the below code where we keep the maximum review entry and delete the rest.

In [12]:
#find the entry with max reviews and store in dictionary {name:review}
reviews_max = {}
for app in play_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


In [13]:
# remove duplicates using review_max dictionary
android_clean = []
already_added = []
for app in play_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name not in already_added and n_reviews == reviews_max[name]:
        already_added.append(name)
        android_clean.append(app)
        
print(len(android_clean))

9659


### Remove non-english apps 
As our company developes English apps, there is no use of analysing data of non-english apps, hence we will filter only the english apps and store in a list. We will define a function that will return False if more than 3 characters in the app name doesn't belong to english alphabet.

In [17]:
def check_english(name):
    count = 0
    for char in name:
        if ord(char) > 127: #The ASCII values of greater than 127 means non-english alphabet
            count+=1
            if count > 3:
                return False
    return True

In [18]:
#filter non-english apps from both datasets
play_data_eng = []
for app in android_clean:
    if check_english(app[0]):
        play_data_eng.append(app)
        
app_data_eng = []
for app in apps_data[1:]:
    if check_english(app[1]):
        app_data_eng.append(app)
        
print("Rows remaining in playstore data:", len(play_data_eng))
print("Rows remaining in app store data:", len(app_data_eng))

Rows remaining in playstore data: 9614
Rows remaining in app store data: 6183


### Filter free apps 
As the company only builds free apps, main source of income is in-app ads. Currently dataset contains both free and non-free apps, thus we filter free apps. 

In [19]:
free_apps_play = []

for app in play_data_eng:
    price = app[6] #Using the Type column having 2 values - 'Free' and 'Paid'
    if price == 'Free':
        free_apps_play.append(app)

free_apps_apple = []

for app in app_data_eng:
    price = float(app[4])
    if price == 0.0:
        free_apps_apple.append(app)
        
print(len(free_apps_play))
print(len(free_apps_apple))
    

8863
3222


After the cleaning process: We are left with 8863 Android apps and 3222 iOS apps for analysis. Let's go ahead and analyse.
___

## Analysis
**Validation strategy for an App :**
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Our end goal is to add the app on both Google Play and the App Store, therefore we will find app profiles that are successful on both markets.

We will start by seeing the most common genres that exist in each market.


In [20]:
# freq_table function will return the freq distribution of a given column in the dataset
def freq_table(dataset, index):
    freq = {}
    for data in dataset:
        param = data[index]
        if param in freq:
            freq[param] += 1
        else:
            freq[param] = 1
    
    for f in freq:
        freq[f] = (freq[f]*100)/len(dataset)
    return freq
        
#display_table will call freq_dist function to show the frequency ia list format    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We will display the frequency distribution of Genre in PlayStore and iOS apps with the help of above functions.
For Playstore apps there are two columns of importance - Category and Genre. 

As per [Link](https://www.kaggle.com/lava18/google-play-store-apps), the difference is that an app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

In [21]:
print("Category in Play dataset:")
display_table(free_apps_play, 1)

Category in Play dataset:
FAMILY : 18.8987927338373
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.700778517432021
MEDICAL : 3.5315355974275078
SPORTS : 3.3961412614238973
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376733
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.2452894053932075
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496447
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916394
AUTO_AND_VEHICLES : 0.9251946293580052
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189553
PARENTING : 0.6544059573

In [22]:
print("Genres in Play dataset:")
display_table(free_apps_play, 9)

Genres in Play dataset:
Tools : 8.450863138892023
Entertainment : 6.070179397495205
Education : 5.348076272142615
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.700778517432021
Medical : 3.5315355974275078
Sports : 3.4638384294257025
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376733
Travel & Local : 2.324269434728647
Shopping : 2.2452894053932075
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496447
Arcade : 1.8503892587160105
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.1282861333634209
Racing : 0.9928917973598105
Role Playing : 0.9364774906916394
Libraries & Demo : 0.9364774906916394
Auto & Vehic

In [24]:
print("Genres in Apple dataset:")
display_table(free_apps_apple, 11)

Genres in Apple dataset:
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


**Important Observations from the Frequency Distribution of the Apple Store Apps**:
* Most Common Genre in Apple Store apps is *Games* (58.16%) followed by *Education* (7.88%). The difference in the number of Games apps and Education apps is huge that shows `Games` dominate the iOS app store market.
* Entertainment apps such as `Games`, `Social Networking`, `Shopping` have relatively higher frequency than practical apps such as `Weather`, `Medical`, `Education`.

**Important Observations Google Play dataset**:
* `Family` Category and `Tools` Genre have highest frequency with 18.89% and 8.455 respectively.
* `Tools`, `Entertainment` and `Education` are top 3 without much difference unlike iOS where Games was a clear dominating genre.
* Here, `Game` category has second highest frequency
* Another difference observed from the iOS apps is that for PlayStore apps, there is a balance of practical and entertainment apps existing.
 
___
Now, we will investigate the number of users using the apps in these genres. We can find the number of users using `Installs` column of the Google Play dataset that gives us the number of users who installed the app. As we don't have such a column for iOS apps, we will use `ratings` that gives us the number of users giving ratings to the app in App store dataset.

### Most popular apps by genre on app store

In [25]:
#This cell will find the average ratings present(we relate that to users) of each genre for iOS apps
unique_genres = [genre for genre in freq_table(free_apps_apple, 11)]
for genre in unique_genres:
    total = 0 #the sum of user ratings
    len_genre = 0 #number of apps specific to each genre
    for app in free_apps_apple:
        genre_app = app[11]
        #print(app)
        if genre_app == genre:
            total = total + float(app[5])
            len_genre += 1
    
    average_users = total/len_genre
    print("Genre :", genre)
    print("Average Ratings", average_users)
    print("\n")

Genre : Education
Average Ratings 7003.983050847458


Genre : Catalogs
Average Ratings 4004.0


Genre : Business
Average Ratings 7491.117647058823


Genre : Productivity
Average Ratings 21028.410714285714


Genre : Book
Average Ratings 39758.5


Genre : Weather
Average Ratings 52279.892857142855


Genre : News
Average Ratings 21248.023255813954


Genre : Sports
Average Ratings 23008.898550724636


Genre : Navigation
Average Ratings 86090.33333333333


Genre : Entertainment
Average Ratings 14029.830708661417


Genre : Medical
Average Ratings 612.0


Genre : Finance
Average Ratings 31467.944444444445


Genre : Shopping
Average Ratings 26919.690476190477


Genre : Travel
Average Ratings 28243.8


Genre : Reference
Average Ratings 74942.11111111111


Genre : Utilities
Average Ratings 18684.456790123455


Genre : Music
Average Ratings 57326.530303030304


Genre : Games
Average Ratings 22788.6696905016


Genre : Lifestyle
Average Ratings 16485.764705882353


Genre : Health & Fitness
Average 

It can be observed that `Navigation` apps have highest number of user reviews(86090), however, this can be attributed to Google Maps as they dominate this category. Similary, for `Social Networking`, the number of reviews are high but this market is also dominated by players like Watsapp, Facebook, Instagram etc.

Therefore, to make apps in these domain despite of high user attraction is a risk as there are already very strong and stable players. Alternately, we can see that `Reference` and `Books` has good reviews (~70,000 and 30,000 respectively). Although, there are few strong players like Kindle, the app market in this genre can be explored further as users are shifting towards using online books rather than traditional stores.

### Most Popular Apps by Genre on Google Play
We are using `installs` column that is currently in string format and will be converted to integer so that we can find average number of installs. 

In [26]:
#show the freq dist of installs column
display_table(free_apps_play, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.553650005641432
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605325
1,000+ : 8.394448832223851
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.77265034412727
5,000+ : 4.5131445334536835
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543947
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.04513144533453684


In [28]:
#installs are currently in string, replace + and , to "" and convert to int 
unique_category = [cat for cat in freq_table(free_apps_play, 1)]
for cat in unique_category:
    total = 0
    len_cat = 0
    for app in free_apps_play:
        category_app = app[1]
        if category_app == cat:
            num_installs = app[5]
            num_installs = num_installs.replace('+','')
            num_installs = num_installs.replace(',','')
            num_installs = int(num_installs)
            total = total + num_installs
            len_cat += 1
    average = total/len_cat
    print("App Genre:", cat)
    print("Average installs", average)
    print("\n")

App Genre: PARENTING
Average installs 542603.6206896552


App Genre: TOOLS
Average installs 10801391.298666667


App Genre: COMICS
Average installs 817657.2727272727


App Genre: FINANCE
Average installs 1387692.475609756


App Genre: BEAUTY
Average installs 513151.88679245283


App Genre: AUTO_AND_VEHICLES
Average installs 647317.8170731707


App Genre: MAPS_AND_NAVIGATION
Average installs 4056941.7741935486


App Genre: VIDEO_PLAYERS
Average installs 24727872.452830188


App Genre: GAME
Average installs 15588015.603248259


App Genre: SOCIAL
Average installs 23253652.127118643


App Genre: ENTERTAINMENT
Average installs 11640705.88235294


App Genre: BUSINESS
Average installs 1712290.1474201474


App Genre: FAMILY
Average installs 3697848.1731343283


App Genre: SHOPPING
Average installs 7036877.311557789


App Genre: HEALTH_AND_FITNESS
Average installs 4188821.9853479853


App Genre: HOUSE_AND_HOME
Average installs 1331540.5616438356


App Genre: EVENTS
Average installs 253542.22222

As per the above insights, we observe that `Communication` profile apps have maximum number of installs(38456119) while `Social`, `Gaming` and `Video Players` follow closely. Also, `Video Players` have high number of installs. However, that can be mainly attributed to Youtube.  

___

### Conclusion

I would recommend that `Books and Reference` has a potential app market. The number of installs are less(8767811) relative to other genres, however, the number of user ratings for `References` are quite high (74942). 

Only 0.55% iOS apps are of Reference category and 0.4% of Books category. In PlayStore as well the distribution is just 2.14%. It seems that the availability of apps is less, however, demand is more as more users are installing and rating apps. Thus, the company should develop apps that bring more books to the users device.