# Profitable App Profiles for the App Store and Google Play Markets

- We only build free apps and our revenue consists of in-app ads. Therefore, our revenue for any given app is mostly influenced by the number of users.
- Goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

### Dataset
- A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. 
- A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. 

### Opening and Exploring Datasets



In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

        
# dataset: list of lists
# start, end: integers, staring and the ending indices of a slice from dataset


In [2]:
file = open('/Users/kitaeklee/Desktop/Data/AppStore/AppleStore.csv')

from csv import reader
read_file = reader(file)
data_list = list(read_file)


In [3]:
file2 = open('/Users/kitaeklee/Desktop/Data/AppStore/googleplaystore.csv')

from csv import reader
read_file2 = reader(file2)
data_list2 = list(read_file2)

In [4]:
explore_data(data_list, 1, 3)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']




In [5]:
explore_data(data_list2, 1, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [6]:
# column names of first dataset
explore_data(data_list, 0, 1)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




In [7]:
# column names of second dataset
explore_data(data_list2, 0, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




# Data Cleaning

### Apple Store Dataset

In [8]:
import pandas as pd

# Create dataframe using list of lists of the first dataset
df = pd.DataFrame(data_list)
header = ['','id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
df = df[1:]
df.columns = header
df = df.drop(df.columns[0], axis=1)

df

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4,4.5,6.3.5,4+,Games,38,5,10,1
2,281796108,Evernote - stay organized,158578688,USD,0,161065,26,4,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0,262241,649,4,4.5,5.10.0,12+,Shopping,37,5,9,1
5,282935706,Bible,92774400,USD,0,985920,5320,4.5,5,7.5.1,4+,Reference,37,5,45,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7193,1187617475,Kubik,126644224,USD,0,142,75,4.5,4.5,1.3,4+,Games,38,5,1,1
7194,1187682390,VR Roller-Coaster,120760320,USD,0,30,30,4.5,4.5,0.9,4+,Games,38,0,1,1
7195,1187779532,Bret Michaels Emojis + Lyric Keyboard,111322112,USD,1.99,15,0,4.5,0,1.0.2,9+,Utilities,37,1,1,1
7196,1187838770,VR Roller Coaster World - Virtual Reality,97235968,USD,0,85,32,4.5,4.5,1.0.15,12+,Games,38,0,2,1


Check the number of duplicates that have the same name of applications.

In [9]:
# Number of duplicates in the AppStore dataset
duplicate_list=[]
for row in df.duplicated(subset='track_name'):
    if row == True:
        duplicate_list.append(row)
num = len(duplicate_list)
print("Number of duplicates:", num)


Number of duplicates: 2


There are 2 duplicates therefore, remove the duplicates based on the name of the application but keep the first one.

In [10]:
# Drop duplicates
df.drop_duplicates(subset='track_name', keep='first', inplace=True)

df

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4,4.5,6.3.5,4+,Games,38,5,10,1
2,281796108,Evernote - stay organized,158578688,USD,0,161065,26,4,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0,262241,649,4,4.5,5.10.0,12+,Shopping,37,5,9,1
5,282935706,Bible,92774400,USD,0,985920,5320,4.5,5,7.5.1,4+,Reference,37,5,45,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7193,1187617475,Kubik,126644224,USD,0,142,75,4.5,4.5,1.3,4+,Games,38,5,1,1
7194,1187682390,VR Roller-Coaster,120760320,USD,0,30,30,4.5,4.5,0.9,4+,Games,38,0,1,1
7195,1187779532,Bret Michaels Emojis + Lyric Keyboard,111322112,USD,1.99,15,0,4.5,0,1.0.2,9+,Utilities,37,1,1,1
7196,1187838770,VR Roller Coaster World - Virtual Reality,97235968,USD,0,85,32,4.5,4.5,1.0.15,12+,Games,38,0,2,1


Our firm only build apps that are free to download and install. Therefore, recreate the dataframe only with the apps that are free.

In [11]:
# Drop the rows with non-free apps since we only deal with free apps.
df = df[df['price'] == '0']

In [12]:
# Drop rows with missing values just in case
df.dropna(axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
# Reset the index
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,281796108,Evernote - stay organized,158578688,USD,0,161065,26,4,3.5,8.2.2,4+,Productivity,37,5,23,1
1,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
2,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0,262241,649,4,4.5,5.10.0,12+,Shopping,37,5,9,1
3,282935706,Bible,92774400,USD,0,985920,5320,4.5,5,7.5.1,4+,Reference,37,5,45,1
4,283646709,PayPal - Send and request money safely,227795968,USD,0,119487,879,4,4.5,6.12.0,4+,Finance,37,0,19,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4049,1186384912,Demolition Derby Virtual Reality (VR) Racing,168774656,USD,0,18,18,4,4,1.0.0,12+,Games,38,4,1,1
4050,1187617475,Kubik,126644224,USD,0,142,75,4.5,4.5,1.3,4+,Games,38,5,1,1
4051,1187682390,VR Roller-Coaster,120760320,USD,0,30,30,4.5,4.5,0.9,4+,Games,38,0,1,1
4052,1187838770,VR Roller Coaster World - Virtual Reality,97235968,USD,0,85,32,4.5,4.5,1.0.15,12+,Games,38,0,2,1


### Google Playstore Dataset

In [14]:
# Create a dataframe using the list of lists of the second dataset
df2 = pd.DataFrame(data_list2)
df2 = df2[1:]
header2 = ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
df2.columns = header2


df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


Based on discussion on Kaggle, there is missing value in one row.

In [15]:
# Based on discussion on Kaggle, there is missing value.
df2.loc[10473, :]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                           
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                          None
Name: 10473, dtype: object

There is missing value in Content Rating. Therefore, drop that row.

In [16]:
# Drop rows with missing values
df2.dropna(axis=0, inplace=True)
df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


Check the number of duplicates that have the same name of the application.

In [17]:
# Number of duplicates in the Googlestore dataset
duplicate_list2=[]
for row in df2.duplicated(subset='App'):
    if row == True:
        duplicate_list2.append(row)
num2 = len(duplicate_list2)
print("Number of duplicates:", num2)

Number of duplicates: 1181


There are 1181 duplicates. Drop the duplicates based on the name of the applications while keeping the first duplicate.

In [18]:
# Drop duplicates
df2.drop_duplicates(subset='App', keep='first', inplace=True)

df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


Our firm only build apps that are free to download and install. Therefore, recreate the dataframe only with the apps that are free.

In [19]:
# Drop the rows with non-free apps since we only deal with free apps.
df2 = df2[df2['Price']=='0']

In [20]:
# Reset the index
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8898,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
8899,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
8900,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
8901,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


# Removing non-English applications

Since our firm only use English for the apps we develop, and we'd like to analyze only the apps that are directed toward an English-speaking user, we have to remove the non-English based applications. 

We would like to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it.

We can get the corresponding number of each character using the ord().

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

Use indexing to select an individual character from strings, and we can also iterate on the string using a for loop.

However, there are some strings that includes emojis while rest of the letters are English. Therfore to minimize the data loss, only drop the rows that contains more than 3 non-English letter.

Define a function that iterate over given strings and return True if more than 3 letters in the string are non-English.

In [21]:
def iseng(string):
    num = 0
    for i in range(len(string)):
        if ord(string[i]) > 127:
            num += 1
        if num > 3:
            return True


Drop the rows with more than 3 non-English letters in the name of applications from Apple Store Dataset.

In [22]:
for row in df.itertuples():
    if iseng(str(row[2])) == True:
        idx = row[0]       
        df.drop(index=idx, inplace=True, axis=0)
df.reset_index(inplace=True, drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [23]:
df

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,281796108,Evernote - stay organized,158578688,USD,0,161065,26,4,3.5,8.2.2,4+,Productivity,37,5,23,1
1,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
2,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0,262241,649,4,4.5,5.10.0,12+,Shopping,37,5,9,1
3,282935706,Bible,92774400,USD,0,985920,5320,4.5,5,7.5.1,4+,Reference,37,5,45,1
4,283646709,PayPal - Send and request money safely,227795968,USD,0,119487,879,4,4.5,6.12.0,4+,Finance,37,0,19,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,1186384912,Demolition Derby Virtual Reality (VR) Racing,168774656,USD,0,18,18,4,4,1.0.0,12+,Games,38,4,1,1
3216,1187617475,Kubik,126644224,USD,0,142,75,4.5,4.5,1.3,4+,Games,38,5,1,1
3217,1187682390,VR Roller-Coaster,120760320,USD,0,30,30,4.5,4.5,0.9,4+,Games,38,0,1,1
3218,1187838770,VR Roller Coaster World - Virtual Reality,97235968,USD,0,85,32,4.5,4.5,1.0.15,12+,Games,38,0,2,1


Drop the rows with more than 3 non-English letters in the name of applications from Google Playstore Dataset.

In [24]:
for row in df2.itertuples():
    if iseng(str(row[1])) == True:
        idx = row[0]       
        df2.drop(index=idx, inplace=True, axis=0)
df2.reset_index(inplace=True, drop=True)

In [25]:
df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8857,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
8858,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
8859,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
8860,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


Check if Applications like the ones below are dropped from the dataset.
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'


In [26]:
for row in df2.itertuples():
    if str(row[2]) == 'Docs To Go™ Free Office Suite':
        print("yes")
    elif str(row[2]) == 'Instachat 😜':
        print("yes2")
    elif str(row[2]) == '爱奇艺PPS -《欢乐颂2》电视剧热播':
        print("yes3")


Based on the code above, there are no such applications left in the datset.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

# Most common apps by genre

First create dictionaries for each categories in each Genre columns from each dataset.

- Dictionary of genres of Apple Store dataset.

In [27]:
dict1={}
for genre in df['prime_genre']:
    if genre in dict1:
        dict1[genre] += 1
    if genre not in dict1:
        dict1[genre] = 1
dict1

{'Productivity': 56,
 'Weather': 28,
 'Shopping': 84,
 'Reference': 18,
 'Finance': 36,
 'Music': 66,
 'Utilities': 81,
 'Travel': 40,
 'Social Networking': 106,
 'Sports': 69,
 'Health & Fitness': 65,
 'Games': 1872,
 'Food & Drink': 26,
 'News': 43,
 'Book': 14,
 'Photo & Video': 160,
 'Entertainment': 254,
 'Business': 17,
 'Lifestyle': 51,
 'Education': 118,
 'Navigation': 6,
 'Medical': 6,
 'Catalogs': 4}

- Two dictionaries of two genre columns of Google Playstore Dataset. 

In [28]:
dict2={}
for genre in df2['Category']:
    if genre in dict2:
        dict2[genre] += 1
    if genre not in dict2:
        dict2[genre] = 1

dict2

{'ART_AND_DESIGN': 60,
 'AUTO_AND_VEHICLES': 82,
 'BEAUTY': 53,
 'BOOKS_AND_REFERENCE': 190,
 'BUSINESS': 407,
 'COMICS': 55,
 'COMMUNICATION': 287,
 'DATING': 165,
 'EDUCATION': 114,
 'ENTERTAINMENT': 100,
 'EVENTS': 63,
 'FINANCE': 328,
 'FOOD_AND_DRINK': 110,
 'HEALTH_AND_FITNESS': 273,
 'HOUSE_AND_HOME': 74,
 'LIBRARIES_AND_DEMO': 83,
 'LIFESTYLE': 346,
 'GAME': 875,
 'FAMILY': 1635,
 'MEDICAL': 312,
 'SOCIAL': 236,
 'SHOPPING': 199,
 'PHOTOGRAPHY': 261,
 'SPORTS': 301,
 'TRAVEL_AND_LOCAL': 207,
 'TOOLS': 748,
 'PERSONALIZATION': 294,
 'PRODUCTIVITY': 345,
 'PARENTING': 58,
 'WEATHER': 71,
 'VIDEO_PLAYERS': 158,
 'NEWS_AND_MAGAZINES': 248,
 'MAPS_AND_NAVIGATION': 124}

In [29]:
dict3={}

for genre in df2['Genres']:
    if genre in dict3:
        dict3[genre] += 1
    if genre not in dict3:
        dict3[genre] = 1

dict3

{'Art & Design': 53,
 'Art & Design;Pretend Play': 1,
 'Art & Design;Creativity': 6,
 'Art & Design;Action & Adventure': 1,
 'Auto & Vehicles': 82,
 'Beauty': 53,
 'Books & Reference': 190,
 'Business': 407,
 'Comics': 54,
 'Comics;Creativity': 1,
 'Communication': 287,
 'Dating': 165,
 'Education;Education': 30,
 'Education': 474,
 'Education;Creativity': 4,
 'Education;Music & Video': 3,
 'Education;Action & Adventure': 3,
 'Education;Pretend Play': 5,
 'Education;Brain Games': 3,
 'Entertainment': 538,
 'Entertainment;Music & Video': 15,
 'Entertainment;Brain Games': 7,
 'Entertainment;Creativity': 3,
 'Events': 63,
 'Finance': 328,
 'Food & Drink': 110,
 'Health & Fitness': 273,
 'House & Home': 74,
 'Libraries & Demo': 83,
 'Lifestyle': 345,
 'Lifestyle;Pretend Play': 1,
 'Adventure;Action & Adventure': 3,
 'Arcade': 164,
 'Casual': 156,
 'Card': 40,
 'Casual;Pretend Play': 21,
 'Action': 275,
 'Strategy': 81,
 'Puzzle': 100,
 'Sports': 307,
 'Music': 18,
 'Word': 23,
 'Racing': 8

- Function that return frequency values and function that display frequency in percentage(%) in descending order.

In [30]:
def freq_table(dataset):
    dictionary={}
    total = 0
    for value in dataset:
        dictionary[value] = dataset[value]
        total += dataset[value]
    for value in dictionary:
        dictionary[value] = dictionary[value]/total
    return dictionary


In [31]:
def display_table(dataset):
    dataset = freq_table(dataset)
    # Create a list of tuples sorted by index 1 i.e. value field     
    listofTuples = sorted(dataset.items() , reverse=True, key=lambda x: x[1])
    # Iterate over the sorted sequence
    for elem in listofTuples :
        print(elem[0] , " : " , '{:.2%}'.format(elem[1]))   

- Frequency of apps sold in Apple Store

In [32]:
display_table(dict1)

Games  :  58.14%
Entertainment  :  7.89%
Photo & Video  :  4.97%
Education  :  3.66%
Social Networking  :  3.29%
Shopping  :  2.61%
Utilities  :  2.52%
Sports  :  2.14%
Music  :  2.05%
Health & Fitness  :  2.02%
Productivity  :  1.74%
Lifestyle  :  1.58%
News  :  1.34%
Travel  :  1.24%
Finance  :  1.12%
Weather  :  0.87%
Food & Drink  :  0.81%
Reference  :  0.56%
Business  :  0.53%
Book  :  0.43%
Navigation  :  0.19%
Medical  :  0.19%
Catalogs  :  0.12%


In the Apple Store, **Games genre** is dominating over all the other genres by 58%, followed by Entertainment by 7.89%, Photo & Video by 4.97%, Education by 3.66% and Social Networking by 3.29%.

Although Game genre applications are the majority, it does not imply that Game genre applications are the genre with the most installed which reflects how many users are actually there.

According to Top 5 genres, the most common applications are for entertainment (Games, Entertainment, Photo & Video, Social Networking, Shopping, etc.). Presumably Top 5 genres except Education seems applications for relatively young people. Therefore, I can assume most of users are relatively young. I believe the reason why the fourth most common application is Education is because younger people tend to be more in need of Education or many of them could be presumably are school-related applications.

- Frequency of apps sold in Google Playstore

In [33]:
display_table(dict2)

FAMILY  :  18.45%
GAME  :  9.87%
TOOLS  :  8.44%
BUSINESS  :  4.59%
LIFESTYLE  :  3.90%
PRODUCTIVITY  :  3.89%
FINANCE  :  3.70%
MEDICAL  :  3.52%
SPORTS  :  3.40%
PERSONALIZATION  :  3.32%
COMMUNICATION  :  3.24%
HEALTH_AND_FITNESS  :  3.08%
PHOTOGRAPHY  :  2.95%
NEWS_AND_MAGAZINES  :  2.80%
SOCIAL  :  2.66%
TRAVEL_AND_LOCAL  :  2.34%
SHOPPING  :  2.25%
BOOKS_AND_REFERENCE  :  2.14%
DATING  :  1.86%
VIDEO_PLAYERS  :  1.78%
MAPS_AND_NAVIGATION  :  1.40%
EDUCATION  :  1.29%
FOOD_AND_DRINK  :  1.24%
ENTERTAINMENT  :  1.13%
LIBRARIES_AND_DEMO  :  0.94%
AUTO_AND_VEHICLES  :  0.93%
HOUSE_AND_HOME  :  0.84%
WEATHER  :  0.80%
EVENTS  :  0.71%
ART_AND_DESIGN  :  0.68%
PARENTING  :  0.65%
COMICS  :  0.62%
BEAUTY  :  0.60%


The most common genres categorized by Category are Family by 18.45% followed by Game by 9.87%, Tools by 8.44%, Business by 4.59% and Lifestyle by 3.90%. The first two categories seems like they are for entertainment while the next three are for practical purposes (Tools, Business, Lifestyle).

In [34]:
display_table(dict3)

Tools  :  8.43%
Entertainment  :  6.07%
Education  :  5.35%
Business  :  4.59%
Lifestyle  :  3.89%
Productivity  :  3.89%
Finance  :  3.70%
Medical  :  3.52%
Sports  :  3.46%
Personalization  :  3.32%
Communication  :  3.24%
Action  :  3.10%
Health & Fitness  :  3.08%
Photography  :  2.95%
News & Magazines  :  2.80%
Social  :  2.66%
Travel & Local  :  2.32%
Shopping  :  2.25%
Books & Reference  :  2.14%
Simulation  :  2.04%
Dating  :  1.86%
Arcade  :  1.85%
Video Players & Editors  :  1.77%
Casual  :  1.76%
Maps & Navigation  :  1.40%
Food & Drink  :  1.24%
Puzzle  :  1.13%
Racing  :  0.99%
Libraries & Demo  :  0.94%
Role Playing  :  0.94%
Auto & Vehicles  :  0.93%
Strategy  :  0.91%
House & Home  :  0.84%
Weather  :  0.80%
Events  :  0.71%
Adventure  :  0.68%
Comics  :  0.61%
Art & Design  :  0.60%
Beauty  :  0.60%
Parenting  :  0.50%
Card  :  0.45%
Casino  :  0.43%
Trivia  :  0.42%
Educational;Education  :  0.39%
Board  :  0.37%
Educational  :  0.37%
Education;Education  :  0.34%
Wor

The most common genres categorized by Genres are Tools by 8.43% followed by Entertainment by 6.07%, Education by 5.35%, Business by 4.59% and Lifestyle by 3.89%. 

While the first two most common categories were for entertainment, the most common applications when we categorized by genres are designed for practical purposes (Tools, Business, Education, Productivity, Lifestyle).

Google Playstore data categorized applications in two different categories which are "Category" and "Genres". 

In [35]:
df2[df2['Genres']=='Entertainment']['Category'].unique()

array(['ENTERTAINMENT', 'FAMILY'], dtype=object)

In [36]:
df2[df2['Category']=='FAMILY']['Genres'].unique()

array(['Casual;Brain Games', 'Educational;Creativity',
       'Puzzle;Brain Games', 'Educational;Education',
       'Education;Creativity', 'Educational;Brain Games',
       'Educational;Pretend Play', 'Education;Education',
       'Casual;Action & Adventure', 'Entertainment;Education',
       'Entertainment;Brain Games', 'Casual;Education',
       'Casual;Pretend Play', 'Casual;Creativity', 'Music;Music & Video',
       'Simulation;Action & Adventure', 'Racing;Action & Adventure',
       'Entertainment;Music & Video', 'Arcade;Pretend Play',
       'Action;Action & Adventure', 'Education;Pretend Play',
       'Adventure;Action & Adventure', 'Role Playing;Action & Adventure',
       'Simulation;Pretend Play', 'Puzzle;Creativity',
       'Sports;Action & Adventure', 'Educational;Action & Adventure',
       'Arcade;Action & Adventure', 'Entertainment;Action & Adventure',
       'Puzzle;Action & Adventure', 'Education;Action & Adventure',
       'Strategy;Action & Adventure', 'Music & Audi

In [37]:
df2['Genres'].unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education;Education', 'Education', 'Education;Creativity',
       'Education;Music & Video', 'Education;Action & Adventure',
       'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity',
       'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
       'Trivia', 'Role 

In [38]:
df2['Category'].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

There are differences between two since, as we can see from the code above as an example, even though some applications are both categorized into same Genres, that doesn't mean they are same type of category. For example, 'Entertainment' Genres are categorized either into 'Entertainment' or 'Family'. 

"Genres" are more in detail while 'Category' is showing the applications in broader range.

Frequency table of applications can show what are the most common genres that developers are making but we can't be sure that it reflects those are what most users use and the most downloaded.

# Most popular apps by genre

To find out what genres are the most popular (have the most users), we should calculate the average number of installs for each app genre. 

For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

### Apple Store

Values in the column 'rating_count_tot' are 'str' values. Therefore replace data types of values into 'float'.

In [39]:

m=0
for val in df.loc[:, 'rating_count_tot']:
    df.loc[m, 'rating_count_tot'] = float(val)
    m+=1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Create a new dataframe with columns of 'prime_genre' and 'rating_count_tot' which is grouped by 'prime_genre' and summed by 'rating_count_tot'.

In [40]:

df_pop_apple = df.loc[:, ['prime_genre', 'rating_count_tot']].groupby('prime_genre').sum()
df_pop_apple.reset_index(inplace=True)
    

In [41]:
df_pop_apple

Unnamed: 0,prime_genre,rating_count_tot
0,Book,556619.0
1,Business,127349.0
2,Catalogs,16016.0
3,Education,826470.0
4,Entertainment,3563577.0
5,Finance,1132846.0
6,Food & Drink,866682.0
7,Games,42705795.0
8,Health & Fitness,1514371.0
9,Lifestyle,840774.0


Add new column 'num_of_apps' for number of applications of each Genre.

In [42]:
# Add new column "number of applications"
df_pop_apple['num_of_apps'] = 0
k=0

for value1 in df_pop_apple['prime_genre']:
    temp = 0
    for value2 in df['prime_genre']:
        if value1 == value2:
            temp+=1
        elif value1 != value2:
            continue
    df_pop_apple.loc[k, 'num_of_apps'] = temp
    k+=1
            



In [43]:
df_pop_apple

Unnamed: 0,prime_genre,rating_count_tot,num_of_apps
0,Book,556619.0,14
1,Business,127349.0,17
2,Catalogs,16016.0,4
3,Education,826470.0,118
4,Entertainment,3563577.0,254
5,Finance,1132846.0,36
6,Food & Drink,866682.0,26
7,Games,42705795.0,1872
8,Health & Fitness,1514371.0,65
9,Lifestyle,840774.0,51


Get the average number of ratings by divinding 'rating_count_tot' by 'num_of_apps' and round the result to 2 decimal values.

In [44]:

df_pop_apple['avg_num_rating'] = 0

k=0
for val1, val2, val3 in zip(df_pop_apple['rating_count_tot'], df_pop_apple['num_of_apps'], df_pop_apple['avg_num_rating']):
    val3 = (val1 / val2)
    df_pop_apple.loc[k, 'avg_num_rating'] = round(val3, 2)
    k+=1


In [45]:
df_pop_apple

Unnamed: 0,prime_genre,rating_count_tot,num_of_apps,avg_num_rating
0,Book,556619.0,14,39758.5
1,Business,127349.0,17,7491.12
2,Catalogs,16016.0,4,4004.0
3,Education,826470.0,118,7003.98
4,Entertainment,3563577.0,254,14029.83
5,Finance,1132846.0,36,31467.94
6,Food & Drink,866682.0,26,33333.92
7,Games,42705795.0,1872,22812.92
8,Health & Fitness,1514371.0,65,23298.02
9,Lifestyle,840774.0,51,16485.76


The most popular apps in Apple Store in descending orders of average number of ratings are: 

In [46]:
df_pop_apple.sort_values(by='avg_num_rating', ascending=False).reset_index(inplace=False, drop=True)

Unnamed: 0,prime_genre,rating_count_tot,num_of_apps,avg_num_rating
0,Navigation,516542.0,6,86090.33
1,Reference,1348958.0,18,74942.11
2,Social Networking,7584125.0,106,71548.35
3,Music,3783551.0,66,57326.53
4,Weather,1463837.0,28,52279.89
5,Book,556619.0,14,39758.5
6,Food & Drink,866682.0,26,33333.92
7,Finance,1132846.0,36,31467.94
8,Photo & Video,4550647.0,160,28441.54
9,Travel,1129752.0,40,28243.8


### Google Playstore

For the Google Playstore dataset, we have 'Installs' which shows the rough number of installations. 

In [47]:
df2

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8857,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
8858,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
8859,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
8860,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [48]:
df2['Installs'].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '500+', '100+', '50+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

Data shows the approximate amount of number of installation. For example, 10,000+ means number of installation is something between 10,000 and 49,999. However, since we are just trying to see what the most popular applications are by genre, this rought estimation would be good enough.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To compute the totla nubmer of installations, I will convert the data type to 'float' and to do that, I need to drop specific characters such as "," and "+" from the each string.

First, create a new dataframe with necessary columns.

In [139]:
df_pop_goog = df2.loc[:, ['Installs', 'Category']]
df_pop_goog

Unnamed: 0,Installs,Category
0,"10,000+",ART_AND_DESIGN
1,"500,000+",ART_AND_DESIGN
2,"5,000,000+",ART_AND_DESIGN
3,"50,000,000+",ART_AND_DESIGN
4,"100,000+",ART_AND_DESIGN
...,...,...
8857,"5,000+",FAMILY
8858,100+,FAMILY
8859,"1,000+",MEDICAL
8860,"1,000+",BOOKS_AND_REFERENCE


Data types of values in the column 'Installs' are strings. Replace "," and "+" characters to "" then convert the data type from string to the float.

In [140]:
ty=0
for value in df_pop_goog.loc[:, 'Installs']:
    value = value.replace(",", "")
    value = value.replace("+", "")
    value = float(value)
    df_pop_goog.loc[ty, 'Installs'] = value
    ty+=1

In [141]:
df_pop_goog

Unnamed: 0,Installs,Category
0,10000,ART_AND_DESIGN
1,500000,ART_AND_DESIGN
2,5e+06,ART_AND_DESIGN
3,5e+07,ART_AND_DESIGN
4,100000,ART_AND_DESIGN
...,...,...
8857,5000,FAMILY
8858,100,FAMILY
8859,1000,MEDICAL
8860,1000,BOOKS_AND_REFERENCE


Group by 'Category' and get the sum of 'Installs' then get the new empty column 'num_of_apps'.

In [142]:
df_pop_goog2 = df_pop_goog.groupby('Category').sum().reset_index()


In [143]:
df_pop_goog2['num_of_apps'] = 0

In [144]:
df_pop_goog2

Unnamed: 0,Category,Installs,num_of_apps
0,ART_AND_DESIGN,114321100.0,0
1,AUTO_AND_VEHICLES,53080060.0,0
2,BEAUTY,27197050.0,0
3,BOOKS_AND_REFERENCE,1665884000.0,0
4,BUSINESS,696902100.0,0
5,COMICS,44971150.0,0
6,COMMUNICATION,11036910000.0,0
7,DATING,140914800.0,0
8,EDUCATION,351350000.0,0
9,ENTERTAINMENT,2113460000.0,0


Get the number of occurrence of each Category and insert them into the column 'num_of_apps'.

In [145]:
for val1 in df_pop_goog2['Category']:
    temp = 0
    for val2 in df_pop_goog.loc[:, 'Category']:
        if val1 == val2:
            temp += 1
        else:
            continue
    df_pop_goog2['num_of_apps'][df_pop_goog2['Category'] == val1] = temp
    
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [146]:
df_pop_goog2

Unnamed: 0,Category,Installs,num_of_apps
0,ART_AND_DESIGN,114321100.0,60
1,AUTO_AND_VEHICLES,53080060.0,82
2,BEAUTY,27197050.0,53
3,BOOKS_AND_REFERENCE,1665884000.0,190
4,BUSINESS,696902100.0,407
5,COMICS,44971150.0,55
6,COMMUNICATION,11036910000.0,287
7,DATING,140914800.0,165
8,EDUCATION,351350000.0,114
9,ENTERTAINMENT,2113460000.0,100


Create a new column 'avg_num_of_installs' then divide total number of instllations by number of occurence of each Category to get the average number of Installation of each Category.

In [147]:
df_pop_goog2['avg_num_of_installs'] = 0

In [148]:
c=0
for val1, val2 in zip(df_pop_goog2['Installs'], df_pop_goog2['num_of_apps']):
    df_pop_goog2.loc[c, 'avg_num_of_installs'] = (val1 / val2)
    c+=1


In [149]:
df_pop_goog2

Unnamed: 0,Category,Installs,num_of_apps,avg_num_of_installs
0,ART_AND_DESIGN,114321100.0,60,1905352.0
1,AUTO_AND_VEHICLES,53080060.0,82,647317.8
2,BEAUTY,27197050.0,53,513151.9
3,BOOKS_AND_REFERENCE,1665884000.0,190,8767812.0
4,BUSINESS,696902100.0,407,1712290.0
5,COMICS,44971150.0,55,817657.3
6,COMMUNICATION,11036910000.0,287,38456120.0
7,DATING,140914800.0,165,854028.8
8,EDUCATION,351350000.0,114,3082018.0
9,ENTERTAINMENT,2113460000.0,100,21134600.0


Round the values in 'avg_num_of_installs' column to suppress the scientific notation then sort the values in descending order.

In [151]:
e=0
for val in df_pop_goog2['avg_num_of_installs']:
    val = round(val)
    df_pop_goog2.loc[e, 'avg_num_of_installs'] = val
    e+=1

In [168]:
df_pop_goog2.sort_values(by='avg_num_of_installs', ascending=False).reset_index(drop=True)


Unnamed: 0,Category,Installs,num_of_apps,avg_num_of_installs
0,COMMUNICATION,11036910000.0,287,38456119.0
1,VIDEO_PLAYERS,3926732000.0,158,24852732.0
2,SOCIAL,5487862000.0,236,23253652.0
3,ENTERTAINMENT,2113460000.0,100,21134600.0
4,PHOTOGRAPHY,4647269000.0,261,17805628.0
5,PRODUCTIVITY,5791629000.0,345,16787331.0
6,GAME,13857870000.0,875,15837565.0
7,TRAVEL_AND_LOCAL,2894704000.0,207,13984078.0
8,TOOLS,8000043000.0,748,10695245.0
9,NEWS_AND_MAGAZINES,2368196000.0,248,9549178.0


### Recommendation

We need applications where many users would actively use and looking into phone since our firm is generating revenues from in-app ads. The most ideal genre or category is the type of application that is popular but not common to increase the probability of higher number of users and their usage of application. 

In **Apple store**, the most popular applicaiton genre is 'Navigation'. However, users will only be looking at the map and directions but not much of other things in the application hence Navigation genre is not the most efficient application to generate revenues by in-apps ads. The most efficient genres for in-app ads out of Top 5 most popular genres in App store are Music and Social Networking genre, however, there are already many famous social networking and music applications that most users are using while the firm has to make a new free application. 

The ideal genre that the firm should focus on creating new application is **"Reference"** since this genre is second the most popular genre while it is not very common.

In [162]:
df[df['prime_genre']=='Reference'].sort_values('rating_count_tot', ascending=False)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
3,282935706,Bible,92774400,USD,0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
50,308750436,Dictionary.com Dictionary & Thesaurus,111275008,USD,0,200047,177,4.0,4.0,7.1.3,4+,Reference,37,0,1,1
153,364740856,Dictionary.com Dictionary & Thesaurus for iPad,165748736,USD,0,54175,10176,4.5,4.5,4.0,4+,Reference,24,5,9,1
258,414706506,Google Translate,65281024,USD,0,26786,27,3.5,4.5,5.10.0,4+,Reference,37,5,59,1
204,388389451,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",100551680,USD,0,18418,706,4.5,5.0,9.2.1,4+,Reference,37,5,16,1
2846,1130829481,New Furniture Mods - Pocket Wiki & Game Tools ...,52959232,USD,0,17588,17588,4.5,4.5,1.0,4+,Reference,38,3,2,1
232,399452287,Merriam-Webster Dictionary,155593728,USD,0,16849,1125,4.5,4.5,4.1,4+,Reference,38,1,12,1
386,475772902,Night Sky,596499456,USD,0,12122,60,4.5,4.5,4.4.1,4+,Reference,37,5,29,1
2899,1135575003,City Maps for Minecraft PE - The Best Maps for...,90124288,USD,0,8535,8535,4.0,4.0,1.0,4+,Reference,37,4,1,1
2863,1132715891,LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...,86874112,USD,0,4693,4693,4.0,4.0,1.0,12+,Reference,37,4,1,1


"Reference" type of apps are ranked as second the most popular application while there are not many competitors in the market as shown in the dataframe above. Three types of the applications for Reference genre that the firm should focus on making are:

**1. Bible:** 
This is the application with the most users while there is only one application available.

**2. Dictionary:** 
This is the second most used application by users. 

**3. Game dictionary/wiki:**
There are many game-related dictionary or wiki type of applications. 
Despite of relatively lower number of installations, these applications would always have certain amount of users using compared to Top 5 Reference applications might be only used by users at times.



I calculated the ratio of "Average number of installations" to "Number of applications" which is "Average number of installations" divided by "Number of applications" to find the most efficient applications to do in-app advertisements for **Google Playstore**. 

In [185]:
df_pop_goog3 = df_pop_goog2
df_pop_goog3['apps_installs_ratio'] = 0

Unnamed: 0,Category,Installs,num_of_apps,avg_num_of_installs,apps_installs_ratio
0,ART_AND_DESIGN,114321100.0,60,1905352.0,0
1,AUTO_AND_VEHICLES,53080060.0,82,647318.0,0
2,BEAUTY,27197050.0,53,513152.0,0
3,BOOKS_AND_REFERENCE,1665884000.0,190,8767812.0,0
4,BUSINESS,696902100.0,407,1712290.0,0
5,COMICS,44971150.0,55,817657.0,0
6,COMMUNICATION,11036910000.0,287,38456119.0,0
7,DATING,140914800.0,165,854029.0,0
8,EDUCATION,351350000.0,114,3082018.0,0
9,ENTERTAINMENT,2113460000.0,100,21134600.0,0


In [188]:
f=0
for val1, val2 in zip(df_pop_goog3['num_of_apps'], df_pop_goog3['avg_num_of_installs']):
    df_pop_goog3.loc[f, 'apps_installs_ratio'] = val2/val1
    f+=1

In [192]:
df_pop_goog3.sort_values(by='apps_installs_ratio', ascending=False).reset_index(drop=True)

Unnamed: 0,Category,Installs,num_of_apps,avg_num_of_installs,apps_installs_ratio
0,ENTERTAINMENT,2113460000.0,100,21134600.0,211346.0
1,VIDEO_PLAYERS,3926732000.0,158,24852732.0,157295.772152
2,COMMUNICATION,11036910000.0,287,38456119.0,133993.445993
3,SOCIAL,5487862000.0,236,23253652.0,98532.423729
4,WEATHER,360288500.0,71,5074486.0,71471.633803
5,PHOTOGRAPHY,4647269000.0,261,17805628.0,68220.796935
6,TRAVEL_AND_LOCAL,2894704000.0,207,13984078.0,67555.932367
7,PRODUCTIVITY,5791629000.0,345,16787331.0,48658.930435
8,BOOKS_AND_REFERENCE,1665884000.0,190,8767812.0,46146.378947
9,NEWS_AND_MAGAZINES,2368196000.0,248,9549178.0,38504.75
