# <center>Google Play and Apple Music App Purchases - Dec 2022 Version</center>

This project is an analysis of free apps obtained from Google Play and Apple.  The analysis answers the following question:

- Since we are developing a free app and will rely on advertising for income, which type of free app is most likely to attract users on Google Play and the Apple Store?

Import Pandas and a sample of free apps from the Apple Store and Google Play

In [1]:
import pandas as pd
df_apple = pd.read_csv('AppleStore.csv')
df_google = pd.read_csv('googleplaystore.csv')

Summary view of the imported data sets

In [None]:
a = "APPLE"
print(a.center(80))
print('\n', df_apple.head(3))
print('\n',"rows and columns", df_apple.shape)
print(2*'\n')
g = "GOOGLE"
print(g.center(80))
print('\n',df_google.head(3))
print('\n',"rows and columns", df_google.shape)

Based on review of the documentation for the Apple file (https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) and the Google file (https://www.kaggle.com/datasets/lava18/google-play-store-apps) the following column headings may
be useful for the analysis:<br><br>
<b>APPLE</b><br>

- ratings_count_ver: Number of user ratings for current version of app
- user_rating_ver: Average user rating for current version of app
- prime_genre: Primary genre of the app

<b>GOOGLE</b><br>

- Category
- Reviews
- Installs
- Genres


### <center>DATA CLEANING</center>

In [None]:
#Per documentation, google row id 10472 is incorrect.  Check this row
df_google.loc[10472]

In [None]:
#Per project instructions, delete row 10472. Comment out after executing to avoid deleting another row

df_google = df_google.drop(labels = 10472, axis=0)

#verified row index 10472 has been removed

Remove duplicate entries using the pandas code in the next two Notebook boxes.  Duplicates are determined
by duplicate "track_name" in the Apple set, and "App" in the Google set.  Kept the entry that contained the
most rating reviews, under the assumption that the highest review total is the most recent version of the app.

In [None]:
#remove duplicate rows Apple
df_apple = df_apple.sort_values(by= "rating_count_tot", ascending=False)
df_apple = df_apple.drop_duplicates(subset = "track_name",keep = "first")
print(df_apple.shape)

In [None]:
#remove duplicate rows Google
df_google = df_google.sort_values(by= "Reviews", ascending=False)
df_google = df_google.drop_duplicates(subset = "App",keep = "first")
print(df_google.shape)

Next we'll remove non English apps.  if a column "App"(Apple) or "track_name" (Google) has an entry
containing a 3 or more ASCII characters with character value > 127, we'll assume this entry is non English and remove.

In [8]:
df_apple = df_apple.drop(df_apple.loc[df_apple["track_name"]
           .apply(lambda x: False if len([i for i in x if ord(i) >127])<=3 else True)].index)   

In [9]:
df_google = df_google.drop(df_google.loc[df_google["App"]
            .apply(lambda x: False if len([i for i in x if ord(i) >127])<=3 else True)].index)   

Next isolate the free apps. First get the unique values and data types in the price column, then use that information to drop the non free apps.

In [None]:
#Apple prices and the column data type
print(df_apple["price"].unique())
print(df_apple["price"].dtypes)

In [None]:
#Google prices and the column data type
print(df_google["Price"].unique())
print(df_google["Price"].dtypes)

In [12]:
#Keep only the free Apple apps
df_apple = df_apple.drop(df_apple.loc[df_apple["price"]
            .apply(lambda x: False if x == 0 else True)].index)

In [13]:
#Keep only the free Google apps
df_google = df_google.drop(df_google.loc[df_google["Price"]
            .apply(lambda x: False if x == '0' else True)].index)

In [None]:
#check that only free Google apps left
print(df_google["Price"].unique())

In [None]:
#check that only free Apple apps left
print(df_apple["price"].unique())

### <center>DATA ANALYSIS</center>

Determine the counts  and percentage of total for the types of apps in the datasets 

In [None]:
#Apple app type count and % of total
apple_count = pd.DataFrame(df_apple["prime_genre"].value_counts()).rename(columns = {"prime_genre": "App_count"})
apple_count["%_of_total"] = (apple_count/apple_count.sum()*100).round(decimals=1)
print(apple_count.head())

In [None]:
#Google app type count and % of total
google_count = pd.DataFrame(df_google["Category"].value_counts()).rename(columns = {"Category": "App_count"})
google_count["%_of_total"] = (google_count/google_count.sum()*100).round(decimals=1)
print(google_count.head())

For Apple apps the Games genre dominates, with 58% of the total apps.  Next closest is Family with 8% of the total.
For Google, Family is the largest, with 19% of the total apps.  Games comes next with 9.7%.

This only gives the number of apps in the categories; we still need an indication of how popular the apps are.  Google provide number of installs, but only by category (100+, 10,000+, etc.)  Apple does not provide number of installs.  We'll use number of reviews as a proxy for number of installs.  For Google we'll use the numeric portion of the number of installs to estimate the value.

In [None]:
apple_count["total # of reviews"] = df_apple.groupby(by="prime_genre")["rating_count_tot"].sum()
print(apple_count.sort_values(by="total # of reviews", ascending = False).head(10))

In [None]:
df_google["Installs"]= df_google["Installs"].str.replace("[+,]","").astype(int)
google_count["total # of installs"] = df_google.groupby(by="Category")["Installs"].sum()
print(google_count.sort_values(by="total # of installs", ascending = False).head(10))

### <center>DISCUSSION & CONCLUSION</center>

Games dominates the app count and # of reviews for the Apple store.  GAME is the 2nd highest percentage of app total and has the highest total number of installs on Google. This is a popular but crowded market where it may be difficult to gain any market share.

Social networking has the 2nd highest number of reviews in the Apple store but is fourth in app count.  Social networking is popular but is much less crowded than Games.

COMMUNICATION is the second most installed app type with Google, ranking fifith in app count. SOCIAL ranks seventh in app count and ranks sixth in installs.

Based on this analysis, a social networking app which allows communication between individuals and groups may be our best opportunity.  Further analysis of demographics being served by current social networking apps may guide us to an underserved market, increasing the probability of success.