user_reviews.csv: contains 100 reviews for each app, 
The text in each review has been pre-processed and attributed with three new features:
Sentiment (Positive, Negative or Neutral), 
Sentiment Polarity and Sentiment Subjectivity.


In [None]:
#apps.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.

In [None]:
# Read in dataset
import pandas as pd
apps = pd.read_csv("../input/apps.csv")

# Column names to check for duplication
column_names = ['App']
duplicates = apps.duplicated(subset = column_names, keep = False)
# Output duplicate values
apps[duplicates].sort_values(by = 'App')

# Drop duplicates
#apps = apps_with_duplicates.drop_duplicates(inplace=True)

# Print the total number of apps
print('Total number of apps in the dataset = ', 9659)

# Print a concise summary of apps dataframe
display(apps.info())

#print first five rows
display(apps.head())
# Have a look at a random sample of n rows
n = 5
apps.sample(n)


Data cleaning 
The four features that we will be working with most frequently henceforth are Installs, Size, Rating and Price. 
The info() function (from the previous task) told us that Installs and Price columns
are of type object and not int64 or float64 as we would expect.
This is because the column contains some characters more than just [0,9] digits.
Ideally, we would want these columns to be numeric as their name suggests.
Hence, we now proceed to data cleaning and prepare our data to be consumed in our analyis later.
Specifically, the presence of special characters (, $ +) in the Installs and Price columns
make their conversion to a numerical data type difficult.


In [None]:
# List of characters to remove
chars_to_remove = ['+','$']
# List of column names to clean
cols_to_clean = ['Installs','Price']

# Replace each character with an empty string
apps["Installs"] = apps["Installs"].str.replace("+", "")
apps["Installs"] = apps["Installs"].str.replace(",", "")
apps["Price"] = apps["Price"].str.replace("$", "")
# Convert col to numeric
apps['Installs'] = pd.to_numeric(apps['Installs']) 
apps['Price'] = pd.to_numeric(apps['Price']) 
apps.info()

In [None]:
# Print the total number of unique categories
categories=apps['Category'].unique()
print(categories)
num_categories = len(apps['Category'].unique())
print('Number of categories = ', num_categories)

In [None]:
# Count the number of apps in each 'Category' and sort them in descending order
num_apps_in_categ = apps['Category'].value_counts().sort_values(ascending = False)
print(num_apps_in_categ)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 4))
num_apps_in_categ.plot(kind='bar')
plt.xlabel("categories",fontsize=14)
plt.ylabel("NO.off apps")
plt.tick_params(labelsize=12)
plt.show()



 Distribution of app ratings:
After having witnessed the market share for each category of apps,
let's see how all these apps perform on an average. App ratings (on a scale of 1 to 5) impact the discoverability,
conversion of apps as well as the company's overall brand image. Ratings are a key performance indicator of an app.

From our research, we found that the average volume of ratings across all app categories is 4.17.
The histogram plot is skewed to the right indicating that the majority of the apps are highly rated with only 
a few exceptions in the low-rated apps.
 

In [None]:
# Average rating of apps
avg_app_rating = apps['Rating'].mean()
print('Average app rating = ', avg_app_rating)

In [None]:

apps['Rating'].plot(kind='hist',bins=25)
plt.axvline(avg_app_rating, color='green', linewidth=2 , linestyle='--')
plt.xlabel("Rating")
plt.show()


Size and price of an app :
How can we effectively come up with strategies to size and price our app?

Does the size of an app affect its rating?
Do users really care about system-heavy apps or do they prefer light-weighted apps?
Does the price of an app affect its rating?
Do users always prefer free apps over paid apps?


In [None]:
# Filter rows where both Rating and Size values are not null
apps_Non_Null = apps[(~apps['Rating'].isnull()) & (~apps['Size'].isnull())]

In [None]:
# Plot size vs. rating
sns.scatterplot(x ='Size', y ='Rating', data = apps_Non_Null,alpha=0.2)
plt.show()


In [None]:
# Subset apps whose 'Type' is 'Paid'
paid_apps = apps_Non_Null[apps_Non_Null['Type'] == 'Paid']

# Plot price vs. rating
sns.scatterplot(x = 'Price', y ='Rating', data = paid_apps)
plt.show()

6. Relation between app category and app price
So now comes the hard part. How are companies and developers supposed to make ends meet?
What monetization strategies can companies use to maximize profit? The costs of apps are largely based on features,
complexity, and platform.

There are many factors to consider when selecting the right pricing strategy for your mobile app.
It is important to consider the willingness of your customer to pay for your app.
A wrong price could break the deal before the download even happens.
Potential customers could be turned off by what they perceive to be a shocking cost,
or they might delete an app they’ve downloaded after receiving too many ads or simply not getting their money's worth.

Different categories demand different price ranges. Some apps that are simple and used daily,
like the calculator app, should probably be kept free. However,
it would make sense to charge for a highly-specialized medical app that diagnoses diabetic patients.
Below, we see that Medical and Family apps are the most expensive. Some medical apps extend even up to $80!
All game apps are reasonably priced below $20.

In [None]:

# Select a few popular app categories
popular_app_cats = apps[apps.Category.isin(['GAME', 'FAMILY', 'PHOTOGRAPHY',
                                            'MEDICAL', 'TOOLS', 'FINANCE',
                                            'LIFESTYLE','BUSINESS'])]

sns.scatterplot(x="Price",y="Category",data= popular_app_cats)
plt.show()

In [None]:
# Apps whose Price is greater than 200
apps_above_200=popular_app_cats[popular_app_cats['Price']>200]
Category_App_Price = apps_above_200[['Category', 'App', 'Price']]
Category_App_Price

7. Filter out "junk" apps
It looks like a bunch of the really expensive apps are "junk" apps.
That is, apps that don't really have a purpose.
Some app developer may create an app called I Am Rich Premium or most expensive app (H)
just for a joke or to test their app development skills. Some developers even do this with malicious intent
and try to make money by hoping people accidentally click purchase on their app in the store.

Let's filter out these junk apps and re-do our visualization. The distribution of apps under $20 becomes clearer.

In [None]:
# Select apps priced below $100
apps_under_100 = popular_app_cats[popular_app_cats['Price']<100]

# Examine price vs category with the authentic apps
sns.scatterplot(x="Price",y="Category",data= apps_under_100)
plt.show()

8. Popularity of paid apps vs free apps
For apps in the Play Store today, there are five types of pricing strategies: free, freemium, paid, paymium, and subscription.
Let's focus on free and paid apps only. Some characteristics of free apps are:

Free to download.
Main source of income often comes from advertisements.
Often created by companies that have other products and the app serves as an extension of those products.
Can serve as a tool for customer retention, communication, and customer service.
Some characteristics of paid apps are:

Users are asked to pay once for the app to download and use it.
The user can't really get a feel for the app before buying it.

In [None]:
# Data for paid apps
paid=apps[apps['Type'] == 'Paid']
# Data for free apps
free=apps[apps['Type'] == 'Free']


fig, ax=plt.subplots()
ax.boxplot([paid['Installs'],free['Installs']])
ax.set_yscale("log")
ax.set_xticklabels(["paid", "free"])
ax.set_ylabel("Installs")
plt.show()

9. Sentiment analysis of user reviews¶
Mining user review data to determine how people feel about your product, brand,
or service can be done using a technique called sentiment analysis.
User reviews for apps can be analyzed to identify if the mood is positive, negative or neutral about that app.
For example, positive words in an app review might include words such as 'amazing', 'friendly', 'good', 'great', and 'love'. 
Negative words might be words like 'malware', 'hate', 'problem', 'refund', and 'incompetent'.

By plotting sentiment polarity scores of user reviews for paid and free apps,
we observe that free apps receive a lot of harsh comments, as indicated by the outliers on the negative y-axis.
Reviews for paid apps appear never to be extremely negative. This may indicate something about app quality, i.e.,
paid apps being of higher quality than free apps on average.
The median polarity score for paid apps is a little higher than free apps, thereby syncing with our previous observation.

In this notebook, we analyzed over ten thousand apps from the Google Play Store.
We can use our findings to inform our decisions should we ever wish to create an app ourselves.

In [None]:
# Load user_reviews.csv
reviews_df = pd.read_csv('../input/user_reviews.csv')

# Join and merge the two dataframe
merged_df = pd.merge(apps, reviews_df, on = 'App', how = "inner")

# Drop NA values from Sentiment and Translated_Review columns
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])

merged_df.info()

In [None]:
# User review sentiment polarity for paid vs. free apps
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11, 8)
ax = sns.boxplot(x = 'Type', y = 'Sentiment_Polarity', data = merged_df)
ax.set_title('Sentiment Polarity Distribution')