## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

## Finding the following information:
1. Read the apps.csv file and clean the Installscolumn to convert it into integer data type. Save your answer as a DataFrame apps
2. Find the number of apps in each category, the average price, and the average rating.
3. Find the top 10 free FINANCE apps having the highest average sentiment score. 

In [2]:
# Use this cell to begin your analysis, and add as many as you would like!
import pandas as pd

In [3]:
#load data
apps = pd.read_csv("datasets/apps.csv")
user_reviews = pd.read_csv("datasets/user_reviews.csv")

In [4]:
#get the 1st 5 rows from apps
apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0.0,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0.0,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0.0,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0.0,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0.0,"June 20, 2018"


In [5]:
# get a general view of apps info
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null object
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 679.2+ KB


In [6]:
#check the number of NAN values
apps.isnull().sum()

App                0
Category           0
Rating          1463
Reviews            0
Size            1227
Installs           0
Type               0
Price              0
Last Updated       0
dtype: int64

# 1. Get the apps dataframe after clean Installs columns

In [7]:
# replace "," as space and remove "+" and convert into int
apps["Installs"] = apps.Installs.apply(lambda x: x.replace("," , "").replace("+","")).astype("int")
#check the convert 
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
App             9659 non-null object
Category        9659 non-null object
Rating          8196 non-null float64
Reviews         9659 non-null int64
Size            8432 non-null float64
Installs        9659 non-null int64
Type            9659 non-null object
Price           9659 non-null float64
Last Updated    9659 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 679.2+ KB


# 2. Find the number of apps, average rating/price on each category

In [8]:
# # group category and count # of apps in each group, convert to dataframe
# app_group = apps.groupby('Category').size().to_frame()

# # average rating and price of each group
# app_group_info = apps.groupby('Category')[['Price','Rating']].agg('mean')

# #concat two data frame
# app_category_info = app_group.join(app_group_info)

# # reset the index 
# app_category_info.reset_index(inplace=True)

# #rename the columns
# app_category_info.columns = ['Category', 'Number of apps', 'Average price', 'Average rating']

# #print the 1st 5 rows
# app_category_info.head()

In [9]:
#group category and find number of apps, average price and average rating
app_category_info = apps.groupby('Category').agg({'Category': 'size', 'Price':'mean', 'Rating':'mean'})

#rename columns
app_category_info.columns = ['Number of apps', 'Average price', 'Average rating']

# reset the index 
app_category_info.reset_index(inplace=True)
app_category_info.head()

Unnamed: 0,Category,Number of apps,Average price,Average rating
0,ART_AND_DESIGN,64,0.093281,4.357377
1,AUTO_AND_VEHICLES,85,0.158471,4.190411
2,BEAUTY,53,0.0,4.278571
3,BOOKS_AND_REFERENCE,222,0.539505,4.34497
4,BUSINESS,420,0.417357,4.098479


# 3. Find top 10  average sentimental score free Finance apps

In [10]:
#check user reivew for finding top 10 free FIANCE app
user_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 4 columns):
App                   64295 non-null object
Review                37427 non-null object
Sentiment Category    37432 non-null object
Sentiment Score       37432 non-null float64
dtypes: float64(1), object(3)
memory usage: 2.0+ MB


In [11]:
user_reviews.isnull().sum()

App                       0
Review                26868
Sentiment Category    26863
Sentiment Score       26863
dtype: int64

In [12]:
user_reviews.head()

Unnamed: 0,App,Review,Sentiment Category,Sentiment Score
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25
2,10 Best Foods for You,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4
4,10 Best Foods for You,Best idea us,Positive,1.0


In [13]:
# extract the free finance apps
free_finance_apps = apps[(apps.Category == 'FINANCE') & (apps.Type == 'Free')]

# extract finance apps from user review
finance_app_reviews = user_reviews[user_reviews.App.isin(free_finance_apps.App)]

In [14]:
# check finance_app_reviews missing value information
finance_app_reviews.isnull().sum()

App                     0
Review                765
Sentiment Category    765
Sentiment Score       765
dtype: int64

In [15]:
#drop all the null value based on previous cell
finance_app_reviews.dropna(inplace=True)

#check finance_app_reviews missing value information again
finance_app_reviews.isnull().sum()

App                   0
Review                0
Sentiment Category    0
Sentiment Score       0
dtype: int64

In [16]:
# find the top 10 average sentiment score for each app 
top_10_user_feedback = finance_app_reviews.groupby('App').agg('mean').sort_values(by='Sentiment Score', ascending=False).reset_index()[:10]
top_10_user_feedback

Unnamed: 0,App,Sentiment Score
0,BBVA Spain,0.515086
1,Associated Credit Union Mobile,0.388093
2,BankMobile Vibe App,0.353455
3,A+ Mobile,0.329592
4,Current debit card and app made for teens,0.327258
5,BZWBK24 mobile,0.326883
6,"Even - organize your money, get paid early",0.283929
7,Credit Karma,0.270052
8,Fortune City - A Finance App,0.266966
9,Branch,0.26423
