## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset you will use here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, it will be your task to explore and manipulate the data until you are able to answer the three questions described in the instructions panel.<br></p>

**CODE BEGINS HERE**

Import the relevant modules and load the two csv files as DataFrame.

In [2]:
import pandas as pd

In [3]:
apps = pd.read_csv('datasets/apps.csv')
reviews = pd.read_csv('datasets/user_reviews.csv')

Let's get a thorough insight on the dataset using the .info() and .head() methods on the two DataFrames.

In [4]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   App           9659 non-null   object 
 1   Category      9659 non-null   object 
 2   Rating        8196 non-null   float64
 3   Reviews       9659 non-null   int64  
 4   Size          8432 non-null   float64
 5   Installs      9659 non-null   object 
 6   Type          9659 non-null   object 
 7   Price         9659 non-null   float64
 8   Last Updated  9659 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 679.3+ KB


In [5]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   App                 64295 non-null  object 
 1   Review              37427 non-null  object 
 2   Sentiment Category  37432 non-null  object 
 3   Sentiment Score     37432 non-null  float64
dtypes: float64(1), object(3)
memory usage: 2.0+ MB


In [6]:
reviews.head()

Unnamed: 0,App,Review,Sentiment Category,Sentiment Score
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25
2,10 Best Foods for You,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4
4,10 Best Foods for You,Best idea us,Positive,1.0


The .sample() method is used to generate a random sample of data from the DataFrame. It is better than the .head() method as more diverse data can be examined.

In [7]:
apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
3341,My Android Device S/W & H/W,FAMILY,4.5,681,2.8,"50,000+",Free,0.0,"November 20, 2015"
8719,Flag Of European Union LWP,PERSONALIZATION,4.3,88,4.2,"1,000+",Free,0.0,"January 24, 2018"
7525,Cursive Writing Wizard - Handwriting,FAMILY,4.0,3745,31.0,"1,000,000+",Free,0.0,"December 3, 2017"
6042,CA Mobile OTP,TOOLS,3.3,688,5.5,"100,000+",Free,0.0,"September 5, 2017"
8170,EF Forms,BUSINESS,5.0,2,23.0,50+,Free,0.0,"July 24, 2018"
498,"Chatting - Free chat, random chat, boyfriend, ...",DATING,4.2,2506,6.1,"500,000+",Free,0.0,"June 15, 2017"
7398,Salah Widget (DK+Malmo),LIFESTYLE,4.5,532,6.4,"10,000+",Free,0.0,"June 28, 2015"
5023,Upohar BD,BUSINESS,3.4,9,1.7,"1,000+",Free,0.0,"October 8, 2016"
7454,DM airdisk,TOOLS,2.8,11,11.0,"1,000+",Free,0.0,"August 5, 2016"
2571,Fraction Calculator Plus Free,TOOLS,4.5,148506,11.0,"5,000,000+",Free,0.0,"July 15, 2018"


The first task was to convert the values on 'Installs' column in the dataset to an integer value. Originally, the 'Installs' column contained String values.


To convert the column to integer, we first need to remove special signs, like the '+' and ',' signs present in values on the 'Installs' column. We did that using the replace() method on the Series. The result was a string, but without the operator signs. We then used the to_numeric() function from Pandas to convert the Series from string (referred to in pandas as Object data type) to int64 type.

In [8]:
# Removing special characters from the 'Installs' column.

chars_to_replace = [',', '+']

for char in chars_to_replace:
    apps['Installs'] = apps['Installs'].str.replace(char, '', regex=False)
apps['Installs'] = pd.to_numeric(apps['Installs'], errors='raise')

We can see that the Installs column is now of data type int64.

In [9]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   App           9659 non-null   object 
 1   Category      9659 non-null   object 
 2   Rating        8196 non-null   float64
 3   Reviews       9659 non-null   int64  
 4   Size          8432 non-null   float64
 5   Installs      9659 non-null   int64  
 6   Type          9659 non-null   object 
 7   Price         9659 non-null   float64
 8   Last Updated  9659 non-null   object 
dtypes: float64(3), int64(2), object(4)
memory usage: 679.3+ KB


Again, let's have a look at the modified apps DataFrame.

In [10]:
apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
5628,Theme for Android 7.0,PERSONALIZATION,,0,,5000,Free,0.0,"October 18, 2017"
6697,iReadMe,PRODUCTIVITY,5.0,8,22.0,100,Free,0.0,"March 6, 2018"
646,GRE Prep & Practice by Magoosh,EDUCATION,4.4,3963,,100000,Free,0.0,"June 6, 2018"
9485,FN FAL rifle explained,BOOKS_AND_REFERENCE,,1,7.3,10,Paid,6.49,"September 6, 2015"
6267,Best CG Photography,FAMILY,,1,2.5,500,Free,0.0,"June 24, 2015"
6310,DRAGON QUEST IV,FAMILY,4.7,1647,16.0,10000,Paid,14.99,"February 28, 2017"
9361,FK Liepaja,SPORTS,,4,26.0,100,Free,0.0,"March 17, 2018"
9400,Scratch-Off Guide for FL Lotto,FAMILY,4.6,25,5.6,5000,Free,0.0,"July 4, 2018"
3759,Grand Theft Auto V: The Manual,FAMILY,4.1,182103,6.9,5000000,Free,0.0,"July 31, 2018"
5040,BANGLA TV 3G/4G,SPORTS,4.1,142,5.3,100000,Free,0.0,"June 7, 2018"


We can also see that the Date is of type 'str'. There is a special data type in Python, called datetime, which is used to process date/ time values. So, let's look at converting the values from the 'Last Updated' column to type datetime.

pandas has a valuable function called to_datetime(), which is similar to the to_numeric() function we used earlier. However, in this function, we need to specify the format of the date/ time in the present column. For this, we can use metacharacters. Metacharacters are special characters that represent certain values in Python.

I looked up the key for metacharacters at the strftime docs available at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

In [11]:
date = pd.to_datetime(apps['Last Updated'], format = '%B %d, %Y')
apps['Last Updated'] = date


In [12]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   App           9659 non-null   object        
 1   Category      9659 non-null   object        
 2   Rating        8196 non-null   float64       
 3   Reviews       9659 non-null   int64         
 4   Size          8432 non-null   float64       
 5   Installs      9659 non-null   int64         
 6   Type          9659 non-null   object        
 7   Price         9659 non-null   float64       
 8   Last Updated  9659 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 679.3+ KB


For the next specified task, we were to find the number of apps in each category, the average price, and the average rating. The answer was to be saved as a DataFrame called app_category_info. Your should rename the four columns as: Category, Number of apps, Average price, Average rating.

In [13]:
no_apps_df = apps.groupby('Category')['App'].count()
avg_price_df = apps.groupby('Category')['Price'].agg('mean')
avg_rating_df = apps.groupby('Category')['Rating'].agg('mean')

In [14]:
app_category_info = pd.DataFrame(data = {
                     'Number of apps':no_apps_df, 
                     'Average price':avg_price_df, 
                     'Average rating':avg_rating_df})

In [15]:
app_category_info.reset_index()

Unnamed: 0,Category,Number of apps,Average price,Average rating
0,ART_AND_DESIGN,64,0.093281,4.357377
1,AUTO_AND_VEHICLES,85,0.158471,4.190411
2,BEAUTY,53,0.0,4.278571
3,BOOKS_AND_REFERENCE,222,0.539505,4.34497
4,BUSINESS,420,0.417357,4.098479
5,COMICS,56,0.0,4.181481
6,COMMUNICATION,315,0.263937,4.121484
7,DATING,171,0.160468,3.970149
8,EDUCATION,119,0.150924,4.364407
9,ENTERTAINMENT,102,0.078235,4.135294


In [27]:
app_category_info['Number of apps'].sum()

9659

In [13]:
# DATAFRAME CREATED, #2 DONE

apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
2263,HD Camera Ultra,PHOTOGRAPHY,4.3,462152,1.5,10000000,Free,0.0,"October 17, 2015"
7145,"10,000 Quotes DB (Premium)",BOOKS_AND_REFERENCE,4.1,70,3.5,500,Paid,0.99,"August 30, 2013"
2786,"Polaris Office - Word, Docs, Sheets, Slide, PDF",PRODUCTIVITY,4.3,549900,60.0,10000000,Free,0.0,"July 18, 2018"
5107,Sexy Hot Detector Prank,FAMILY,3.9,17067,2.7,5000000,Free,0.0,"February 13, 2018"
5976,SegPlay Mobile Paint by Number,FAMILY,3.7,1478,43.0,100000,Free,0.0,"April 22, 2017"
206,Call Blocker,BUSINESS,4.6,188841,3.2,5000000,Free,0.0,"June 21, 2018"
1291,Safeway,LIFESTYLE,4.3,33572,37.0,1000000,Free,0.0,"August 2, 2018"
6557,Pokémon TV,FAMILY,4.2,117461,,5000000,Free,0.0,"June 29, 2018"
2128,Rossmann PL,SHOPPING,4.0,15867,,5000000,Free,0.0,"August 6, 2018"
2855,Baby Care & Tracker,PARENTING,4.1,319,11.0,100000,Free,0.0,"July 12, 2018"



328 FREE finance apps, 345 total

In [14]:
score_apps_table = apps.merge(reviews, on='App', how='inner')

In [21]:
score_apps_table.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
55224,ColorNote Notepad Notes,PRODUCTIVITY,4.6,2401017,,100000000,Free,0.0,"June 27, 2018",A useful particularly forgetful people like me...,Positive,0.191667
56490,HTC Gallery,VIDEO_PLAYERS,4.1,45744,,10000000,Free,0.0,"June 3, 2016",,,
27706,Cooking Fever,GAME,4.5,3197865,82.0,100000000,Free,0.0,"July 12, 2018",Why drawback game??? I upgraded 100% upgrading...,Negative,-0.4
47818,Cheapflights – Flight Search,TRAVEL_AND_LOCAL,4.4,47780,19.0,5000000,Free,0.0,"July 31, 2018",Very nice,Positive,0.78
1452,Filters for Selfie,BEAUTY,4.3,8572,25.0,1000000,Free,0.0,"May 10, 2018",Disgusting.... I think good reviews paid.... !...,Negative,-0.0625


In [28]:
finance_scores_table = score_apps_table[(score_apps_table['Category'] == 'FINANCE') & (score_apps_table['Type'] == 'Free')]

In [29]:
finance_scores_table

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
14112,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","Forget paying app, designed make fail payments...",Negative,-0.500000
14113,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","It's working expected, talking best bank Mexic...",Positive,0.400000
14114,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",It has many problems with Android 8.1. You can...,Positive,0.250000
14115,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","I changed my phone to a Xiaomi Redmi Note 5, t...",Positive,0.175000
14116,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",In her eagerness to make her look pretty with ...,Negative,-0.158333
14117,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",I can not activate my mobile Netkey because it...,Positive,0.116667
14118,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",The mobile netkey does not work on Android 8.1...,Neutral,0.000000
14119,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018",The new update is frozen on the home screen wi...,Positive,0.168182
14120,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","Almost everything works well, only that the tr...",Neutral,0.000000
14121,Citibanamex Movil,FINANCE,3.6,52306,42.0,5000000,Free,0.0,"July 27, 2018","In my case after the update stopped working, t...",Positive,0.243750


In [33]:
top_10_user_feedback = finance_scores_table.groupby('App')['Sentiment Score'].agg('mean')

In [36]:
pd.Dara(top_10_user_feedback.shape)

(46,)

In [32]:
top_10.groupby('App')['Sentiment Score'].agg('mean').sort_values(ascending = False)

NameError: name 'top_10' is not defined

In [246]:
top_10_user_feedback = pd.DataFrame(top_10_user_feedback.groupby('App')['Sentiment Score'].agg('mean').sort_values(ascending=False)[:10])

NameError: name 'top_10_user_feedback' is not defined

In [None]:
top_10_user_feedback