# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
#Import your libraries
import numpy as np
import pandas as pd

# Introduction

In this lab, we will use two datasets. Both datasets contain variables that describe apps from the Google Play Store. We will use our knowledge in feature extraction to process these datasets and prepare them for the use of a ML algorithm.

## Data Sources
- googleplaystore.csv
- googleplaystore_user_reviews.csv

## Changes
- 11-09-2021 Updated the project
- 13-03-2021 Solved Challenge 4

# Challenge 1 - Loading and Extracting Features from the First Dataset

#### In this challenge, our goals are: 
* Exploring the dataset.
* Identify the columns with missing values.
* Either replacing the missing values in each column or drop the columns.
* Convert each column to the appropriate type.

#### The first dataset contains different information describing the apps. 

Load the dataset into the variable `google_play` in the cell below. The dataset is in the file `googleplaystore.csv`

In [2]:
# Your code here:
google_play = pd.read_csv("C:/Users/digit/Desktop/Ironhack/lab_work/lab-supervised-learning-feature-extraction/data//googleplaystore.csv")
google_play.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [3]:
# assign google_play to gp
gp = google_play
gp.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [4]:
gp.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [5]:
# clean the column names and convert into lowercase
gp.columns = gp.columns.str.lower()
gp.columns

Index(['app', 'category', 'rating', 'reviews', 'size', 'installs', 'type',
       'price', 'content rating', 'genres', 'last updated', 'current ver',
       'android ver'],
      dtype='object')

In [6]:
# replace space with an underscore
gp.columns = gp.columns.str.replace(" ","_")
gp.columns

Index(['app', 'category', 'rating', 'reviews', 'size', 'installs', 'type',
       'price', 'content_rating', 'genres', 'last_updated', 'current_ver',
       'android_ver'],
      dtype='object')

#### Examine all variables and their types in the following cell

In [7]:
# Your code here:
gp.info()
# only "rating" is numerical, the rest is of an object
# CTA (call to action): convert certain columns into the right dtypes (later)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          9367 non-null   float64
 3   reviews         10841 non-null  object 
 4   size            10841 non-null  object 
 5   installs        10841 non-null  object 
 6   type            10840 non-null  object 
 7   price           10841 non-null  object 
 8   content_rating  10840 non-null  object 
 9   genres          10841 non-null  object 
 10  last_updated    10841 non-null  object 
 11  current_ver     10833 non-null  object 
 12  android_ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


#### Since this dataset only contains one numeric column, let's skip the `describe()` function and look at the first 5 rows using the `head()` function

In [8]:
# Your code here:
gp.head(5)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


#### We can see that there are a few columns that could be coerced to numeric.

Start with the reviews column. We can evaluate what value is causing this column to be of object type finding the non-numeric values in this column. To do this, we recall the `to_numeric()` function. With this function, we are able to coerce all non-numeric data to null. We can then use the `isnull()` function to subset our dataframe using the True/False column that this function generates.

In the cell below, transform the Reviews column to numeric and assign this new column to the variable `Reviews_numeric`. Make sure to coerce the errors.

In [9]:
# Your code here:
# coerced means vynuceny (pod natlakem ucineny)
# subset means podmnozina, podskupina
# convert the values in "reviews" into numeric ones and assign under "reviews_numeric"
# whenever there is errors = 'coerce', you transform the errors into null values
reviews = gp['reviews']
reviews_numeric = pd.to_numeric(reviews, errors= 'coerce')
reviews_numeric

0           159.0
1           967.0
2         87510.0
3        215644.0
4           967.0
           ...   
10836        38.0
10837         4.0
10838         3.0
10839       114.0
10840    398307.0
Name: reviews, Length: 10841, dtype: float64

Next, create a column containing True/False values using the `isnull()` function. Assign this column to the `Reviews_isnull` variable.

In [10]:
# Your code here:
gp["reviews_isnull"] = reviews_numeric.isnull()
gp.head()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,False
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,False
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,False
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,False
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,False


Finally, subset the `google_play` with `Reviews_isnull`. This should give you all the rows that contain non-numeric characters.

Your output should look like:

![Reviews_bool.png](../images/reviews-bool.png)

In [11]:
# Your code here:
# the most straightforward approach for me is this:
gp.isnull().sample(1)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull
3454,False,False,False,False,False,False,False,False,False,False,False,False,False,False


#### We see that Google Play is using a shorthand for millions. 

Let's write a function to transform this data.

Steps:

1. Create a function that returns the correct numeric values of *Reviews*.
1. Define a test string with `M` in the last character.
1. Test your function with the test string. Make sure your function works correctly. If not, modify your functions and test again.

In [12]:
# Your code here
def convert_string_to_numeric(s):
    """
    Convert a string value to numeric. If the last character of the string is `M`, obtain the 
    numeric part of the string, multiply it with 1,000,000, then return the result. Otherwise, 
    convert the string to numeric value and return the result.
    
    Args:
        s: The Reviews score in string format.

    Returns:
        The correct numeric value of the Reviews score.
    """
    if s[-1] == "M":
        s = s.replace("M", "") 
        return float(s)*1000000
    else:
        return float(s)
test_string = '4.0M'

convert_string_to_numeric(test_string) == 4000000

True

The last step is to apply the function to the `Reviews` column in the following cell:

In [13]:
# Your code here:
gp['reviews'].apply(convert_string_to_numeric)

0           159.0
1           967.0
2         87510.0
3        215644.0
4           967.0
           ...   
10836        38.0
10837         4.0
10838         3.0
10839       114.0
10840    398307.0
Name: reviews, Length: 10841, dtype: float64

Check the non-numeric `Reviews` row again. It should have been fixed now and you should see:

![Reviews_bool_fixed.png](../images/reviews-bool-fixed.png)

In [14]:
gp.sample()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull
5748,Digital Clock AW-7,TOOLS,4.1,484,2.9M,"100,000+",Free,0,Everyone,Tools,"March 6, 2018",2.0,4.3 and up,False


In [15]:
# Your code here
gp['reviews'] = gp['reviews'].apply(convert_string_to_numeric)
gp['reviews']

0           159.0
1           967.0
2         87510.0
3        215644.0
4           967.0
           ...   
10836        38.0
10837         4.0
10838         3.0
10839       114.0
10840    398307.0
Name: reviews, Length: 10841, dtype: float64

Also check the variable types of `google_play`. The `Reviews` column should be a `float64` type now.

In [16]:
# Your code here:
gp['reviews']

0           159.0
1           967.0
2         87510.0
3        215644.0
4           967.0
           ...   
10836        38.0
10837         4.0
10838         3.0
10839       114.0
10840    398307.0
Name: reviews, Length: 10841, dtype: float64

#### The next column we will look at is `Size`. We start by looking at all unique values in `Size`:

*Hint: use `unique()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html))*.

In [17]:
# Your code here:
gp['size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

You should have seen lots of unique values of the app sizes.

#### While we can convert most of the `Size` values to numeric in the same way we converted the `Reviews` values, there is one value that is impossible to convert.

What is that badass value? Enter it in the next cell and calculate the proportion of its occurence to the total number of records of `google_play`.

In [18]:
# Your code here:
pd.to_numeric(gp['size'], errors='raise')
# the badass value that is impossible to convert is "19M" at position 0 in an array

ValueError: Unable to parse string "19M" at position 0

In [19]:
# string contains
gp[gp['size'].str.contains("19M")]
# this querey revealed that 154 rows containing the value "19M"
# that's why we will assume that 
# there are 154 values containing the string "19M"

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,False
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178.0,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up,False
94,"Used car is the first car - used car purchase,...",AUTO_AND_VEHICLES,4.6,5097.0,19M,"1,000,000+",Free,0,Everyone,Auto & Vehicles,"July 23, 2018",1.5.18,4.0.3 and up,False
159,Cloud of Books,BOOKS_AND_REFERENCE,3.3,1862.0,19M,"1,000,000+",Free,0,Everyone,Books & Reference,"April 27, 2018",2.2.5,4.1 and up,False
167,English to Urdu Dictionary,BOOKS_AND_REFERENCE,4.6,4620.0,19M,"500,000+",Free,0,Everyone,Books & Reference,"November 23, 2017",2.0,4.0.3 and up,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10646,Podcast App: Free & Offline Podcasts by Player FM,NEWS_AND_MAGAZINES,4.6,66407.0,19M,"1,000,000+",Free,0,Teen,News & Magazines,"July 25, 2018",4.1.0.72,4.0 and up,False
10651,FN pistol model 1903 explained,BOOKS_AND_REFERENCE,,1.0,19M,10+,Paid,$6.49,Everyone,Books & Reference,"September 5, 2015",Android 3.0 - 2015,1.6 and up,False
10699,FO BOULANGER,FINANCE,,10.0,19M,50+,Free,0,Everyone,Finance,"May 15, 2018",1.0,4.4 and up,False
10801,Fr Ignacio Outreach,FAMILY,4.9,52.0,19M,"1,000+",Free,0,Everyone,Education,"January 19, 2018",1.0,4.4 and up,False


In [20]:
gp.info()
# 10841 entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          9367 non-null   float64
 3   reviews         10841 non-null  float64
 4   size            10841 non-null  object 
 5   installs        10841 non-null  object 
 6   type            10840 non-null  object 
 7   price           10841 non-null  object 
 8   content_rating  10840 non-null  object 
 9   genres          10841 non-null  object 
 10  last_updated    10841 non-null  object 
 11  current_ver     10833 non-null  object 
 12  android_ver     10838 non-null  object 
 13  reviews_isnull  10841 non-null  bool   
dtypes: bool(1), float64(2), object(11)
memory usage: 1.1+ MB


In [21]:
# calculate the proportion of its occurence to the total number of records of google_play (gp)
testsizeprop = 154 / 10841
testsizeprop
# the proportion of its occurence to the total number of records is 
# 0.014205

0.014205331611474956

#### While this column may be useful for other types of analysis, we opt to drop it from our dataset. 

There are two reasons. First, the majority of the data are ordinal but a sizeable proportion are missing because we cannot convert them to numerical values. Ordinal data are both numerical and categorical, and they usually can be ranked (e.g. 82k is smaller than 91M). In contrast, non-ordinal categorical data such as blood type and eye color cannot be ranked. The second reason is as a categorical column, it has too many unique values to produce meaningful insights. Therefore, in our case the simplest strategy would be to drop the column.

Drop the column in the cell below (use `inplace=True`)

In [22]:
# Your code here:
gp.drop(columns=['size'], inplace = True)
# ok this is giving me an error, because the first time I ran it
# it dropped the column
# and now it cannot find the 'Size' column on the axis=1
# because it dropped after running this code the first time

In [23]:
# here we can check the the 'Size' column has been dropped
gp.columns

Index(['app', 'category', 'rating', 'reviews', 'installs', 'type', 'price',
       'content_rating', 'genres', 'last_updated', 'current_ver',
       'android_ver', 'reviews_isnull'],
      dtype='object')

#### Now let's look at how many missing values are in each column. 

This will give us an idea of whether we should come up with a missing data strategy or give up on the column all together. In the next column, find the number of missing values in each column: 

*Hint: use the `isna()` and `sum()` functions.*

In [24]:
# Your code here:
gp.isna().sum()
# "rating" has 1474 missing values
# "type" has 1, "content_rating" 1, 
# "current_ver" 8, "android_ver" 3

app                  0
category             0
rating            1474
reviews              0
installs             0
type                 1
price                0
content_rating       1
genres               0
last_updated         0
current_ver          8
android_ver          3
reviews_isnull       0
dtype: int64

You should find the column with the most missing values is now `Rating`.

#### What is the proportion of the missing values in `Rating` to the total number of records?

Enter your answer in the cell below.

In [25]:
# Your code here:
gp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          9367 non-null   float64
 3   reviews         10841 non-null  float64
 4   installs        10841 non-null  object 
 5   type            10840 non-null  object 
 6   price           10841 non-null  object 
 7   content_rating  10840 non-null  object 
 8   genres          10841 non-null  object 
 9   last_updated    10841 non-null  object 
 10  current_ver     10833 non-null  object 
 11  android_ver     10838 non-null  object 
 12  reviews_isnull  10841 non-null  bool   
dtypes: bool(1), float64(2), object(10)
memory usage: 1.0+ MB


In [26]:
testratingprop = 9367 / 10841
testratingprop
# the propotion is 0.864

0.8640346831473111

A sizeable proportion of the `Rating` column is missing. A few other columns also contain several missing values.

#### We opt to preserve these columns and remove the rows containing missing data.

In particular, we don't want to drop the `Rating` column because:

* It is one of the most important columns in our dataset. 

* Since the dataset is not a time series, the loss of these rows will not have a negative impact on our ability to analyze the data. It will, however, cause us to lose some meaningful observations. But the loss is limited compared to the gain we receive by preserving these columns.

In the cell below, remove all rows containing at least one missing value. Use the `dropna()` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)). Assign the new dataframe to the variable `google_missing_removed`.

In [27]:
# Your code here:
google_missing_removed = gp.dropna()

From now on, we use the `google_missing_removed` variable instead of `google_play`.

#### Next, we look at the `Last Updated` column.

The `Last Updated` column seems to contain a date, though it is classified as an object type. Let's convert this column using the `pd.to_datetime` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)).

In [28]:
# Your code here:
google_missing_removed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             9360 non-null   object 
 1   category        9360 non-null   object 
 2   rating          9360 non-null   float64
 3   reviews         9360 non-null   float64
 4   installs        9360 non-null   object 
 5   type            9360 non-null   object 
 6   price           9360 non-null   object 
 7   content_rating  9360 non-null   object 
 8   genres          9360 non-null   object 
 9   last_updated    9360 non-null   object 
 10  current_ver     9360 non-null   object 
 11  android_ver     9360 non-null   object 
 12  reviews_isnull  9360 non-null   bool   
dtypes: bool(1), float64(2), object(10)
memory usage: 959.8+ KB


In [29]:
google_missing_removed["last_updated"] = pd.to_datetime(google_missing_removed["last_updated"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed["last_updated"] = pd.to_datetime(google_missing_removed["last_updated"])


#### The last column we will transform is `Price`. 

We start by looking at the unique values of this column.

In [30]:
# Your code here:
# remember the code ".to_frame()"
google_missing_removed["price"].value_counts().to_frame()

Unnamed: 0,price
0,8715
$2.99,114
$0.99,106
$4.99,70
$1.99,59
...,...
$3.88,1
$379.99,1
$4.84,1
$4.60,1


Since all prices are ordinal data without exceptions, we can tranform this column by removing the dollar sign and converting to numeric. We can create a new column called `Price Numerical` and drop the original column.

We will achieve our goal in three steps. Follow the instructions of each step below.

#### First we remove the dollar sign. Do this in the next cell by applying the `str.replace` function to the column to replace `$` with an empty string (`''`).

In [31]:
# Your code here:
google_missing_removed["price"] = google_missing_removed["price"].str.replace("$"," ")

  google_missing_removed["price"] = google_missing_removed["price"].str.replace("$"," ")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed["price"] = google_missing_removed["price"].str.replace("$"," ")


In [32]:
google_missing_removed.sample()

Unnamed: 0,app,category,rating,reviews,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull
4316,Anna.K Tarot,FAMILY,4.8,17.0,100+,Paid,3.99,Mature 17+,Entertainment,2017-01-16,1.4.4,4.0.3 and up,False


#### Second step, coerce the `Price Numerical` column to numeric.

In [33]:
# Your code here:
google_missing_removed["price_numerical"] = google_missing_removed["price"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed["price_numerical"] = google_missing_removed["price"]


In [37]:
google_missing_removed["price_numerical"] = google_missing_removed["price"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed["price_numerical"] = google_missing_removed["price"]


In [46]:
google_missing_removed["price_numerical"] = google_missing_removed["price_numerical"].astype(str).astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed["price_numerical"] = google_missing_removed["price_numerical"].astype(str).astype(float)


**Finally, drop the original `Price` column.**

In [40]:
# Your code here:
google_missing_removed.drop(columns=["price"], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Now check the variable types of `google_missing_removed`. Make sure:

* `Size` and `Price` columns have been removed.
* `Rating`, `Reviews`, and `Price Numerical` have the type of `float64`.
* `Last Updated` has the type of `datetime64`.

In [51]:
# Your code here
google_missing_removed.info()
# size and price are removed
# rating, reviews and price_numerical have the type of float64
# last_updated is datetime64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   app              9360 non-null   object        
 1   category         9360 non-null   object        
 2   rating           9360 non-null   float64       
 3   reviews          9360 non-null   float64       
 4   installs         9360 non-null   object        
 5   type             9360 non-null   object        
 6   content_rating   9360 non-null   object        
 7   genres           9360 non-null   object        
 8   last_updated     9360 non-null   datetime64[ns]
 9   current_ver      9360 non-null   object        
 10  android_ver      9360 non-null   object        
 11  reviews_isnull   9360 non-null   bool          
 12  price_numerical  9360 non-null   float64       
dtypes: bool(1), datetime64[ns](1), float64(3), object(8)
memory usage: 1.2+ MB


# Challenge 2 - Loading and Extracting Features from the Second Dataset

Load the second dataset to the variable `google_reviews`. The data is in the file `googleplaystore_user_reviews.csv`.

In [53]:
# Your code here:
google_rewiews = pd.read_csv("C:/Users/digit/Desktop/Ironhack/lab_work/lab-supervised-learning-feature-extraction/data//googleplaystore_user_reviews.csv")

#### This dataset contains the top 100 reviews for each app. 

Let's examine this dataset using the `head` function

In [69]:
# Your code here
gw = google_rewiews
gw.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3
5,10 Best Foods for You,Best way,Positive,1.0,0.3


#### The main piece of information we would like to extract from this dataset is the proportion of positive reviews of each app. 

Columns like `Sentiment_Polarity` and `Sentiment_Subjectivity` are not to our interests because we have no clue how to use them. We do not care about `Translated_Review` because natural language processing is too complex for us at present (in fact the `Sentiment`, `Sentiment_Polarity`, and `Sentiment_Subjectivity` columns are derived from `Translated_Review` the data scientists). 

What we care about in this challenge is `Sentiment`. To be more precise, we care about **what is the proportion of *Positive* sentiment of each app**. This will require us to aggregate the `Sentiment` data by `App` in order to calculate the proportions.

Now that you are clear about what we are trying to achieve, follow the steps below that will walk you through towards our goal.

#### Our first step will be to remove all rows with missing sentiment. 

In the next cell, drop all rows with missing data using the `dropna()` function and assign this new dataframe to `review_missing_removed`.

In [71]:
# Your code here:
gw.isnull().sum()

App                       0
Translated_Review         0
Sentiment                 0
Sentiment_Polarity        0
Sentiment_Subjectivity    0
dtype: int64

In [74]:
gw = gw.dropna(how="all")
review_missing_removed = gw
review_missing_removed.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3
5,10 Best Foods for You,Best way,Positive,1.0,0.3


#### Now, use the `value_counts()` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)) to get a sense on how many apps are in this dataset and their review counts.

In [77]:
# Your code here:
rmr = review_missing_removed
rmr.value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,0
App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,Unnamed: 5_level_1
"FastMeet: Chat, Dating, Love",Good,Positive,0.700000,0.600000,8
"BestCam Selfie-selfie, beauty camera, photo editor",Good,Positive,0.700000,0.600000,7
Bubble Shooter,Good,Positive,0.700000,0.600000,7
Candy Crush Saga,I love game TOO many pop ups. I want open game play. But 7 notices first every time open game. Once daily enough. I level 2475 NEVER hit jackpot what's purpose wheel. What perks committed player can't hit jackpot there. The move suggestion Too fast.,Positive,0.022727,0.430303,6
Duolingo: Learn Languages Free,"Duolingo deserves place number 1 education apps. The problem really teach much pronounciation, rules exceptions words 'The'. Duolingo, could please add in, I think would possible contest position number 1. Also, I think would awesome really helpful could choose accent pronounciation words could in. Please consider request, and, need money this, I'm sure everyone would ok amount ads increased slightly whilst work this. Thanks duo, saved sanity whilst trying learn German",Positive,0.263333,0.435556,6
...,...,...,...,...,...
Calorie Counter & Diet Tracker,I've using weeks now. I My Fitness Pal user years. But I found I like now. I accurate serving counts log food whereas MFP seemingly lost feature decimals. It's interactive social. I gave 4 stars instead 5 food database comprehensive accurate I'd like.,Positive,0.277778,0.444444,1
Calorie Counter & Diet Tracker,"I've using days like far. Easy navigate scanner really handy. Would like take supplements consideration daily nutrition totals daily tally number fruit, vegetable, meat, grain, dairy, fat servings would help diets require track food groups, i.e. DASH. Something similar water tracker. But overall, I like app.",Positive,0.161905,0.447619,1
Calorie Counter & Diet Tracker,"I've used Spark People web-based tool, back 2012 & 2013 gotten inconvenient foods still needed manually entered. With works phone, I've found food database become enormous; I enter single food, even brand names available!",Negative,-0.034286,0.502857,1
Calorie Counter & Diet Tracker,I've tried several good ones best best! It's actually fun notifying things I committed gives much needed push.,Positive,0.533333,0.266667,1


#### Now the tough part comes. Let's plan how we will achieve our goal:

1. We will count the number of reviews that contain *Positive* in the `Sentiment` column.

1. We will create a new dataframe to contain the `App` name, the number of positive reviews, and the total number of reviews of each app.

1. We will then loop the new dataframe to calculate the postivie review portion of each app.

#### Step 1: Count the number of positive reviews.

In the following cell, write a function that takes a column and returns the number of times *Positive* appears in the column. 

*Hint: One option is to use the `np.where()` function ([documentation](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html)).*

In [78]:
# Your code below

def positive_function(x):
    """
    Count how many times the string `Positive` appears in a column (exact string match).
    
    Args:
        x: data column
    
    Returns:
        The number of occurrences of `Positive` in the column data.
    """
    return np.where(x == 'Positive',True,False).sum()
positive_function(rmr['Sentiment'])

23998

#### Step 2: Create a new dataframe to contain the `App` name, the number of positive reviews, and the total number of reviews of each app

We will group `review_missing_removed` by the `App` column, then aggregate the grouped dataframe on the number of positive reviews and the total review counts of each app. The result will be assigned to a new variable `google_agg`. Here is the ([documentation on how to achieve it](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)). Take a moment or two to read the documentation and google examples because it is pretty complex.

When you obtain `google_agg`, check its values to make sure it has an `App` column as its index as well as a `Positive` column and a `Total` column. Your output should look like:

![Positive Reviews Agg](../images/positive-review-agg.png)

*Hint: Use `positive_function` you created earlier as part of the param passed to the `agg()` function in order to aggregate the number of positive reviews.*

#### Bonus:

As of Pandas v0.23.4, you may opt to supply an array or an object to `agg()`. If you use the array param, you'll need to rename the columns so that their names are `Positive` and `Total`. Using the object param will allow you to create the aggregated columns with the desirable names without renaming them. However, you will probably encounter a warning indicating supplying an object to `agg()` will become outdated. It's up to you which way you will use. Try both ways out. Any way is fine as long as it works.

In [81]:
# Your code here:
g_agg = review_missing_removed.groupby('App')['Sentiment'].agg([lambda x: positive_function(x),'count']).rename({'<lambda_0>':'Positive','count':'Total'},axis=1)

Unnamed: 0_level_0,Positive,Total
App,Unnamed: 1_level_1,Unnamed: 2_level_1
10 Best Foods for You,162,194
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,31,40
11st,23,39
1800 Contacts - Lens Store,64,80
1LINE – One Line with One Touch,27,38


Print the first 5 rows of `google_agg` to check it.

In [82]:
# Your code here
g_agg.head()

Unnamed: 0_level_0,Positive,Total
App,Unnamed: 1_level_1,Unnamed: 2_level_1
10 Best Foods for You,162,194
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,31,40
11st,23,39
1800 Contacts - Lens Store,64,80
1LINE – One Line with One Touch,27,38


#### Add a derived column to `google_agg` that is the ratio of the `Positive` and the `Total` columns. Call this column `Positive Ratio`. 

Make sure to account for the case where the denominator is zero using the `np.where()` function.

In [83]:
# Your code here:
g_agg['positive_ratio'] = np.where(g_agg['Total'] > 0, g_agg['Positive']/g_agg['Total'],0)
g_agg.head()

Unnamed: 0_level_0,Positive,Total,positive_ratio
App,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10 Best Foods for You,162,194,0.835052
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,31,40,0.775
11st,23,39,0.589744
1800 Contacts - Lens Store,64,80,0.8
1LINE – One Line with One Touch,27,38,0.710526


#### Now drop the `Positive` and `Total` columns. Do this with `inplace=True`.

In [84]:
# Your code here:
g_agg.drop(columns = ['Positive','Total'],inplace=True)

Print the first 5 rows of `google_agg`. Your output should look like:

![Positive Reviews Agg](../images/positive-review-ratio.png)

In [85]:
# Your code here:
g_agg.head()

Unnamed: 0_level_0,positive_ratio
App,Unnamed: 1_level_1
10 Best Foods for You,0.835052
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,0.775
11st,0.589744
1800 Contacts - Lens Store,0.8
1LINE – One Line with One Touch,0.710526


# Challenge 3 - Join the Dataframes

In this part of the lab, we will join the two dataframes and obtain a dataframe that contains features we can use in our ML algorithm.

In the next cell, join the `google_missing_removed` dataframe with the `google_agg` dataframe on the `App` column. Assign this dataframe to the variable `google`.

In [87]:
# Your code here:
google = google_missing_removed.join(g_agg,how='inner',on='app')

#### Let's look at the final result using the `head()` function. Your final product should look like:

![Final Product](../images/google-final-head.png)

In [88]:
# Your code here:
google.head()

Unnamed: 0,app,category,rating,reviews,installs,type,content_rating,genres,last_updated,current_ver,android_ver,reviews_isnull,price_numerical,positive_ratio
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,"500,000+",Free,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up,False,0.0,0.590909
2033,Coloring book moana,FAMILY,3.9,974.0,"500,000+",Free,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up,False,0.0,0.590909
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791.0,"1,000,000+",Free,Everyone,Art & Design,2017-09-20,2.9.2,3.0 and up,False,0.0,0.711111
18,FlipaClip - Cartoon animation,ART_AND_DESIGN,4.3,194216.0,"5,000,000+",Free,Everyone,Art & Design,2018-08-03,2.2.5,4.0.3 and up,False,0.0,1.0
21,Boys Photo Editor - Six Pack & Men's Suit,ART_AND_DESIGN,4.1,654.0,"100,000+",Free,Everyone,Art & Design,2018-03-20,1.1,4.0.3 and up,False,0.0,0.605263
