# Numeric Features from Reviews
Imagine you are in the shoes of a company offering a variety of products. You want to know which of your products are bestsellers and most of all - why. We embark on step 1 of understanding the reviews of products, using a dataset with Amazon product reviews. To that end, we transform the text into a numeric form and consider a few complexities in the process.

## Bag-of-words
<video controls src="video/video2_1.mp4" width=720>

## Which statement about BOW is true?
You were introduced to a bag-of-words(BOW) and some of its characteristics in the video. Which of the following statements about BOW **is** true?

**Possible Answers**
+ Bag-of-words preserves the word order and grammar rules. (Does it really preserve grammar rules and word order?)
+ Bag-of-words describes the order and freqeuncy of words or tokens within a corpus of documents. (It describes the frequency but do you think it says anything about the word order?)
+ **Bag-of-words is a simple but effective method to build a vocabulary of all the words occuring in a document.**
+ Bag-of-words can only be applied to a large document, not to shorter documents or single sentences. (No, BOW is not limited to the size of a document. It works both on short and long documents.)

That's correct! You'll next see how to apply this idea to sentiment analysis further.

## Your first BOW
### Exercise
A bag-of-words is an approach to transform text to numeric form.

In this exercise, you will apply a BOW to the `annak` list before moving on to a larger dataset in the next exercise.

Your task will be to work with this list and apply a BOW using the `CountVectorizer()`. This transformation is your first step in being able to understand the sentiment of a text. Pay attention to words which might carry a strong sentiment.

Remember that the output of a `CountVectorizer()` is a sparse matrix, which stores only entries which are non-zero. To look at the actual content of this matrix, we convert it to a dense array using the `.toarray()` method.

Note that in this case you don't need to specify the `max_features` argument because the text is short.

### Instructions
+ Import the count vectorizer function from `sklearn.feature_extraction.text`.
+ Build and fit the vectorizer on the small dataset.
+ Create the BOW representation with name `anna_bow` by calling the `transform()` method.
+ Print the BOW result as a dense array.

In [None]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

Great job! You have transformed the first sentence of *Anna Karenina* to an array counting the frequencies of each word. However, the output is not very readable, is it? We are still missing the names of the features. And does the approach change when we apply it to a larger dataset? We explore these problems in the next exercise.

## BOW using product reviews
### Exercise
You practiced a BOW on a small dataset. Now you will apply it to a sample of Amazon product reviews. The data has been imported for you and is called `reviews`. It contains two columns. The first one is called `score` and it is `0` when the review is negative, and `1` when it is positive. The second column is called `review` and it contains the text of the review that a customer wrote. Feel free to explore the data in the IPython Shell.

Your task is to build a BOW vocabulary, using the `review` column.

Remember that we can call the `.get_feature_names()` method on the vectorizer to obtain a list of all the vocabulary elements.

### Instructions
+ Create a CountVectorizer object, specifying the maximum number of features.
+ Fit the vectorizer.
+ Transform the fitted vectorizer.
+ Create a DataFrame where you transform the sparse matrix to a dense array and make sure to correctly specify the names of columns.

In [None]:
import pandas as pd
reviews = pd.read_csv("amazon_reviews_sample.csv")

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Well done! You have successfully built your first BOW generated vocabulary and transformed it to numeric features of the dataset!

## Getting granular with n-grams
<video controls src="video/video2_2.mp4" width=720>

## Specify token sequence length with BOW
### Exercise
We saw in the video that by specifying different length of tokens - what we called n-grams - we can better capture the context, which can be very important.

In this exercise, you will work with a sample of the Amazon product reviews. Your task is to build a BOW vocabulary, using the `review` column and specify the sequence length of tokens.

### Instructions
+ Build the vectorizer, specifying the token sequence length to be uni- and bigrams.
+ Fit the vectorizer.
+ Transform the fitted vectorizer.
+ In the DataFrame, make sure to correctly specify the column names.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1, 2))
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Excellent work! You have built a numeric representation of the review column using uni- and bigrams!

## Size of vocabulary of movies reviews
### Exercise
In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the `movies` reviews dataset. The first column is the `review`, which is of type `object` and the second column is the `label`, which is `0` for a negative review and 1 for a positive one.

The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.

### Instructions
1. Using the `movies` dataset, limit the size of the vocabulary to `100`.
2. Using the `movies` dataset, limit the size of the vocabulary to include terms which occur in no more than 200 documents.
3. Using the `movies` dataset, limit the size of the vocabulary to ignore terms which occur in no less than 50 documents.

In [None]:
movies = pd.read_csv("IMDB_sample.csv")

#---------------
# Instruction 1
#---------------

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print("INSTRUCTION 1:", X_df.head())

#---------------
# Instruction 2
#---------------
from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print("INSTRUCTION 1:", X_df.head())

#---------------
# Instruction 3
#---------------
from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print("INSTRUCTION 1:", X_df.head())

Great job! Any of the three methods you applied here can be used to limit the size of the vocabulary. Which of the three methods you used resulted in the lowest number of constructed features?

## BOW with n-grams and vocabulary size
### Exercise
In this exercise, you will practice building a bag-of-words once more, using the `reviews` dataset of Amazon product reviews. Your main task will be to limit the size of the vocabulary and specify the length of the token sequence.

### Instructions
+ Import the vectorizer from `sklearn`.
+ Build the vectorizer and make sure to specify the following parameters: the size of the vocabulary should be limited to 1000, include only bigrams, and ignore terms that appear in more than 500 documents.
+ Fit the vectorizer to the review column.
+ Create a DataFrame from the BOW representation.

In [None]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Wonderful job! You have successfully created a bag-of-words representation of the product reviews dataset, including more sophisticated sequence of tokens, while limiting the size of the vocabulary.

## Build new features from text
<video controls src="video/video2_3.mp4" width=720>

## Tokenize a string from GoT
### Exercise
A first standard step when working with text is to tokenize it, in other words, split a bigger string into individual strings, which are usually single words (tokens).

A string `GoT` has been created for you and it contains a quote from George R.R. Martin's *Game of Thrones*. Your task is to split it into individual tokens.

### Instructions
+ Import the word tokenizing function from `nltk`.
+ Transform the `GoT` string to word tokens.

In [None]:
from variables import GoT

# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

Great effort! You have successfully taken a string and split it up into word tokens.

## Word tokens from the Avengers
### Exercise
Now that you have tokenized your first string, it is time to iterate over items of a list and tokenize them as well. An easy way to do that with one line of code is with a list comprehension.

A list `avengers` has been created for you. It contains a few quotes from the *Avengers* movies. You can explore it in the IPython Shell.

### Instructions
+ Import the required function and package.
+ Apply the word tokenizing function on each item of our list.

In [None]:
from variables import avengers

# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

Nice work! You have built up on what you developed in the previous exercise and created a list comprehension where each of the items in the list is a quote from an Avengers movie.

## A feature for the length of a review
### Exercise
You have now worked with a string and a list with string items, it is time to use a larger sample of data.

You task in this exercise is to create a new feature for the length of a review, using the familiar `reviews` dataset.

### Instructions
1. 
	+ Import the word tokenizing function from the required package.
	+ Apply the function to the `review` column of the `reviews` dataset.
2. 
	+ Iterate over the created `word_tokens` list.
	+ As you iterate, find the length of each item in the list and append it to the empty `len_tokens` list.
	+ Create a new feature `n_words` in the `reviews` for the length of the reviews.

In [None]:
#---------------
# Instruction 1
#---------------
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

#---------------
# Instruction 2
#---------------
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens

Great job! You have used a list comprehension and a for loop to iterate over the word tokens created from the review column. You can employ the same approach to create other features, such as one counting the number of sentences in each review. This knowledge will also help you understand the next chapter.

## Can you guess the language?
<video controls src="video/video2_4.mp4" width=720>

## Idenfity the language of a string
### Exercise
Sometimes you might need to analyze the sentiment of non-English text. Your first task in such a case will be to identify the foreign language.

In this exercise you will identify the language of a single string. A string called `foreign` has been created for you. Feel free to explore it in the IPython Shell.

### Instructions
+ Import the required function from the language detection package.
+ Detect the language of the `foreign` string.

In [None]:
from variables import foreign

# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))

Great job! You have successfully identified the language of the string to be French!

## Detect language of a list of strings
### Exercise
Now you will detect the language of each item in a list. A list called `sentences` has been created for you and it contains 3 sentences, each in a different language. They have been randomly extracted from the product reviews dataset.

### Instructions
+ Iterate over the sentences in the list.
+ Detect the language of each sentence and append the detected language to the empty list `languages`.

In [None]:
from variables import sentences
from langdetect import detect_langs

languages = []

# Loop over the sentences in the list and detect their language
for sentence in range(len(sentences)):
    languages.append(detect_langs(sentences[sentence]))
    
print('The detected languages are: ', languages)

Great job! What languages did you detect?

## Language detection of product reviews
### Exercise
You will practice language detection on a small dataset called `non_english_reviews`. It is a sample of non-English reviews from the Amazon product reviews.

You will iterate over the rows of the dataset, detecting the language of each row and appending it to an empty list. The list needs to be cleaned so that it only contains the language of the review such as `'en'` for English instead of the regular output `en:0.9987654`. Remember that the language detection function might detect more than one language and the first item in the returned list is the most likely candidate. Finally, you will assign the list to a new column.

The logic is the same as used in the slides and the exercise before but instead of applying the function to a list, you work with a dataset.

### Instructions
+ Iterate over the rows of the `non_english_reviews` dataset.
+ Inside the loop, detect the language of the second column of the dataset.
+ Clean the string by splitting on a `:` inside the list comprehension expression.
+ Finally, assign the cleaned list to a new column.

In [None]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(non_english_reviews)):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] = languages

print(non_english_reviews.head())

Good job! You have succesfully built a new column in the dataset, which tells you in which language the respective review is written. This can be a very useful feature!