# Session 14: Text as Data


# Required readings

- Gentzkow, M., Kelly, B.T. and Taddy, M., 2019. ["Text as data"](https://doi.org/10.1257/jel.20181020) *Journal of Economic Literature* 57(3).
  - Following sections:
    - 1. Introduction
    - 2. Representing Text as Data

- Chapter 2. Dan Jurafsky and James H. Martin: [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/)
  - Following sections:
    - 2.4 Text Normalization

- PML; Python Machine Learning, 3rd ed. (2019) by Sebastian Raschka & Vahid Mirjalili: following sections from chapter 8:
  - Introduction
  - Preparing the IMDb movie review data for text processing
  - Introducing the bag-of-words model
  - Training a logistic regression model for document classification
  - Topic modeling with Latent Dirichlet Allocation

# Overview of Session 14

1. **Intro to text as data**
2. **Examples of text as data for social scientists**
3. **What is a text?**
    - What do we mean by a "document"?
    - We need to represent the words of a text in a structured way!
4. **A text data analysis recipe**
    1. Specify your document
    2. Preprocess your text
    3. Apply
5. **Cleaning and preprocessing text**
    - Clean text: ignore/remove any unwanted characters: casing, HTML markup, non-words, etc. (maybe also emoticons?)
    - Tokenization and stop-words
    - Stemming and lemmatization

6. **Bag of Words model**
    - Term frequency
    - N-grams
    - Term frequency - Inverse Document Frequency
7. **Applications:**
    1. **Training a logistic model to classify whether a text is positive or negative**
        - IMDB reviews
    2. **Lexicons**
        - Is a word positive or negative?
    3. **Topic modelling**
        - Assign topics to text

# 1. Intro to text as data

Regard this session as an appetizer!
- Text as data can be a course in itself
- We cannot go into details, so don't worry if you do not understand everything!

- Use the session as an overview of what text analysis can do
    - What do you find interesting? Dive into the details yourself
    - Maybe already in the exam project

- Want to work with text as data?
    - Good starting point is PML chapter 8!
        - Nice and easily accessible introduction to text data analysis
        - Read it carefully
            - There are many steps in text data analysis
            - If you miss one step, the other steps might be hard to follow

# 2. Examples of text as data for social scientists

## Examples you have already seen in the course

- News paper articles
- Job posts
- Reviews on Trustpilot (quick example)

## Other examples

- Social media (tweets, Facebook posts etc.)
- Text from central bank reports: https://sekhansen.github.io/pdf_files/jme_2019.pdf 
- Text from annual reports: https://www.nationalbanken.dk/da/publikationer/Documents/2018/11/WP_130.pdf
- Congressional speeches and partisanship in the US: https://scholar.harvard.edu/files/shapiro/files/politext.pdf 
- Property descriptions on property portals
- AirBnB descriptions
- Can you find more examples?

## Project ideas

- Predicting election outcomes or market trends from sentiment
- Stance or sentiment towards political parties
- Hate speech detection
- Analysing the most important topics in a public debate

# 3. What is a text?

## A dataset of movie reviews and sentiment towards the movies

*(Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

*Data from http://ai.stanford.edu/~amaas/data/sentiment/)*

In [14]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8', sep=';')

In [15]:
df

Unnamed: 0,review,sentiment,set
0,I went and saw this movie last night after bei...,1,test
1,Actor turned director Bill Paxton follows up h...,1,test
2,As a recreational golfer with some knowledge o...,1,test
3,"I saw this film in a sneak preview, and it is ...",1,test
4,Bill Paxton has taken the true story of the 19...,1,test
...,...,...,...
49995,"Towards the end of the movie, I felt it was to...",0,train
49996,This is the kind of movie that my enemies cont...,0,train
49997,I saw 'Descent' last night at the Stockholm Fi...,0,train
49998,Some films that you pick up for a pound turn o...,0,train


## So what is a text?

In [16]:
review = df['review'][1]
review

'Actor turned director Bill Paxton follows up his promising debut, the Gothic-horror "Frailty", with this family friendly sports drama about the 1913 U.S. Open where a young American caddy rises from his humble background to play against his Bristish idol in what was dubbed as "The Greatest Game Ever Played." I\'m no fan of golf, and these scrappy underdog sports flicks are a dime a dozen (most recently done to grand effect with "Miracle" and "Cinderella Man"), but some how this film was enthralling all the same.<br /><br />The film starts with some creative opening credits (imagine a Disneyfied version of the animated opening credits of HBO\'s "Carnivale" and "Rome"), but lumbers along slowly for its first by-the-numbers hour. Once the action moves to the U.S. Open things pick up very well. Paxton does a nice job and shows a knack for effective directorial flourishes (I loved the rain-soaked montage of the action on day two of the open) that propel the plot further or add some unexpec

- We can also call our text a **document**
    - The document determines at which level we will analyse the text. For example, the text above can be analysed in different ways:
        - split each sentence to analyse them separately: *each sentence* is then defined as a document
        - analyse the whole text: *the whole text* is then defined as a document
        - analyse all the reviews that the author has written: *all reviews combined* are then defined as a document

- Which one is the right definition of the document?
    - It depends on the task you would like to solve
        - Are there any dependencies across the author's reviews? Then it might be a good idea to combine them all 

### What does a document consist of?

- WORDS!
- In the raw text, words are not structured in any way
    - We need structured data to analyse it!
    - --> Structure the words in a Bag of Words model (more about that later)

# 4. A 'text as data' recipe

## A. Specify what is your document

- Is it every single tweet?
- Daily tweets?
- Monthly tweets?
- Or all tweets a person has ever made?

## B. Preprocess the text: Reduce the number of language elements

- Clean text: ignore/remove any unwanted characters: casing, HTML markup, non-words, etc. (maybe also emoticons?)
- Tokenization and stop-words
- Stemming and lemmatization

## C. Apply: What question would you like to answer and what is the right tool?

- Machine learning model for sentiment analysis
- Lexicons
- Topic modelling

# Video 14.1: Preprocessing text data

# 5. Preprocessing text data (second step in our recipe)

## Different steps in preprocessing:

1. Clean text: ignore/remove any unwanted characters: casing, HTML markup, non-words, etc. (maybe also emoticons?)
2. Tokenization and stop-words
3. Stemming and lemmatization

Which preprocessing steps that are important depends on the problem you will solve

### 1. Clean text: ignore casing, HTML markup, non-words

- Casing: 
    - We want "Movie" and "movie" to be the same word, so we change all letters to lower case
- HTML markup:
    - In our review example we see there is some unwanted HTML markup left. We want to drop it
- Non-words: 
    - Any other character than words or numbers (non-alphanumeric characters) are typically not important for text data analysis, so we may drop them
    - Exceptions:
        - Emoticons may very much give information about sentiment in a text
        - Dollar signs to indicate a price. Punctuation to indicate decimals in the price
    - It all depends on the problem you want to solve!
    - Careful: You might not want to remove any non-alphanumeric characters before you tokenize (next step)
- Other stuff?

#### Change to lower case:

In [17]:
review_low = review.lower()
review_low

'actor turned director bill paxton follows up his promising debut, the gothic-horror "frailty", with this family friendly sports drama about the 1913 u.s. open where a young american caddy rises from his humble background to play against his bristish idol in what was dubbed as "the greatest game ever played." i\'m no fan of golf, and these scrappy underdog sports flicks are a dime a dozen (most recently done to grand effect with "miracle" and "cinderella man"), but some how this film was enthralling all the same.<br /><br />the film starts with some creative opening credits (imagine a disneyfied version of the animated opening credits of hbo\'s "carnivale" and "rome"), but lumbers along slowly for its first by-the-numbers hour. once the action moves to the u.s. open things pick up very well. paxton does a nice job and shows a knack for effective directorial flourishes (i loved the rain-soaked montage of the action on day two of the open) that propel the plot further or add some unexpec

#### Remove HTML markup:

In [18]:
import re
review_noHTML = re.sub(r'<[^>]*>', ' ', review_low) #Regex pattern matches the HTML markup surrounded by "<" and ">" and replace it with ' ' using the method sub()
review_noHTML

'actor turned director bill paxton follows up his promising debut, the gothic-horror "frailty", with this family friendly sports drama about the 1913 u.s. open where a young american caddy rises from his humble background to play against his bristish idol in what was dubbed as "the greatest game ever played." i\'m no fan of golf, and these scrappy underdog sports flicks are a dime a dozen (most recently done to grand effect with "miracle" and "cinderella man"), but some how this film was enthralling all the same.  the film starts with some creative opening credits (imagine a disneyfied version of the animated opening credits of hbo\'s "carnivale" and "rome"), but lumbers along slowly for its first by-the-numbers hour. once the action moves to the u.s. open things pick up very well. paxton does a nice job and shows a knack for effective directorial flourishes (i loved the rain-soaked montage of the action on day two of the open) that propel the plot further or add some unexpected psycho

#### Remove all characters that are not words or numbers:

In [19]:
review_cleaned = re.sub(r'[^\w\s]','',review_noHTML) #Regex pattern matches any non-alphanumeric characters and replace them with '' using the method sub()
review_cleaned

'actor turned director bill paxton follows up his promising debut the gothichorror frailty with this family friendly sports drama about the 1913 us open where a young american caddy rises from his humble background to play against his bristish idol in what was dubbed as the greatest game ever played im no fan of golf and these scrappy underdog sports flicks are a dime a dozen most recently done to grand effect with miracle and cinderella man but some how this film was enthralling all the same  the film starts with some creative opening credits imagine a disneyfied version of the animated opening credits of hbos carnivale and rome but lumbers along slowly for its first bythenumbers hour once the action moves to the us open things pick up very well paxton does a nice job and shows a knack for effective directorial flourishes i loved the rainsoaked montage of the action on day two of the open that propel the plot further or add some unexpected psychological depth to the proceedings theres

#### Other stuff?

- There may be other things you need to remove before you are ready to move on
- It depends on the texts you are dealing with and the problem you want to solve
    - Investigate the texts
    - Make sure that you keep all the important stuff and remove the rest

We now apply our cleaning process on all reviews in the dataset to work with it later:

In [20]:
def cleaner(document):
    document = document.lower() #To lower case
    document = re.sub(r'<[^>]*>', ' ', document) #Remove HTML
    document = re.sub(r'[^\w\s]','', document) #Remove non-alphanumeric characters
    return document

df['review'] = df['review'].apply(cleaner)

In [21]:
df['review']

0        i went and saw this movie last night after bei...
1        actor turned director bill paxton follows up h...
2        as a recreational golfer with some knowledge o...
3        i saw this film in a sneak preview and it is d...
4        bill paxton has taken the true story of the 19...
                               ...                        
49995    towards the end of the movie i felt it was too...
49996    this is the kind of movie that my enemies cont...
49997    i saw descent last night at the stockholm film...
49998    some films that you pick up for a pound turn o...
49999    this is one of the dumbest films ive ever seen...
Name: review, Length: 50000, dtype: object

### 2. Tokenization (I/II)

- Tokenization is about splitting the document into meaningful elements (/*tokens*)
    - Tokens can be thought of as words in a sentence or sentences in a text
- Simplest tokenization: Split the cleaned document at its whitespaces:

In [22]:
# Split at whitespace with the split() method
review_tokens = review_cleaned.split()
review_tokens

['actor',
 'turned',
 'director',
 'bill',
 'paxton',
 'follows',
 'up',
 'his',
 'promising',
 'debut',
 'the',
 'gothichorror',
 'frailty',
 'with',
 'this',
 'family',
 'friendly',
 'sports',
 'drama',
 'about',
 'the',
 '1913',
 'us',
 'open',
 'where',
 'a',
 'young',
 'american',
 'caddy',
 'rises',
 'from',
 'his',
 'humble',
 'background',
 'to',
 'play',
 'against',
 'his',
 'bristish',
 'idol',
 'in',
 'what',
 'was',
 'dubbed',
 'as',
 'the',
 'greatest',
 'game',
 'ever',
 'played',
 'im',
 'no',
 'fan',
 'of',
 'golf',
 'and',
 'these',
 'scrappy',
 'underdog',
 'sports',
 'flicks',
 'are',
 'a',
 'dime',
 'a',
 'dozen',
 'most',
 'recently',
 'done',
 'to',
 'grand',
 'effect',
 'with',
 'miracle',
 'and',
 'cinderella',
 'man',
 'but',
 'some',
 'how',
 'this',
 'film',
 'was',
 'enthralling',
 'all',
 'the',
 'same',
 'the',
 'film',
 'starts',
 'with',
 'some',
 'creative',
 'opening',
 'credits',
 'imagine',
 'a',
 'disneyfied',
 'version',
 'of',
 'the',
 'animated',

### 2. Tokenization (II/II)

- The simple tokenization might not suffice in some cases:
    - How should we treat abbreviations like Ph.D.? And dollar signs before a price? And punctuation that indicates decimals?
- The NLTK library has some [tokenizer packages](https://www.nltk.org/api/nltk.tokenize.html) that can hep you:
    - `word_tokenize()` splits the words 
    - If you have twitter data, then `TweetTokenizer()` will keep the hashtag intact
    - You can also define your own tokenization pattern using regex with `regexp_tokenize()`
- But in many cases it is just fine to use `split()`

In [23]:
import nltk
review_tokens = nltk.tokenize.word_tokenize(review_cleaned)
review_tokens

['actor',
 'turned',
 'director',
 'bill',
 'paxton',
 'follows',
 'up',
 'his',
 'promising',
 'debut',
 'the',
 'gothichorror',
 'frailty',
 'with',
 'this',
 'family',
 'friendly',
 'sports',
 'drama',
 'about',
 'the',
 '1913',
 'us',
 'open',
 'where',
 'a',
 'young',
 'american',
 'caddy',
 'rises',
 'from',
 'his',
 'humble',
 'background',
 'to',
 'play',
 'against',
 'his',
 'bristish',
 'idol',
 'in',
 'what',
 'was',
 'dubbed',
 'as',
 'the',
 'greatest',
 'game',
 'ever',
 'played',
 'im',
 'no',
 'fan',
 'of',
 'golf',
 'and',
 'these',
 'scrappy',
 'underdog',
 'sports',
 'flicks',
 'are',
 'a',
 'dime',
 'a',
 'dozen',
 'most',
 'recently',
 'done',
 'to',
 'grand',
 'effect',
 'with',
 'miracle',
 'and',
 'cinderella',
 'man',
 'but',
 'some',
 'how',
 'this',
 'film',
 'was',
 'enthralling',
 'all',
 'the',
 'same',
 'the',
 'film',
 'starts',
 'with',
 'some',
 'creative',
 'opening',
 'credits',
 'imagine',
 'a',
 'disneyfied',
 'version',
 'of',
 'the',
 'animated',

### Stop-words:

- Words that are extremely common in all texts
- Probably bear no useful information about the text --> we want to remove them
- Examples: *is, and, has, like...*

Use the NLTK library of 127 English stop-words
- NLTK (Natural Language ToolKit) is a popular Python package for natural language processing

In [24]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
review_nostop = [i for i in review_tokens if i not in stop]
review_nostop

[nltk_data] Downloading package stopwords to /Users/fch/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['actor',
 'turned',
 'director',
 'bill',
 'paxton',
 'follows',
 'promising',
 'debut',
 'gothichorror',
 'frailty',
 'family',
 'friendly',
 'sports',
 'drama',
 '1913',
 'us',
 'open',
 'young',
 'american',
 'caddy',
 'rises',
 'humble',
 'background',
 'play',
 'bristish',
 'idol',
 'dubbed',
 'greatest',
 'game',
 'ever',
 'played',
 'im',
 'fan',
 'golf',
 'scrappy',
 'underdog',
 'sports',
 'flicks',
 'dime',
 'dozen',
 'recently',
 'done',
 'grand',
 'effect',
 'miracle',
 'cinderella',
 'man',
 'film',
 'enthralling',
 'film',
 'starts',
 'creative',
 'opening',
 'credits',
 'imagine',
 'disneyfied',
 'version',
 'animated',
 'opening',
 'credits',
 'hbos',
 'carnivale',
 'rome',
 'lumbers',
 'along',
 'slowly',
 'first',
 'bythenumbers',
 'hour',
 'action',
 'moves',
 'us',
 'open',
 'things',
 'pick',
 'well',
 'paxton',
 'nice',
 'job',
 'shows',
 'knack',
 'effective',
 'directorial',
 'flourishes',
 'loved',
 'rainsoaked',
 'montage',
 'action',
 'day',
 'two',
 'open',

In [25]:
print(len(review_tokens))
print(len(review_nostop))

342
185


### 3. Stemming and lemmatization

#### Stemming:
- The process of transforming a word into its root form
- Allows us to map related words to the same stem
- Examples: `'runners', 'run', 'running'` becomes `'runner', 'run', 'run'`. `'wonderful'` becomes `'wonder'`.
- You can use the Porter stemmer in the NLTK library to stem your words: `PorterStemmer()`
    - With stemming we generally just remove the suffix of the word: very simple method

In [26]:
# Stem the words
porter = nltk.PorterStemmer()
review_stemmed = [porter.stem(i) for i in review_nostop]
review_stemmed

['actor',
 'turn',
 'director',
 'bill',
 'paxton',
 'follow',
 'promis',
 'debut',
 'gothichorror',
 'frailti',
 'famili',
 'friendli',
 'sport',
 'drama',
 '1913',
 'us',
 'open',
 'young',
 'american',
 'caddi',
 'rise',
 'humbl',
 'background',
 'play',
 'bristish',
 'idol',
 'dub',
 'greatest',
 'game',
 'ever',
 'play',
 'im',
 'fan',
 'golf',
 'scrappi',
 'underdog',
 'sport',
 'flick',
 'dime',
 'dozen',
 'recent',
 'done',
 'grand',
 'effect',
 'miracl',
 'cinderella',
 'man',
 'film',
 'enthral',
 'film',
 'start',
 'creativ',
 'open',
 'credit',
 'imagin',
 'disneyfi',
 'version',
 'anim',
 'open',
 'credit',
 'hbo',
 'carnival',
 'rome',
 'lumber',
 'along',
 'slowli',
 'first',
 'bythenumb',
 'hour',
 'action',
 'move',
 'us',
 'open',
 'thing',
 'pick',
 'well',
 'paxton',
 'nice',
 'job',
 'show',
 'knack',
 'effect',
 'directori',
 'flourish',
 'love',
 'rainsoak',
 'montag',
 'action',
 'day',
 'two',
 'open',
 'propel',
 'plot',
 'add',
 'unexpect',
 'psycholog',
 'de

#### Lemmatization:

- Stemming can create non-real words in some cases (see above)
- Lemmatization is more advanced and seeks to find the grammatically correct form of the word (the lemma)
    - Example: `'coding', 'code', 'coded'` will all be lemmatized to `'code'`
- Lemmatization demands a lot of computer power --> it is slow
- In practice there are little difference between stemming and lemmatization on the performance of text classification
    - [Influence of Word Normalization on Text Classification](https://www.researchgate.net/publication/250030718_Influence_of_Word_Normalization_on_Text_Classification)

You can use the [WordNet](https://wordnet.princeton.edu/) lemmatizer from NLTK
- WordNet is a large lexical database of English words

In [28]:
# Lemmatize the words with the WordNetLemmatizer
nltk.download('omw-1.4') #Download OpenMultilingualWordnet
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()
review_lemma = [wnl.lemmatize(i) for i in review_nostop]
review_lemma

[nltk_data] Downloading package omw-1.4 to /Users/fch/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/fch/nltk_data...


['actor',
 'turned',
 'director',
 'bill',
 'paxton',
 'follows',
 'promising',
 'debut',
 'gothichorror',
 'frailty',
 'family',
 'friendly',
 'sport',
 'drama',
 '1913',
 'u',
 'open',
 'young',
 'american',
 'caddy',
 'rise',
 'humble',
 'background',
 'play',
 'bristish',
 'idol',
 'dubbed',
 'greatest',
 'game',
 'ever',
 'played',
 'im',
 'fan',
 'golf',
 'scrappy',
 'underdog',
 'sport',
 'flick',
 'dime',
 'dozen',
 'recently',
 'done',
 'grand',
 'effect',
 'miracle',
 'cinderella',
 'man',
 'film',
 'enthralling',
 'film',
 'start',
 'creative',
 'opening',
 'credit',
 'imagine',
 'disneyfied',
 'version',
 'animated',
 'opening',
 'credit',
 'hbos',
 'carnivale',
 'rome',
 'lumber',
 'along',
 'slowly',
 'first',
 'bythenumbers',
 'hour',
 'action',
 'move',
 'u',
 'open',
 'thing',
 'pick',
 'well',
 'paxton',
 'nice',
 'job',
 'show',
 'knack',
 'effective',
 'directorial',
 'flourish',
 'loved',
 'rainsoaked',
 'montage',
 'action',
 'day',
 'two',
 'open',
 'propel',
 'p

# Video 14.2: The Bag of Words model and tf-idf

# The Bag of Words model

Read more about the bag of words model in this article: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
- It can be a good starting point to go into more details

In [29]:
review_cleaned

'actor turned director bill paxton follows up his promising debut the gothichorror frailty with this family friendly sports drama about the 1913 us open where a young american caddy rises from his humble background to play against his bristish idol in what was dubbed as the greatest game ever played im no fan of golf and these scrappy underdog sports flicks are a dime a dozen most recently done to grand effect with miracle and cinderella man but some how this film was enthralling all the same  the film starts with some creative opening credits imagine a disneyfied version of the animated opening credits of hbos carnivale and rome but lumbers along slowly for its first bythenumbers hour once the action moves to the us open things pick up very well paxton does a nice job and shows a knack for effective directorial flourishes i loved the rainsoaked montage of the action on day two of the open that propel the plot further or add some unexpected psychological depth to the proceedings theres

- To exploit the information in text data we need to structure it in some way
    - Raw text is not structured
- A simple way to structure the documents/texts is the Bag of Words model
    - The Bag of Words model simply counts the number of times each word occurs in a document
    - That way we can store all documents and word counts in one big matrix (a term-document frequency matrix):

#### A Bag of Words model:
<img src="https://drive.google.com/uc?exportview&id=1-VxQqdWhzIVt5l_7W-WljUa_iY8euUFk"/>

- Each row represents a document, and each column represents a word
- The values in the matrix are the count of each word in the document

We can construct a bag of words with our review data using the module [feature_extraction](https://scikit-learn.org/stable/modules/feature_extraction.html) from the Scikit-learn library
- The [CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class constructs the bag of words for us

Let us first do it for the first two reviews:
- The `fit_transform()` method in the CountVectorizer() class first finds all the words in the documents (learn the vocabulary), and then constructs the matrix (count the words in each document):

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer() #Store the class in 'count' to ease coding

review_array = df['review'].values[0:2] #Take the first two reviews and store them in an array
bag = count.fit_transform(review_array) #fit_transform takes an array as input and outputs the bag of words

Let's see how the bag of words looks in the matrix:

In [31]:
count_array = bag.toarray() #Make the bag to an array
matrix = pd.DataFrame(data=count_array,columns = count.get_feature_names_out()) #Input the bag and the words into a dataframe
matrix

Unnamed: 0,1913,able,about,action,actor,add,admit,after,against,alive,...,when,where,which,while,who,with,women,wrong,you,young
0,0,1,0,0,0,0,1,1,0,0,...,0,0,1,1,0,2,1,1,2,0
1,1,0,1,2,1,1,0,0,1,1,...,1,1,0,0,1,3,0,0,1,1


- The number of times a word (/term) occurs in a document is also called the **term frequency**.

## N-grams:

- In our bag of words from above each term represent **one** word
    - It is a bag of words model with **1-grams**
- I.e., we pool all words from a document into one big bag
    - --> we loose all information that lies in the order of the words

- Instead we can specify for example 2-grams:
    - With 1-grams: "My name is Hjalte" will yield the terms; 'My', 'name', 'is', 'Hjalte'
    - With 2-grams: "My name is Hjalte" will yield the terms; 'My name', 'name is', 'is Hjalte'
- N-grams of more than 1 is a way to keep some of the information in the order of the words

Let us see how to do it in Python:
- You can choose the N-grams via the `ngram_range()` parameter

In [32]:
count = CountVectorizer(ngram_range=(2,2)) #Choose only 2-grams

review_array = df['review'].values[0:2]
bag = count.fit_transform(review_array)

count_array = bag.toarray() #Make the bag to an array
matrix = pd.DataFrame(data=count_array,columns = count.get_feature_names_out()) #Input the bag and the words into a dataframe
matrix

Unnamed: 0,1913 us,able to,about the,action moves,action on,actor turned,add some,admit that,after being,against his,...,with our,with some,with such,with this,women in,wrong kutcher,you go,you judge,you know,young american
0,0,1,0,0,0,0,0,1,1,0,...,1,0,1,0,1,1,1,1,0,0
1,1,0,1,1,1,1,1,0,0,1,...,0,1,0,1,0,0,0,0,1,1


Note: We will get more terms with N-grams of higher degrees
- More terms makes the bag of words model more computationally heavy to work with
- **Classic trade-off in text as data: Trade-off between information and computer power**

## Term frequency-inverse document frequency

- From the matrix above you can see that we very fast get a lot of terms even with few documents
- It is a problem for the computational efficiency

#### Is there a way to limit the terms that do not provide a lot of information?
- The technique called: Term frequency-inverse document frequency!

### Background:

- When analyzing text data we often have words that appears frequently across many documents
    - These words typically do not carry much information about each document --> they are simply just in all documents
- Similarly there will be words that are very rare
    - These words will carry a lot of information
    - But the information they provide may not be enough to counteract the computational cost they carry 
--> We want to down-weight very common words and very rare words
- That is what the term frequency - inverse document frequency (tf-idf) technique does

### Tf-idf:
The tf-idf is computed like this:

$tf-idf(t,d) = tf(t,d) \times idf(t,d)$

- $tf(t,d)$ is the term frequency and measures how many times a word/term $t$ occurs in a document $d$ (just as you have seen with the bag of words model)

- $idf(t,d)$ is computed like this: $idf(t,d) = log \frac{n_d}{1+df(t,d)}$

    - $n_d$ is the total number of documents, and $df(t,d)$ is the number of documents $d$ that contains the term $t$.

- Very common words will have low tf-idf score because $idf(t,d)$ will be low
- Very rare words will have low tf-idf score because $tf(t,d)$ will be low

Common practice:
- Only keep the words in a document if they have a tf-idf score above some threshold

How do we compute the tf-idf score in Python?

In [33]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer() #Ease coding
bag_tfidf = tfidf.fit_transform(bag) #Compute the tf-idf score from the bag of words from before ('bag')

In [34]:
tfidf_array = bag_tfidf.toarray() #Make the bag to an array
matrix_tfidf = pd.DataFrame(data=tfidf_array,columns = count.get_feature_names_out()) #Input the bag and the words into a dataframe
matrix_tfidf

Unnamed: 0,1913 us,able to,about the,action moves,action on,actor turned,add some,admit that,after being,against his,...,with our,with some,with such,with this,women in,wrong kutcher,you go,you judge,you know,young american
0,0.0,0.082333,0.0,0.0,0.0,0.0,0.0,0.082333,0.082333,0.0,...,0.082333,0.0,0.082333,0.0,0.082333,0.082333,0.082333,0.082333,0.0,0.0
1,0.053631,0.0,0.053631,0.053631,0.053631,0.053631,0.053631,0.0,0.0,0.053631,...,0.0,0.053631,0.0,0.053631,0.0,0.0,0.0,0.0,0.053631,0.053631


# Video 14.3: Text as data applications

# 6. Applications (third step in our recipe)

- Training a logistic model to classify whether a text is positive or negative
- Lexicons
- Topic modelling

# 6. Applications (I/III): Training a logistic model for text classification

Recall the structure of our movie review dataset:
- Variable containing the reviews ('review')
- Variable stating whether the person had a positive or negative sentiment towards the movie ('sentiment')
- Variable stating whether the review is in the test or train set ('set')

In [35]:
df

Unnamed: 0,review,sentiment,set
0,i went and saw this movie last night after bei...,1,test
1,actor turned director bill paxton follows up h...,1,test
2,as a recreational golfer with some knowledge o...,1,test
3,i saw this film in a sneak preview and it is d...,1,test
4,bill paxton has taken the true story of the 19...,1,test
...,...,...,...
49995,towards the end of the movie i felt it was too...,0,train
49996,this is the kind of movie that my enemies cont...,0,train
49997,i saw descent last night at the stockholm film...,0,train
49998,some films that you pick up for a pound turn o...,0,train


We have labelled each review with a sentiment

- --> We can train a machine learning model on our "train reviews" to predict the sentiment of our "test reviews"
    - I.e., the goal is to predict the sentiment (positive or negative) of the reviews just by inputting the words in the reviews
    
We will use a logistic regression model for this text classification

## How do we do it in practice?

- First, we load the train and test dataset into two different datasets:
    - Remember that we have already cleaned the data with our cleaner function

In [37]:
import numpy as np 

df_train = df[df.set=="train"]
df_test = df[df.set=="test"]

# Sort the data randomly to mix positive and negative reviews
np.random.seed(0)
df_train = df.reindex(np.random.permutation(df_train.index))
df_test = df.reindex(np.random.permutation(df_test.index))

# Take out X and Y variable
x_train = df_train['review'].values
x_test = df_test['review'].values
y_train = df_train['sentiment'].values
y_test = df_test['sentiment'].values

- Second, we need to make our bag of words and down-weight common and rare words with tf-idf
    - Remember we used `CountVectorizer` and `TfidfTransformer` to do this
    - `TfidfVectorizer` combines the two

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
x_train_bag = tfidf.fit_transform(x_train)

- Third, we fit our logistic regression model on the training set's bag of words (x_train_bag) and the true sentiments (y_train)

In [39]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=0) #Text classifier
lr.fit(x_train_bag,y_train)

- Fourth, we can now test our fitted logistic regression model on both the train set and test set

In [40]:
# First we need to make a tf-idf bag of words for the test set as well.
# (use the transform() method for that: do NOT use fit_transform() as in the train set. Because we only use the words from the train set to fit our model on)
x_test_bag = tfidf.transform(x_test)

In [41]:
# Then we predict the sentiment 
train_preds = lr.predict(x_train_bag)
test_preds = lr.predict(x_test_bag)

# And we compare the predicted sentiment with the actual sentiment
print("Training accuracy:", np.mean([(train_preds==y_train)]))
print("Testing accuracy:", np.mean([(test_preds==y_test)]))

Training accuracy: 0.9334
Testing accuracy: 0.8844


## We can use the coefficients from the fitted model to say something about the importance of words

In [42]:
# Get all the words (features)
features = ['_'.join(s.split()) for s in tfidf.get_feature_names_out()]

# Get the coefficients from the fitted model
coefficients = lr.coef_

# Present coefficients for each feature
coefs_df = pd.DataFrame.from_records(coefficients, columns=features)
coefs_df

Unnamed: 0,00,000,0000000000001,000001,00000110,0001,00015,001,0010,002,...,étcother,évery,êxtase,ís,ísnt,østbye,über,überannoying,überspy,üvegtigris
0,-0.008332,-0.002248,-0.036233,-0.033263,-0.006562,0.010348,-0.005333,-0.033112,-0.001661,-0.021993,...,0.034033,-0.072971,0.016622,0.002062,-0.02997,0.010549,-0.097527,-0.006549,0.015016,-0.058898


In [43]:
# Print the 20 words with highest positive sentiment
print(coefs_df.T.sort_values(by=[0], ascending=False).head(20))

                   0
great       7.554328
excellent   6.259992
best        5.158860
perfect     4.730005
wonderful   4.616599
amazing     4.135281
well        3.864223
favorite    3.847079
loved       3.829614
love        3.820325
fun         3.779435
enjoyed     3.569553
710         3.438833
highly      3.420503
today       3.401102
and         3.268132
brilliant   3.230066
superb      3.211875
definitely  3.089752
still       3.051435


In [44]:
# Print the 20 words with lowest positive sentiment
print(coefs_df.T.sort_values(by=[0], ascending=True).head(20))

                      0
worst         -9.214507
bad           -8.007373
awful         -6.389001
waste         -6.339590
boring        -5.937048
poor          -5.395301
terrible      -4.865746
nothing       -4.777008
worse         -4.635933
no            -4.488101
horrible      -4.200668
dull          -4.197008
poorly        -4.096072
unfortunately -3.962912
annoying      -3.936293
script        -3.799177
stupid        -3.766312
ridiculous    -3.647288
minutes       -3.608638
even          -3.538806


# 6. Applications (II/III): Lexicons

Sometimes we do not have labelled data as in our IMDB reviews example

- I.e., we do not know in advance whether a review has a positive or negative sentiment towards a movie
    - Recall: For each review we had a variable called 'sentiment' which stated whether the person writing the review had a positive or negative sentiment towards the movie
- Then we cannot train a machine learning to classify the sentiment

Instead, we can use predefined lexicons!

- The lexicons have a dictionary of words that can have some predefined labels:
    - polarity score: positive, negative or neutral sentiment
    - mood
    - and so on

We can use these predefined labels to score the sentiment of texts
- The more positive words in the text, the more positive will the sentiment be
- The more negative words in the text, the more negative will the sentiment be

You can read more about lexicons [here](https://medium.com/nerd-for-tech/sentiment-analysis-lexicon-models-vs-machine-learning-b6e3af8fe746) 

## Different lexicons:

- AFINN: https://github.com/fnielsen/afinn
- VADER: https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

### AFINN:

- Danish lexicon
- Simple and popular lexicon
- Word-list based: Contains 3382 words that are scored for polarity

Positive score: Positive sentiment. Negative score: Negative sentiment.

### VADER:

- Specifically tuned to social media
- VADER scores both polarity and intensity of emotion
- Word-list based as AFINN
- But also rule-based:
    - Example: It knows that "dit not love" is negative because of the negation
    
Positive score: Positive sentiment. Negative score: Negative sentiment.

### How does it work in practice?

- The document is tokenized (as you know how to do know)
- Each token in the document is matched with the words in the lexicon: Are they positive, negative or neutral?
- All the token sentiment scores in the document are summed or averaged to predict the overall sentiment of the document

### How does it work in Python?

### AFINN

In [46]:
from afinn import Afinn

afn = Afinn(emoticons=True) #Also use the emoticons in the lexicon
review_sample=df.loc[[0,1000,49000]] #Choose some reviews from the cleaned dataset
for i, row in review_sample.iterrows(): #Print the review, actual sentiment, and polarity score
  print("REVIEW: ", row.review)
  print("Actual Sentiment: ", row.sentiment)
  print('Predicted Sentiment polarity: ', afn.score(row.review)) #Get the AFINN polarity score

REVIEW:  i went and saw this movie last night after being coaxed to by a few friends of mine ill admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy i was wrong kutcher played the character of jake fischer very well and kevin costner played ben randall with such professionalism the sign of a good movie is that it can toy with our emotions this one did exactly that the entire theater which was sold out was overcome by laughter during the first half of the movie and were moved to tears during the second half while exiting the theater i not only saw many women in tears but many full grown men as well trying desperately not to let anyone see them crying this movie was great and i suggest that you go see it before you judge
Actual Sentiment:  1
Predicted Sentiment polarity:  -7.0
Actual Sentiment:  1
Predicted Sentiment polarity:  16.0
REVIEW:  christ oh christ one watches stunned incredulous and possibly deranged as this tawdry exer

Now let's see how well the AFINN lexicon predicts the actual sentiment of the reviews (it takes a while to run the code):

In [47]:
import numpy as np

preds = []
for i in df['review'].values: #For each review compute the polarity score, and classify it as positive or negative
    score = afn.score(i)
    if score<=0:
        preds.append(0)
    else:
        preds.append(1)

In [48]:
# Share of correct sentiment scores
print(np.mean([(preds==df.sentiment.values)]))

0.71312


### VADER

In [49]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

analyser = SentimentIntensityAnalyzer()
review_sample=df.loc[[0,1000,49000]] #Choose some reviews from the cleaned dataset
for i, row in review_sample.iterrows(): #Print the review, actual sentiment, and polarity score
  print("REVIEW: ", row.review)
  print("Actual Sentiment: ", row.sentiment)
  print('Predicted Sentiment polarity: ', analyser.polarity_scores(row.review)) #Get the VADER polarity score 

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/fch/nltk_data...


REVIEW:  i went and saw this movie last night after being coaxed to by a few friends of mine ill admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy i was wrong kutcher played the character of jake fischer very well and kevin costner played ben randall with such professionalism the sign of a good movie is that it can toy with our emotions this one did exactly that the entire theater which was sold out was overcome by laughter during the first half of the movie and were moved to tears during the second half while exiting the theater i not only saw many women in tears but many full grown men as well trying desperately not to let anyone see them crying this movie was great and i suggest that you go see it before you judge
Actual Sentiment:  1
Predicted Sentiment polarity:  {'neg': 0.096, 'neu': 0.765, 'pos': 0.139, 'compound': 0.734}
Actual Sentiment:  1
Predicted Sentiment polarity:  {'neg': 0.113, 'neu': 0.718, 'pos': 0.169, 'com

Now let's see how well the VADER lexicon predicts the actual sentiment of the reviews (it takes a while to run the code):

In [50]:
preds = []
for i in df['review'].values: #For each review compute the polarity score, and classify it as positive or negative
    score = analyser.polarity_scores(i)["compound"]
    if score<=0:
        preds.append(0)
    else:
        preds.append(1)

In [51]:
# Share of correct sentiment scores
import numpy as np
print(np.mean([(preds==df.sentiment.values)]))

0.69684


# 6. Applications (III/III): Topic modelling

Topic modelling is the task of assigning topics to unlabelled text documents

- Our movie review example:
    - Based on the review texts we can assign the movies into movie genres
    - We cluster all the reviews that contains similar words
        - For example reviews that contain words like 'horror', 'scared', 'shock', 'blood' may be clustered into the same topic: horror movies

## Latent Dirichlet Allocation (LDA)

We can make the topic modelling with the [Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2)

- LDA is an unsupervised machine learning algorithm
- Finds groups of words that appear frequently together across several documents
    - The groups of words will then be our topics

### How does it work in practice?

- The LDA algorithm takes a bag of words model as input
- It then outputs two things:
    - a document to topic matrix (it allocates each document to a topic)
    - a word to topic matrix (it allocates each word to a topic
- We need to define the number of topics beforehand (the number of topics is a hyperparameter)!
    - This is a bit arbitrary
    - Try to play around with it and define different number of topics

### Let's see how it works in Python

- First we need to make our bag of words:
    - For convenience we use the built-in stop-word library in scikit-learn
    - We set the maximum document frequency to 10 percent to exclude very common words
    - We limit the number of words to 5000 most frequently occuring words
        - It limits the dimensionality of the dataset to ease computation
        
The maximum document frequency and number of words are hyperparameters that you can tune

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english', max_df=0.1, max_features=5000)
bag = count.fit_transform(df['review'].values)

- Second we fit our LDA estimator to the bag of words
    - We specify the number of topics to 10
    - The code may take 5-10 minutes to run

In [53]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,random_state=123) #The random_state parameter pass an integer that makes the result reproducible 
review_topics = lda.fit_transform(bag)

Let's now print the 5 most important words for each topic:

In [54]:
n_top_words = 5
word_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_): #lda.components_ stores a matrix containing the word importance for each topic
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([word_names[i]
    for i in topic.argsort()\
        [:-n_top_words - 1:-1]]))

Topic 1:
comedy black action police crime
Topic 2:
book version musical play role
Topic 3:
war american men history country
Topic 4:
role john performance plays actor
Topic 5:
dvd music video watched fun
Topic 6:
kids guy stupid girl school
Topic 7:
house horror woman dead wife
Topic 8:
worst minutes script awful boring
Topic 9:
family feel beautiful performance mother
Topic 10:
series original game effects action


Based on the 5 most important words we may identify following topics:

1. Action and comedy movies
2. Musicals
3. War movies
4. Reviews somehow related to the quality of acting (not really a movie genre)
5. Movies from home
6. Teen movies
7. Horror movies
8. Bad movies
9. Feel-good or family movies
10. Movies related to series