<center><h1>BUSS6002 - Data Science in Business</h1></center>

#### Pre-Tutorial Checklist

1. Complete Task 1 and Task 2 from Week 10

# Tutorial 10 - Text Analytics

## Text Analytics

Often the data that we need to analyse is textual. Text data is incompatible with the models we have discussed so far because the models require numeric values. For example we don't have a direct numeric distance between the words "hello" and "friend". 

Moreover sometimes single data points such as a tweet, facebook post etc will contain a large number of words. So we need a way to convert the text data into a flexible numeric representation. This representation should tell us which words occured and how many times.


## Bag-of-Words

The bag-of-words (BoW) model is a simple method of transforming strings into a numeric representation. BoW treats each word as a feature and the value of the feature is the number of times it occurds.


For example the string

    "The quick brown fox jumps over the lazy dog"
    
would be transformed into

| the | quick | brown | fox | jumps | over | lazy | dog |
|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Note that:
- the word "the" occurs twice so its count is 2
- other words are unique so they only occur once
- the number of features is the number of unique words

To create a BoW set of features

<div style="margin-bottom: 0px;"><img width=20 style="display: block; float: left;  margin-right: 20px;" src="img/docs.png"> <h3 style="padding-top: 0px;">Documentation - CountVectorizer</h3></div>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

corpus = ["The quick brown fox jumps over the lazy dog",
          "Another dog ran after the fox"]

X = count_vectorizer.fit_transform(corpus)

X is a sparse matrix. To view it we need to convert it to a dense matrix

In [2]:
print(X.todense())

[[0 0 1 1 1 1 1 1 1 0 2]
 [1 1 0 1 1 0 0 0 0 1 1]]


In [8]:
print(X)
X.shape

  (0, 10)	2
  (0, 8)	1
  (0, 2)	1
  (0, 4)	1
  (0, 5)	1
  (0, 7)	1
  (0, 6)	1
  (0, 3)	1
  (1, 10)	1
  (1, 4)	1
  (1, 3)	1
  (1, 1)	1
  (1, 9)	1
  (1, 0)	1


(2, 11)

The matrix isn't that helpful by itself. Lets look at the corresponding feature names (words)

In [3]:
print(count_vectorizer.get_feature_names())

['after', 'another', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'ran', 'the']


Lets combine everything into a DataFrame for clarity. You don't always have to do this

In [7]:
import pandas as pd

features = pd.DataFrame(X.todense(), columns = count_vectorizer.get_feature_names())

features

Unnamed: 0,after,another,brown,dog,fox,jumps,lazy,over,quick,ran,the
0,0,0,1,1,1,1,1,1,1,0,2
1,1,1,0,1,1,0,0,0,0,1,1



## TF-IDF

In text data there will be lots of repeated words such as "a", "is" and "the" that aren't very useful. We should ignore them as much as possible.

The Term Frequency–Inverse Document Frequency (TF-IDF) is a weighting procedure for BoW data. The TF-IDF weights boost the counts or frequency of uncommon words (which will be useful) and shrinks the mangitude of common words. There are two components to the TF-IDF weights, and each of these can be calculated in different ways:

- Term Frequency, often the _raw count_ of a term in a document $tf = f_D$. Other possibilities are boolean (1 if the term appears, otherwise 0), length adjusted ($tf = \frac{f_D}{n_{words}}$) or logarithmic ($tf = \log(1+f_D)$).

- Inverse Document Frequency, or a measure of the information contained in a word. This is a penalty for commonly used words like 'a' and 'the'. It's the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). $idf = \log(\frac{N}{n_D})$ where $N$ is the number of documents and $n_D$ is the number of documents in which the word appears.

The tfidf score is calculated as follows:

$$tfidf = tf \cdot idf $$

## TF-IDF in Sklearn

Let's vectorise a collection of documents. Notice that each line is treated as a document in this case, so our corpus is a total of 4 documents. 

<div style="margin-bottom: 0px;"><img width=20 style="display: block; float: left;  margin-right: 20px;" src="img/docs.png"> <h3 style="padding-top: 0px;">Documentation - TfidfVectorizer</h3></div>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [42]:
corpus = ["The quick brown fox jumps over the lazy dog",
          "Another dog ran after the fox",
          "The world is turning",
          "Hello world"
         ]

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectoriser = TfidfVectorizer()

X = tfidf_vectoriser.fit_transform(corpus)

In [12]:
features = pd.DataFrame(X.todense(), columns = tfidf_vectoriser.get_feature_names())

features

Unnamed: 0,after,another,brown,dog,fox,hello,is,jumps,lazy,over,quick,ran,the,turning,world
0,0.0,0.0,0.356398,0.280988,0.280988,0.0,0.0,0.356398,0.356398,0.356398,0.356398,0.0,0.454968,0.0,0.0
1,0.463709,0.463709,0.0,0.365594,0.365594,0.0,0.0,0.0,0.0,0.0,0.0,0.463709,0.29598,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.57458,0.0,0.0,0.0,0.0,0.0,0.366747,0.57458,0.453005
3,0.0,0.0,0.0,0.0,0.0,0.785288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.61913


<div style="margin-bottom: 30px;"><img width=48 style="display: block; float: left;  margin-right: 20px;" src="img/question-mark-button.png"> <h3 style="padding-top: 15px;">Exercise 1 - TF-IDF weights calculation</h3></div>

In the corpus with two documents:

``On the 24th of February, 1815, the look–out at Notre–Dame de la Garde signalled the three–master, the Pharaon from Smyrna, Trieste, and Naples.``

``As usual, a pilot put off immediately, and rounding the Chateau d’If, got on board the vessel between Cape Morgion and Rion island.``

Find the TF-IDF weights for the words ``the`` and ``Trieste`` by code or by hand. For term frequency, try Boolean weights.