# Natural Language Processing


## What you'll learn in this course 🧐🧐

This course is an introduction to statistical analysis of text and language. It is a discipline that has recently undergone enormous scientific progress thanks to developments in deep learning and the exponential growth of computing power. In this course, We will start  with some basic NLP principles. We will learn: 

* Types of language encoding 
* Basic text preprocessing with N-Grams
* Basic analysis of frequency with TF-IDF


## What is NLP? 🦜🦜

Natural Language Processing, often abbreviated as *NLP*  consists in studying how computer languages and human languages (natural language) interact.

### Examples of applications ✅

NLP has grown enormously since its beginnings in the 1950s thanks to the rapid growth of computing power and progress in deep learning. Here are a few of popular NLP applications:


* Spell checking, keyword searching, synonym searching 🎯

* Extraction of information from websites (e.g. product prices, dates, names of people, companies, etc...) ℹ️

* Classification:
    * Sentiment Analysis 💖
    * Topic Modeling 🗼

* Translation 🌍

* Speech-controlled system 💬


### Why language a very special type of data 🗼

Most of the data that Data Scientists is usually quantitative (stock price, revenue etc.) or qualitative (Color, gender etc.). With NLP, we deal with whole corpus of text and we need to somehow translate that into numbers so that computers can understand it. This comes with a set challenges to keep in mind: 


  * Language is an ambiguous mode of expression. The same word, sentence or text can have completely different interpretations. Generations of researchers continue to be interested in works, quotations and even simple words, so rich is the richness of language interpretation.

  * Interpretation of language depends on situational context, the surrounding real world, common sense, and cultural and social norms.

All this makes NLP an infinitely interesting subject where data-scientists have made tremendous progress. 🚀

## Preprocessing in NLP 🚧🚧


In order to analyze text with Python, it is necessary to work on preprocessing data so that it can be easily understood by a computer. Let's quickly show different word processing technics that can be used for Machine Learning:

<table>
<tr>
<th>
    Preprocessing type
</th>
<th>
    Description
</th>
</tr>

<tr>
<td>
    Lower case
</td>

<td>
Python, like many other computer languages, distinguishes between upper and lower case letters. Thus, *A* is not the same character as *a*. Similarly, the words *Learn* and *learn*, although understood the same way by our human brains, do not represent the same strings of characters in a language like Python. We will therefore generally replace all characters in a text with their lowercase equivalent before starting more advanced processing.
</td>
</tr>

<tr>
<td>
    Punctuation
</td>

<td>
It's rare to analyze punctuation when it comes to text mining. We will usually remove all punctuation so that we only have words to analyze.
</td>
</tr>

<tr>
<td>
    Stop words
</td>

<td>
<a href="https://en.wikipedia.org/wiki/Stop_word" target="_blank">Stop words</a> corresponds to all the linking words, articles and quantifiers that are used a lot in a language but are not in themselves meaningful. For example, in English these words are: *a*, *the*, *and*, *any* etc.

<br/>

They are generally removed because they are so frequent in texts that they generally prevent the different patterns that can be used to detect the words that really characterize a text.
</td>
</tr>


<tr>
<td>
    Common words
</td>

<td>
For some analyses, we will remove common words. For example, if you want to perform topic modeling related to traveling in Italy, you will probably remove the vocabulary that commonly describes Italy so that your algorithm will focus on the real differences between topics.

</td>
</tr>


<tr>
<td>
    Rare words
</td>

<td>
Conversely, words that are too infrequent in texts may be useless because their connection with other words in the text could create noise.

</td>
</tr>

<tr>
<td>
    Typos
</td>

<td>
In order to unify spelling, Python packages are often used to correct typos.

</td>
</tr>


<tr>
<td>
    Stemming (Rooting)
</td>

<td>
Many words in different languages are simple a variation of a common root. This is  called an *inflection*. <a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html" target="_blank">Stemming</a> chops off inflections in order to keep only the common root of the words and thus be able to make analyses that were previously impossible. For example: 

<ul>
<li>Cats → Cat</li>
<li>Caresses → Caress</li>
<li>Ponies → Poni</li>
</ul>
</td>
</tr>

<tr>
<td>
    Lemmatization
</td>

<td>
<a href="https://queryunderstanding.com/stemming-and-lemmatization-6c086742fe45" target="_blank"> Lemmatization</a> is a "clever stemming". Instead of simply choping off inflections, it will transform words based on lexical knowledge. For example, stemming would perform these transformations: 

<ul>
<li>Car → Car</li>
<li>Cars → Car</li>
<li>Caring → Car</li>
<li>Care → Car</li>
</ul>

Whereas Lemmatization would do something smarter: 

<ul>
<li>Car → Car</li>
<li>Cars → Car</li>
<li>Caring → Care</li>
<li>Care → Care</li>
</ul>
</td>
</tr>

</table>

### N-grams 🏘️

When you will be doing text mining, you will need to seperate a document into groups of words. Thus, we won't let the computer deal with a string of characters of the form:

"The cat eats fish."

Instead, a document is usually represented as *N-grams*. N-grams are a way of breaking down the text into groups of "N" words. For example, a 1-grams or unigrams of the previous text looks like this:

<table>
  <tr>
    <td>“the”</td>
    <td>“cat”</td>
    <td>“eats”</td>
    <td>“the”</td>
    <td>“fish”</td>
  </tr>
</table>

2-grams would be:

<table>
  <tr>
    <td>“the cat”</td>
    <td>“cat eats”</td>
    <td>“eats the”</td>
    <td>“the fish”</td>
  </tr>
</table>


N-grams is a very useful tool for analysing the general idea of a text


### Parse trees 🌳

Parse trees are tools that allow you to break down simple sentences automatically.


![](https://drive.google.com/uc?export=view&id=15yqMknQsiaqnubxCQjN4AwnlmcDWRQh1)


To build sentences, we use grammatical rules. 

For example, a sentence can be composed of a nominal group (Noun phrase *NP*) followed by a verbal group (*Verb phrase*). These rules can be listed, and we can use Python to analyze simple sentences. 

It is thanks to this technology that Personal Digital Assistants such as Siri (Apple), Alexa (Amazon) or Cortana (Microsoft) are able to quickly understand vocal instructions after converting them into text. 🤖

However, when sentences become too complex, this type of analysis does not allow the meaning to be extracted in a way that is intelligible to the computer. **It is better suited to understanding the meaning of short, precise and factual instructions.**

## Text Mining ⛏️⛏️

Before moving on to advanced Machine Learning methods for language comprehension, let's start by introducing techniques that are very frequently used in linguistics and quantitative marketing to extract information from textual data.


### Term frequency 👨‍👩‍👧

*Word frequency* is simply how many times a word occurs in a document: 

$$
Term\hspace{0.2cm}frequency = \frac{word\hspace{0.2cm} occurences}{total\hspace{0.2cm}word\hspace{0.2cm}occurences}
$$
 
For example in the following sentence: *The black cat eats fish but does not eat other black cats*. The frequency table can be written as follows after lemmization treatment and deletion of stop words:


<table>
  <tr>
    <th>Term</th>
    <th>Occurrences</th>
    <th>Term Frequency</th>
  </tr>
  <tr>
    <td>“cat”</td>
    <td>2</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td>“black”</td>
    <td>2</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td>“eat”</td>
    <td>2</td>
    <td>0.25</td>
  </tr>
  <tr>
    <td>“fish”</td>
    <td>1</td>
    <td>0.125</td>
  </tr>
  <tr>
    <td>“other”</td>
    <td>1</td>
    <td>0.125</td>
  </tr>
  <tr>
    <td><strong>Total</strong></td>
    <td><strong>8</strong></td>
    <td><strong>1</strong></td>
  </tr>
</table>


### Inverse Document Frequency 🙃

Term frequency is usually too *simple* when you need to perform classification. Instead, we use Inverse Document Frequency (IDF) where a high IDF mean a representative unique word. Here is the formula:


$$
IDF = log(\frac{N}{n})
$$


Where 

* $N$ is total number of documents in a corpus.
* $n$ is total number of documents where a term $t$ appears in. 

For example, if we consider the following two documents: 

* *the black cat eats fish but does not eat other black cats.* 
* *the giraffe eats leaves thanks to its long neck but does not eat fish.* 


If we calculate the IDF on these examples, we get the following table:


<table>
  <tr>
    <th>Term</th>
    <th>$N$</th>
    <th>$n$</th>
    <th>IDF</th>
  </tr>
  <tr>
    <td>"cat"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"black"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"eat"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>0</p>
    </td>
  </tr>
  <tr>
    <td>"fish"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>0</p>
    </td>
  </tr>
  <tr>
    <td>“other”</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"giraffe"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"leaf"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"long"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
  <tr>
    <td>"neck"</td>
    <td>
      <p>2</p>
    </td>
    <td>
      <p>1</p>
    </td>
    <td>0.30</td>
  </tr>
</table>

### Term frequency - Inverse document frequency 👨‍👩‍👧 🙃

*TF-IDF* is simply $TF \times IDF$. This metric is a mix between $TF$ and $IDF$ and is widely used because it describes the importance of a word in a document as well as its frequency, which is extremely useful. 

For example, imagine a word that has a high frequency but that is contained in every document, its importance therefore becomes relative low. 

Let's take our documents and apply TF-IDF:

<table>
  <tr>
    <th>Document</th>
    <th>Term</th>
    <th>TF</th>
    <th>IDF</th>
    <th>TF-IDF</th>
  </tr>
  <tr>
    <td>
      <p>1</p>
    </td>
    <td>"cat"</td>
    <td>
      <p>0.25</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.075</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>1</p>
    </td>
    <td>“black”</td>
    <td>
      <p>0.25</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.075</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>1</p>
    </td>
    <td>“eats”</td>
    <td>
      <p>0.25</p>
    </td>
    <td>
      <p>0</p>
    </td>
    <td>
      <p>0</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>1</p>
    </td>
    <td>“fish”</td>
    <td>
      <p>0.125</p>
    </td>
    <td>
      <p>0</p>
    </td>
    <td>
      <p>0</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>1</p>
    </td>
    <td>“other”</td>
    <td>
      <p>0.125</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.037</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>2</p>
    </td>
    <td>"giraffe"</td>
    <td>
      <p>0.2</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.06</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>2</p>
    </td>
    <td>"eats"</td>
    <td>
      <p>0.2</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>2</p>
    </td>
    <td>"leaf"</td>
    <td>
      <p>0.2</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.06</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>2</p>
    </td>
    <td>"long"</td>
    <td>
      <p>0.2</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.06</p>
    </td>
  </tr>
  <tr>
    <td>
      <p>2</p>
    </td>
    <td>"neck"</td>
    <td>
      <p>0.2</p>
    </td>
    <td>
      <p>0.30</p>
    </td>
    <td>
      <p>0.06</p>
    </td>
  </tr>
</table>

Given this table, we understand that the most important words in document 1 that distinguish it from the other documents of the corpus are:

* "cat" 
* "black"

In document 2 the most important terms are:

* "giraffe"
* "leaf" 
* "long"
* "neck" 

Conversely, the following words have a low tf-idf in both documents:

* "eat"
* "fish"

We can therefore conclude that both documents talk about eating fish, but that document 1 focuses on cats, while document 2 focuses on giraffes.

## Resources 📚📚

* <a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html" target="_blank">Stemming</a>
* <a href="https://queryunderstanding.com/stemming-and-lemmatization-6c086742fe45" target="_blank"> Lemmatization</a> 
* <a href="https://en.wikipedia.org/wiki/Stop_word" target="_blank"> Stop Words</a>
* <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank">TF-IDF</a>
* <a href="https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089" target="_blank">TF-IDF from scratch in python on real world dataset</a>