# A Brief History of Natural Language Processing (NLP)

### Symbolic NLP (1950s – early 1990s)
- Theory / Rule Based (e.g. Chomskyan theories on "transformational grammar"

### Statistical NLP (1990s–2010s)
- Move towards corpus linguistics with availability of large multilingual text corpora 

### Neural NLP (present)
- Large datasets + Large deep nerual networks + Large compute power


#### A review from the ethics point-of-view of language models
  - [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?](https://dl.acm.org/doi/10.1145/3442188.3445922)

## Use cases
- Text classification
  - Language detection
  - **Sentiment Analysis**
- Text generation 
  - Language translations
  - Chatbots 

## What's so special about text! 
- Unlike images, *text* is inherently sequential (with a few language specific exceptions of course!). There is a *beginning* and an *end* to a sentence with strong "recurring" relationships between previous 1) letter(s) 2) word(s) 3) sentence(s). 
- Unlike images, typically, text doesn't have a fixed length. Therefore, the model needs to be flexible to process/understand variable legnth input. 

- **Recurrent neural networks** (RNN) and **Transformer networks** are the common and effective models used to capture such specific structure of textual *information*. 

![](https://github.com/rubrickyard/pytorch-sentiment-analysis/blob/master/assets/sentiment1.png?raw=1)

## A bit of AI jargon 
- **Dataset**: It's a set of files that contain **input** (e.g. sentences in French and English) and **output** (e.g. human assigned langauge label (French or English) for each setence)

- **Model**: A computer code that will take the input and produce the expected output. This code implements an **alogithm** that is suitable for the particular dataset (text, images, speech etc) and particular task (e.g. language detection). 

- **Training**: This is the phase in which we use available dataset try out bunch of alternative algorithms aka models to perform certain task. For instance, we proivde a sentence and models tries to predict if it's in French or English. In theory, there are infinite number of ways to do this. However, some algorithms will work better than the other, and in this phase we try to find the best algorithm as quickly as possible. This is a difficult and time-consuming processes - but once we have a algorithm that performs well-enough for our liking, we say we have a **trained model**. 

- **Testing / inference**: This is when we / anyone uses a **trained model** with their own **new data**. Our personal use of Google translate, Siri, Alexa are examples of **inference**. 


## The sneak-peak behind the scenes

1. Parse words from sentences: this is called **tokenziation**. Common way to do this is to use *space* as your separater. But other delimeters can be used depending on the language! 

2. Build vocabulary: languages are huge! We typically keep only the most common words. All the infrequent words get a single category of "unknown". 

3. Algorithms needs numbers! Assign numbers to words. 

![](https://github.com/rubrickyard/pytorch-sentiment-analysis/blob/master/assets/sentiment5.png?raw=1)

4. We do similar thing for our **output** labels (e.g. French --> 0 vs English --> 1)

5. We use these numbers during our model training and inference.   

## Let's walk through a simple example
### Sentiment Analysis (adopted from [here](https://github.com/bentrevett/pytorch-sentiment-analysis) and [here](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert))

#### Goal: Detect if a sentence is positive or negative from Allociné.fr user reviews database.

#### Dataset: 100k positive and 100k negative reviews divided into 3 balanced splits: train (160k reviews), val (20k) and test (20k).

![AlloCine](https://drive.google.com/uc?export=view&id=1K8g_CB-_P9Cet2PlEfR_nMZB9cd4GFBN)
<!-- ![AlloCine](https://drive.google.com/file/d/1K8g_CB-_P9Cet2PlEfR_nMZB9cd4GFBN/view?usp=sharing) -->

---



## Training review examples:

| Review                                                                              |  Polarity  |
| :---------------------------------------------                                      |----------|
| *Magnifique épopée, une belle histoire, touchante avec des acteurs qui interprètent très bien leur rôles (Mel Gibson, Heath Ledger, Jason Isaacs...), le genre de film qui se savoure en famille!*                                                   |  Positive  |
| *N'étant pas fan de SF, j'ai du mal à commenter ce film. Au moins, dirons nous, il n'y a pas d'effets spéciaux et le thème de ces 3 derniers survivants, un blanc, un maori, une blanche est assez bien traité. Mais c'est quand même bien longuet* !        |  Negative  |
| *Les scènes s'enchaînent de manière saccadée, les dialogues sont théâtraux, le jeu des acteurs ne transcende pas franchement le film. Seule la musique de Vivaldi sauve le tout. Belle déception.*                                                   |  Negative  |


## Review counts
### Actual reviews
![actual reviews](https://drive.google.com/uc?export=view&id=19wPCxMUaeti0hxovuZ6CXY_xpZFiv5KH)
<!-- ![actual_reviews](https://drive.google.com/file/d/19wPCxMUaeti0hxovuZ6CXY_xpZFiv5KH/view?usp=sharing) -->
### Review Binary
![review_binary](https://drive.google.com/uc?export=view&id=1ySQBdHax3cJSq1kuqighLGxacGXQymXU)<>
<!-- ![review_binary](https://drive.google.com/file/d/19wPCxMUaeti0hxovuZ6CXY_xpZFiv5KH/view?usp=sharing) -->

### Lenght of reviews
![lenght_of_reviews](https://drive.google.com/uc?export=view&id=1LhNq1j7f1WNpsqk46zccIFKpH7DwxY5X)
<!-- ![lenght_of_reviews](https://drive.google.com/file/d/1LhNq1j7f1WNpsqk46zccIFKpH7DwxY5X/view?usp=sharing) -->


## In this example, we will **not** train a new model. Instead we will use the **pre-trained model** named: **CamemBERT** which was trained using AlloCine data



## Code setup

In [1]:

!pip install transformers>=4.0
!pip install sentencepiece

import tensorflow as tf
assert tf.__version__ >= "2.0"
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 18.9 MB/s eta 0:00:01[K     |▌                               | 20 kB 25.4 MB/s eta 0:00:01[K     |▉                               | 30 kB 12.3 MB/s eta 0:00:01[K     |█                               | 40 kB 9.6 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.2 MB/s eta 0:00:01[K     |█▋                              | 61 kB 5.7 MB/s eta 0:00:01[K     |██                              | 71 kB 5.5 MB/s eta 0:00:01[K     |██▏                             | 81 kB 6.1 MB/s eta 0:00:01[K     |██▍                             | 92 kB 4.7 MB/s eta 0:00:01[K     |██▊                             | 102 kB 5.0 MB/s eta 0:00:01[K     |███                             | 112 kB 5.0 MB/s eta 0:00:01[K     |███▎                            | 122 kB 5.0 MB/s eta 0:00:01[K     |███▌       

## Specify pre-trained model: "tblard/tf-allocine"

In [2]:
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")

nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

All the layers of TFCamembertForSequenceClassification were initialized from the model checkpoint at tblard/tf-allocine.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCamembertForSequenceClassification for predictions without further training.


## Create a user interface to try new setences! 

In [12]:
#@title
import ipywidgets as widgets
from IPython.display import display

class Color:   
   GREEN = '\033[92m'
   RED = '\033[91m'
   BOLD = '\033[1m'   
   END = '\033[0m'

button = widgets.Button(
    description='CLASSIFY !',
    button_style='primary'
  )

text_area = widgets.Textarea(
    value='',
    placeholder='Type something',
    description='',
    disabled=False
)
output = widgets.Output()

def on_button_clicked(b):
  text = text_area.value
  result = nlp(text)
  print(f'results: {result}')
  prediction = result[0]["label"]

  if prediction == "POSITIVE":    
    color = Color.GREEN    
  else:
    color = Color.RED

  with output:    
    print(Color.BOLD + color + f'{prediction}: ' + Color.END + f'"{text}"')

button.on_click(on_button_clicked)
display(text_area, button, output)


Textarea(value='', placeholder='Type something')

Button(button_style='primary', description='CLASSIFY !', style=ButtonStyle())

Output()