# Natural Language Processing
**Natural Language Processing (NLP)** is a confluence of Artificial Intelligence and Linguistics which tries to enable computers to understand natural language data, including text, speech, etc. Tasks like [Speech Recognition](https://en.wikipedia.org/wiki/Speech_recognition), [Machine Translation](https://en.wikipedia.org/wiki/Machine_translation), [Text-to-speech](https://en.wikipedia.org/wiki/Speech_synthesis) and [Part-of-speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) are just some of NLP's branches.

<p align="center">
  <img src="https://miro.medium.com/max/866/1*Ldc8IUYbFiAU83HIfJbtNQ.png"/>
</p>

Historically, [Turing test]() can be considered as a starting point in the realm of Natural Language Processing. Some single-purpose systems like [SHRDLU](https://en.wikipedia.org/wiki/SHRDLU) and [PARRY](https://en.wikipedia.org/wiki/PARRY) were developed by rule-based methods.

There are two revolutions in NLP, the first one happened in late 1980's with introduction of machine learning which came up with statistical models and caused remarkable successes especially in machine translation. Deep learning methods which were introduced in 2010's outperformed previous methods and thus they are considered as second revoloution in NLP.

Through this notebook, we will study main challenges and problem-solving approaches in NLP and introduce some related libraries in Python.

##### Contents:
- [Challenges](#challenges)
    - [Similar Words and Homophones](#homophones)
    - [Sentence Boundary Detection](#sbd)
    - [Ambiguity](#ambiguity)
- [Approaches](#approaches)
    - [Rule-Based Methods](#rule-based)
    - [Machine Learning Methods](#machine-learning)
    - [Deep Learning Methods](#deep-learning)
- [Useful Links](#links)

<a id="challenges"></a>
## Challenges

There are number of challenges and limitations in NLP that we should be aware of. Throughout this section, we will study some of these challenges. Some of these challenges are not completely solved yet.

<a id="homophones"></a>
### Similar Words and Homophones
Same words can have different meanings according the context of the context of a sentence. For example, consider *apple* which can refer to both the fruit and the company. Or another example is "*He can can a can!*" which contains same word "*can*" with three different meanings. Humans can understand the meaning related to the context but differentiating between these meanings for a computer may be challenging. 

As another case, consider [homophones](https://en.wikipedia.org/wiki/Homophone) which are words or phrases sharing same pronounciation while having different meanings, words like "*by*", "*bye*" and "*buy*" and phrases like "*some others*" and "*some mothers*". Detecting these homophones are sometimes hard even for people. 

<a id="sbd"></a>
### Sentence Boundary Detection
One of challenges in NLP is deciding where sentences begin and end. This is mostly because of using punctuation marks which can create ambiguity. As an example, if we simply define full stop as the end of a sentence, then we may face counterexamples as this character may refer to an abbreviation or a decimal number. Rule-based and deep learning approaches are used to solve this problem.

<a id="ambiguity"></a>
### Ambiguity
Sometimes group of words can have two or more interpretations. Consider the following statement:
**<center>I saw a man on a hill with a telescope.</center>**

which can means "*There was a man on the hill and I saw him using my telescope*" while it can be interpreted as "*I saw a man on the hill and he had a telescope*". These ambiguities are sometimes hard to be cleared up since they should be interpreted according to the context. Part-of-speech tagging is one NLP soloution which can help solving this problem. 

Above challenges were just some examples of existing challenges in NLP. Irony and sarcasm, colloquialisms and slang, etc. are some other examples of problems in NLP. For further information you can checkout links provided in [Useful Links](#links) section.

<a id="approaches"></a>
## Approaches

<a id="rule-based"></a>
### Rule-based Methods
[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [context free grammars](https://en.wikipedia.org/wiki/Context-free_grammar) are famous rule-based methods which can be beneficial for some tasks like [parsing](https://en.wikipedia.org/wiki/Parsing). Let's contemplate search queries for plane tickets. A suggested context free grammar for parsing these queries is provided below:

<p align="center">
  <img src="https://cdn.builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/national/parsing-natural-language-processing.png"/>
</p>

**<center> S &#8594; SHOW FLIGHTS ORIGIN DESTINATION DEPARDATE | ... </center>**
**<center> SHOW &#8594; Show me | I want | Can I see | ... </center>**
**<center> FLIGHTS &#8594; (a) flight | flights </center>**
**<center> ORIGIN &#8594; from CITY </center>**
**<center> DESTINATION &#8594; to CITY </center>**
**<center> CITY &#8594; Boston | Denver | ... </center>**

There are some problems with rule-based methods. First, these rules must be generated manually. In addition, the person who defines these rules probably should have high linguistic skills. The other problem is that rule-based methods are not scalable. Imagine how hard it would be if we want to put all cities' names in CITY grammar in above example; however, rule-based methods usually achieve high accuracy if rules are defined precisely.


<a id="machine-learning"></a>
### Machine Learning Methods

This is exactly like what you've seen before in other machine learning tasks. So, first we should have a dataset which is usually a corpus. Then we should do some feature engineering to find features related to our desired task. For example *Does this word begin with a capital letter?* or *What words came before and after this word?*. Next a model like [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier), [random forest](https://en.wikipedia.org/wiki/Random_forest) or etc. should be trained. 

In following cells we will build a [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) classifier using Python. Throughout these codes, we will introduce [**Natural Language Toolkit (NLTK)**](https://www.nltk.org/) that contains many useful classes and functions related to NLP tasks.

In [None]:
# Loading dataset
import nltk
nltk.download('movie_reviews')  
from nltk.corpus import movie_reviews
from random import shuffle

movie_reviews.categories()
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
# Documents are now saved as a tuple: (words list, label)
shuffle(documents)
documents[0]

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


(['you',
  'know',
  'the',
  'plot',
  ':',
  'a',
  'dimwit',
  'with',
  'a',
  'shady',
  'past',
  'is',
  'seduced',
  'into',
  'committing',
  'a',
  'crime',
  'only',
  'to',
  'be',
  'double',
  '-',
  'crossed',
  'by',
  'a',
  'fatal',
  'femme',
  '.',
  'in',
  '"',
  'palmetto',
  ',',
  '"',
  'the',
  'dimwit',
  'is',
  'harry',
  'barber',
  '(',
  'woody',
  'harrelson',
  ')',
  ',',
  'a',
  'reporter',
  'who',
  "'",
  's',
  'just',
  'been',
  'released',
  'from',
  'prison',
  '(',
  'he',
  'was',
  'framed',
  'by',
  'the',
  'gangsters',
  'and',
  'corrupt',
  'officials',
  'he',
  'was',
  'investigating',
  ')',
  '.',
  'enter',
  'la',
  'femme',
  ':',
  'rhea',
  'malroux',
  '(',
  'elisabeth',
  'shue',
  ')',
  ',',
  'the',
  'sexy',
  'young',
  'wife',
  'of',
  'the',
  'richest',
  'man',
  'in',
  'palmetto',
  ',',
  'florida',
  '(',
  'rolf',
  'hoppe',
  ')',
  '.',
  'she',
  'and',
  'her',
  'stepdaughter',
  'odette',
  '(',
 

In [None]:
# Selecting 5000 words from whole to be word features (removing punctuations and stopwords)
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words("english")
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words() if w.isalpha() and not w.lower() in stopwords)
word_features = list(all_words)[:3000]
word_features

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['plot',
 'two',
 'teen',
 'couples',
 'go',
 'church',
 'party',
 'drink',
 'drive',
 'get',
 'accident',
 'one',
 'guys',
 'dies',
 'girlfriend',
 'continues',
 'see',
 'life',
 'nightmares',
 'deal',
 'watch',
 'movie',
 'sorta',
 'find',
 'critique',
 'mind',
 'fuck',
 'generation',
 'touches',
 'cool',
 'idea',
 'presents',
 'bad',
 'package',
 'makes',
 'review',
 'even',
 'harder',
 'write',
 'since',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'head',
 'lost',
 'highway',
 'memento',
 'good',
 'ways',
 'making',
 'types',
 'folks',
 'snag',
 'correctly',
 'seem',
 'taken',
 'pretty',
 'neat',
 'concept',
 'executed',
 'terribly',
 'problems',
 'well',
 'main',
 'problem',
 'simply',
 'jumbled',
 'starts',
 'normal',
 'downshifts',
 'fantasy',
 'world',
 'audience',
 'member',
 'going',
 'dreams',
 'characters',
 'coming',
 'back',
 'dead',
 'others',
 'look',
 'like',
 'strange',
 'apparitions',
 'disappearances',
 'looooot',
 'chase',
 'scenes'

In [None]:
# We simply define 3000 word features indicating whether document contains that word or not
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[f'contains ({word})'] = (word in document_words)
    return features

In [None]:
# Using naive Bayes classifier
final_dataset = [(extract_features(d), c) for (d,c) in documents]
train_set, test_set = final_dataset[:int(0.9 * len(documents))], final_dataset[int(0.9 * len(documents)):]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

0.84

As you see, we achieved accuracy of 84% using simple features and without any parameter tuning! Now let's see which words are most informative features.

In [None]:
classifier.show_most_informative_features(10)

Most Informative Features
        contains (sucks) = True              neg : pos    =     10.0 : 1.0
contains (unimaginative) = True              neg : pos    =      8.5 : 1.0
       contains (annual) = True              pos : neg    =      8.2 : 1.0
      contains (frances) = True              pos : neg    =      7.5 : 1.0
  contains (silverstone) = True              neg : pos    =      7.1 : 1.0
   contains (schumacher) = True              neg : pos    =      7.1 : 1.0
    contains (atrocious) = True              neg : pos    =      6.7 : 1.0
     contains (chambers) = True              neg : pos    =      6.4 : 1.0
       contains (crappy) = True              neg : pos    =      6.4 : 1.0
       contains (turkey) = True              neg : pos    =      6.4 : 1.0


<a id="deep-learning"></a>
### Deep Learning Methods
Neural networks architectures are now widely used in different NLP tasks. [Recurrent Neural Networks (RNNs)](https://en.wikipedia.org/wiki/Recurrent_neural_network) are able to process sequential information. Many-to-one RNNs can be used for text classification problems, one-to-many RNNs are good for text generation tasks and many-to-many RNNs are useful in machine translation. 

<p align="center">
  <img src="https://miro.medium.com/max/875/1*rqPCvf3mRrl9pGKW76wcTw.jpeg"/>
</p>

Other approaches like [Long Short-term Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory), [Attention Mechanism](https://en.wikipedia.org/wiki/Attention_(machine_learning) and [Deep Generative Models](https://towardsdatascience.com/deep-generative-models-25ab2821afd3) are used in different NLP tasks.

Now it's time to see how deep models can be useful in representing words. In NLP tasks we usually need to show words numerically, e.g., using vectors. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) approach -which doesn't use neural networks- can show significance of each word in the document using its frequency in the given document and whole corpus. But it doesn't capture similarities between words. Furthermore, vectors are high dimensional since every word is a feature.

Word2Vec is an alternative approach which uses neural networks to find word embeddings. It can discover similarities between words such that words which are semantically close together have similar embeddings. Vector size is much less than vocabulary size and is usually selected according to corpus size.

Two famous Word2Vec architectures are [continuous bag-of-words (CBOW)](https://en.wikipedia.org/wiki/Bag-of-words_model#CBOW) and [skip-gram](https://en.wikipedia.org/wiki/N-gram#Skip-gram). CBOW uses surrounding words to predict current word while skip-gram aims to predict surrounding words using current word.

<p align="center">
  <img src="https://miro.medium.com/max/875/1*VTu7IlEOcoqs4B5Xs711Ag.png"/>
</p>

In following cell we will try to create a Word2Vec model using *movie_reviews* dataset which was imported in last section.

In [None]:
from gensim.models import Word2Vec

documents_words = [doc[0] for doc in documents]
model = Word2Vec(sentences=documents_words, size=100, window=5, min_count=1, workers=4)

Now let's see which words are mostly similar to the word *ship* using Word2Vec.

In [None]:
sims = model.wv.most_similar('ship', topn=10)  # get other similar words
sims

[('island', 0.9006446003913879),
 ('plane', 0.8913903832435608),
 ('country', 0.886518120765686),
 ('land', 0.8806939125061035),
 ('room', 0.8713586926460266),
 ('planet', 0.8674205541610718),
 ('floor', 0.8587629199028015),
 ('government', 0.856345534324646),
 ('boat', 0.8548795580863953),
 ('fire', 0.8527746200561523)]

Interesting! As we expected, we see words which are semantically close to the word *ship*, such as *island*, *boat*, *plane*, *room*, etc.

You can test other words using same syntax.

<a id="links"></a>
## Useful Links
- [Major challenges of Natural Language Processing (NLP)](https://monkeylearn.com/blog/natural-language-processing-challenges/)

- [Machine Learning vs. Rule Based Systems in NLP](https://medium.com/friendly-data/machine-learning-vs-rule-based-systems-in-nlp-5476de53c3b8)

- [POS Tagging with NLTK and Chunking in NLP](https://www.guru99.com/pos-tagging-chunking-nltk.html)

- [Deep Learning for NLP: An Overview of Recent Trends](https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-recent-trends-d0d8f40a776d)

- [Word Embedding Techniques: Word2Vec and TF-IDF Explained](https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08)

- [RNN in NLP using Python (example)](https://www.codeastar.com/recurrent-neural-network-rnn-in-nlp-and-python-part-2/)