# Natural Language Processing

<img src="https://codesrevolvewordpress.s3.us-west-2.amazonaws.com/revolveai/2022/05/15110810/natural-language-processing-techniques.png" width=550>

### What is Natural Language Processing?

> #### Field of study focused on making sense of language - Using statistics and computers
> is a subfield of Linguistics, Computer Science (CS), and Artificial Intelligence (AI) concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
<br><br>
> Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.
<br><br>
> **Methods**: Rules, statistics, neural networks

# Regular Expressions: Regexes in Python

### A (Very Brief) History of Regular Expressions

In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

### The re Module

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, re.search().

re.search(, )

re.search(, ) scans looking for the first location where the pattern matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.

re.search() takes an optional third argument that you’ll learn about at the end of this tutorial.

# Introduction to tokenization

# nltk library

# Word counts with bag-of-words

### Simple text preprocessing

### Introduction to gensim

> ### What is gensim?
> Popular open-source NLP library
<br><br>
> Uses top academic models to perform complex tasks
<br><br>
>Building document or word vectors
<br><br>
Performing topic identi,cation and document comparison

In [3]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

In [4]:
my_documents = [
    'The movie was about a spaceship and aliens.',
    'I really liked the movie!',
    'Awesome action scenes, but boring characters.',
    'The movie was awful! I hate alien films.',
    'Space is cool! I liked the movie.',
    'More space films, please!',
]

In [8]:
tokenized_docs = [word_tokenize(doc.lower())for doc in my_documents]
tokenized_docs

[['the', 'movie', 'was', 'about', 'a', 'spaceship', 'and', 'aliens', '.'],
 ['i', 'really', 'liked', 'the', 'movie', '!'],
 ['awesome', 'action', 'scenes', ',', 'but', 'boring', 'characters', '.'],
 ['the', 'movie', 'was', 'awful', '!', 'i', 'hate', 'alien', 'films', '.'],
 ['space', 'is', 'cool', '!', 'i', 'liked', 'the', 'movie', '.'],
 ['more', 'space', 'films', ',', 'please', '!']]

In [9]:
dictionary = Dictionary(tokenized_docs)

In [10]:
dictionary.token2id

{'.': 0,
 'a': 1,
 'about': 2,
 'aliens': 3,
 'and': 4,
 'movie': 5,
 'spaceship': 6,
 'the': 7,
 'was': 8,
 '!': 9,
 'i': 10,
 'liked': 11,
 'really': 12,
 ',': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'cool': 24,
 'is': 25,
 'space': 26,
 'more': 27,
 'please': 28}

In [17]:
#Creating a gensim corpus
corpus =[dictionary.doc2bow(doc) for doc in tokenized_docs]


In [24]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)],
 [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

- gensim models can be easily saved, updated, and reused
- Our dictionary can also be updated
- This more advanced and feature rich bag-of-words can beused in future exercises

### Tf-idf with gensim

> ### What is tf-idf?
> Term frequency - inverse document frequency
<br><br>
> Allows you to determine the most important words in each document
<br><br>
> Each corpus may have shared words beyond just stopwords
<br><br>
> These words should be down-weighted in importance
<br><br>
> Example from astronomy: "Sky
<br><br>
> Ensures most common words don't show up as key words
<br><br>
> Keeps document speci,c frequent words weighted high
<br><br>

In [15]:
from gensim.models.tfidfmodel import TfidfModel

In [18]:
tfidf = TfidfModel(corpus)

In [19]:
tfidf[corpus[1]]

[(5, 0.1746298276735174),
 (7, 0.1746298276735174),
 (9, 0.1746298276735174),
 (10, 0.29853166221463673),
 (11, 0.47316148988815415),
 (12, 0.7716931521027908)]

# Named Entity Recognition

### Using nltk for Named Entity Recognition

### Introduction to SpaCy

### Multilingual NER with polyglot