<h1>Natural Language Processing. Chapter I. Basics</h1>
<!--<h2>Introduction, tokens, etc</h2>-->

## Introduction to Natural Language Processing (NLP) ##

### How the text is processed ###

<p>The text can be processed by a machine as it is with other types of data (images, sound, video, etc), as it can be represented with numbers, and thus it can be exploited so as to be processed by machines.</p>
<p>The particularity is the way that natural text is written: It is written by humans to be read by humans as well, so the code is known by speakers of the same language, and that is not the case of a machine unless it is taught to do so</p>
<p>To represent the charcters of text, there are some representations. Each character in a text document is in fact represented in a machine using look up tables: ASCII (the most basic ones, UTF-8 and other encodings to handle pictograms, accented characters or special characters for specific languages, such as 'ç', 'ñ', etc</p>
<p>When dealing with text written by humans there are also additional dificulties, such as that, as humans we do some mistakes or write things in a non-standard ways. Therefore, it can be interpreted as noise or deviations of the commonly used rules of that language. This also complicates the analisys. Of course, not all the humans in the world speak the same language, so translations might be needed to represent the same meaning to make other people understand the underlying message</p>
<p>After having seen all of this and given thah there are more neuances that complicates the understanding of the language by a machine, how do we start processing the text automatically?</p>

There are two main trends in which written text can be divided to better perform the analysis:
* character analysis
* word analysis

Each of those have their pros and cons.
Regarding per-character analysis the set of possible basic symbols to memorize is short, but the possible combinations are far more intractable. On the other hand, this way does not restrict the posibility of producing new words in case the application needs it.
With respect to the per word analysis, the set of possible symbols to analyze is as big as th amount of words we want to consider. The combinations between symbols are also big, but not as potentially big as a per-character combinations. With this, the meaning analysis is closer as a word, or the main part of it, contans a basic meaning which has more restrictions, so it is easier to handle.

### Preparing libraries: nltk / spacy

In [None]:
!python3 -m pip install --upgrade setuptools
# NLTK library
!python3 -m pip install -U nltk
# Spacy library 
!python3 -m pip install -U spacy
!python3 -m spacy download en_core_web_sm

In [3]:
# now the imports:
import spacy
import nltk

### Tokenization ###

<p>In order to process the words and just the words, we need a way to separate them into minimum units that can mean something.</p>
<p>As a first approach, taking the text and splitting the contents with spaces, sometimes taking punctuation characters as ',', '.' or others to separate or sometimes not, depending on the application.</p>
<p>Sometimes, prefixes and suffixes can also be considered sometimes as tokens, as they add additional meaning to the word they are connected to but they can be de-coupled.</p>
<p>Tokens are not limited to words but sentences can also be divided</p>

In [8]:
# use nltk / spacy to tokenize a sentence and see how the segmentation
# is performed
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
# for nltk

tokens = nltk.tokenize.word_tokenize(text)
for token in tokens :
    print(token)
    
sentences = nltk.tokenize.sent_tokenize(text)
for sentence in sentences:
    print (sentence)
    print('\n')

# just for you to know, there are special tokenizers in nltk,
# i.e. TweetTokenizer, which takes into account specific things
# to work for the tweet type of text: more informal, short, etc.  

When
Sebastian
Thrun
started
working
on
self-driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
wasn
’
t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.
When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.


“I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, in an interview with Recode earlier this week.




In [13]:
# for spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for token in doc:
    print(token.text)
for sent in doc.sents:
    print(sent.text)
    print('\n')

When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.
When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.


“I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, in an interview with Recode earlier this week.




### Text cleaning, stop words removal, etc ###

<p>There are some words that do not add any meaning, but only serve to complete structures. There words are some of the following: punctuation, pronouns, possesives, demonstratives, reflexives, some verbs (to be), articles, particles, etc.</p>
<p>It can be seen that those words can have more meaning that it can be seen but in fact, for a high level analysis, for instance in classification, sentiment analysis, etc, the words that contain the most of the meaning are not those, but words like nouns, adjetives, etc.</p> 

In [None]:
# use nltk, and show the head of the stop words. Apply to a sentence
nltk_stopwords = nltk.corpus.stopwords.words('english')
print(nltk_stopwords)

filtered_nltk = [w for w in tokens if w not in nltk_stopwords]

In [None]:
# use spacy to do the same
filtered_spacy = [w for w in doc if not w.is_stop]
print(filtered_spacy)

### Lemmatization / Stemming ###

With this operations we remove the declinations / flexions of the word giving only the most meaningful part of it

With the lematization we have a dictionary in which we map the possible combinations and the lema to give as output

The stemmization cuts the word according to some rules and it can yield sometimes to incorrect / imprecise word reductions

For instance, if we remove the number and gender of some words we get the actual meaning

|   word  | lemmatization | stemming |
|---------|---------------|----------|
| niñas   | niño          | niñ      |
| niñez   | niñez         | niñ      |
| studies | study         | studi    |
| study   | study         | study    |

In [None]:
# apply nltk to do some examples of both things: lemmatization, stemming 
  
lemmatizer = nltk.stem.WordNetLemmatizer()
stemmer = nltk.stem.PorterStemmer()
words = ['children', 'studies', 'study']
for w in words:
    print(w + ' -> ' + lemmatizer.lemmatize(w) + ' / ' + stemmer.stem(w))

print('\n')

for w in filtered_nltk:
    print(w + ' -> ' + lemmatizer.lemmatize(w) + ' / ' + stemmer.stem(w))

In [None]:
# do the same for spacy
for w in filtered_spacy:
    # in spacy there is no implementation of stemmers. Lemmatizers are presumed to be better than stemmers
    print(w.text + ' -> ' + w.lemma_)

### Regexp search ###

This search can be done using raw python functions, including re package. 

In [10]:
import re
l = [w for w in tokens if re.search(".*ing", w)]
print(repr(l))

['working', 'self-driving', 'talking']


However, nltk and spacy also includes some of the basic functionalities for this.

In [15]:
# nltk code (for token only)
nltkText = nltk.Text(tokens)
print(repr(nltkText.findall(r'.*ing')))

workin; self-drivin; talkin
None


In [None]:
#spacy code (for token only)

# Match different spellings of token texts
pattern = [{"TEXT": {"REGEX": ".*ing"}}]

<h4>etc...</h4>

### POS ###

POS (Part Of Speech) is the acronim for the syntactic analysis of a sentence. Here the result is whether a word is a noun, a verb, an adjetive, etc.

Depending on the library, the tags are not the same between libraries.
For instance NNP is a noun in nltk and NOUN in spacy.
It is also not treated the same way in both libraries: In nltk, POS is calculated when needed, weather in spacy, it is calculated at the beginning and left on a field in each token.

In [None]:
# for nltk the code is the following
pos_nltk = nltk.pos_tag(tokens)
for w, p in pos_nltk:
    print(w + ' -> ' + p)

In [None]:
# for spacy the code is simpler
for w in doc:
    print (w.text + ' -> ' + w.pos_ + ' ' + w.dep_)