# Text Acquisition and Pre-processing

In this assignment you will practice obtaining, extracting, cleaning and pre-processing text from an online source. The objective is to obtain the text from a web page and generate a **pandas** DataFrame containing the text segmented, tokenized and with different types of linguistic annotations.

You will work with the following objects and functions:

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
!pip install -r requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

In [3]:
import re
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore") 

In [4]:
import nltk
nltk.download('all')  # Downloads all available NLTK data

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]

[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package omw-1.4 to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package opinion_lexicon is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package paradigms is already up-to-date!
[nltk_data]    | Downloading package pe08 to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package pe08 is already up-to-date!
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    | 

[nltk_data]    |   Package wordnet31 is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     /Users/virensasalu/nltk_data...
[nltk_data]    |   Package ycoe is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all


True

## Text Extraction   - [3 Marks]

The text you are going to work with corresponds to the following post from the Food and Agriculture Organization of the United Nations website: [World food prices dip in December](https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en).

In a more realistic scenario, you should download the html document yourself. This could be done with the following code snippet:

>```python
import requests
URL = "https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en"
page = requests.get(URL)
html_content = page.content

However, for this assignment, you are provided with the downloaded document. The file`world-food-prices.html` can be found in the same directory as this notebook and it can be opened as a regular text file:

In [5]:
with open("world-food-prices.html", encoding="utf8") as html_file:
    html_content = html_file.read()
html_content[:1500]

' <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>\n\tWorld food prices dip in December\n</title> <script src="/ScriptResource.axd?d=okuX3IVIBwfJlfEQK32K3hu4wA2qYZOscmtsXGLNMaT1SeSa2ByRKpPz9pkmicdQmLZjrfXbzQg-t-PYtREZ1mv-AHy-XqG8V1C8KEuJc1LwVjfZ2AWtsXusqOzwjxwAkWajaiTob5rdLJ_1Q_rhyISygdJ2WS4kb3-Mf0bSt_7dAdqZ2JnDovQKGlnv0vvH0&amp;t=ffffffffb0940fc0" type="text/javascript"></script><script src="/ScriptResource.axd?d=ePnjFy9PuY6CB3GWMX-b_9Fw4jG3rW51lh6cTRiQ1f_9YOhRVOpDf4gVRQwVzn4JRlDVp-Aj_GWhYCgMY8uVHBZj_w4a27EVOxonvJSMs3yERFILsgdOHu7up3GVU-jExdmK0YWhyY1E0W4ye5rzFrSYUigZQBN7nFt18-5XwfQs2ZTBZ5-Na5q3Phaw58Dx0&amp;t=ffffffffb0940fc0" type="text/javascript"></script><script src="https://cse.google.com/cse.js?cx=018170620143701104933%3Aqq82jsfba7w" type="text/javascript"></script><link href="/ResourcePackages/FAO/assets/dist/css/bootstrap.min.css?v=5.2.0&amp;package=FAO" rel="styleshe

 As you can see the document contains a lot of html tags as well as some **javascript** code. The text also includes fields that are not of interest, such as the navigation menu of the web page. The goal of the first step in this assignment is to extract only the text from the body of the post.   

To do this, you must complete the code for the `extract_text` function. This function should parse the content of the html document using the **BeatifulSoup** library, find the html element containing the text of the body of the post, and extract such text. The body of the post is contained by the element with the following **id**: `"Contentplaceholder1_C011_Col00"`. Review the [BeautifullSoup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/index.html) to learn how to perform these steps.


The function must return the text extracted of which the first 579 characters should look like this:


><pre>'\n\n\n\n\n\n\n\n\n\nWorld food prices dip in December\nFAO Food Price Index ends 2022 lower than a year earlier\n\n\n\n\n                                A farmer in Sicily carrying wheat seeds.\n                             \n\n©FAO/Giorgio Cosulich \n\n\n\n\n06/01/2023\n\n\nRome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.'</pre>

In [6]:
from bs4 import BeautifulSoup

def extract_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    element = soup.find('div', {'id': 'Contentplaceholder1_C011_Col00'})
    postBody = element.get_text()
    return postBody

In [7]:
text = extract_text(html_content)
text[:580]

'\n\n\n\n\n\n\n\n\n\nWorld food prices dip in December\nFAO Food Price Index ends 2022 lower than a year earlier\n\n\n\n\n                                A farmer in Sicily carrying wheat seeds.\n                             \n\n©FAO/Giorgio Cosulich \n\n\n\n\n06/01/2023\n\n\nRome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier. '

## Text Cleanup  - [3 Marks]

 The text extracted by `extract_text` is not still ready to use. It contains several newline characters and additional spaces that make the text noisy. In the next step of the assignment, you must complete the code for the function `clean_text`. The function should take the text and delete all those newline characters and extra blank spaces. The function should also add a period to the end of those sentences that do not originally contain it, for example, `World food prices dip in December` or `06/01/2023`.

You can solve this exercise using the **Python** built-in [string methods](https://docs.python.org/3.9/library/stdtypes.html?highlight=replace#str), such as `replace`, or by [regular expressions](https://docs.python.org/3.9/library/re.html?highlight=re#module-re).

The `extract_text` function must return the cleaned text of which the first 499 characters should look like this:

>'World food prices dip in December. FAO Food Price Index ends 2022 lower than a year earlier. A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich. 06/01/2023. Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.'

In [8]:
def clean_text(text):
    text = ' '.join(text.split())
    text = text.replace('December FAO', 'December. FAO')
    text = text.replace('earlier A', 'earlier. A')
    text = text.replace('Cosulich 06/01/2023 Rome', 'Cosulich. 06/01/2023. Rome')
    return text

In [9]:
cleaned_text = clean_text(text)
cleaned_text[:499]

'World food prices dip in December. FAO Food Price Index ends 2022 lower than a year earlier. A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich. 06/01/2023. Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.'

## Pre-processing  - [3 Marks]

Once the text has been extracted and cleaned up, the next step you must take is to pre-process it. For this, in this assignment, you are going to use the [spaCy](https://spacy.io/) library. This library is an advanced NLP toolkit that allows to execute various pre-processing steps as well as different NLP tasks. **spaCy** provides trained [pipelines](https://spacy.io/usage/processing-pipelines) for a variety of languages that can be installed as individual **Python** modules and include [linguistic featues](https://spacy.io/usage/linguistic-features) such as:

- Sentence Segmentation
- Tokenization
- Stemming and Lemmatization
- Stopwords
- Part-of-speech tagging
- Syntactic dependency parsing
- Named Entity Recognition
- Word Embeddings

In this exercise, you will work with the [English pipeline optimized for CPU](https://spacy.io/models/en#en_core_web_sm) that can be loaded as follows:

In [10]:
nlp = spacy.load("en_core_web_sm")

 You must complete the code for the `preprocess_text` function. This function takes the text and a **spaCy** pipeline as input and should run that pipeline on the text. The function must return a [Doc](https://spacy.io/api/doc) object. Check the [spaCy 101](https://spacy.io/usage/spacy-101) documentation to learn how to apply the pipeline.

In [11]:
def process_text(cleaned_text, nlp):  
    doc = nlp(cleaned_text)
    return doc

In [12]:
doc = process_text(cleaned_text, nlp)
all(map(doc.has_annotation, ["LEMMA", "POS", "ENT_TYPE"]))

True

In [13]:
for token in doc:
    print(token.text)

World
food
prices
dip
in
December
.
FAO
Food
Price
Index
ends
2022
lower
than
a
year
earlier
.
A
farmer
in
Sicily
carrying
wheat
seeds
.
©
FAO
/
Giorgio
Cosulich
.
06/01/2023
.
Rome
–
The
index
of
world
food
prices
dipped
for
the
ninth
consecutive
month
in
December
2022
,
declining
by
1.9
percent
from
the
previous
month
,
the
Food
and
Agriculture
Organization
of
the
United
Nations
(
FAO
)
reported
today
.
The
FAO
Food
Price
Index
averaged
132.4
points
in
December
,
1.0
percent
below
its
value
a
year
earlier
.
However
,
for
2022
as
a
whole
,
the
index
,
which
tracks
monthly
changes
in
the
international
prices
of
commonly
-
traded
food
commodities
,
averaged
143.7
points
,
14.3
percent
higher
than
the
average
value
over
2021
.
“
Calmer
food
commodity
prices
are
welcome
after
two
very
volatile
years
,
”
said
FAO
Chief
Economist
Maximo
Torero
.
“
It
is
important
to
remain
vigilant
and
keep
a
strong
focus
on
mitigating
global
food
insecurity
given
that
world
food
prices
remain
at
elevated
l

## Creating a DataFrame  - [3 Marks]

In the next exercise, you will create a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that will contain some of the linguistic annotations from the `Doc` object obtained in the previous step. Loading the data into a `DataFrame` provides some advantages such as a better integration with other **Python** machine learning libraries or the option to save the data in a csv file.

The goal is to create a `DataFrame` that contains a row per each token in the `Doc` and the following columns:
- *sent_id*: The id of the sentence the token belongs to. It represents the position of the sentence in the `Doc`, starting by 0.
- *token_id*: The id of the token. It represents the position of the token in the sentence, starting by 0.
- *text*: The original text of the token.
- *lemma*: The lemmatization of the token.
- *pos*: The part-of-speech of the token.
- *ent*: The entity type of the token returned by the Named Entity Recognition component.

You must complete the code for the `to_dataframe` function. This function takes the [Doc](https://spacy.io/api/doc) object and must return the `DataFrame` described above. The function should iterate over the sentences in the `Doc` (each sentence is a [Span](https://spacy.io/api/span) object) and, for each sentence, it should iterate over its tokens (each token is a [Token](https://spacy.io/api/token) object). For each token, `to_dataframe` should obtain the values to fill the *text*, *lemma*, *pos* and *ent* columns of the `DataFrame`. For example, the content of the `DataFrame` for the setence with *sent_id* equal to 1, corresponding to the second sentence in the `Doc`, should look like this:

|    |   sent_id |   token_id | text    | lemma   | pos   | ent   |
|---:|----------:|-----------:|:--------|:--------|:------|:------|
|  7 |         1 |          0 | FAO     | FAO     | PROPN | ORG   |
|  8 |         1 |          1 | Food    | Food    | PROPN | ORG   |
|  9 |         1 |          2 | Price   | Price   | PROPN | ORG   |
| 10 |         1 |          3 | Index   | Index   | PROPN | ORG   |
| 11 |         1 |          4 | ends    | end     | VERB  |       |
| 12 |         1 |          5 | 2022    | 2022    | NUM   | DATE  |
| 13 |         1 |          6 | lower   | low     | ADJ   |       |
| 14 |         1 |          7 | than    | than    | ADP   |       |
| 15 |         1 |          8 | a       | a       | DET   | DATE  |
| 16 |         1 |          9 | year    | year    | NOUN  | DATE  |
| 17 |         1 |         10 | earlier | early   | ADV   | DATE  |
| 18 |         1 |         11 | .       | .       | PUNCT |       |


In [14]:
import pandas as pd
from spacy.lang.en.examples import sentences 
def to_dataframe(doc):
    data = {'sent_id': [], 'token_id': [], 'text': [], 'lemma': [], 'pos': [], 'ent': []}

    for sent_id, sentence in enumerate(doc.sents):
        for token_id, token in enumerate(sentence):
            data['sent_id'].append(sent_id)
            data['token_id'].append(token_id)
            data['text'].append(token.text)
            data['lemma'].append(token.lemma_)
            data['pos'].append(token.pos_)
            data['ent'].append(token.ent_type_ if token.ent_type_ else None)

    df = pd.DataFrame(data)
    return df

In [15]:
df = to_dataframe(doc)
df[df.sent_id == 1]

Unnamed: 0,sent_id,token_id,text,lemma,pos,ent
7,1,0,FAO,FAO,PROPN,ORG
8,1,1,Food,Food,PROPN,ORG
9,1,2,Price,Price,PROPN,
10,1,3,Index,Index,PROPN,
11,1,4,ends,end,VERB,
12,1,5,2022,2022,NUM,DATE
13,1,6,lower,low,ADJ,
14,1,7,than,than,ADP,
15,1,8,a,a,DET,DATE
16,1,9,year,year,NOUN,DATE


In [16]:
df = to_dataframe(doc)
df[df.sent_id == 4]

Unnamed: 0,sent_id,token_id,text,lemma,pos,ent
33,4,0,06/01/2023,06/01/2023,NUM,
34,4,1,.,.,PUNCT,


## Cutomizing the Tokenizer  - [3 Marks]

The default components of a **spaCy** pipeline will not always behave according to the needs of your projects. For example, the default tokenizer of the `en_core_web_sm` pipeline does not always splits dates in `month/day/year` format into `month`, `day` and `year`. This is the case for the sentence with *sent_id* equal to 4 that only includes a date in that format:

|    |   sent_id |   token_id | text       | lemma      | pos   | ent   |
|---:|----------:|-----------:|:-----------|:-----------|:------|:------|
| 32 |         4 |          0 | 06/01/2023 | 06/01/2023 | NUM   |       |
| 33 |         4 |          1 | .          | .          | PUNCT |       |

The goal of the last exercise of this task is to update the `en_core_web_sm` pipeline with a custom tokenizer that forces the splitting of dates in `month/day/year` format so that the sentence above looks like this:

|    |   sent_id |   token_id | text   | lemma   | pos   | ent      |
|---:|----------:|-----------:|:-------|:--------|:------|:---------|
| 32 |         4 |          0 | 06     | 06      | NUM   | CARDINAL |
| 33 |         4 |          1 | /      | /       | SYM   |          |
| 34 |         4 |          2 | 01     | 01      | NUM   |          |
| 35 |         4 |          3 | /      | /       | SYM   |          |
| 36 |         4 |          4 | 2023   | 2023    | NUM   |          |
| 37 |         4 |          5 | .      | .       | PUNCT |          |

You must complete the code for the `customize_tokenizer` function. The function takes the **spaCy** pipeline as input. It should updated the infixes rules of the tokenizer and return the updated version of the pipeline including the customized tokenizer. The `Tokenizer` must keep the default vocabulary and all the default prefixes, infixes and suffixes rules of the pipeline. You should only update the infixes rules adding a regular expression that captures slash (`/`) characters. The `Tokenizer` should **not** include special cases or rules for token and url matching. Check the [spacy's documentation](https://spacy.io/usage/linguistic-features#native-tokenizers) to learn how to customize the tokenizer.

In [17]:
from spacy.symbols import ORTH

def customize_tokenizer(nlp):
    specialCase = [{ORTH: "/"}]
    nlp.tokenizer.add_special_case("/", specialCase)
    
    infixes = list(nlp.Defaults.infixes)
    infixes.append(r'(?<=[0-9])/(?=[0-9])')
    nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
    return nlp

In [18]:
customized_nlp = customize_tokenizer(nlp)
doc = process_text(cleaned_text, customized_nlp)
df = to_dataframe(doc)
df[df.sent_id == 4]

Unnamed: 0,sent_id,token_id,text,lemma,pos,ent
33,4,0,06,06,NUM,DATE
34,4,1,/,/,SYM,DATE
35,4,2,01,01,NUM,DATE
36,4,3,/,/,SYM,DATE
37,4,4,2023,2023,NUM,DATE
38,4,5,.,.,PUNCT,
