### Importing required modules

In [1]:
# Import requests
import requests

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Import NLTK module
import nltk

# Import word_tokenize 
from nltk.tokenize import word_tokenize

# Import POS tagger
from nltk.tag import pos_tag

### Obtaining text from a website

In [3]:
# send a request to the website
page = requests.get("https://en.wikipedia.org/wiki/Natural_Language_Toolkit")

# Use BeautifulSoup to parse HTML using html5 protocol. It is slower
# but more efficient 
page_content = BeautifulSoup(page.text, "html5lib")

# Now we look for the paragraphs
textContent = []
for i in range(0, 3):
    paragraphs = page_content.find_all("p")[i].text
    textContent.append(paragraphs)

# Join the paragraphs together and replace the `\n` for empty strings
wiki_nltk = " ".join(textContent).replace("\n", "")

In [4]:
wiki_nltk

' The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.[5] NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit,[6] plus a cookbook.[7] NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.[8]NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. There are 32 universities in the US and 25 countries using NLTK in their courses. NLTK 

### NTLK in action

NLTK is the most used platform when working with human language data in Python. It provides more than 50 corpora and lexical resources. It also has libraries to classify, tokenize, and tag texts, among other functions.
It is able to help us with our task of classifying words in verbs, nouns, adjectives. Together they are referred to as parts of speech. And the task of labeling them is called part-of-speech tagging, or POS-tagging.

Let's define a method that takes the text, tokenize it, label it and returns the tuples. Each tuple consists of a word and the tag with its part of speech.

In [7]:
def preprocess_text(text):
    """
    This function takes a text. Split it in tokens using word_tokenize. 
    And then tags them using pos_tag from NLTK module.
    It outputs a list of tuples. Each tuple consists of a word and the tag with its 
    part of speech.
    """
    # Get the tokens
    tokens = nltk.word_tokenize(text)
    # Tags the tokens
    tagging = nltk.pos_tag(tokens)
    # Returns the list of tuples
    return tagging

Now, we apply the method to the text.

In [9]:
# Split and label the text
label_text = preprocess_text(wiki_nltk)

And we print the first 20 tuples.

In [19]:
for i in range(0, 20, 5):
    print(label_text[i], label_text[i+1], label_text[i+2], label_text[i+3], label_text[i+4])

('The', 'DT') ('Natural', 'NNP') ('Language', 'NNP') ('Toolkit', 'NNP') (',', ',')
('or', 'CC') ('more', 'JJR') ('commonly', 'RB') ('NLTK', 'NNP') (',', ',')
('is', 'VBZ') ('a', 'DT') ('suite', 'NN') ('of', 'IN') ('libraries', 'NNS')
('and', 'CC') ('programs', 'NNS') ('for', 'IN') ('symbolic', 'JJ') ('and', 'CC')
