**NLTK :** The Natural Language Toolkit (nltk) is a library in Python that provides tools to work with human language data (text). It is a popular library for working with text data in the field of natural language processing (NLP). The library contains a wide range of tools and resources, including tools for tokenization, part-of-speech tagging, stemming, and lemmatization, as well as tools for parsing and processing treebanks, WordNet, and other corpora. It also includes a collection of text data, including sample data and a number of large corpora, such as the Brown Corpus and the Penn Treebank. The library is widely used by researchers and practitioners working in NLP, and is an essential tool for many NLP tasks.

To use the Natural Language Toolkit (nltk) library in your Python code, you will first need to install it. You can install nltk using pip, the Python package manager, by running the following command:
!pip install nltk

Once you have installed nltk, you can import it into your code using the import statement:

In [1]:
import nltk

This will give you access to all of the functions, classes, and data provided by the nltk library. You can then use these to perform various NLP tasks, such as tokenization, part-of-speech tagging, and parsing.

The nltk.download() function is a function provided by the Natural Language Toolkit (nltk) library in Python that allows you to download a variety of data and resources that are provided with the library. When you call nltk.download(), it will open a graphical user interface (GUI) that allows you to select which data and resources you want to download. The data and resources that are available for download include sample data, corpora, and pretrained models, as well as other resources such as WordNet, the Punkt tokenizer, and the averaged perceptron tagger. Once you have selected the data and resources you want to download, the nltk.download() function will download and install them on your machine. You can then use these resources in your NLP tasks using the nltk library.

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### Tokenization of paragraphs/sentences

In [2]:
text = """Well, personally, if you ask me, I feel, you know, you have to be honest in life.
        You have to be honest to yourself. You have to be practical.
        You have to take risks in life, but at the same time, you have to be calculated.
        You can’t just say, okay, I took a risky option at some point of time, you have to be ready with,
        ready for the kind of talent that’s really needed to achieve what you want to achieve.
        But at the same, you know, you have to take risks in life. So, for me, being honest in life is very important,
        hard work that you have to put in, irrespective of what your profession is.
        The hard work, the honesty, respecting the elders which I feel is the key.
        You know, if you don’t respect the elders, be it your parents or be it anyone, you know,
        it becomes very difficult to be successful in life.
        Being humble, you know, try to be, when you enter let’s say any big building, you know,
        right from the first man you meet to maybe the managing director, you have to be the same to each and every one.
        So, that’s what life is all about. Go through the difficult periods, fight it out.
        But if you can do it with a smile, you know, you will become part of maybe the 5% persons,
        you know, who can actually do it. Because at times we crib about life, about the tough period,
        but what’s important is to go through the tough period that actually make you a better human being."""

In [10]:
len(text)

1501

**Tokenization :** Tokenization is the process of breaking a stream of text up into individual words, phrases, symbols, or other meaningful elements, known as tokens. The tokens can then be further processed, such as for the purpose of language modeling, machine translation, or information retrieval. Tokenization is a crucial preprocessing step in natural language processing (NLP) tasks, as it allows the relevant information in a text to be extracted and processed effectively. There are many different approaches to tokenization, depending on the desired characteristics of the tokens and the specific requirements of the task at hand. Some common approaches include using whitespace as a delimiter, using regular expressions to identify specific patterns, or using a predefined vocabulary of known tokens

#### Tokenizing sentences

In [15]:
sentences = nltk.sent_tokenize(text)
print(sentences)

['Well, personally, if you ask me, I feel, you know, you have to be honest in life.', 'You have to be honest to yourself.', 'You have to be practical.', 'You have to take risks in life, but at the same time, you have to be calculated.', 'You can’t just say, okay, I took a risky option at some point of time, you have to be ready with,\n        ready for the kind of talent that’s really needed to achieve what you want to achieve.', 'But at the same, you know, you have to take risks in life.', 'So, for me, being honest in life is very important,\n        hard work that you have to put in, irrespective of what your profession is.', 'The hard work, the honesty, respecting the elders which I feel is the key.', 'You know, if you don’t respect the elders, be it your parents or be it anyone, you know,\n        it becomes very difficult to be successful in life.', 'Being humble, you know, try to be, when you enter let’s say any big building, you know,\n        right from the first man you meet t

In [11]:
len(sentences)

14

#### Tokenizing words

In [14]:
words = nltk.word_tokenize(text)
print(words)

['Well', ',', 'personally', ',', 'if', 'you', 'ask', 'me', ',', 'I', 'feel', ',', 'you', 'know', ',', 'you', 'have', 'to', 'be', 'honest', 'in', 'life', '.', 'You', 'have', 'to', 'be', 'honest', 'to', 'yourself', '.', 'You', 'have', 'to', 'be', 'practical', '.', 'You', 'have', 'to', 'take', 'risks', 'in', 'life', ',', 'but', 'at', 'the', 'same', 'time', ',', 'you', 'have', 'to', 'be', 'calculated', '.', 'You', 'can', '’', 't', 'just', 'say', ',', 'okay', ',', 'I', 'took', 'a', 'risky', 'option', 'at', 'some', 'point', 'of', 'time', ',', 'you', 'have', 'to', 'be', 'ready', 'with', ',', 'ready', 'for', 'the', 'kind', 'of', 'talent', 'that', '’', 's', 'really', 'needed', 'to', 'achieve', 'what', 'you', 'want', 'to', 'achieve', '.', 'But', 'at', 'the', 'same', ',', 'you', 'know', ',', 'you', 'have', 'to', 'take', 'risks', 'in', 'life', '.', 'So', ',', 'for', 'me', ',', 'being', 'honest', 'in', 'life', 'is', 'very', 'important', ',', 'hard', 'work', 'that', 'you', 'have', 'to', 'put', 'in',

In [12]:
len(words)

336