# NLP Assignment_2 solutions

Q.1. What are Corpora?

Ans: Corpora, in the context of linguistics and natural language processing (NLP), refer to large collections of texts or spoken language data that are used for linguistic analysis, language modeling, and various other language-related tasks. A corpus (plural: corpora) is essentially a structured and organized repository of linguistic data, typically gathered from different sources such as books, articles, websites, speeches, conversations, and more.

Corpora serve as valuable resources for studying language patterns, understanding linguistic phenomena, and developing computational models and algorithms for NLP tasks. Linguists and researchers analyze corpora to investigate aspects of language like syntax, semantics, pragmatics, discourse, and sociolinguistics. NLP practitioners and developers utilize corpora to train and evaluate machine learning models for tasks like text classification, machine translation, sentiment analysis, and speech recognition.

Corpora can be broadly categorized into two types:

1. Text Corpora: These corpora consist of written texts, such as books, articles, websites, newspapers, and social media posts. They can cover various domains and genres, allowing researchers to explore different styles and registers of language.

2. Spoken Corpora: These corpora contain transcribed or recorded speech data, capturing natural language usage in conversations, interviews, broadcasts, and other spoken contexts. Spoken corpora are valuable for studying phonetics, phonology, prosody, and discourse analysis.

Corpora can be manually created by collecting and annotating data specifically for a particular research project or task. Alternatively, they can be compiled from existing sources, such as publicly available texts or speech recordings. Corpora often undergo preprocessing steps, such as cleaning, tokenization, part-of-speech tagging, and syntactic parsing, to make the data suitable for analysis and modeling.

Examples of well-known corpora include the Brown Corpus, Penn Treebank, British National Corpus (BNC), Corpus of Contemporary American English (COCA), and many others, each tailored to specific research objectives or applications.

Q.2. What are Tokens?

In the context of natural language processing (NLP) and computational linguistics, tokens refer to the individual units or elements into which a text or speech data is divided. These units can be words, characters, or subword units, depending on the tokenization scheme used.

Tokenization is the process of breaking down a continuous stream of text or speech into discrete tokens. It serves as a fundamental step in NLP tasks, as it enables subsequent analysis, modeling, and processing of language data.

Here are a few key points about tokens:

1. Word Tokens: Word tokenization is the most common form of tokenization, where a text is divided into individual words. For example, the sentence "I love eating pizza" would be tokenized into five word tokens: ["I", "love", "eating", "pizza"].

2. Character Tokens: In some cases, the tokenization process may split the text into individual characters rather than words. This approach can be useful for certain NLP tasks, such as character-level language modeling or sentiment analysis at the character level.

3. Subword Tokens: Subword tokenization involves breaking down words into smaller subword units, such as morphemes or syllables. This approach is particularly beneficial for languages with complex morphology or for handling out-of-vocabulary (OOV) words. Popular subword tokenization algorithms include Byte-Pair Encoding (BPE) and SentencePiece.

4. Punctuation and Special Characters: Punctuation marks and special characters, like commas, periods, question marks, and hashtags, are typically treated as separate tokens in tokenization. They provide important contextual information for language understanding.

Tokenization not only breaks down text into meaningful units but also handles issues like contractions, hyphenated words, and punctuation marks. Tokenized data is often used as input for various NLP tasks, including part-of-speech tagging, named entity recognition, machine translation, sentiment analysis, and more.

It's worth noting that the specific tokenization approach chosen can significantly impact downstream NLP tasks and model performance. Therefore, tokenization should be carefully designed and tailored to suit the objectives of the task and the characteristics of the language being processed.

Q.What are Unigrams, Bigrams, Trigrams?

Ans: Unigrams, bigrams, and trigrams are terms used to describe different types of n-grams, which are contiguous sequences of n items (usually words) extracted from a text or corpus. N-grams are commonly used in natural language processing (NLP) and computational linguistics for various tasks such as language modeling, information retrieval, and text classification.

1. Unigrams: Unigrams are n-grams of size 1, meaning they consist of individual words in a text. Each word in a sentence is treated as a separate unigram. For example, the sentence "I love eating pizza" would have the following unigrams: ["I", "love", "eating", "pizza"]. Unigrams provide information about the occurrence and frequency of individual words in a text.

2. Bigrams: Bigrams are n-grams of size 2, where adjacent pairs of words are considered as a single unit. Each bigram represents a sequence of two consecutive words in the text. For example, using the same sentence as before, the bigrams would be: ["I love", "love eating", "eating pizza"]. Bigrams capture some level of word order information and can be useful for tasks like language modeling, text generation, and collocation analysis.

3. Trigrams: Trigrams are n-grams of size 3, meaning they consist of three consecutive words. Trigrams provide even more context and capture dependencies between three adjacent words. Using the same sentence, the trigrams would be: ["I love eating", "love eating pizza"]. Trigrams can be helpful in tasks like language modeling, machine translation, and text summarization.

N-grams of larger sizes (e.g., four-grams, five-grams) can also be used, but their practicality may decrease as the size increases due to the exponential growth in the number of possible combinations.

N-grams are often computed from a corpus or a specific text dataset to capture statistical patterns, frequency distributions, and relationships between words. They serve as features for various NLP applications, such as building language models, extracting features for text classification, sentiment analysis, and information retrieval.

Q.4. How to generate n-grams from text?

Ans: Generating n-grams from text involves the process of breaking down the text into contiguous sequences of n items, typically words. Here's a general approach to generate n-grams from a given text:

1. Text Preprocessing: Before generating n-grams, it's often helpful to perform some preprocessing steps to clean the text and remove noise. This can involve removing punctuation, converting text to lowercase, handling special characters, and eliminating stopwords (common words like "the," "and," "is," etc.) that do not carry significant meaning.

2. Tokenization: The next step is to tokenize the preprocessed text into individual words. You can use a tokenizer library or function in your programming language of choice to split the text into word tokens. The resulting tokens will form the basis for generating n-grams.

3. Generating n-grams: Once you have the word tokens, you can generate n-grams by iterating over the tokens and extracting contiguous sequences of n words. This can be done using a sliding window approach. For example, for bigrams (n=2), you would slide a window of size 2 over the tokens, capturing adjacent pairs of words.

   Here's an example Python code snippet that demonstrates how to generate bigrams from a given list of tokens:

```python
tokens = ["I", "love", "eating", "pizza"]

bigrams = []
for i in range(len(tokens) - 1):
    bigrams.append((tokens[i], tokens[i+1]))

print(bigrams)
```

This code will output: `[("I", "love"), ("love", "eating"), ("eating", "pizza")]`.

You can extend this approach to generate trigrams or higher-order n-grams by adjusting the window size accordingly.

It's worth noting that the choice of tokenization approach and any additional preprocessing steps may vary depending on the specific requirements of your task and the characteristics of the text data. Additionally, libraries or NLP frameworks often provide built-in functions for generating n-grams, making the process more efficient and convenient.

In [2]:
tokens = ["I", "love", "eating", "pizza"]

bigrams = []
for i in range(len(tokens) - 1):
    bigrams.append((tokens[i], tokens[i+1]))

print(bigrams)


[('I', 'love'), ('love', 'eating'), ('eating', 'pizza')]


Q.5. Explain Lemmatization

Ans: Lemmatization is a linguistic process used in natural language processing (NLP) to reduce words to their base or canonical form, known as the lemma. The lemma represents the dictionary form or the common root of a word, disregarding its inflections or variations.

The purpose of lemmatization is to normalize words, ensuring that different forms of the same word are treated as a single entity. For example, the lemmatization process would convert words like "walking," "walked," and "walks" to their lemma "walk."

Here are some key points about lemmatization:

1. Reducing Word Variations: Lemmatization aims to reduce words to their base forms, taking into account morphological analysis and grammatical context. It involves removing suffixes, prefixes, and other inflections to find the lemma that represents the word's essential meaning.

2. Linguistic Considerations: Unlike stemming, which uses heuristics to chop off word endings, lemmatization relies on a deeper understanding of language and relies on dictionaries or morphological rules specific to a language. It considers factors such as part of speech (noun, verb, adjective, etc.) and grammatical features to generate accurate lemmas.

3. Improved Coherence: Lemmatization helps in creating more coherent and semantically meaningful representations of text. By reducing words to their base form, it brings together different surface forms of a word, making it easier to identify relationships between words and extract meaningful insights from the text.

4. Lemmatization vs. Stemming: Lemmatization is often considered more advanced and accurate than stemming. Stemming typically applies simple rules to remove word endings, resulting in the stem, which may not always be a valid word. In contrast, lemmatization produces lemmas that are actual words and preserve the core meaning of the word.

5. Part-of-Speech Tagging: In many lemmatization systems, part-of-speech (POS) tagging is performed alongside the lemmatization process. POS tagging helps disambiguate words with multiple meanings based on their role in the sentence. For instance, "running" could be a verb or an adjective, and POS tagging helps determine the correct lemma based on the context.

Lemmatization is commonly used in various NLP applications, including information retrieval, question-answering systems, machine translation, text classification, and sentiment analysis. It aids in improving the accuracy of these tasks by reducing word variations and aligning similar word forms under a common lemma.

Q.6.Explain Stemming

Ans: Stemming is a linguistic process used in natural language processing (NLP) to reduce words to their base or root form, known as the stem. The stem is obtained by removing prefixes, suffixes, and other affixes from a word while attempting to retain its core meaning. The purpose of stemming is to normalize words and reduce them to a common form, treating different variations of a word as the same entity.

Here are some key points about stemming:

1. Word Stemming: Stemming aims to simplify words by removing derivational and inflectional affixes. Derivational affixes change the word's part of speech or meaning (e.g., "happy" to "happiness"), while inflectional affixes alter the word's grammatical function or tense (e.g., "walk" to "walked"). The stemming process applies heuristics or rule-based methods to cut off these affixes and obtain the stem.

2. Heuristic Approach: Stemming algorithms follow pattern-based rules and heuristics to identify and remove affixes. They don't rely on linguistic knowledge or contextual analysis but rather use straightforward rules to trim word endings. This may result in stems that are not always actual words but instead represent a generalized form.

3. Language Variations: Stemming algorithms vary based on the language being processed since different languages have different morphological rules and affix patterns. There are several popular stemming algorithms available for different languages, such as the Porter stemming algorithm for English and the Snowball stemming algorithm that supports multiple languages.

4. Performance and Accuracy: Stemming algorithms are computationally efficient, making them suitable for large-scale text processing. However, stemming is a more aggressive normalization technique compared to lemmatization, as it may produce stems that are not actual words and can potentially introduce ambiguity.

5. Example: Consider the word "running." Applying a stemming algorithm to it would likely produce the stem "run." Similarly, words like "walked," "walks," and "walking" would all be stemmed to "walk." The goal is to bring together different surface forms of a word to simplify text analysis and improve information retrieval.

Stemming is commonly used in various NLP tasks such as information retrieval, search engines, and text mining, where the focus is on word frequency and basic analysis rather than preserving the semantic meaning of the words. However, in cases where the accurate identification of word forms and their semantic relationships is essential, lemmatization is often preferred over stemming.

Q.7. Explain Part-of-speech (POS) tagging

Ans:Part-of-speech (POS) tagging, also known as POS labeling or grammatical tagging, is a process in natural language processing (NLP) that assigns grammatical labels or tags to each word in a sentence, indicating its syntactic category and role in the sentence. The assigned POS tags provide information about the word's part of speech, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, and more.

Here are some key points about POS tagging:

1. Grammatical Categorization: POS tagging involves classifying each word into specific grammatical categories based on its syntactic function in the sentence. These categories capture the word's role, behavior, and relationships with other words, which is crucial for understanding the structure and meaning of a sentence.

2. Disambiguation: POS tagging helps disambiguate words with multiple meanings. Many words in natural language have different grammatical roles depending on the context in which they appear. POS tagging considers the context of the word in the sentence and assigns the appropriate tag to disambiguate its meaning.

3. Context Dependency: POS tagging takes into account the surrounding words and the grammatical context to assign the appropriate tag. For instance, the word "run" can be a noun or a verb, and POS tagging helps differentiate between "run" as a noun (e.g., "a morning run") and "run" as a verb (e.g., "to run a marathon").

4. Tag Sets: POS tags are defined according to specific tag sets or taggers, which may vary depending on the linguistic framework or language being analyzed. Popular tag sets include the Penn Treebank tag set for English and the Universal Dependencies tag set that aims for cross-linguistic consistency.

5. POS Tagging Techniques: POS tagging can be accomplished using rule-based approaches, statistical methods, or machine learning techniques. Rule-based taggers rely on handcrafted grammatical rules to assign tags, statistical taggers employ probabilistic models trained on annotated corpora, and machine learning taggers utilize supervised or unsupervised learning algorithms to learn patterns and make predictions.

6. Applications: POS tagging plays a crucial role in various NLP applications, such as part-of-speech-based parsing, named entity recognition, information retrieval, sentiment analysis, machine translation, text generation, and more. It provides valuable linguistic information that helps improve the accuracy and understanding of these tasks.

Overall, POS tagging is a fundamental step in NLP that assigns grammatical labels to words, aiding in syntactic analysis, semantic understanding, and subsequent language processing tasks.

Q.8. Explain Chunking or shallow parsing

Ans: Chunking, also known as shallow parsing or partial parsing, is a natural language processing (NLP) technique that involves grouping words together into meaningful syntactic units called chunks. Chunking aims to identify and extract phrases or noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and other chunk types from a sentence, without providing a complete syntactic parse tree like in full parsing.

Here are some key points about chunking:

1. Chunk Types: Chunking identifies and groups words based on their syntactic role and relationships within a sentence. Common chunk types include noun phrases (e.g., "the cat"), verb phrases (e.g., "ate the fish"), prepositional phrases (e.g., "on the table"), and others, depending on the specific chunking scheme or linguistic framework used.

2. Grammatical Patterns: Chunking relies on grammatical patterns and rules to identify and label chunks. These patterns are often defined using regular expressions or finite-state grammar rules that capture specific sequences of part-of-speech (POS) tags or lexical patterns indicative of particular chunk types.

3. Chunking Process: The chunking process typically involves three steps: tokenization, part-of-speech tagging, and chunk identification. First, the sentence is tokenized into words or tokens. Then, each token is assigned a part-of-speech tag, indicating its grammatical category. Finally, the POS-tagged sentence is processed to identify and group words into chunks based on the defined chunking rules.

4. Example: Consider the sentence "The black cat sat on the mat." A simple chunking scheme might identify the following chunks: [The black cat] [sat] [on the mat]. Here, the chunks are indicated by square brackets, and each chunk represents a meaningful syntactic unit.

5. Applications: Chunking has various applications in NLP. It can be used to extract noun phrases for information extraction tasks, identify verb phrases for semantic analysis, or extract prepositional phrases for understanding relationships between entities. Chunking serves as an intermediate step in more complex parsing tasks and can be useful for tasks like named entity recognition, text summarization, and information retrieval.

Chunking differs from full parsing in that it does not produce a complete parse tree or provide a detailed analysis of the grammatical structure. Instead, it focuses on identifying and extracting chunks of words that form meaningful units within a sentence, offering a less complex but still valuable representation of the syntactic information present in the text.

Q.9. Explain Noun Phrase (NP) chunking

Ans: Noun Phrase (NP) chunking, also known as NP chunking or noun chunking, is a specific type of chunking in natural language processing (NLP) that focuses on identifying and extracting noun phrases from a sentence. Noun phrases are syntactic units that consist of a noun and any modifiers, determiners, or adjectives that accompany it.

Here are some key points about noun phrase chunking:

1. Definition of Noun Phrase: A noun phrase (NP) is a phrase that includes a noun as its head and may contain other words that modify or describe the noun. It typically consists of determiners (such as articles), adjectives, quantifiers, and other modifiers that provide additional information about the noun. For example, in the sentence "The big red apple," the noun phrase is "The big red apple," where "apple" is the head noun, and "The," "big," and "red" are modifiers.

2. Purpose of NP Chunking: NP chunking aims to identify and extract noun phrases from sentences. It is particularly useful in various NLP tasks, such as information extraction, named entity recognition, text summarization, and relation extraction. By identifying and extracting noun phrases, the structure and key elements of a sentence can be better understood and utilized in downstream applications.

3. Chunking Process: NP chunking typically involves a sequence of steps, including tokenization, part-of-speech (POS) tagging, and chunk identification. First, the sentence is tokenized into individual words or tokens. Then, each token is assigned a part-of-speech tag, which indicates its grammatical category. Finally, the POS-tagged sentence is processed to identify and extract noun phrases based on specific grammatical patterns and rules.

4. Example: Consider the sentence "John bought a beautiful house." NP chunking would identify the noun phrase "a beautiful house," where "house" is the head noun and "a" and "beautiful" are modifiers. The chunking result would be [John] [bought] [a beautiful house], where the noun phrase is indicated by square brackets.

5. Noun Phrase Chunking Techniques: NP chunking can be performed using rule-based approaches or statistical methods. Rule-based techniques often rely on patterns defined using regular expressions or grammatical rules to identify noun phrases based on specific POS tag patterns. Statistical methods utilize machine learning algorithms trained on annotated data to predict noun phrases based on features such as POS tags, word context, and syntactic dependencies.

Noun phrase chunking is a valuable technique in NLP for extracting and understanding the key elements and structure of sentences. It helps in capturing important noun phrases that convey significant information and context, enabling more accurate and effective text processing and analysis.

Q.10. Explain Named Entity Recognition

Ans: Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, numerical quantities, and more. Named entities are specific named entities or proper nouns that refer to unique individuals, places, organizations, and other entities.

Here are some key points about Named Entity Recognition:

1. Definition of Named Entities: Named entities are specific words or phrases that refer to unique entities in the real world, such as people, places, organizations, dates, products, and more. For example, in the sentence "Apple Inc. was founded by Steve Jobs in 1976," the named entities are "Apple Inc.," "Steve Jobs," and "1976."

2. Purpose of NER: The primary goal of NER is to identify and classify named entities in text, enabling the extraction of valuable information and facilitating various downstream NLP tasks. NER is essential in applications such as information extraction, question answering, chatbots, sentiment analysis, and text summarization.

3. Approaches to NER: NER can be performed using rule-based approaches, statistical models, or deep learning techniques. Rule-based approaches use handcrafted patterns or regular expressions to match named entities based on specific rules or patterns. Statistical models, such as Conditional Random Fields (CRF) or Hidden Markov Models (HMM), learn patterns from annotated training data to make predictions. Deep learning models, such as recurrent neural networks (RNNs) or transformers, can also be used for NER, leveraging large annotated datasets and powerful contextual representations.

4. Training Data and Annotation: NER models require annotated training data, where human annotators label the named entities in a corpus of text. The annotators mark the boundaries of the named entities and assign them to appropriate predefined categories. This annotated data is used to train the NER model, which can then generalize to recognize named entities in unseen text.

5. Evaluation: NER systems are evaluated based on metrics such as precision, recall, and F1 score. Precision measures the proportion of correctly identified named entities out of all the identified entities, while recall measures the proportion of correctly identified entities out of all the true entities. The F1 score is the harmonic mean of precision and recall, providing a balanced measure of performance.

6. Challenges in NER: NER can be challenging due to the ambiguity and variability of named entities in text. It requires handling entity mentions that can have multiple words, handling unseen entities, dealing with out-of-vocabulary words, and resolving entity boundaries in complex sentences. Additionally, multilingual NER introduces additional challenges, including language-specific characteristics and lack of annotated data.

Named Entity Recognition is a crucial NLP task that plays a vital role in information extraction and understanding unstructured text. By identifying and categorizing named entities, NER enables more advanced analysis, retrieval, and interpretation of textual data, contributing to a wide range of applications across industries.