# Implementation of POS-Tagging

**Part-of-speech (POS)** tagging is the process of assigning a grammatical tag or label to each word in a sentence to identify its syntactic function in the sentence. The tags are typically defined by a set of linguistic rules or a machine learning algorithm trained on a labeled corpus.

For example, consider the sentence: `"The cat sat on the mat"`. A POS tagger would assign a tag to each word as follows:

* `"The"` is a determiner
* `"cat"` is a noun
* `"sat"` is a verb
* `"on"` is a preposition
* `"the"` is a determiner
* `"mat"` is a noun

The complete sequence of POS tags for this sentence would be: `DET NOUN VERB ADP DET NOUN`.

POS tagging is useful in many natural language processing applications such as text-to-speech systems, machine translation, and text analysis. By identifying the parts of speech in a sentence, a computer program can better understand the meaning and structure of the text, and can use that information to perform various tasks.

## Spacy

**Spacy** is a popular open-source natural language processing (NLP) library that can be used for many NLP tasks, including **part-of-speech (POS)** tagging.

Spacy's POS tagging functionality is based on statistical models that have been trained on large annotated corpora. These models use contextual information to predict the most likely POS tags for each word in a given sentence or document. Spacy also provides pre-trained models for several languages, including English, Spanish, French, and German.

To perform POS tagging using Spacy, you would typically first load a language-specific model and then process the text you wish to tag using the model's pipeline. 

#install spacy
!pip install spacy

In [1]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

import pandas as pd

# This will help view all the text
pd.set_option('display.max_colwidth', None)

2023-02-21 23:26:55.142978: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


If you get an error here, `Can't find model 'en_core_web_sm'`, you can install the `en_core_web_sm model` by running the following command in your terminal or command prompt:

**`python -m spacy download en_core_web_sm`**


In [2]:
# Load the sample corpus
df = pd.read_csv('tfidf.csv')
df

Unnamed: 0,documentId,text,category
0,1293,Climate change is a pressing issue that affects us all.,climate
1,1294,"The Earth's temperature is rising due to human activities, such as the burning of fossil fuels and deforestation.",climate
2,1295,"This has led to more extreme weather events, including hurricanes, floods, and droughts.",climate
3,1296,"The consequences of climate change are already being felt around the world, with vulnerable populations, such as the poor and marginalized, bearing the brunt of the impact.",climate
4,1297,It's crucial that we take immediate action to reduce our carbon footprint and mitigate the effects of climate change.,climate


In [3]:
# Define a function to apply POS tagging to each sentence
def pos_tagging(text):
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

# Apply the function to the 'text' column of the DataFrame
df['pos_tags'] = df['text'].apply(pos_tagging)

In [9]:
df

Unnamed: 0,documentId,text,category,pos_tags
0,1293,Climate change is a pressing issue that affects us all.,climate,"[(Climate, NOUN), (change, NOUN), (is, AUX), (a, DET), (pressing, VERB), (issue, NOUN), (that, PRON), (affects, VERB), (us, PRON), (all, PRON), (., PUNCT)]"
1,1294,"The Earth's temperature is rising due to human activities, such as the burning of fossil fuels and deforestation.",climate,"[(The, DET), (Earth, PROPN), ('s, PART), (temperature, NOUN), (is, AUX), (rising, VERB), (due, ADP), (to, ADP), (human, ADJ), (activities, NOUN), (,, PUNCT), (such, ADJ), (as, ADP), (the, DET), (burning, NOUN), (of, ADP), (fossil, ADJ), (fuels, NOUN), (and, CCONJ), (deforestation, NOUN), (., PUNCT)]"
2,1295,"This has led to more extreme weather events, including hurricanes, floods, and droughts.",climate,"[(This, PRON), (has, AUX), (led, VERB), (to, ADP), (more, ADV), (extreme, ADJ), (weather, NOUN), (events, NOUN), (,, PUNCT), (including, VERB), (hurricanes, NOUN), (,, PUNCT), (floods, NOUN), (,, PUNCT), (and, CCONJ), (droughts, NOUN), (., PUNCT)]"
3,1296,"The consequences of climate change are already being felt around the world, with vulnerable populations, such as the poor and marginalized, bearing the brunt of the impact.",climate,"[(The, DET), (consequences, NOUN), (of, ADP), (climate, NOUN), (change, NOUN), (are, AUX), (already, ADV), (being, AUX), (felt, VERB), (around, ADP), (the, DET), (world, NOUN), (,, PUNCT), (with, ADP), (vulnerable, ADJ), (populations, NOUN), (,, PUNCT), (such, ADJ), (as, ADP), (the, DET), (poor, ADJ), (and, CCONJ), (marginalized, VERB), (,, PUNCT), (bearing, VERB), (the, DET), (brunt, NOUN), (of, ADP), (the, DET), (impact, NOUN), (., PUNCT)]"
4,1297,It's crucial that we take immediate action to reduce our carbon footprint and mitigate the effects of climate change.,climate,"[(It, PRON), ('s, AUX), (crucial, ADJ), (that, SCONJ), (we, PRON), (take, VERB), (immediate, ADJ), (action, NOUN), (to, PART), (reduce, VERB), (our, PRON), (carbon, NOUN), (footprint, NOUN), (and, CCONJ), (mitigate, VERB), (the, DET), (effects, NOUN), (of, ADP), (climate, NOUN), (change, NOUN), (., PUNCT)]"


In [4]:
# Define a function to apply POS tagging to each sentence
def pos_tags(text):
    doc = nlp(text)
    return [token.pos_ for token in doc]

# Apply the function to the 'text' column of the DataFrame
df['pos'] = df['text'].apply(pos_tags)

In [19]:
df

Unnamed: 0,documentId,text,category,pos_tags,pos
0,1293,Climate change is a pressing issue that affects us all.,climate,"[(Climate, NOUN), (change, NOUN), (is, AUX), (a, DET), (pressing, VERB), (issue, NOUN), (that, PRON), (affects, VERB), (us, PRON), (all, PRON), (., PUNCT)]","[NOUN, NOUN, AUX, DET, VERB, NOUN, PRON, VERB, PRON, PRON, PUNCT]"
1,1294,"The Earth's temperature is rising due to human activities, such as the burning of fossil fuels and deforestation.",climate,"[(The, DET), (Earth, PROPN), ('s, PART), (temperature, NOUN), (is, AUX), (rising, VERB), (due, ADP), (to, ADP), (human, ADJ), (activities, NOUN), (,, PUNCT), (such, ADJ), (as, ADP), (the, DET), (burning, NOUN), (of, ADP), (fossil, ADJ), (fuels, NOUN), (and, CCONJ), (deforestation, NOUN), (., PUNCT)]","[DET, PROPN, PART, NOUN, AUX, VERB, ADP, ADP, ADJ, NOUN, PUNCT, ADJ, ADP, DET, NOUN, ADP, ADJ, NOUN, CCONJ, NOUN, PUNCT]"
2,1295,"This has led to more extreme weather events, including hurricanes, floods, and droughts.",climate,"[(This, PRON), (has, AUX), (led, VERB), (to, ADP), (more, ADV), (extreme, ADJ), (weather, NOUN), (events, NOUN), (,, PUNCT), (including, VERB), (hurricanes, NOUN), (,, PUNCT), (floods, NOUN), (,, PUNCT), (and, CCONJ), (droughts, NOUN), (., PUNCT)]","[PRON, AUX, VERB, ADP, ADV, ADJ, NOUN, NOUN, PUNCT, VERB, NOUN, PUNCT, NOUN, PUNCT, CCONJ, NOUN, PUNCT]"
3,1296,"The consequences of climate change are already being felt around the world, with vulnerable populations, such as the poor and marginalized, bearing the brunt of the impact.",climate,"[(The, DET), (consequences, NOUN), (of, ADP), (climate, NOUN), (change, NOUN), (are, AUX), (already, ADV), (being, AUX), (felt, VERB), (around, ADP), (the, DET), (world, NOUN), (,, PUNCT), (with, ADP), (vulnerable, ADJ), (populations, NOUN), (,, PUNCT), (such, ADJ), (as, ADP), (the, DET), (poor, ADJ), (and, CCONJ), (marginalized, VERB), (,, PUNCT), (bearing, VERB), (the, DET), (brunt, NOUN), (of, ADP), (the, DET), (impact, NOUN), (., PUNCT)]","[DET, NOUN, ADP, NOUN, NOUN, AUX, ADV, AUX, VERB, ADP, DET, NOUN, PUNCT, ADP, ADJ, NOUN, PUNCT, ADJ, ADP, DET, ADJ, CCONJ, VERB, PUNCT, VERB, DET, NOUN, ADP, DET, NOUN, PUNCT]"
4,1297,It's crucial that we take immediate action to reduce our carbon footprint and mitigate the effects of climate change.,climate,"[(It, PRON), ('s, AUX), (crucial, ADJ), (that, SCONJ), (we, PRON), (take, VERB), (immediate, ADJ), (action, NOUN), (to, PART), (reduce, VERB), (our, PRON), (carbon, NOUN), (footprint, NOUN), (and, CCONJ), (mitigate, VERB), (the, DET), (effects, NOUN), (of, ADP), (climate, NOUN), (change, NOUN), (., PUNCT)]","[PRON, AUX, ADJ, SCONJ, PRON, VERB, ADJ, NOUN, PART, VERB, PRON, NOUN, NOUN, CCONJ, VERB, DET, NOUN, ADP, NOUN, NOUN, PUNCT]"


It's generally recommended to perform **part-of-speech (POS)** tagging on a text after text cleaning. This is because text cleaning can modify the original text and change the context of words, which can affect the accuracy of POS tagging.

For example, if a text contains contractions like `"didn't"` or `"couldn't"`, removing the apostrophes during text cleaning could result in the words "didnt" and "couldnt" being considered as separate words instead of contractions. This can lead to incorrect POS tags being assigned to those words.

However, there may be cases where specific types of text cleaning, such as removing stop words or stemming words, can improve the accuracy of POS tagging. For instance, removing stop words like "the" or "and" can help focus on the more informative words in a text, which can make the POS tagging process more accurate. Similarly, stemming can reduce the number of unique words in a text, which can help reduce the computational cost of POS tagging.

In summary, while it's generally recommended to perform POS tagging on a text after text cleaning, the specific type of text cleaning and the goals of the analysis should be taken into account to determine the best approach for a particular application.

## Model Training

In [7]:
X = df['text'].apply(pos_tags)
X

0                                                                                                                  [NOUN, NOUN, AUX, DET, VERB, NOUN, PRON, VERB, PRON, PRON, PUNCT]
1                                                           [DET, PROPN, PART, NOUN, AUX, VERB, ADP, ADP, ADJ, NOUN, PUNCT, ADJ, ADP, DET, NOUN, ADP, ADJ, NOUN, CCONJ, NOUN, PUNCT]
2                                                                            [PRON, AUX, VERB, ADP, ADV, ADJ, NOUN, NOUN, PUNCT, VERB, NOUN, PUNCT, NOUN, PUNCT, CCONJ, NOUN, PUNCT]
3    [DET, NOUN, ADP, NOUN, NOUN, AUX, ADV, AUX, VERB, ADP, DET, NOUN, PUNCT, ADP, ADJ, NOUN, PUNCT, ADJ, ADP, DET, ADJ, CCONJ, VERB, PUNCT, VERB, DET, NOUN, ADP, DET, NOUN, PUNCT]
4                                                       [PRON, AUX, ADJ, SCONJ, PRON, VERB, ADJ, NOUN, PART, VERB, PRON, NOUN, NOUN, CCONJ, VERB, DET, NOUN, ADP, NOUN, NOUN, PUNCT]
Name: text, dtype: object

In [8]:
# Convert the POS tags to one-hot encoded vectors
all_tags = set(tag for tags in X for tag in tags)
tag_index = {tag: i for i, tag in enumerate(all_tags)}


In [9]:
all_tags

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'NOUN',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'VERB'}

In [10]:
tag_index

{'PART': 0,
 'AUX': 1,
 'ADV': 2,
 'PROPN': 3,
 'SCONJ': 4,
 'VERB': 5,
 'DET': 6,
 'NOUN': 7,
 'PRON': 8,
 'ADJ': 9,
 'PUNCT': 10,
 'ADP': 11,
 'CCONJ': 12}

Convert the POS tags in X to one-hot encoded vectors using the one_hot_encode function. This function takes a list of POS tags as input, and returns a binary vector where the i-th element is 1 if the i-th POS tag in the corpus is present in the input, and 0 otherwise.



In [11]:
#Create a function
def one_hot_encode(tags):
    vec = [0] * len(all_tags)
    for tag in tags:
        vec[tag_index[tag]] = 1
    return vec

X = [one_hot_encode(tags) for tags in X]

In [14]:
X

[[0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0],
 [1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1],
 [0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1],
 [0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1],
 [1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

Since we don't have a target variable, we can't proceed in this example. The code after this to train the model is as below. Where Y is the target. 

Note: since X is a list, Y should also be a list of the same length as X with corresponding labels for each element in X. Or make both an array.

In [None]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train a logistic regression model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, Y_train)

# Evaluate the model on the test set
Y_pred = clf.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy:", accuracy)

## Applications:

Once you have obtained the part-of-speech (POS) tags for a text, you can use them for various text analysis tasks. Here are some examples of how you can use POS tags:

Text classification: POS tags can be used as features for text classification tasks. For example, you could use the frequency of certain POS tags in a text to classify the text into categories such as news, reviews, or social media posts.

Named entity recognition: POS tags can be used to identify named entities in a text, such as people, places, and organizations. This is because proper nouns are often tagged as 'NNP' (singular proper noun) or 'NNPS' (plural proper noun) by POS taggers.

Sentiment analysis: POS tags can be used to identify the sentiment of a text, such as positive or negative. For example, words tagged as adjectives ('JJ') or adverbs ('RB') can provide cues about the sentiment of a text.