<a href="https://colab.research.google.com/github/rupeshthapa123/NotebookProject/blob/main/RupeshThapa_Lab3_NLPTasks_EntityRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Tasks

## Named Entity Recognition

Classifying tokens of interest (think words) in a sequence of tokens (think
sentence) into specific entity types, such as a person, an organization, or a
location. Statistical NLP models were used to perform NER, but today the
best-performing NER models are transformer-based.

In [None]:
import pandas as pd  # Import the pandas library and alias it as pd
import os  # Import the os library

In [None]:
# Get the current working directory
pwd = os.getcwd()

## AG Dataset(Kaggle Version)
News articles with title and description, classified into one of four classes (1-World, 2-Sports, 3-Business, and 4-Sci/Tech)

• 120,000 training samples and 7,600 testing samples
• 30,000 training samples and 1,900 testing samples per class

In [None]:
from google.colab import files

In [None]:
uploaded = files.upload()

Saving train.csv to train (1).csv


In [None]:
# Read the csv file named 'train.csv' into a pandas dataframe named 'data'
data = pd.read_csv('train.csv')

In [None]:
# data = pd.read_csv('sample_data/train.csv')

In [None]:
# Import the pandas library and create a DataFrame from the data
data=pd.DataFrame(data=data)

# Rename the columns to remove any spaces
data.columns = data.columns.str.replace(" ","_")

# Convert all columns to lowercase
data.columns = data.columns.str.lower()

# Create a new column 'class_name' based on the values in the 'class_index' column
data['class_name'] = data['class_index'].map({1:"World", 2:"Sports", 3:"Business", 4:"Sci_Tech"})

# Display the DataFrame
data

Unnamed: 0,class_index,title,description,class_name
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Business
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Business
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Business
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Business
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",Business
...,...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...,World
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...,Sports
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...,Sports
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...,Sports


## Data Review

 Using value_counts to get count per class

 Using iterations to get article titles and descriptions

In [None]:
# This line of code will return a series containing the count of each unique value in the 'class_name' column of the 'data' DataFrame
data.class_name.value_counts()

class_name
Business    30000
Sci_Tech    30000
Sports      30000
World       30000
Name: count, dtype: int64

In [None]:
# loop through the range of 10
for i in range(10):
  # print the title of the article
  print("Title of Article", i)
  # print the title of the article from the data
  print(data.loc[i, "title"])
  # print a new line
  print("\n")

Title of Article 0
Wall St. Bears Claw Back Into the Black (Reuters)


Title of Article 1
Carlyle Looks Toward Commercial Aerospace (Reuters)


Title of Article 2
Oil and Economy Cloud Stocks' Outlook (Reuters)


Title of Article 3
Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)


Title of Article 4
Oil prices soar to all-time record, posing new menace to US economy (AFP)


Title of Article 5
Stocks End Up, But Near Year Lows (Reuters)


Title of Article 6
Money Funds Fell in Latest Week (AP)


Title of Article 7
Fed minutes show dissent over inflation (USATODAY.com)


Title of Article 8
Safety Net (Forbes.com)


Title of Article 9
Wall St. Bears Claw Back Into the Black




In [None]:
# loop through the range of 10
for i in range(10):
  # print the description of the article
  print("Description of Article",i)
  # print the description of the article from the data
  print(data.loc[i,"description"])
  # print a new line
  print("\n")

Description of Article 0
Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


Description of Article 1
Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.


Description of Article 2
Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.


Description of Article 3
Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.


Description of Article 4
AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.


Description o

## Preprocessing data

Cleaning the data and removing superfluous tokens and errors in the text and storing the cleaned dataset in a csv file for future use

In [None]:
# This code snippet is used to replace special characters in the 'title' and 'description' columns of the 'data' DataFrame
cols = ["title","description"]
# Replace '\\' with a space
data[cols] = data[cols].applymap(lambda x: x.replace("\\"," "))
# Replace '#36;' with a '$'
data[cols] = data[cols].applymap(lambda x: x.replace("#36;","$"))
# Replace multiple spaces with a single space
data[cols] = data[cols].applymap(lambda x: x.replace(" "," "))
# Remove any leading or trailing whitespace
data[cols] = data[cols].applymap(lambda x: x.strip())

In [None]:
# Save the data to a CSV file
data.to_csv(pwd+'/sample_data/train_prepared.csv', index=False)

## spaCy pretrained model

Installing the necessary libraries and importing the pretrained transformer model

In [None]:
# ! pip install -U spacy[cuda110,transformers,lookups]
# ! pip install -U spacy-lookups-data==1.0.0
# ! pip install cupy-cuda110==8.5.0
! python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
# Enable GPU usage
spacy.prefer_gpu()
# Check if GPU usage is enabled
print(spacy.prefer_gpu())
# Load the English language model
nlp = spacy.load("en_core_web_trf")

False


In [None]:
# This command will output the version of cupy installed in the current environment
! pip freeze | grep cupy

cupy-cuda12x==12.2.0


### Printing the metadata of the model, which highlights the underlying components and the associated accuracy metrics.

Based on the metrics, the model has an NER component, which supports
various entity types, including the following:

• CARDINAL, DATE,EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP,ORDINAL,
ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, and WORK OF ART.

In [None]:
import pprint
# Create a PrettyPrinter object with 4 spaces for indentation
pp = pprint.PrettyPrinter(indent=4)
# Print the meta data of the natural language processing object
pp.pprint(nlp.meta)

{   'author': 'Explosion',
    'components': [   'transformer',
                      'tagger',
                      'parser',
                      'attribute_ruler',
                      'lemmatizer',
                      'ner'],
    'description': 'English transformer pipeline '
                   "(Transformer(name='roberta-base', "
                   "piece_encoder='byte-bpe', stride=104, type='roberta', "
                   'width=768, window=144, vocab_size=50265)). Components: '
                   'transformer, tagger, parser, ner, attribute_ruler, '
                   'lemmatizer.',
    'disabled': [],
    'email': 'contact@explosion.ai',
    'labels': {   'attribute_ruler': [],
                  'lemmatizer': [],
                  'ner': [   'CARDINAL',
                             'DATE',
                             'EVENT',
                             'FAC',
                             'GPE',
                             'LANGUAGE',
                             'LAW',

## Using the Model

Applying the spaCy model to our AG News data and generate the results of
named entity recognition

Checking or Reading the performance of the model

In [None]:
# loop through each article
for i in range(9):
  # print the article number
  print("Article",i)
  # print the description of the article
  print(data.loc[i,"description"])
  # print the text start and end, and label
  print("Text Start End Label")
  # create a doc object from the description
  doc = nlp(data.loc[i, "description"])
  # loop through each token in the doc
  for token in doc.ents:
    # print the token's text, start char, end char, and label
    print(token.text, token.start_char,
          token.end_char, token.label_)
  # print a new line
  print("\n")

Article 0
Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again.
Text Start End Label
Reuters 0 7 ORG


Article 1
Reuters - Private investment firm Carlyle Group, which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market.
Text Start End Label
Reuters 0 7 ORG
Carlyle Group 34 47 ORG


Article 2
Reuters - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Text Start End Label
Reuters 0 7 ORG
next week 134 143 DATE
summer 168 174 DATE


Article 3
Reuters - Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday.
Text Start End Label
Reuters 0 7 ORG
Iraq 86 90 GPE
Saturday 186 194 DATE


Article 4