# Task 4: PoS and NER

## Import Spacy

Read more at [spacy.io](https://spacy.io)

In [1]:
# install spacy libraries

!pip install spacy
import spacy
!python -m spacy download en_core_web_sm
from spacy.lang.en.examples import sentences

# import pandas library

import pandas as pd
import datetime

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# We load a model for english, based on a web crawl, and we choose the small model

nlp = spacy.load("en_core_web_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


## Word Tokenize
Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.

In [3]:
# import csv dataset

df = pd.read_csv(r'/Users/sascha.schmid/Documents/University/Master/04 NLP/Project/20230427_Data & Results.xlsx - Data & Labels.csv')
print(df)

         id      date                                               text  \
0     10836  03.01.06  Survey measures of individuals' expectations o...   
1     10837  21.02.06  Although high profit margins could imply some ...   
2     10838  20.07.06  Consumer price inflation remained elevated in ...   
3     10839  11.10.06  Spending on cars and light trucks increased so...   
4     10840  15.11.06  The Manager also discussed with the Committee ...   
...     ...       ...                                                ...   
3060  13896  19.11.08  Participants noted that the financial turmoil ...   
3061  13897  08.04.09  Among the advanced foreign economies, headline...   
3062  13898  08.04.09  Consumer outlays showed some signs of stabiliz...   
3063  13899  20.05.09  The staff raised its near-term estimate of cor...   
3064  13900  15.07.09  Such spreads, however, remained somewhat wide ...   

     growth_sentiment_1 employment_sentimen_1 inflation_sentiment_1  
0               n

In [4]:
# add column "text" to list and print it

texts = df['text'].tolist()
 
# printing list data

print('Text:', texts)

Text: ["Survey measures of individuals' expectations of future labor market conditions improved in November, largely reversing post-Katrina declines.", 'Although high profit margins could imply some existing pricing power, they might also provide a cushion to absorb some future cost increases.', 'Consumer price inflation remained elevated in April and May, reflecting sharp rises in energy prices and more rapid increases in core prices.', 'Spending on cars and light trucks increased somewhat in July after a lackluster pace in the second quarter but apparently weakened in August.', 'The Manager also discussed with the Committee the results of a recent review of the management of the domestic security holdings of the SOMA.', 'Higher imports of capital goods excluding aircraft, computers, and semiconductors and of oil also contributed to the overall gain in imports.', 'TIPS-based inflation compensation at the five-year horizon was about unchanged, while inflation compensation at longer hor

In [5]:
# concatenate all the list elements into one string for easier processing

oneStringTexts = ' '.join([str(item) for item in texts])
print(oneStringTexts)
    

Survey measures of individuals' expectations of future labor market conditions improved in November, largely reversing post-Katrina declines. Although high profit margins could imply some existing pricing power, they might also provide a cushion to absorb some future cost increases. Consumer price inflation remained elevated in April and May, reflecting sharp rises in energy prices and more rapid increases in core prices. Spending on cars and light trucks increased somewhat in July after a lackluster pace in the second quarter but apparently weakened in August. The Manager also discussed with the Committee the results of a recent review of the management of the domestic security holdings of the SOMA. Higher imports of capital goods excluding aircraft, computers, and semiconductors and of oil also contributed to the overall gain in imports. TIPS-based inflation compensation at the five-year horizon was about unchanged, while inflation compensation at longer horizons crept higher. Moreov

In [6]:
# tokenize words and save it to a list

doc = nlp(oneStringTexts)
words = [token.text for token in doc]
print(words)

['Survey', 'measures', 'of', 'individuals', "'", 'expectations', 'of', 'future', 'labor', 'market', 'conditions', 'improved', 'in', 'November', ',', 'largely', 'reversing', 'post', '-', 'Katrina', 'declines', '.', 'Although', 'high', 'profit', 'margins', 'could', 'imply', 'some', 'existing', 'pricing', 'power', ',', 'they', 'might', 'also', 'provide', 'a', 'cushion', 'to', 'absorb', 'some', 'future', 'cost', 'increases', '.', 'Consumer', 'price', 'inflation', 'remained', 'elevated', 'in', 'April', 'and', 'May', ',', 'reflecting', 'sharp', 'rises', 'in', 'energy', 'prices', 'and', 'more', 'rapid', 'increases', 'in', 'core', 'prices', '.', 'Spending', 'on', 'cars', 'and', 'light', 'trucks', 'increased', 'somewhat', 'in', 'July', 'after', 'a', 'lackluster', 'pace', 'in', 'the', 'second', 'quarter', 'but', 'apparently', 'weakened', 'in', 'August', '.', 'The', 'Manager', 'also', 'discussed', 'with', 'the', 'Committee', 'the', 'results', 'of', 'a', 'recent', 'review', 'of', 'the', 'managemen

## Stop-Word removal
Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [7]:
doc = nlp(oneStringTexts)

#remove stopwords and punctuations
words_without_punctuation = [
    token.text
    for token in doc
    if not token.is_stop and not token.is_punct
]

print(words_without_punctuation)

['Survey', 'measures', 'individuals', 'expectations', 'future', 'labor', 'market', 'conditions', 'improved', 'November', 'largely', 'reversing', 'post', 'Katrina', 'declines', 'high', 'profit', 'margins', 'imply', 'existing', 'pricing', 'power', 'provide', 'cushion', 'absorb', 'future', 'cost', 'increases', 'Consumer', 'price', 'inflation', 'remained', 'elevated', 'April', 'reflecting', 'sharp', 'rises', 'energy', 'prices', 'rapid', 'increases', 'core', 'prices', 'Spending', 'cars', 'light', 'trucks', 'increased', 'somewhat', 'July', 'lackluster', 'pace', 'second', 'quarter', 'apparently', 'weakened', 'August', 'Manager', 'discussed', 'Committee', 'results', 'recent', 'review', 'management', 'domestic', 'security', 'holdings', 'SOMA', 'Higher', 'imports', 'capital', 'goods', 'excluding', 'aircraft', 'computers', 'semiconductors', 'oil', 'contributed', 'overall', 'gain', 'imports', 'TIPS', 'based', 'inflation', 'compensation', 'year', 'horizon', 'unchanged', 'inflation', 'compensation',

## Get word frequency
counting the word occurrence using FreqDist library


In [8]:
from collections import Counter

word_freq = Counter(words_without_punctuation)
common_words = word_freq.most_common()

print (common_words)

[('inflation', 563), ('remained', 400), ('rate', 388), ('economic', 359), ('Committee', 334), ('quarter', 325), ('market', 319), ('participants', 310), ('continued', 299), ('prices', 288), ('conditions', 275), ('term', 270), ('year', 262), ('growth', 254), ('percent', 246), ('period', 241), ('policy', 228), ('longer', 223), ('recent', 210), ('financial', 201), ('markets', 191), ('measures', 190), ('pace', 186), ('consumer', 185), ('funds', 185), ('securities', 184), ('credit', 180), ('spending', 175), ('months', 173), ('federal', 169), ('levels', 168), ('increased', 165), ('rates', 163), ('expected', 161), ('activity', 160), ('labor', 158), ('price', 157), ('based', 156), ('low', 154), ('declined', 153), ('little', 152), ('noted', 145), ('intermeeting', 145), ('business', 143), ('unemployment', 142), ('expectations', 141), ('run', 141), ('outlook', 140), ('likely', 135), ('energy', 134), ('Treasury', 134), ('sector', 132), ('loans', 127), ('foreign', 126), ('somewhat', 123), ('meeting'

## Part of Speech tags
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [9]:
# concatenate all the list elements into one string for easier processing

oneString_words_without_punctuation = ' '.join([str(item) for item in words_without_punctuation])
print(oneString_words_without_punctuation)

# add PoS tags to words in oneString_words_without_punctuation

doc = nlp(oneString_words_without_punctuation)

for token in doc:
    print (token.text, token.pos_)

Survey measures individuals expectations future labor market conditions improved November largely reversing post Katrina declines high profit margins imply existing pricing power provide cushion absorb future cost increases Consumer price inflation remained elevated April reflecting sharp rises energy prices rapid increases core prices Spending cars light trucks increased somewhat July lackluster pace second quarter apparently weakened August Manager discussed Committee results recent review management domestic security holdings SOMA Higher imports capital goods excluding aircraft computers semiconductors oil contributed overall gain imports TIPS based inflation compensation year horizon unchanged inflation compensation longer horizons crept higher participants noted capital expenditures internally financed making sensitive credit market conditions readings business sector softer Industrial production fell October orders shipments capital goods showing signs improvement late September 

Survey NOUN
measures VERB
individuals NOUN
expectations NOUN
future ADJ
labor NOUN
market NOUN
conditions NOUN
improved VERB
November PROPN
largely ADV
reversing VERB
post NOUN
Katrina PROPN
declines VERB
high ADJ
profit NOUN
margins NOUN
imply VERB
existing VERB
pricing NOUN
power NOUN
provide VERB
cushion NOUN
absorb VERB
future ADJ
cost NOUN
increases VERB
Consumer NOUN
price NOUN
inflation NOUN
remained VERB
elevated ADJ
April PROPN
reflecting VERB
sharp ADJ
rises NOUN
energy NOUN
prices NOUN
rapid ADJ
increases NOUN
core NOUN
prices NOUN
Spending VERB
cars NOUN
light NOUN
trucks NOUN
increased VERB
somewhat ADV
July PROPN
lackluster ADJ
pace NOUN
second ADJ
quarter NOUN
apparently ADV
weakened VERB
August PROPN
Manager PROPN
discussed VERB
Committee PROPN
results VERB
recent ADJ
review NOUN
management NOUN
domestic ADJ
security NOUN
holdings NOUN
SOMA VERB
Higher ADJ
imports NOUN
capital NOUN
goods NOUN
excluding VERB
aircraft NOUN
computers NOUN
semiconductors NOUN
oil PROPN
cont

## NER (Named Entity Recognition) 

| Label    | Description                                          |
|----------|------------------------------------------------------|
| ORG      | Companies, agencies, institutions.                   |
| GPE      | Geopolitical entity, i.e. countries, cities, states. |
| CARDINAL | Numerals                                             |

In [19]:
# recognize named entity

ner = []
ner_label = []
doc = nlp(oneString_words_without_punctuation)

for ent in doc.ents:
    ner.append(ent.text + " " + ent.label_)
    #print(ent.text, ent.label_) --- Variante 2

print(ner)

['November DATE', 'Katrina EVENT', 'April DATE', 'Spending WORK_OF_ART', 'July DATE', 'second quarter DATE', 'August DATE', 'SOMA Higher PERSON', 'year DATE', 'October DATE', 'late September DATE', 'October DATE', 'fourth quarter quarter DATE', 'Incoming ORG', 'Asia LOC', 'second quarter DATE', 'Mexico Credit ORG', 'Federal Reserve Term Auction Facility ORG', 'TAF GPE', '$ 448 billion MONEY', 'Federal Reserve ORG', 'Participants ORG', 'Orders PERSON', 'October November DATE', 'Federal Reserve ORG', 'fourth quarter year DATE', 'Committee ORG', 'CDS ORG', '2002 DATE', '2001 DATE', 'Federal Reserve ORG', 'Federal Reserve ORG', '$ 300 billion MONEY', 'Treasury ORG', 'October 2009 DATE', 'September DATE', 'slightly quarter DATE', 'year month DATE', 'month DATE', 'Based ORG', '0 쩌 percent PERCENT', 'U.S. GPE', 'weekly hours TIME', 'months DATE', 'half year DATE', 'quarter DATE', 'Treasury ORG', 'second half year DATE', 'Treasury ORG', 'past months DATE', 'decades DATE', 'M2 PRODUCT', 'Septem

# Plot Graphs (NER)

In [26]:
# Frequency of NER

word_freq2 = Counter(ner)
common_words2 = word_freq2.most_common()

print (common_words2)

[('Committee ORG', 143), ('Treasury ORG', 104), ('U.S. GPE', 95), ('2 percent PERCENT', 68), ('recent months DATE', 62), ('quarter DATE', 58), ('April DATE', 54), ('fourth quarter DATE', 52), ('Federal Reserve ORG', 51), ('year DATE', 46), ('December DATE', 45), ('PCE ORG', 44), ('July DATE', 42), ('second quarter DATE', 42), ('October DATE', 42), ('November DATE', 39), ('September DATE', 37), ('quarter CARDINAL', 37), ('June DATE', 37), ('August DATE', 36), ('monthly DATE', 35), ('February DATE', 35), ('Participants ORG', 34), ('March DATE', 34), ('January DATE', 33), ('MBS ORG', 33), ('Household PERSON', 30), ('12 month DATE', 29), ('earlier year DATE', 28), ('China GPE', 26), ('CRE ORG', 25), ('months DATE', 24), ('0 1/4 percent PERCENT', 24), ('second ORDINAL', 23), ('FOMC ORG', 21), ('month DATE', 20), ('United States GPE', 20), ('January February DATE', 19), ('CMBS ORG', 18), ('Desk ORG', 18), ('half year DATE', 17), ('October November DATE', 15), ('Canada GPE', 15), ('C&I ORG', 

In [25]:
# reformat date

df['date'] = pd.to_datetime(df['date'])
print(df['date'])

0      2006-03-01
1      2006-02-21
2      2006-07-20
3      2006-11-10
4      2006-11-15
          ...    
3060   2008-11-19
3061   2009-08-04
3062   2009-08-04
3063   2009-05-20
3064   2009-07-15
Name: date, Length: 3065, dtype: datetime64[ns]


In [27]:
print(df)

         id       date                                               text  \
0     10836 2006-03-01  Survey measures of individuals' expectations o...   
1     10837 2006-02-21  Although high profit margins could imply some ...   
2     10838 2006-07-20  Consumer price inflation remained elevated in ...   
3     10839 2006-11-10  Spending on cars and light trucks increased so...   
4     10840 2006-11-15  The Manager also discussed with the Committee ...   
...     ...        ...                                                ...   
3060  13896 2008-11-19  Participants noted that the financial turmoil ...   
3061  13897 2009-08-04  Among the advanced foreign economies, headline...   
3062  13898 2009-08-04  Consumer outlays showed some signs of stabiliz...   
3063  13899 2009-05-20  The staff raised its near-term estimate of cor...   
3064  13900 2009-07-15  Such spreads, however, remained somewhat wide ...   

     growth_sentiment_1 employment_sentimen_1 inflation_sentiment_1  
0    

In [31]:
df[df["text"].str.contains("MBS")]

Unnamed: 0,id,date,text,growth_sentiment_1,employment_sentimen_1,inflation_sentiment_1
59,10895,2014-08-01,Year-to-date issuance of commercial mortgage-b...,positive,positive,neutral
85,10921,2016-06-04,Spreads on commercial mortgage-backed securiti...,neutral,neutral,neutral
320,11156,2009-08-04,One member preferred to focus additional purch...,neutral,neutral,neutral
416,11252,2020-08-04,Those Treasury and agency MBS purchases would ...,neutral,neutral,neutral
506,11342,2012-04-10,Participants discussed the effectiveness of pu...,neutral,neutral,neutral
...,...,...,...,...,...,...
2988,13824,2017-05-07,While delinquency rates on CRE loans held by b...,,,
3017,13853,2021-06-01,Triple-B-rated non-agency CMBS spreads came do...,,,
3028,13864,2022-05-01,Increase the SOMA holdings of Treasury securit...,,,
3031,13867,2022-06-04,"For agency mortgage-backed securities (MBS), t...",,,
