## Configuration

In [1]:
# Import spaCy
import spacy

from pathlib import Path
from spacy.matcher import Matcher

In [2]:
# This contains the processing pipeline
# As well as language-specific rules for tokenization etc.
nlp = spacy.load('en_core_web_lg')

In [3]:
# Read from a text file and replace newline characters
moby_dick = Path('../Text Files/moby_dick.txt').read_text(encoding='utf8')
moby_dick = moby_dick.replace('\n', '')

ai_forecast_1 = Path('../Text Files/ai_forecast1.txt').read_text(encoding='utf8')
ai_forecast_1 = ai_forecast_1.replace('\n', '')

ai_forecast_2 = Path('../Text Files/ai_forecast2.txt').read_text(encoding='utf8')
ai_forecast_2 = ai_forecast_2.replace('\n', '')

# Import them into a Doc object
moby_dick_doc = nlp(moby_dick)
ai_forecast_1_doc = nlp(ai_forecast_1)
ai_forecast_2_doc = nlp(ai_forecast_2)

# Test
print(moby_dick_doc.text)

I stuffed a shirt or two into my old carpet-bag, tucked it under my arm, and started for Cape Horn and the Pacific. Quitting the good city of old Manhatto, I duly arrived in New Bedford. It was a Saturday night in December. Much was I disappointed upon learning that the little packet for Nantucket had already sailed, and that no way of reaching that place would offer, till the following Monday. As most young candidates for the pains and penalties of whaling stop at this same New Bedford, thence to embark on their voyage, it may as well be related that I, for one, had no idea of so doing. For my mind was made up to sail in no other than a Nantucket craft, because there was a fine, boisterous something about everything connected with that famous old island, which amazingly pleased me. Besides though New Bedford has of late been gradually monopolising the business of whaling, and though in this matter poor old Nantucket is now much behind her, yet Nantucket was her great original—the Tyre

# Print out the POS and the grammatical structure of the sentences

Note: I intentionally only printed the first two sentences for both, in order to not clutter up the screen.

If by any chance you want to see all the sentences, remove the [0:2]

or

you can also just do:
```
for sentence in doc.sents:
```

## AI Forecast 1

In [4]:
# Storing sentences
doc_sentences = [sentence for sentence in ai_forecast_1_doc.sents]

# Iterate over the sentences
for sentence in doc_sentences[0:2]:
    # These conditions are just added for prettier printing
    if sentence == doc_sentences[0]:
        print('{}\n'.format(sentence))

    else:
        print('\n\n\n{}\n'.format(sentence))

    # Iterate over the tokens
    for token in sentence:
        # Print the text, the predicted part-of-speech tag, syntactic dependencies and head text
        print(token.text, token.pos_, token.dep_, token.head.text)

Pune, India, Sept. 13, 2022 (GLOBE NEWSWIRE) --

Pune PROPN ROOT Pune
, PUNCT punct Pune
India PROPN npadvmod Pune
, PUNCT punct Pune
Sept. PROPN npadvmod Pune
13 NUM nummod Sept.
, PUNCT punct Sept.
2022 NUM nummod Sept.
( PUNCT punct Pune
GLOBE PROPN compound NEWSWIRE
NEWSWIRE PROPN appos Pune
) PUNCT punct Pune
-- PUNCT punct Pune



The global AI market size is projected to grow from USD 387.45 billion in 2022 to USD 1394.30 billion in 2029 at a CAGR of 20.1% in the forecast period.

The DET det size
global ADJ amod size
AI PROPN compound size
market NOUN compound size
size NOUN nsubjpass projected
is AUX auxpass projected
projected VERB ROOT projected
to PART aux grow
grow VERB xcomp projected
from ADP prep grow
USD SYM pobj from
387.45 NUM compound billion
billion NUM nummod USD
in ADP prep grow
2022 NUM pobj in
to ADP prep grow
USD SYM compound billion
1394.30 NUM compound billion
billion NUM pobj to
in ADP prep grow
2029 NUM pobj in
at ADP prep grow
a DET det CAGR
CAGR NOUN pob

## AI Forecast 2

In [5]:
# Storing sentences
doc_sentences = [sentence for sentence in ai_forecast_2_doc.sents]

# Iterate over the sentences
for sentence in doc_sentences[0:2]:
    # These conditions are just added for prettier printing
    if sentence == doc_sentences[0]:
        print('{}\n'.format(sentence))

    else:
        print('\n\n\n{}\n'.format(sentence))

    # Iterate over the tokens
    for token in sentence:
        # Print the text, the predicted part-of-speech tag, syntactic dependencies and head text
        print(token.text, token.pos_, token.dep_, token.head.text)

The global artificial intelligence market size was $93.5 billion in 2021.

The DET det size
global ADJ amod size
artificial ADJ amod size
intelligence NOUN compound market
market NOUN compound size
size NOUN nsubj was
was AUX ROOT was
$ SYM quantmod billion
93.5 NUM compound billion
billion NUM attr was
in ADP prep was
2021 NUM pobj in
. PUNCT punct was



And according to Grand View Research, Inc., it is projected to expand at a compound annual growth rate (CAGR) of 38.1% from 2022 to 2030.

And CCONJ cc projected
according VERB prep projected
to ADP prep according
Grand PROPN compound Inc.
View PROPN compound Inc.
Research PROPN nmod Inc.
, PUNCT punct Inc.
Inc. PROPN pobj to
, PUNCT punct projected
it PRON nsubjpass projected
is AUX auxpass projected
projected VERB ROOT projected
to PART aux expand
expand VERB xcomp projected
at ADP prep expand
a DET det rate
compound ADJ nmod rate
annual ADJ amod rate
growth NOUN compound rate
rate NOUN pobj at
( PUNCT punct rate
CAGR NOUN appos ra

# Display the named entities, their labels and label descriptions

## AI Forecast 1

In [6]:
# Iterate over the predicted entities
for ent in ai_forecast_1_doc.ents:
    # Print the entity text , it's label and explanation
    print('Named Entity - {}\nEntity Label - {}\nEntity Label Description - {}\n\n'.format(ent.text, ent.label_, spacy.explain(ent.label_)))

Named Entity - Pune
Entity Label - GPE
Entity Label Description - Countries, cities, states


Named Entity - India
Entity Label - GPE
Entity Label Description - Countries, cities, states


Named Entity - Sept. 13, 2022
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - GLOBE NEWSWIRE
Entity Label - ORG
Entity Label Description - Companies, agencies, institutions, etc.


Named Entity - USD 387.45 billion
Entity Label - MONEY
Entity Label Description - Monetary values, including unit


Named Entity - 2022
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - USD 1394.30 billion
Entity Label - MONEY
Entity Label Description - Monetary values, including unit


Named Entity - 2029
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - 20.1%
Entity Label - PERCENT
Entity Label Description - Percentage, including "%"


Named Entity - the next severa

## AI Forecast 2

In [7]:
# Iterate over the predicted entities
for ent in ai_forecast_2_doc.ents:
    # Print the entity text , it's label and explanation
    print('Named Entity - {}\nEntity Label - {}\nEntity Label Description - {}\n\n'.format(ent.text, ent.label_, spacy.explain(ent.label_)))

Named Entity - $93.5 billion
Entity Label - MONEY
Entity Label Description - Monetary values, including unit


Named Entity - 2021
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - Grand View Research, Inc.
Entity Label - ORG
Entity Label Description - Companies, agencies, institutions, etc.


Named Entity - annual
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - 38.1%
Entity Label - PERCENT
Entity Label Description - Percentage, including "%"


Named Entity - 2022
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - 2030
Entity Label - DATE
Entity Label Description - Absolute or relative dates or periods


Named Entity - healthcare
Entity Label - ORG
Entity Label Description - Companies, agencies, institutions, etc.


Named Entity - around 66%
Entity Label - PERCENT
Entity Label Description - Percentage, including "%"


Named Entity 

# Use the matcher to look up 'Artificial Intelligence' in the text

Documentation: [https://spacy.io/api/matcher](https://spacy.io/api/matcher)

In [8]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"LOWER": "artificial"}, {"LOWER": "intelligence"}]
matcher.add("ARTIFICIAL_INTELLIGENCE_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_2_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_2_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 2
End: 4
Matched Text: artificial intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 95
End: 97
Matched Text: Artificial intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 159
End: 161
Matched Text: artificial intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 359
End: 361
Matched Text: Artificial Intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 447
End: 449
Matched Text: Artificial Intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 458
End: 460
Matched Text: artificial intelligence


Match ID: 8058983868262941017
Matched Pattern: ARTIFICIAL_INTELLIGENCE_PATTERN
Start: 627
End: 629
Matched Text: Artificial Intelligence


Match ID: 8058983868262941017
Mat

# Construct a matcher that looks for the word 'AI' followed by a verb

Since it wasn't mentioned for which document, I only did this for the Key Market Insights text.
The other one didn't have any results either way...

In [9]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"LOWER": "ai"}, {"POS": "VERB"}]
matcher.add("AI_FOLLOWED_BY_VERB_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_2_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_2_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 17616057253289433113
Matched Pattern: AI_FOLLOWED_BY_VERB_PATTERN
Start: 1054
End: 1056
Matched Text: AI grew




# Construct a matcher to find all numbers followed by a percent sign (%)

## AI Forecast 1

In [10]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"POS": "NUM"}, {"TEXT": "%"}]
matcher.add("NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_1_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_1_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 38
End: 40
Matched Text: 20.1%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 275
End: 277
Matched Text: 23%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 531
End: 533
Matched Text: 55%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 536
End: 538
Matched Text: 75%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 547
End: 549
Matched Text: 77%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 559
End: 561
Matched Text: 37%




## AI Forecast 2

In [11]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"POS": "NUM"}, {"TEXT": "%"}]
matcher.add("NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_2_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_2_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 37
End: 39
Matched Text: 38.1%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 196
End: 198
Matched Text: 66%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 1236
End: 1238
Matched Text: 115%


Match ID: 4462332722968326931
Matched Pattern: NUMBER_FOLLOWED_BY_PERCENT_SIGN_PATTERN
Start: 1378
End: 1380
Matched Text: 42%




# Construct a matcher to look for companies names

## AI Forecast 1

In [12]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"ENT_TYPE": "ORG"}]
matcher.add("COMPANY_NAME_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_1_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_1_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 9
End: 10
Matched Text: GLOBE


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 10
End: 11
Matched Text: NEWSWIRE


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 66
End: 67
Matched Text: Fortune


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 67
End: 68
Matched Text: Business


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 68
End: 69
Matched Text: Insights


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 69
End: 70
Matched Text: ™


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 122
End: 123
Matched Text: BFSI


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 124
End: 125
Matched Text: healthcare


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 150
End: 151
Matched Text: the


Match

## AI Forecast 2

In [13]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"ENT_TYPE": "ORG"}]
matcher.add("COMPANY_NAME_PATTERN", [pattern])

# Call the matcher on the doc
matches = matcher(ai_forecast_2_doc)

# Iterate over the matches
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    # Get the matched span
    matched_span = ai_forecast_2_doc[start:end]
    print('Match ID: {}\nMatched Pattern: {}\nStart: {}\nEnd: {}\nMatched Text: {}\n\n'.format(match_id, string_id, start, end, matched_span.text))

Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 16
End: 17
Matched Text: Grand


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 17
End: 18
Matched Text: View


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 18
End: 19
Matched Text: Research


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 19
End: 20
Matched Text: ,


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 20
End: 21
Matched Text: Inc.


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 110
End: 111
Matched Text: healthcare


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 206
End: 207
Matched Text: AI


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 212
End: 213
Matched Text: life


Match ID: 11620291018188914089
Matched Pattern: COMPANY_NAME_PATTERN
Start: 213
End: 214
Matched Text: sciences


Match ID: 

# Where do you find these matches ?

For each match, I print out the Start and End token, hopefully that answers the question...

In [14]:
print(ai_forecast_2_doc[16:21])

Grand View Research, Inc.
