# Exploring the use of spaCy in NLP, general and aspect-based sentiment analysis  

#### Includes data collection, cleaning, and EDA

<i>Note: for the ease of reading and follow up by non NLP practicioners, NLP/ SA/ spaCy specific terminology will be tagged with [word] where needed</i>

### Third-party package imports
<b>Spacy</b> is used to provide all relevant NLP procedures, including several trained models, built-in tokenization, annotations, etc

<b>Beautiful Soup 4</b> provides web scraping functionality

In [26]:
%time

import re
import requests

import spacy
from bs4 import BeautifulSoup as bs


CPU times: total: 0 ns
Wall time: 0 ns


In [17]:
# load english small/medium model - models must be installed seperately from the base spacy package
# python -m spacy download en_core_web_sm --user

nlp = spacy.load(r'C:\Users\zhuwe\AppData\Roaming\Python\Python310\site-packages\en_core_web_sm\en_core_web_sm-3.3.0')


In [20]:
# scrape web data from sample news article
link = "https://www.channelnewsasia.com/singapore/hiv-risk-transmission-man-did-not-inform-sexual-partner-jail-2732376"

resp = requests.get(link)
assert resp.status_code == 200

In [38]:
# get only text in the <p> tags 
news_text = [i.getText() for i in bs(resp.content).find_all('p')]

In [40]:
# manually filter news website boilerplate, legal disclaimers, etc.
non_boilerplate_text = news_text[3:-5]

# create spacy DOC object
doc = nlp(''.join(non_boilerplate_text))


## Analysis of textual data in this article

In [53]:
print("Number of sentences in this news article: " + str(len(list(doc.sents))))

Number of sentences in this news article: 27


### Token-wise analysis

#### POS (Part of Speech) tagging
Using [POS] tags, we can determine if an individual [token] is a noun, adjective, etc.

Skimming through available UPOS tags, the following seem to be most important for Sentiment Analysis
| POS tag | full name | examples       |
|---------|-----------|----------------|
| adj     | adjective | enormous, fast |
| adv     | adverb    | very, exactly  |
| verb    | verb      | eat, running   |



In [52]:
set([token for token in doc if token.pos_ == 'VERB'])

{sentenced,
 informing,
 named,
 diagnosed,
 interviewed,
 told,
 required,
 inform,
 perceived,
 pleaded,
 considered,
 offered,
 booked,
 decided,
 go,
 engaged,
 inform,
 contracting,
 obtain,
 accept,
 said,
 discovered,
 reported,
 claimed,
 assaulted,
 having,
 informing,
 contracting,
 charged,
 formed,
 taken,
 indicated,
 detected,
 indicated,
 stated,
 According,
 cited,
 was,
 detected,
 said,
 was,
 stated,
 tested,
 detected,
 taken,
 accepted,
 detected,
 stated,
 asked,
 noting,
 accused,
 means,
 was,
 inform,
 exposed,
 reoffended,
 said,
 accused,
 use,
 engaged,
 accept,
 mitigated,
 was,
 said,
 sought,
 arguing,
 transmit,
 consumed,
 including,
 clouded,
 added,
 disclose,
 knew,
 was,
 argued,
 "Afraid,
 omitted,
 disclose,
 appealed,
 jailed,
 fined}