# Preperation for NLP pipelines.

Let's load these polarizing galatic tweets from files we would have scaped as we initiated the NLP pipelines.
We should be doing some cleaning here (removing stop words, lowercase, lematize, etc) - though the NER pipelines work ok taking the raw text.

In [1]:
import spacy
from spacy import displacy

spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')


In [2]:
import json
 
f = open('warposts.json')
posts = json.load(f)['posts']
f.close()

# Breaking down Space Tweets with SPACY

first we will dowload the spacy pipelinesm then we run some tests with the given text.
Below is how Spacy will categorize entities it detects.

| **Category**                              | **Description**                                      |
| ----------------------------------------- | ---------------------------------------------------- |
| PERSON:                                   | People, including fictional.                         |
| NORP:                                     | Nationalities or religious or political groups.      |
| FAC:                                      | Buildings, airports, highways, bridges, etc.         |
| ORG:                                      | Companies, agencies, institutions, etc.              |
| GPE:                                      | Countries, cities, states.                           |
| LOC:                                      | Non-GPE locations, mountain ranges, bodies of water. |
| PRODUCT:                                  | Objects, vehicles, foods, etc. (Not services.)       |
| EVENT:                                    | Named hurricanes, battles, wars, sports events, etc. |
| WORK_OF_ART:                              | Titles of books, songs, etc.                         |
| LAW:                                      | Named documents made into laws.                      |
| LANGUAGE:                                 | Any named language.                                  |
| DATE:                                     | Absolute or relative dates or periods.               |
| TIME:                                     | Times smaller than a day.                            |
| PERCENT:                                  | Percentage, including ”%“.                           |
| MONEY:                                    | Monetary values, including unit.                     |
| QUANTITY:                                 | Measurements, as of weight or distance.              |
| ORDINAL:                                  | “first”, “second”, etc.                              |
| CARDINAL:                                 | Numerals that do not fall under another type.        |

Spacy pipelines work by breaking down words into tokens of these types:

| **Tag** |     **Meaning**     |          **English Examples**          |
|:-------:|:-------------------:|:--------------------------------------:|
| ADJ     | adjective           | new, good, high, special, big, local   |
| ADP     | adposition          | on, of, at, with, by, into, under      |
| ADV     | adverb              | really, already, still, early, now     |
| CONJ    | conjunction         | and, or, but, if, while, although      |
| DET     | determiner, article | the, a, some, most, every, no, which   |
| NOUN    | noun                | year, home, costs, time, Africa        |
| NUM     | numeral             | twenty-four, fourth, 1991, 14:24       |
| PRT     | particle            | at, on, out, over per, that, up, with  |
| PRON    | pronoun             | he, their, her, its, my, I, us         |
| VERB    | verb                | is, say, told, given, playing, would   |
| .       | punctuation marks   | . , ; !                                |
| X       | other               | ersatz, esprit, dunno, gr8, univeristy |

In [3]:
# Mark entities 
for txt in posts:
    text1 = nlp(txt)
    displacy.render(text1,style="ent", jupyter=True)



Let's train it to recognize some violence related themes (this would need a large training set to be done properly). From the above, we see that some people or organizations are linked to this war event - so we can add these to a custom pattern in the pipeline, plus others we feel they should be categorized with this event.

In [9]:
from spacy.pipeline import EntityRuler

pattern=["space battle", "evil agenda", "pig", "pigs", "authoritarian", "genocide", "die", "#EvilMonarchs", "facist", "Red Matter", "Borg Cubes", "Laira Rillak", \
        "Klingon invasion", "Romulan invasions", "invasions", "goddam slave", "slaves", "Galatic War", "war", "Kahless the Unforgettable"]

#Create or replace the EntityRuler
if nlp.has_pipe("entity_ruler") and nlp.get_pipe("entity_ruler") is not None:
    nlp.remove_pipe("entity_ruler")

ruler = nlp.add_pipe("entity_ruler", first=True)
for a in pattern:
    ruler.add_patterns([{"label": "WAR-EVENT", "pattern": a}])

# Mark entities with new labels
for txt in posts:
    text1 = nlp(txt)
    displacy.render(text1,style="ent", jupyter=True)

## Sentiment Analysis

This will be the process of determining the attitude or the emotion of the text. 
We will test *TextBlob* which has two properties:

- **polarity** : Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
- **subjectivity**: Subjectivity is also a float which lies in the range of [0,1], subjective sentences generally refer to personal opinion.

We will also trial VADER (Valence Aware Dictionary and Sentiment Reasoner) which provides us a '**neg**', '**neu**' and '**pos**' scores with the field '**compound**' summing these up.  All these are floats with a range of [-1,1].

From the comparisons, VADER is the obvious better framework. 

In [5]:
from spacytextblob.spacytextblob import SpacyTextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sid_obj= SentimentIntensityAnalyzer()

# Add sentiment pipeline.
if not nlp.has_pipe("spacytextblob"):
    nlp.add_pipe("spacytextblob")

import pandas as pd

# TextBloB Polarity, TextBlob Subjecitivty, Vader Compound Sentiment.
df = pd.DataFrame(columns = ["TBP", "TBS", "VS", "Text"])
for txt in posts:
    text1 = nlp(txt)
    
    dict = {'TBP': [text1._.blob.polarity], 'TBS': [text1._.blob.subjectivity], 'VS': [sid_obj.polarity_scores(txt)['compound']], 'Text': [txt]}
    _df =pd.DataFrame.from_dict(dict)
    df = pd.concat([df, _df], ignore_index=True)

df

Unnamed: 0,TBP,TBS,VS,Text
0,0.392857,0.7,0.938,There was no greater role model than James Tib...
1,0.277273,0.58,-0.5242,The Galatic #Federation is selling a timeline ...
2,-0.025622,0.632143,-0.8046,"Some horrible authoritarian pig of an alien, c..."
3,0.39,0.69,0.9412,Taking an amazing vacation on Deneb IV! Thanks...
4,-0.077539,0.684375,-0.8574,No absolutely not!!! Klingon and their non-Fed...
5,0.083333,0.5,-0.6369,If ya ain't human or similar ANNNDDDDD pretty:...
6,0.167424,0.255303,-0.5574,"""There is no way the Galatic War can be accept..."
7,0.525,0.475,0.807,"Great learning experience at the ""Vulcan Schoo..."
8,0.475,0.8375,0.2808,LOL This galatic war is meh. Happy to tell you...


# Putting it all together

We identified the entities related to the war event and we can guess programmatically the sentiment for these war related posts. With this we can  recognize polarizing posts and ban these from our feeds and social networks. We can also store these somewhere and find these quickly through their named entities in the future or revisit these for a topic modelling exercise.

In [10]:
NEGATIVE_POST_LIMIT = 0.1

bad_posts = []
for txt in posts:
    sent = sid_obj.polarity_scores(txt)['compound']
    doc = nlp(txt)
    for ent in doc.ents:
      if "WAR-EVENT" in ent.label_ and sent < NEGATIVE_POST_LIMIT:
        bad_posts.append(txt)
        break

bad_posts

['The Galatic #Federation is selling a timeline of the facist Klingon and wicked Romulan invasions as #NFTs!\nThe collection is titled #SpaceHistory: The Galatic War Museum and each token is associated with space battle.\nIt was confirmed by tweet of the Mega President of the United Federation of Planets: Laira Rillak, that the federation has trust in #blockchaintechnology!\nLong live freedom through digital assets, death to the slave of Kahless the Unforgettable.',
 "Some horrible authoritarian pig of an alien, calling himself a 'savior' #EvilMonarchs decided he knows better about what we need, decided he has the right to ruin the federation, its planets and the societies we admired to live in by starting a war, killing our children and separate us from our loved ones through planetary hate!\nWe will never understand and will never forgive anyone who supports this Klingon invasion in any way.\nAny businesses, or planets that continue the affairs with Klingon: supports this destructive