# Dataset

In [15]:
import pandas as pd
df = pd.read_csv("Data/hate-text.csv")

Initial model test without handling negation and polarity. We simply trust the algo to pick up on the patterns of negations. Which probably **won't** be successful.

# Base keras model

No cleaning, no regularization, no nothin'

In [16]:
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Preprocess
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['test_case'])

X = tokenizer.texts_to_sequences(df['test_case'])
X = pad_sequences(X)

y = (df['label_gold'] == 'hateful').astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=X.shape[1]))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [17]:
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

loss, accuracy = model.evaluate(X_test, y_test)
loss, accuracy

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


(0.1818477213382721, 0.9436619877815247)

In [24]:
def preprocess_text(text, tokenizer):
    sequences = tokenizer.texts_to_sequences([text])
    padded_sequences = pad_sequences(sequences, maxlen=X.shape[1])
    return padded_sequences

def predict_hatefulness(model, text, tokenizer):
    processed_text = preprocess_text(text, tokenizer)
    prediction = model.predict(processed_text)[0, 0]
    
    label = 'Hateful' if prediction >= 0.5 else 'Not Hateful'
    
    return label, prediction

new_text = ["I hate women.", "I don't hate women.", "I don't love women", "I love women"]

for i in new_text:
    label, prediction = predict_hatefulness(model, i, tokenizer)
    print(f"Text: '{i}'")
    print(f"Predicted Label: {label}")
    print(f"Prediction Score: {prediction}")


Text: 'I hate women.'
Predicted Label: Hateful
Prediction Score: 0.7363396883010864
Text: 'I don't hate women.'
Predicted Label: Hateful
Prediction Score: 0.5670516490936279
Text: 'I don't love women'
Predicted Label: Not Hateful
Prediction Score: 0.44292059540748596
Text: 'I love women'
Predicted Label: Hateful
Prediction Score: 0.6289886236190796


As expected, the model is unable to pick up negations from the training data.

Negative terms include no, not, won't, shouldn’t, etc. When a negation appears in a sentence, it is critical to determine which words are impacted by this phrase

The negation terms like these are used to perform sentiment analysis of a sentence, a phrase or even a paragraph. To process these words, we define what is called a Sentence Polarity.

The sentence polarity is calculated on the basis of the parts of a sentence. A sentence may contain either simple POS (Verb, Adverb, Adjectives, etc.) or complex parts of
speech (Noun Phrase [Pronoun, Noun] or Verb Phrase [Verb, Noun Phrase], relations of possession, determiner, etc.). The following hierarchy is an example of POS in a complete sentence.

(Sentence
(Noun Phrase (Pronoun, Noun))
(Adverbial Phrase (Adverb))
(Verb Phrase (Verb)
(Sentence
(Verb Phrase (Verb)
(Noun Phrase (Noun))
) ) ) )


Sentiment polarity calculation is a nested process. This process calculates the sentiment of the most inner level first and then it calculates along with the next higher level, which is also called Sentiment Propagation. This process calculates the polarity and intensity of the words and phrases. If there is a negation term, the polarity will be calculated accordingly. The following three examples illustrate the whole process of polarity calculation.

A. Example 1::
They have not succeeded, and will never succeed, in
breaking the will of this valiant people.
(Sentence
(Pronoun They)
(Verb Phrase
(Verb Phrase (have not)
(Verb Phrase (Verb succeeded)))
(and)
(Verb Phrase (will)
(Adverbial Phrase (Adverb never))
(Verb Phrase (succeed)))
(Prepositional Phrase (in)
(Sentence
(Verb Phrase (breaking)
(Noun Phrase
(Noun Phrase (the will))
(Prepositional Phrase (of)
(Noun Phrase (this valiant people)))))))))

The negation word ‘not’ is affecting the succeeded (+) whereas never is effecting succeed (+) where succeeded and succeed are joined by 'and' (joins same polarity). Both successes are in breaking (-) the will of people who are valiant (+) people. As they have not succeeded in doing something 'Negative' and the polarity of sentence is 'Positive'.

Source: [Ashudeep Singh, Quora](https://www.quora.com/NLP-whats-the-best-method-to-detect-negated-contexts-in-text)

---

### Possible solution

All negation words are divided into three categories.

- All negations that totally reverse the polarity of other words are classified as syntactic negations.
- The diminisher class covers all negation words that lessen the polarities rather than inverting them.
- All prefixes and suffixes that can be used to produce a morphological negative are included in the morphological class. These prefixes and suffixes are also employed to identify the existence of a morphological negative.

[source](https://analyticsindiamag.com/when-to-use-negation-handling-in-sentiment-analysis/)

# Test with pre-trained

https://stanfordnlp.github.io/stanza/<br>
https://spacy.io/universe/project/spacy-stanza

In [19]:
import spacy 
import stanza 
import spacy_stanza
from negspacy.negation import Negex
from negspacy.termsets import termset 

nlp_model = spacy_stanza.load_pipeline('en')

2024-01-22 20:05:51 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-01-22 20:05:53 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

2024-01-22 20:05:53 INFO: Using device: cpu
2024-01-22 20:05:53 INFO: Loading: tokenize
2024-01-22 20:05:53 INFO: Loading: pos
2024-01-22 20:05:54 INFO: Loading: lemma
2024-01-22 20:05:54 INFO: Loading: constituency
2024-01-22 20:05:55 INFO: Loading: depparse
2024-01-22 20:05:55 INFO: Loading: sentiment
2024-01-22 20:05:55 INFO: Loading: ner
2024-01-22 20:05:56 INFO: Done loading processors!


In [20]:
nlp_model.add_pipe("negex", config={"ent_types":["PERSON","ORG","CARDINAL", "DATE", "EVENT", "LANGUAGE", "PRODUCT", "QUANTITY", "TIME", "WORK_OF_ART"]})

<negspacy.negation.Negex at 0x14e070dd0>

In [21]:
sample = nlp_model('There is no English language option.')
 
for e in sample.ents:
  print(e.text, e._.negex)

doc = nlp_model('He does not like Adolf Hitler but likes German products.')
 
for e in doc.ents:
  print(e.text, e._.negex)

English True
Adolf Hitler True
German False


The true indicates the word has a negative meaning and the false indicates the positive sense.

In [28]:
stanza.download('en')
nlp = stanza.Pipeline('en', use_gpu=False)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-01-22 20:32:13 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/default.zip:   0%|          | 0…

2024-01-22 20:32:42 INFO: Finished downloading models and saved to /Users/helvetica/stanza_resources.
2024-01-22 20:32:42 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-01-22 20:32:44 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

2024-01-22 20:32:44 INFO: Using device: cpu
2024-01-22 20:32:44 INFO: Loading: tokenize
2024-01-22 20:32:44 INFO: Loading: pos
2024-01-22 20:32:44 INFO: Loading: lemma
2024-01-22 20:32:44 INFO: Loading: constituency
2024-01-22 20:32:45 INFO: Loading: depparse
2024-01-22 20:32:46 INFO: Loading: sentiment
2024-01-22 20:32:46 INFO: Loading: ner
2024-01-22 20:32:47 INFO: Done loading processors!


In [32]:
doc = nlp("Immigrants like you do not deserve to live.")

print(doc)
print(doc.entities)

[
  [
    {
      "id": 1,
      "text": "Immigrants",
      "lemma": "immigrant",
      "upos": "NOUN",
      "xpos": "NNS",
      "feats": "Number=Plur",
      "head": 6,
      "deprel": "nsubj",
      "start_char": 0,
      "end_char": 10,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 2,
      "text": "like",
      "lemma": "like",
      "upos": "ADP",
      "xpos": "IN",
      "head": 3,
      "deprel": "case",
      "start_char": 11,
      "end_char": 15,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 3,
      "text": "you",
      "lemma": "you",
      "upos": "PRON",
      "xpos": "PRP",
      "feats": "Case=Acc|Person=2|PronType=Prs",
      "head": 1,
      "deprel": "nmod",
      "start_char": 16,
      "end_char": 19,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 4,
      "text": "do",
      "lemma": "do",
      "upos": "AUX",
      "xpos": "VBP",
      "feats": "Mo

In [33]:
for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Immigrants immigrant NOUN
like like ADP
you you PRON
do do AUX
not not PART
deserve deserve VERB
to to PART
live live VERB
. . PUNCT


---

In [34]:
import stanza

stanza.download('en')
nlp = stanza.Pipeline('en', use_gpu=False)

def tokenize_and_lemmatize(text):
    doc = nlp(text)
    tokens = [word.lemma for sent in doc.sentences for word in sent.words]
    return tokens

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-01-22 20:41:30 INFO: Downloading default packages for language: en (English) ...
2024-01-22 20:41:32 INFO: File exists: /Users/helvetica/stanza_resources/en/default.zip
2024-01-22 20:41:37 INFO: Finished downloading models and saved to /Users/helvetica/stanza_resources.
2024-01-22 20:41:37 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-01-22 20:41:40 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

2024-01-22 20:41:40 INFO: Using device: cpu
2024-01-22 20:41:40 INFO: Loading: tokenize
2024-01-22 20:41:40 INFO: Loading: pos
2024-01-22 20:41:40 INFO: Loading: lemma
2024-01-22 20:41:40 INFO: Loading: constituency
2024-01-22 20:41:41 INFO: Loading: depparse
2024-01-22 20:41:41 INFO: Loading: sentiment
2024-01-22 20:41:42 INFO: Loading: ner
2024-01-22 20:41:42 INFO: Done loading processors!
