# Sentiment Analysis - Feature extraction

This notebook is part of the paper *Automated Identification of Security-Relevant Configuration Settings Using NLP* submitted to the [**37<sup>th</sup> IEEE/ACM International Conference on Automated Software Engineering (ASE)**](https://conf.researchr.org/track/ase-2022/ase-2022-industry-showcase).

The other notebooks can be found here

- [Topic Modeling and Latent Dirichlet Allocation](https://www.kaggle.com/code/tumin4/topic-modeling-and-latent-dirichlet-allocation)
- [Transformer-based Machine Learning](https://www.kaggle.com/tumin4/transformer-based-machine-learning)

and on [GitHub](https://github.com/tum-i4/Automated-Identification-of-Security-Relevant-Configuration-Settings-Using-NLP/)

## Contact

If you have any questions, please contact [Patrick Stöckle](mailto:patrick.stoeckle@tum.de?subject=Kaggle%20Notebook%20%22Sentiment%20Analysis%22).


1. Acknowledgments
2. Import libraries
2. Load data
3. Text preprocessing
1. Class distribution
4. Feature extraction characteristics
    1. POS Tags
    8. N-grams
    1. Frequency

### Acknowledgements

This kernel is inspired by the following notebooks: 
* NLP - EDA, Bag of Words, TF IDF, GloVe, BERT
* Twitter sentiment Extaction-Analysis,EDA and Model

### Import libraries

In [None]:
from json import load
from collections import Counter
from pandas import DataFrame
from wordcloud import WordCloud
from nltk import word_tokenize
from nltk.text import Text
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import matplotlib.pyplot as plt
import string
import spacy

### Load data

In [None]:
with open("../input/ase2022/docs/cis/1909/sec_docs.json") as f_read:
    security_documents = load(f_read)

with open("../input/ase2022/docs/cis/1909/non_sec_docs.json") as f_read:
    non_security_documents = load(f_read)

print(f"CIS windows 10 version 1909:")
print(f"{len(security_documents)} security documents.")
print(f"{len(non_security_documents)} non security documents.")

In [None]:
# Pandas dataframe
df_sec_docs = DataFrame(security_documents)
df_non_sec_docs = DataFrame(non_security_documents)

# Text representation
sec_docs_text = df_sec_docs.text.tolist()
non_sec_docs_text = df_non_sec_docs.text.tolist()

In [None]:
print(df_sec_docs.head(10))

### Text preprocessing

The Spacy library performs tokenization on the documents and automatically attaches several pieces of information to each word: 
* lemma
* pos-tag
* is stop-word

Spacy does not include a stemmer

In [None]:
nlp = spacy.load('en_core_web_sm')
spacy_non_sec_docs= [nlp(document) for document in df_non_sec_docs['text']]
spacy_sec_docs= [nlp(document) for document in df_sec_docs['text']]

### Class distribution

In [None]:
plt.rcParams['figure.figsize'] = (7, 5)
plt.bar(10,len(df_non_sec_docs),3, label="Non security relevant docs", color='blue')
plt.bar(15,len(df_sec_docs),3, label="Security relevant docs", color='red')
plt.legend()
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
plt.ylabel('Number of documents')
plt.title('Class distribution Windows 10 v1909 configuration settings')
plt.savefig('ClassDist.png', bbox_inches='tight')
plt.show()

### Feature extraction characteristics

In [None]:
punctuations = string.punctuation

def print_word_cloud(input_words, title, file_name):
    """
    create word cloud and save to file
    """
    plt.figure(figsize=(16,13))
    word_could_dict=Counter(input_words)
    wc = WordCloud(background_color="white", max_words=500, max_font_size= 200,  width=1600, height=800).generate_from_frequencies(word_could_dict)
    plt.title(title, fontsize=20)
    plt.imshow(wc.recolor( colormap= 'viridis' , random_state=17), alpha=0.98, interpolation="bilinear", )
    plt.axis('off')
    wc.to_file(file_name)
    
def to_df(words, col_lst):
    """
    display dataframe with color gradient
    """
    df = DataFrame(words)
    df.columns = col_lst
    display(df.style.background_gradient(cmap='Blues'))

### POS Tags

#### Distribution
The words from the sec-docs are grouped by their POS tag and displayed in a word cloud

In [None]:
regex = re.compile('[@_!#$%^&*()<>?/\|}{~:]')
pos_dict = {'NN':[],'JJ':[],'VB':[],'RB':[]}
for pos in pos_dict:
    pos_dict[pos]= [token.lemma_.lower() for doc in spacy_sec_docs for sent in doc.sents for token in sent if (regex.search(token.text) is None) and (not token.is_stop) and re.search(re.compile(rf'{pos}.*'),token.tag_)]
    print_word_cloud(pos_dict[pos], f"{pos} in Windows 10 v1909 security documents", f"{pos}Wc.png") 

#### Frequent noun chunks

In [None]:
noun_chunks = Counter([chunk.text.lower() for doc in spacy_sec_docs for sent in doc.sents for chunk in sent.noun_chunks if len(chunk.text)>4])
to_df(noun_chunks.most_common(50), ['noun_chunks','count'])

### N-grams

#### Collocations

In [None]:
plain_tokens =[token for document in sec_docs_text for token in word_tokenize(document)]
Text(plain_tokens).collocations()

#### Named Entity Recognition

In [None]:
ner=[ent.text.lower() for doc in spacy_sec_docs for sent in doc.sents for ent in sent.ents if len(sent)>3]
print_word_cloud(ner, "Named entities in security documents", "NE.png")

### Frequency

#### 1. High frequency

#### Sec-docs

In [None]:
sec_lemma_tokens = Counter([token.lemma_.lower() for doc in spacy_sec_docs for sent in doc.sents for token in sent if (not token.is_stop) and (token.lemma_ not in punctuations)])
print(f"Amount of sec tokens: {len(sec_lemma_tokens)}")
to_df(sec_lemma_tokens.most_common(50), ['Frequent_words','Count'])

#### Non-sec-docs

In [None]:
non_sec_lemma_tokens = Counter([token.lemma_.lower() for doc in spacy_non_sec_docs for sent in doc.sents for token in sent if (not token.is_stop) and (token.lemma_ not in punctuations)])
print(f"Amount of non-sec tokens: {len(non_sec_lemma_tokens)}")
to_df(non_sec_lemma_tokens.most_common(50), ['Frequent_words','Count'])

#### Frequency of security relevant & frequent words from sec-docs in non-sec-docs

In [None]:
security_identifying_words = {'access', 'password', 'bitlocker', 'encryption', 'update', 'recovery'}
sec_tokens_in_non_sec_docs = Counter([token.lemma_.lower() for doc in spacy_non_sec_docs for sent in doc.sents for token in sent if token.lemma_.lower() in security_identifying_words])
to_df(sec_tokens_in_non_sec_docs.most_common(6), ['Sec_words','Count_in_non-sec_docs'])

#### 2. Words occurring only in sec or non-sec docs

#### Words occurring only in sec docs

In [None]:
sec_tokens = sec_lemma_tokens.keys()
non_sec_tokens = non_sec_lemma_tokens.keys()

sec_only_words = sec_tokens-non_sec_tokens
# display long words that occur only in sec-docs
long_sec_only_words = set(word for word in sec_only_words if (len(word)>2) and (len(word)<16) and (regex.search(word) is None))

print(f"Number of words occurring only in security documents: {len(long_sec_only_words)}")

# Word-Cloud
print_word_cloud(long_sec_only_words, "Words occurring in sec-docs only", "OnlySecWords.png")

In [None]:
long_low_freq_sec_only_words = set(word for word in long_sec_only_words if sec_lemma_tokens[word]<=5) 
print(f"Number of words occurring only in security documents having a low frequency < 5: {len(long_low_freq_sec_only_words)}")

In [None]:
long_G5freq_sec_only_words = set(word for word in long_sec_only_words if sec_lemma_tokens[word]>5)

print(f"Number of words occurring only in security documents having a frequency > 5: {len(long_G5freq_sec_only_words)}")

# Word-Cloud
print_word_cloud(long_G5freq_sec_only_words, "Words occurring only in sec docs with a frequency > 5", "freq-sec-only.png")

#### Words occurring only in non-sec-docs

In [None]:
non_sec_only_words = non_sec_tokens-sec_tokens
# display long words that occur only in non-sec-docs
long_non_sec_only_words = set(word for word in non_sec_only_words if (len(word)>2) and (len(word)<16) and (regex.search(word) is None))

print(f"Number of words occurring only in non-sec documents: {len(long_non_sec_only_words)}")

# Word-Cloud
print_word_cloud(long_non_sec_only_words, "Words occurring in non-sec-docs only", "Non-sec-only.png")

In [None]:
non_sec_only_freq = Counter([token.lemma_.lower() for doc in spacy_non_sec_docs for sent in doc.sents for token in sent if (not token.is_stop) and (token.lemma_ not in punctuations) and (token.lemma_.lower() not in sec_tokens)])
to_df(non_sec_only_freq.most_common(50), ['Words_in_non-sec_docs_only','Count'])

#### 3. Td-idf

In [None]:
# https://www.bogotobogo.com/python/NLTK/tf_idf_with_scikit-learn_NLTK.php
def tokenize(text):
    """
    tokenize text and remove stop words and punctuations
    """
    spacy_text= nlp(text)
    lemma_tokens = [token.lemma_.lower() for sent in spacy_text.sents for token in sent if (not token.is_stop) and (token.lemma_ not in punctuations)]
    return lemma_tokens

def tf_idf_feature_extraction(data):
    """
    Extract features using TF-IDF
    """
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    feature_set = set()
    response = vectorizer.fit_transform(data)
    for i in range(len(data)):
        df = DataFrame(response[i].T.todense(),
            index=vectorizer.get_feature_names(),
            columns=["tfidf"])
        for e in df.index[df.tfidf >= 0.5].tolist():
            if (len(e) > 2) and (len(e) < 16):
                feature_set.add(e)
    return feature_set

#### Sec-docs

In [None]:
tfidf_sec_docs_features = tf_idf_feature_extraction(sec_docs_text)
print(f"Feature size: {len(tfidf_sec_docs_features)}")

# Word-Cloud
print_word_cloud(tfidf_sec_docs_features, "Tf-idf sec-docs features", "tf_idf_sec_docs_features.png")

##### Manual identification of security relevant words

In [None]:
sec_identifying_words={'inprivate', 'connectivity', 'trust', 'spotlight', 'encryption', 'registration', 'wdig', 'print', 'insecure', 'pin', 'preview', 'blocker', 'recovery', 'recording', 'disconnected', 'remote', 'log', 'game', 'bridge', 'antivirus', 'updates', 'store', 'standby', 'peer', 'driver', 'quality', 'complexity', 'build', 'index', 'winrm', 'pause', 'boot', 'protocol', 'search', 'autoplay', 'toast', 'cookie', 'sehop', 'monitoring', 'camera', 'cortana', 'flag', 'certificate', 'notification', 'scan', 'connection', 'rpc', 'installation', 'elevate', 'dangerous', 'smartscreen', 'clipboard', 'password', 'lpt', 'microphone', 'credssp', 'watson', 'join', 'share', 'sleep', 'player', 'redirection', 'publish', 'push', 'credential', 'dma', 'expiration', 'update', 'authentication', 'mapper', 'location', 'late', 'ntp', 'saver', 'enumerate', 'restart', 'error', 'autorun', 'tip', 'llmnr'}
print(f"Number of security identifying words: {len(sec_identifying_words)}")

# Word-Cloud
print_word_cloud(sec_identifying_words, "Tf-idf security words", "security_words.png")

#### Frequency of security relevant words identified with td-idf in non-sec docs

In [None]:
freq_sec_words_in_non_sec_docs = Counter([token.lemma_.lower() for doc in spacy_non_sec_docs for sent in doc.sents for token in sent if token.lemma_.lower() in sec_identifying_words])
to_df(freq_sec_words_in_non_sec_docs.most_common(50), ['Sec_words','Count_in_non-sec_docs'])

#### POS Tags of security relevant words

In [None]:
tag_dict = {'NN':0, 'VB':0, 'JJ':0, 'RB':0}
for pos in tag_dict:
    tag_dict[pos]= len([1 for doc in spacy_sec_docs for sent in doc.sents for token in sent if (token.lemma_.lower() in sec_identifying_words) and re.search(re.compile(rf'{pos}.*'),token.tag_)])
plt.rcParams['figure.figsize'] = (18.0, 6.0)
x,y=zip(*tag_dict.items())
plt.bar(x,y)
plt.savefig('POSDist.png', bbox_inches='tight')     

#### Non-sec-docs

In [None]:
tfidf_non_sec_docs_features = tf_idf_feature_extraction(non_sec_docs_text)
print(f"Feature size: {len(tfidf_non_sec_docs_features)}")

# Word-Cloud
print_word_cloud(tfidf_non_sec_docs_features, "Tf-idf non-sec-docs features", "tf_idf_non_sec_docs_features.png")

In [None]:
tfidf_non_sec_words_freq = Counter([token.lemma_.lower() for doc in spacy_non_sec_docs for sent in doc.sents for token in sent if (not token.is_stop) and (token.lemma_ not in punctuations) and (token.lemma_.lower() in tfidf_non_sec_docs_features) and (token.lemma_.lower() not in sec_tokens)])
print(f"Number of Tfidf features occurring in non-sec and not in sec-docs {len(tfidf_non_sec_words_freq)}")
to_df(tfidf_non_sec_words_freq.most_common(50), ['Non-sec_tf-idf_features','Count'])

In [None]:
# Word-Cloud
print_word_cloud(tfidf_non_sec_words_freq, "Tf-idf non-sec-only features", "tf_idf_non_sec_only_features.png")