<a href="https://colab.research.google.com/github/sha863/MSc-Dissertation-2022/blob/main/Feature_Engineering_for_NLP_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chap1 Basic Feature Extraction

## Number of characters

In [None]:
# Compute the number of characters
text = "I don't know."
num_char = len(text)

# Print the number of characters
print(num_char)

# Create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)

## Number of words

In [None]:
# Function that returns number of words in string
def word_count(string):
# Split the string into words
words = string.split()

# Return length of words list
return len(words)

# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)

## Average Word Length

In [None]:
#Function that returns average word length
def avg_word_length(x):

# Split the string into words
words = x.split()

# Compute length of each word and store in a separate list
word_lengths = [len(word) for word in words]

# Compute average word length
avg_word_length = sum(word_lengths)/len(words)

# Return average word length
return(avg_word_length)

# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(avg_word_length)

## Hashtags and mentions

In [None]:
# Function that returns number of hashtags
def hashtag_count(string):

# Split the string into words
words = string.split()

# Create a list of hashtags
hashtags = [word for word in words if word.startswith('#')]

# Return number of hashtags
return len(hashtags)

## Readability Test

* Flesch reading ease
* Gunning fog index

In [None]:
# Import the Textatistic class
from textatistic import Textatistic

# Create a Textatistic Object
readability_scores = Textatistic(text).scores

# Generate scores
print(readability_scores['flesch_score'])
print(readability_scores['gunningfog_score'])

# Chapter 2 Text Preprocessing

## Tokenization using spaCy

In [None]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'m", 'doing', 'here', '.']


## Lemmatization using spaCy

In [None]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here', '.']


## Text Cleaning

* Unnecessary whitespaces and escape sequences
* Punctuations
* Special characters (numbers, emojis, etc.)
* Stopwords

### Removing non-alphabetic characters

In [None]:
string = """
OMG!!!! This is like
the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
import spacy
# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]
...
...
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas
if lemma.isalpha() or lemma == '-PRON-']
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely


### Removing stopwords using spaCy

In [None]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """
OMG!!!! This is like
the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
...
...
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG like good thing wow amazing song hooked definitely


### Other text preprocessing techniques
* Removing HTML/XML tags
* Replacing accented characters (such as é)
* Correcting spelling errors

## Part of Speech Tagging

### POS tagging using spaCy

In [None]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Jane is an amazing guitarist"
# Create a Doc object
doc = nlp(string)
...
...
# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')]


## Name entity Recoginition

In [None]:
import spacy
string = "John Doe is a software engineer working at Google. He lives in France."

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

# Generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]


## Chap 3 Bag of Words Model

### BOW using SKLEARN

In [None]:
import pandas as pd

corpus = pd.Series([
'The lion is the king of the jungle',
'Lions have lifespans of a decade',
'The lion is an endangered species'
])

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Create CountVectorizer object
vectorizer = CountVectorizer()
# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

[[0 0 0 0 1 1 1 0 1 0 1 0 3]
 [0 1 0 1 0 0 0 1 0 1 1 0 0]
 [1 0 1 0 1 0 0 0 1 0 0 1 1]]


### Building a BoW Naive Bayes classifier

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)

# Import train_test_split
from sklearn.model_selection import train_test_split

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)

# Generate training Bow vectors
X_train_bow = vectorizer.fit_transform(X_train)

# Generate test BoW vectors
X_test_bow = vectorizer.transform(X_test)

### Training the Naive Bayes classifier

In [None]:
# Import MultinomialNB
from sklearn.naive_bayes import MultinomialNB

# Create MultinomialNB object
clf = MultinomialNB()

# Train clf
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)

### Building n-gram models

In [None]:
# Generates only bigrams.
bigrams = CountVectorizer(ngram_range=(2,2))

# Generates unigrams, bigrams and trigrams.
ngrams = CountVectorizer(ngram_range=(1,3))

# Chap 4 Building tf-idf document vectors

## tf-idf using scikit-learn

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

[[0.         0.         0.         0.         0.25434658 0.33443519
  0.33443519 0.         0.25434658 0.         0.25434658 0.
  0.76303975]
 [0.         0.46735098 0.         0.46735098 0.         0.
  0.         0.46735098 0.         0.46735098 0.35543247 0.
  0.        ]
 [0.45954803 0.         0.45954803 0.         0.34949812 0.
  0.         0.         0.34949812 0.         0.         0.45954803
  0.34949812]]


## Cosine similarity

In [None]:
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
# Define two 3-dimensional vectors A and B
A = (4,7,1)
B = (5,2,3)
# Compute the cosine score of A and B
score = cosine_similarity([A], [B])
# Print the cosine score
print(score)

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)

# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

get_recommendations('The Lion King', cosine_sim, indices)

## Beyond n-grams: word embeddings

* Mapping words into an n-dimensional vector space
* Produced using deep learning and huge amounts of data
* Discern how similar two words are to each other
* Used to detect synonyms and antonyms
* Captures complex relationships
King - Queen → Man - Woman
France - Paris → Russia - Moscow
* Dependent on spacy model; independent of dataset you use

## Word embeddings using spaCy

In [None]:
import spacy
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp('I am happy')
# Generate word vectors for each token
for token in doc:
 print(token.vector)
doc = nlp("happy joyous sad")
for token1 in doc:
 for token2 in doc:
  print(token1.text, token2.text, token1.similarity(token2))

# Review

* Basic features (characters, words, mentions, etc.)
* Readability scores
* Tokenization and lemmatization
* Text cleaning
* Part-of-speech tagging & named entity recognition
* n-gram modeling
* tf-idf
* Cosine similarity
* Word embeddings