**Table of contents**<a id='toc0_'></a>    
- [Structured vs Unstructured Data](#toc1_)    
  - [Structured Data](#toc1_1_)    
    - [Pros $^{[1]}$      ](#toc1_1_1_)    
    - [Cons $^{[1]}$      ](#toc1_1_2_)    
    - [Tools](#toc1_1_3_)    
  - [Unstructured Data](#toc1_2_)    
    - [Pros $^{[1]}$      ](#toc1_2_1_)    
    - [Cons $^{[1]}$      ](#toc1_2_2_)    
    - [Tools](#toc1_2_3_)    
- [Intro to NLP](#toc2_)    
  - [Text preprocessing](#toc2_1_)    
    - [Tokenization](#toc2_1_1_)    
      - [Word tokenization](#toc2_1_1_1_)    
      - [Sentence tokenization](#toc2_1_1_2_)    
    - [POS tagging](#toc2_1_2_)    
    - [Stemming](#toc2_1_3_)    
    - [Lemmatization](#toc2_1_4_)    
    - [Stopwords removal](#toc2_1_5_)    
    - [Vectorization - Bag of Words (BoW) model](#toc2_1_6_)    
  - [News clustering](#toc2_2_)    
    - [Extract news data](#toc2_2_1_)    
    - [Preprocess text](#toc2_2_2_)    
    - [Vectorization](#toc2_2_3_)    
    - [Clustering](#toc2_2_4_)    
- [Resources](#toc3_)    
- [References](#toc4_)    
- [Acknowledgements](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Structured vs Unstructured Data](#toc0_)

![](https://imgs.search.brave.com/-3045xn2hR2tPAVWfaseQTUx2r88o-SkOSWMKpPrRNc/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9sYXd0/b21hdGVkLmNvbS93/cC1jb250ZW50L3Vw/bG9hZHMvMjAxOS8w/NC9zdHJ1Y3R1cmVk/VnNVbnN0cnVjdHVy/ZWRJZ25lb3MucG5n)  
(Source: [Structured vs Unstructured Data: An Overview, Mongo DB](https://www.mongodb.com/unstructured-data/structured-vs-unstructured))

## <a id='toc1_1_'></a>[Structured Data](#toc0_)

![](https://images.surferseo.art/970da28a-3eeb-45ec-a3bf-1f9dce66c94d.png)  
(Source: PhoenixNAP Global IT Services, June 2021)

### <a id='toc1_1_1_'></a>Pros [$^{[1]}$](https://www.mongodb.com/unstructured-data/structured-vs-unstructured)       [&#8593;](#toc0_)
> - **Easily used by machine learning (ML) algorithms**: The specific and organized architecture of structured data eases manipulation and querying of ML data.
> - **Easily used by business users**: Structured data does not require an in-depth understanding of different types of data and how they function. With a basic understanding of the topic relative to the data, users can easily access and interpret the data.
> - **Accessible by more tools**: Since structured data predates unstructured data, there are more tools available for using and analyzing structured data.

### <a id='toc1_1_2_'></a>Cons [$^{[1]}$](https://www.mongodb.com/unstructured-data/structured-vs-unstructured)       [&#8593;](#toc0_)
> - **Limited usage**: Data with a predefined structure can only be used for its intended purpose, which limits its flexibility and usability.
> - **Limited storage options**: Structured data is generally stored in data storage systems with rigid schemas (e.g., “data warehouses”). Therefore, changes in data requirements necessitate an update of all structured data, which leads to a massive expenditure of time and resources.

### <a id='toc1_1_3_'></a>[Tools](#toc0_)

- mainly SQL-based databases

## <a id='toc1_2_'></a>[Unstructured Data](#toc0_)

![](https://www.egnyte.com/sites/default/files/inline-images/zITEmudg0OvGfRblGApjWuFu20xY1NCNAnmu8O52KKtD4FLSPG.png)  
(Source: [What Is Unstructured Data?, Egnyte](https://www.egnyte.com/guides/governance/unstructured-data))

### <a id='toc1_2_1_'></a>Pros [$^{[1]}$](https://www.mongodb.com/unstructured-data/structured-vs-unstructured)       [&#8593;](#toc0_)
> - **Native format**: Unstructured data, stored in its native format, remains undefined until needed. Its adaptability increases file formats in the database, which widens the data and enables data scientists to prepare and analyze only the data they need.
> - **Fast accumulation rates**: Since there is no need to predefine the data, it can be collected quickly and easily.
> - **Data lake storage**: Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases scalability.

### <a id='toc1_2_2_'></a>Cons [$^{[1]}$](https://www.mongodb.com/unstructured-data/structured-vs-unstructured)       [&#8593;](#toc0_)
> - **Requires expertise**: Due to its undefined/non-formatted nature, data science expertise is required to prepare and analyze unstructured data. This is beneficial to data analysts but alienates unspecialized business users who may not fully understand specialized data topics or how to utilize their data.
> - **Specialized tools**: Specialized tools are required to manipulate unstructured data, which limits product choices for data managers.

### <a id='toc1_2_3_'></a>[Tools](#toc0_)

- MongoDB, DynamoDB (AWS), Hadoop, Azure (Microsoft), S3 (AWS), also known as No-SQL databases :)

# <a id='toc2_'></a>[Intro to NLP](#toc0_)

In [None]:
# You know the drill
# !pip install nltk
# !pip install spacy

In [None]:
print(nltk.__version__)    # I have 3.8.1
print(spacy.__version__)   # I have 3.5.3

In [None]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy #if this doesn't work after installation, you might need to create another environment
pd.set_option('display.max_colwidth', None)

In [None]:
text = """The US State Department has just issued a statement on the meeting between Blinken and Abbas earlier today in Ramallah.

“Secretary Blinken discussed ongoing efforts to minimize civilian harm in Gaza and accelerate and increase the delivery of humanitarian assistance to Palestinian civilians throughout Gaza,” the statement quoted spokesperson Matthew Miller as saying.

Blinken also “noted increased volatility” in the occupied West Bank and discussed US efforts to address “extremist violence”.

“He also underscored the United States’ position that all Palestinian tax revenues collected by Israel should be consistently conveyed to the Palestinian Authority in accordance with prior agreements,” Miller said, adding that the US “supports tangible steps towards the creation of a Palestinian state alongside the State of Israel, with both living in peace and security”.

Earlier, we reported that during the meeting Abbas discussed the efforts made to stop the Israeli war on Gaza and the importance of accelerating the entry of aid into the bombarded territory.

For its part, Hamas denounced Blinken’s visit to the region saying the US official’s “attempts to justify the genocide committed by the Israeli occupation army against Palestinian civilians … are miserable attempts to wash the hands of the criminal occupation of the blood of children, women and the elderly of Gaza”."""

In [None]:
text = """Satya Nadella, the chief executive of Microsoft, OpenAI’s biggest investor with a 49% stake, led mediation efforts that were complicated by Altman’s reported insistence that OpenAI’s board be removed as a precondition for his return. The interim CEO, Mira Murati, OpenAI’s chief technology officer, signalled her support for Altman’s return by posting a heart emoji next to her former colleague’s post professing love for OpenAI."""

## <a id='toc2_1_'></a>[Text preprocessing](#toc0_)

### <a id='toc2_1_1_'></a>[Tokenization](#toc0_)

> Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze. [$^{[2]}$](https://www.datacamp.com/blog/what-is-tokenization)

Types of tokenization: [$^{[2]}$](https://www.datacamp.com/blog/what-is-tokenization)

- **Multiple word tokenization.** Breaking down text into groups of words (n-grams) or sentences. This can be relevant for elements that typically go together such as "New York", or for analyzing larger corpora of data, e.g. books.
> - **Word tokenization**. Breaking text down into individual words. It's the most common approach and is particularly effective for languages with clear word boundaries like English. All NLP packages support this.
>
> - **Character tokenization**. Breaking text down into individual characters. This method is beneficial for languages that lack clear word boundaries (e.g. Chinese, Japanese, Arabic) or for tasks that require a granular analysis, such as spelling correction. For Chinese and Arabic, you can use the [Stanford Word Segmenter](https://nlp.stanford.edu/software/segmenter.shtml) from the NLTK library.
>         
> - **Subword tokenization**. Striking a balance between word and character tokenization, this method breaks text into units that might be larger than a single character but smaller than a full word. For instance, "Chatbots" could be tokenized into "Chat" and "bots". This approach is especially useful for languages that form meaning by combining smaller units (e.g. German) or when dealing with out-of-vocabulary words in NLP tasks.

#### <a id='toc2_1_1_1_'></a>[Word tokenization](#toc0_)

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

tokens = word_tokenize(text)
tokens[:15]

We often remove punctuation after tokenization since punctuation is unlikely to be a good predictive feature:

In [None]:
tokens = [word for word in tokens if word.isalnum()]
tokens[:15]

#### <a id='toc2_1_1_2_'></a>[Sentence tokenization](#toc0_)

In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

### <a id='toc2_1_2_'></a>[POS tagging](#toc0_)

> Part of speech can be a useful feature in itself, but is also heavily used in making lemmatization and stemming more effective:

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(tokens, lang='eng')[:15]
#explanation of all these codes can be found here: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Spacy does the tokenization under the hood
for token in doc:
    if (not token.is_punct) and (not token.is_space): # but it doesn't automatically remove stopwords and punctuation
        print(token, token.pos_)

In [None]:
# Spacy also gives the option to visualize the POS tagging
from spacy import displacy
displacy.serve(doc, style="dep")

POS tagging can be very useful in disambiguating words that have different meanings depending on context. For example, you could distinguish between Thermos (the brand) and thermos (the commonly used item) and also identify what other words it is connected to. Additionally, it can also be used in NER (Named Entity Recognition), machine translation, and other downstream tasks.

### <a id='toc2_1_3_'></a>[Stemming](#toc0_)

Stemming is an easy method to decrease the size of our vocabulary by shortening words (i.e. removing suffixes/prefixes) to the minimum viable length:

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed = [ps.stem(w) for w in tokens]
stemmed[:15]

However, stemming doesn't care about the semantics (meaning) of those words, which can result in words with different meaning being treated as the same word. That's when we would prefer to use lemmatization instead.

### <a id='toc2_1_4_'></a>[Lemmatization](#toc0_)

> Lemmatization is a more context-aware version of stemming, where we take the actual roots of individual words. However, lemmatization needs an understanding of the language to work, meaning that for under-represented languages (e.g. African languages), stemming may be more appropriate, if the language allows it. 

In [None]:
# Wordnet is the most well known lemmatizer for english
nltk.download('wordnet') 
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
lemmatized[:15]

> Lemmatization may still be a bit weak (e.g. `issued` should ideally be reduced to `issue`), mostly because the lemmatizer would like a bit more information about context to make decisions.

In [None]:
display(lemmatizer.lemmatize("was"))
# display(lemmatizer.lemmatize("was", wordnet.VERB))
display(lemmatizer.lemmatize("better"))
# display(lemmatizer.lemmatize("better", wordnet.ADJ))
display(lemmatizer.lemmatize("canning"))
# display(lemmatizer.lemmatize("canning", wordnet.NOUN))
# display(lemmatizer.lemmatize("canning", wordnet.VERB))

Given there's a mismatch between the Wordnet POS tagging and the previous POS tagging, we need to map them before using them to inform our lemmatizer:

In [None]:
# nltk.download('averaged_perceptron_tagger')

# unfortunately pos_tag and lemmatize use different codes for parts of speech
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper() # gets first letter of POS categorization
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN) # get returns second argument if first key does not exist

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in tokens]
lemmatized[:15]

Now let's see how spaCy approaches it:

In [None]:
spacy_lemma = [token.lemma_ for token in doc if (not token.is_punct and not token.is_space)]
spacy_lemma[:15]

### <a id='toc2_1_5_'></a>[Stopwords removal](#toc0_)

Stopwords are words that support the structure of a language without providing additional meaning to it. Thus, removal of stopwords allows us to reduce the noise in the data and focus on the words that carry meaning.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

without_sw = [word for word in lemmatized if not word in stopwords.words()]
without_sw[:15]

In [None]:
# Let's check the full sentence
" ".join(without_sw[:15])

Let's also try with spaCy:

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

# Uncomment to download the package of small models for English
# !python -m spacy download en_core_web_sm  
# To find a list of languages available, you can check out https://spacy.io/usage/models#languagess

In [None]:
words_spacy = []
for token in doc:
    if (not token.is_stop) and (not token.is_punct) and (not token.is_space):
        words_spacy.append(token.lemma_)
words_spacy[:15]

In [None]:
# Let's check the full sentence
" ".join(str(word) for word in words_spacy[:15])

Now, while spacy removes "the" from the token list, it also removes US (country) from the list, as it interprets it as us (pronoun).

### <a id='toc2_1_6_'></a>[Vectorization - Bag of Words (BoW) model](#toc0_)

![](https://www.mathworks.com/discovery/bag-of-words/_jcr_content/mainParsys/columns/8977d091-c0d0-4e65-925d-2d3cd856939c/image.adapt.full.medium.jpg/1686734791097.jpg)  
(Source: [What Is a Bag-of-Words?, Mathworks](https://www.mathworks.com/discovery/bag-of-words.html))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vect = CountVectorizer()
# Fit creates one entry for each different word seen
bow_vect.fit([" ".join(without_sw)])

In [None]:
without_sw

In [None]:
bow_vect.transform([text]).toarray()

In [None]:
bow_vect.transform(['gaza civilian today gaza']).toarray()

Transform only considers the words it has seen during the training (fitting) stage:

In [None]:
bow_vect.transform(['gaza strip isi']).toarray()

The Bag of Words model is a very simple model to transform unstructured text data to structured tabular data. However, it has a few shortcomings:
- It fails to recognize the meaning of word combinations, e.g. not bad, don't do. This can be solved by adding also combinations of 2 (2-grams) or more words (n-grams) in the vectorizing method.
- It doesn't take into account context, which means it may fail to pick up on things like sarcasm. This is usually solved by employing word embedding models, which we'll discuss in the next lesson.

## <a id='toc2_2_'></a>[News clustering](#toc0_)

### <a id='toc2_2_1_'></a>[Extract news data](#toc0_)

> Corpus of 120k news headlines, here shortened to 10k:

In [None]:
all_news = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/news.csv')
print(all_news.shape)
all_news.head()

In [None]:
all_news.iloc[100]['news']

### <a id='toc2_2_2_'></a>[Preprocess text](#toc0_)

We do the same process as before, except now we do it for all news pieces:

In [None]:
# Tokenization, lowercasing, removing punctuation
def tokenizer_and_remove_punctuation(row):
  tokens = word_tokenize(row['news'])
  return [word.lower() for word in tokens if word.isalpha()]

all_news['tokenized'] = all_news.apply(tokenizer_and_remove_punctuation, axis=1)
all_news.head()

In [None]:
# Lemmatization using POS tags
lemmatizer = WordNetLemmatizer()

def lemmatizer_with_pos(row):
  return [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in row['tokenized']]

all_news['lemmatized'] = all_news.apply(lemmatizer_with_pos, axis=1) # This one will take a while
all_news.head()

In [None]:
# Stopwords removal
def remove_sw(row):
  return list(set(row['lemmatized']).difference(stopwords.words()))

all_news['no_stopwords'] = all_news.apply(remove_sw, axis=1)
all_news.head()

In [None]:
# Re-creating the sentences with the preprocessed text
def recreate_sentences(row):
  return " ".join(row['no_stopwords'])

all_news['clean_text'] = all_news.apply(recreate_sentences, axis=1)
all_news.head()

### <a id='toc2_2_3_'></a>[Vectorization](#toc0_)

In [None]:
# We will use only the most common 1000 words
bow_vect = CountVectorizer(max_features=1000)
# Fit transform creates one entry for each different word seen
X = bow_vect.fit_transform(all_news['clean_text']).toarray()
as_df = pd.DataFrame(X, columns=bow_vect.get_feature_names_out())
as_df.head()

### <a id='toc2_2_4_'></a>[Clustering](#toc0_)

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=100)
kmeans.fit(X)
pred = kmeans.predict(X)

In [None]:
pred

In [None]:
predict_df = pd.concat([all_news['news'], pd.DataFrame(pred, columns=['class'])], axis=1)
predict_df.head()

Let's check out the clusters:

In [None]:
predict_df[predict_df['class']==0]

In [None]:
predict_df[predict_df['class']==1]

In [None]:
predict_df[predict_df['class']==4]

In [None]:
predict_df[predict_df['class']==5]

# <a id='toc3_'></a>[Resources](#toc0_)

- spaCy tutorial for beginners, made by spaCy creators: [Advanced NLP with spaCy](https://spacy.io/universe/project/spacy-course)
- NLP Intro (full of maths): [NLP 4 You](https://lena-voita.github.io/nlp_course.html)

# <a id='toc4_'></a>[References](#toc0_)

[1] [Structured vs Unstructured Data: An Overview, Mongo DB](https://www.mongodb.com/unstructured-data/structured-vs-unstructured)  
[2] [What is tokenization?, DataCamp](https://www.datacamp.com/blog/what-is-tokenization)

# <a id='toc5_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure & content!