## AI Text Summarizer

Rodžers Ušackis, ACS301

### Extractive summarization

Main idea:
- the most important words will be the words that appear frequently
- if we print out the sentences with the most important words then we probably have some sort of summary

Comparable with students marking important text.

### Abstractive summarization

Main idea:

- Feed our model with a lot of documents and human-made summaries.
- The deep learning model learns how to make a summary itself from these examples
- Afterwards we try to minimize the BLEU score


## Configuration

### Imports

In [None]:
import spacy as spacy
import textacy as textacy
import wikipedia
import re

from pathlib import Path
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim import models
from gensim import corpora

In [None]:
# This contains the processing pipeline
# As well as language-specific rules for tokenization etc.
nlp = spacy.load('en_core_web_lg')

In [None]:
# Import the document
thomas_splint = Path('Text Files/thomas_splint.txt').read_text()
thomas_splint = re.sub(r'"', '', thomas_splint)
thomas_splint = re.sub(r"'", '', thomas_splint)
thomas_splint = thomas_splint.replace('/^\s+|\s+$|\s+(?=\s)/g', '')
thomas_splint = thomas_splint.replace('\n', ' ')
thomas_splint_doc = nlp(thomas_splint)

## AI Text Summarizer

### Tutorial

This is my first walk-through the 'build AI Summarizer in 30 lines or less'.

Here I document my steps which will serve as the basis for the future improved versions.

#### Word dictionary

I first start off by creating a word dictionary, in which I count up the number of occurrences for each word in the document.

In [None]:
# create dictionary
word_dict = {}
# loop through every sentence and give it a weight
for word in thomas_splint_doc:
    word = word.text.lower()
    if word in word_dict:
        word_dict[word] += 1
    else:
        word_dict[word] = 1

#### Scoring each sentence for the text summarizer

Afterwards I loop through the entire document, sentence by sentence and assign a score to the sentences based on the previously created word dictionary.

In [None]:
# create a list of tuple (sentence text, score, index)
sents = []
# score sentences
sent_score = 0
for index, sent in enumerate(thomas_splint_doc.sents):
    for word in sent:
        word = word.text.lower()
        sent_score += word_dict[word]
    sents.append((sent.text.replace("\n", " "), sent_score/len(sent), index))

#### Sorting the sentences for the text summarizer

And then, for the text summary itself, I sort the sentences by score (highest to lowest).

While also limiting the list to only 5 entries (top 5 sentences).

In [None]:
# sort sentence by word occurrences
sents = sorted(sents, key=lambda x: -x[1])
# return top 3
sents = sorted(sents[:5], key=lambda x: x[2])

#### Returning the summary

In [None]:
# compile them into text
summary_text = ''
for sent in sents:
    summary_text += sent[0] + '\n'

print(summary_text)

He employed the splint successfully for many years in his orthopaedic practice.
Its use on the Western Front brought recognition to the appliance and its value.  
It was not, however, routinely supplied to army medical teams until 1917.
When the Thomas Splint was consistently used, an enormous reduction in the mortality rate of soldiers with severe thigh injuries was observed.
This Anzac Day learn more about the medical advancements of World War One by visiting the Anzac Square Memorial Galleries.



#### What is the idea behind this technique ?

The AI Text Summarization model shown in the tutorial is a model which uses extractive summarization.

But what does that mean ?

It means that in order to make a summarization, we first check for the most frequently used words, and then we print out the sentences with the most important words.
In this particular example, we print out only the top 5 sentences.

#### How would I improve it with Spacy ?

Theoretically, I could try to do some text pre-processing, like converting all the words to lowercase and removing punctuations, special signs, empty spaces and stop signs.

Whether that will help with the model or not is yet to be discovered.

## My 'own' text summarizers

Since this was my first time doing this, and I had no clue what works best, I ended up trying out multiple versions (even beyond what was documented here).

I took inspiration from the previous assignment, by removing the unnecessary tokens.

But this time, instead of doing it on the whole document, I'm only doing this for the word_dictionary function, so that the original document stays in-tact.

This will ensure that only the 'words' that matter are taken into account for the frequency.

### Version 1

In this version, I walk through each process step by step with some explanations.

In later versions, I will only write down the changes.

#### Pre-Processing

##### Get text from a Wikipedia article

I decided to go for an approach in which the user can easily pick whatever topic they want from Wikipedia, and get a summary on it.

If this ends up working, I really wish I had this in high school...

In [None]:
wiki_page = wikipedia.page('The Barricades')
barricades_text = wiki_page.content

print(barricades_text)

The Barricades (Latvian: Barikādes) were a series of confrontations between the Republic of Latvia and the Union of Soviet Socialist Republics in January 1991 which took place mainly in Riga. The events are named for the popular effort of building and protecting barricades from 13 January until about 27 January. Latvia, which had declared restoration of independence from the Soviet Union a year earlier, anticipated that the Soviet Union might attempt to regain control over the country by force.
After attacks by the Soviet OMON on Riga in early January, the government called on people to build barricades for protection of possible targets (mainly in the capital city of Riga and nearby Ulbroka, as well as Kuldīga and Liepāja). Six people were killed in further attacks, several were wounded in shootings or beaten by OMON. Most victims were shot during the Soviet attack on the Latvian Ministry of the Interior on January 20, while another person died in a building accident reinforcing the b

##### Clean up the Wikipedia text

What we end up having from the Wikipedia, is a bunch of text with tags, which are presented as '== {tag} ==' or '=== {tag} ==='.

But since we're not interested in what's after the '== See also ==' tag, I will be removing everything after it.

And afterwards I removed the rest of the titles themselves as well, while also removing quotes.

The reason I don't remove punctuations, stop words and other stuff is because I decided to only do it for the dictionary, since that's what the summarizer will be based on.

In my mind I saw no reason to do it for the text itself, since the word dictionary is what I will base my summarization on.

In [None]:
remove_after = '== See also =='
split_barricades_text = barricades_text.split(remove_after, 1)[0]

end_barricades_text = re.sub(r'==.*==', '', split_barricades_text)
end_barricades_text = re.sub(r'"', '', end_barricades_text)
end_barricades_text = re.sub(r"'", '', end_barricades_text)
end_barricades_text = end_barricades_text.replace('/^\s+|\s+$|\s+(?=\s)/g', '')
end_barricades_text = end_barricades_text.replace('\n', '')

end_barricades_text = nlp(end_barricades_text)

print(end_barricades_text)



##### Word Dictionary

For my first version of the Text Summarizer, the word dictionary stays the same, nothing added, nothing removed.

While adding tokens into the word dictionary, I ignore stop signs, punctuations, numbers, brackets, currencies, etc.

In [None]:
# create dictionary
word_dict = {}

# loop through every sentence and give it a weight
for token in end_barricades_text:
    if not nlp.vocab[token.text].is_stop and not nlp.vocab[token.text].is_punct and not nlp.vocab[token.text].like_num and not nlp.vocab[token.text].is_space and not nlp.vocab[token.text].is_bracket and not nlp.vocab[token.text].is_left_punct and not nlp.vocab[token.text].is_right_punct and not nlp.vocab[token.text].is_quote and not nlp.vocab[token.text].is_currency:
        word = token.text.lower()
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1

In [None]:
print(word_dict)



Out of curiosity, let's check which is the most commonly used word in this document.

In [None]:
max_occurrence = max(word_dict.values())

print('Most frequently used word/word\'s with occurrence of {}:\n'.format(max_occurrence))

for index, word in enumerate(word_dict.values()):
    if word == max_occurrence:
        print(list(word_dict.keys())[index])

Most frequently used word/word's with occurrence of 66:

soviet


Apparently it's the word - soviet.

Probably because it's both used alone and as bi-grams, tri-grams.

##### Normalization of the word dictionary

I thought it would also make sense to normalize these values, so that the range would be between 0 and 1.

In [None]:
for word in word_dict.keys():
    print('{} divided by {} = {}'.format(word_dict[word], max_occurrence, (word_dict[word] / max_occurrence)))
    word_dict[word] = word_dict[word] / max_occurrence

41 divided by 66 = 0.6212121212121212
24 divided by 66 = 0.36363636363636365
1 divided by 66 = 0.015151515151515152
2 divided by 66 = 0.030303030303030304
1 divided by 66 = 0.015151515151515152
4 divided by 66 = 0.06060606060606061
32 divided by 66 = 0.48484848484848486
17 divided by 66 = 0.25757575757575757
66 divided by 66 = 1.0
1 divided by 66 = 0.015151515151515152
4 divided by 66 = 0.06060606060606061
31 divided by 66 = 0.4696969696969697
5 divided by 66 = 0.07575757575757576
2 divided by 66 = 0.030303030303030304
3 divided by 66 = 0.045454545454545456
20 divided by 66 = 0.30303030303030304
6 divided by 66 = 0.09090909090909091
1 divided by 66 = 0.015151515151515152
19 divided by 66 = 0.2878787878787879
3 divided by 66 = 0.045454545454545456
7 divided by 66 = 0.10606060606060606
1 divided by 66 = 0.015151515151515152
4 divided by 66 = 0.06060606060606061
2 divided by 66 = 0.030303030303030304
13 divided by 66 = 0.19696969696969696
2 divided by 66 = 0.030303030303030304
1 divided b

In [None]:
print(word_dict)



Looking at this data, I'm  wondering whether outlier removal would be helpful here, since the word 'war' has A LOT more occurrences than anything else.

Even rendering some words to value of 0.

##### Scoring each sentence

In [None]:
# create a list of tuple (sentence text, score, index)
sents = []
# score sentences
sent_score = 0
for index, sent in enumerate(end_barricades_text.sents):
    for word in sent:
        if word.text.lower() in word_dict.keys(): # Checking whether the word is even in the dictionary, otherwise it throws an error.
            word = word.text.lower()
            sent_score += word_dict[word]
    sents.append((sent.text.replace('\n', ''), sent_score/len(sent), index))

##### Print summary

In [None]:
# sort sentence by word occurrences
sents = sorted(sents, key=lambda x: -x[1])
# return top 10
sents = sents[:20]

# compile them into text
summary_text = ''
for sent in sents:
    summary_text += sent[0] + '\n'

print(summary_text)

It also gathers information on participants.
The fund is for the families of victims.
Another person was killed on the barricades.
Two other people were also injured.
The Barricades are also commemorated by numerous monuments in Latvia.
Another bombing took place at 8:45 pm.
They included colleagues and students.
Artists were invited to entertain people.
17 cars were burned during the day.
The OMON attacked Brasa and Vecmilgrāvis bridges.
An often noted example is Lithuania.
Rumors were spread that attacks were planned.
The individual barricades were organised by regions.
It was noted that the attackers also suffered casualties.
This announcement was broadcast in the Soviet media.
Others gathered after the midday demonstration.
The Popular Front withdrew its call to protect the barricades.
This was seen by some as disaffection with the whole idea.
Foreign calls to Lithuania were transferred through Riga.
In 1995, a support fund for Participants of the Barricades of 1991 was created.



In [None]:
print(sents)

[('It also gathers information on participants.', 35.21645021644947, 166), ('The fund is for the families of victims.', 27.36700336700279, 165), ('Another person was killed on the barricades.', 24.60416666666627, 141), ('Two other people were also injured.', 24.545454545454255, 118), ('The Barricades are also commemorated by numerous monuments in Latvia.', 23.03719008264414, 171), ('Another bombing took place at 8:45 pm.', 21.59090909090883, 120), ('They included colleagues and students.', 21.035353535353646, 65), ('Artists were invited to entertain people.', 20.794372294372202, 92), ('17 cars were burned during the day.', 20.329545454545237, 112), ('The OMON attacked Brasa and Vecmilgrāvis bridges.', 20.312499999999787, 111), ('An often noted example is Lithuania.', 20.056277056276997, 86), ('Rumors were spread that attacks were planned.', 19.84469696969677, 109), ('The individual barricades were organised by regions.', 19.38825757575739, 106), ('It was noted that the attackers also s

##### Ending thoughts

I don't think an extractive summarizer is fit for summarizing wikipedia articles.

Although it highlights some things based on the sentence weight,
the context is missing and there's a lot of short sentences with a lot of weight to them so the summarization is often all over the place.


### Version 2

In this version I manually implement Bi-Grams and Tri-Grams into the word dictionary and use them for the sentence scoring.

Also, I converted everything to functions and will do so moving forward, so that they could be tested out separately on demand.

The way it's intended to work is for the user to pass in the **Wikipedia article** and **percentage of total text**.

#### Functions

##### Main function

In [None]:
def version_2_summarizer_function(article_name, percentage_of_total_text):
    processed_text = version_2_get_text_from_wiki_article(article_name)
    text_doc = nlp(processed_text)

    word_dictionary = version_2_create_word_dictionary(text_doc)
    word_dictionary = version_2_normalize_word_dictionary(word_dictionary)

    sentences = version_2_score_each_sentence(text_doc, word_dictionary)

    number_of_lines = int(len(sentences) * (percentage_of_total_text / 100))

    # sort sentence by word occurrences
    sorted_sentences = sorted(sentences, key=lambda x: -x[1])
    # return top 10
    sorted_sentences = sorted_sentences[:number_of_lines]

    text_summary = ""
    for sentence in sorted_sentences:
        text_summary += sentence[0] + '\n'

    return text_summary

##### Word dictionary function

In [None]:
def version_2_create_word_dictionary(text_doc):
    # create dictionary
    word_dictionary = {}

    # loop through every token and give it a weight
    for token in text_doc:
        if not nlp.vocab[token.text].is_stop and not nlp.vocab[token.text].is_punct and not nlp.vocab[token.text].like_num and not nlp.vocab[token.text].is_space and not nlp.vocab[token.text].is_bracket and not nlp.vocab[token.text].is_left_punct and not nlp.vocab[token.text].is_right_punct and not nlp.vocab[token.text].is_quote and not nlp.vocab[token.text].is_currency:
            word = token.text.lower()
            if word in word_dictionary:
                word_dictionary[word] += 1
            else:
                word_dictionary[word] = 1

    # the same for bi-gram
    bi_grams = list(textacy.extract.basics.ngrams(text_doc, 2, min_freq=2))

    for token in bi_grams:
        bi_gram = token.text.lower()
        if bi_gram in word_dictionary:

            # Thought about adding += 2, = 4
            # Still testing
            word_dictionary[bi_gram] += 1
        else:
            word_dictionary[bi_gram] = 1

    # the same for tri-gram
    tri_grams = list(textacy.extract.basics.ngrams(text_doc, 3, min_freq=2))

    for token in tri_grams:
        tri_gram = token.text.lower()
        if tri_gram in word_dictionary:

            # Thought about adding += 3, = 6
            # Still testing
            word_dictionary[tri_gram] += 1
        else:
            word_dictionary[tri_gram] = 1

    return word_dictionary

##### Dictionary normalization function

In [None]:
def version_2_normalize_word_dictionary(word_dictionary):
    max_occurrence = max(word_dict.values())

    for word in word_dictionary.keys():
        word_dictionary[word] = word_dictionary[word] / max_occurrence

    # print(word_dictionary)
    return word_dictionary

##### Sentence scoring function

In [None]:
def version_2_score_each_sentence(text_doc, word_dictionary):
    # create a list of tuple (sentence text, score, index)
    sentences = []

    # score sentences
    sentence_score = 0
    for index, sentence in enumerate(text_doc.sents):
        for j, word in enumerate(sentence):

            # Tri-Grams
            # -1 because of end of sentence punctuation
            if (j+2) < (len(sentence) - 1):
                if (sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower() + ' ' + sentence[j + 2].text.lower()) in word_dictionary.keys():
                    tri_gram_word = sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower() + ' ' + sentence[j + 2].text.lower()
                    sentence_score += word_dictionary[tri_gram_word]
                    print('Adding Tri Gram Score for sentence #{}: {}'.format(index, word_dictionary[tri_gram_word]))

                # Skip to next iteration if matched, to avoid duplicates
                # (World War 1, World War, World) for example would add three separate dictionary words to the sentence score.
                continue


            # Bi-Grams
            # -1 because of end of sentence punctuation
            if (j+1) < (len(sentence) - 1):
                if (sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower()) in word_dictionary.keys():
                    bi_gram_word = sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower()
                    sentence_score += word_dictionary[bi_gram_word]
                    print('Adding Bi Gram Score for sentence #{}: {}'.format(index, word_dictionary[bi_gram_word]))

                # Skip to next iteration if matched, to avoid duplicates
                # (World War 1, World War, World) for example would add three separate dictionary words to the sentence score.
                continue


            # Single words
            if sentence[j].text.lower() in word_dictionary.keys(): # Checking whether the word is even in the dictionary, otherwise it throws an error.
                singular_word = sentence[j].text.lower()
                sentence_score += word_dictionary[singular_word]


        sentences.append((sentence.text.replace("\n", " "), sentence_score/len(sentence), index))

    return sentences

##### Article retrieval and processing function

In [None]:
def version_2_get_text_from_wiki_article(topic):
    wikipedia_page = wikipedia.page(topic)
    wikipedia_text = wikipedia_page.content

    seperator = '== See also =='
    split_wikipedia_text = wikipedia_text.split(seperator, 1)[0]

    end_wikipedia_text = re.sub(r'==.*==', '', split_wikipedia_text)
    end_wikipedia_text = re.sub(r'"', '', end_wikipedia_text)
    end_wikipedia_text = re.sub(r"'", '', end_wikipedia_text)
    end_wikipedia_text = end_wikipedia_text.replace('/^\s+|\s+$|\s+(?=\s)/g', '')
    end_wikipedia_text = end_wikipedia_text.replace('\n', '')

    return end_wikipedia_text

##### Print summarization with the summarizer_function

In [None]:
version_2_result = version_2_summarizer_function('The Barricades', 20)

Adding Tri Gram Score for sentence #0: 3.0
Adding Tri Gram Score for sentence #2: 2.0
Adding Tri Gram Score for sentence #3: 3.0
Adding Tri Gram Score for sentence #3: 2.0
Adding Tri Gram Score for sentence #7: 2.0
Adding Tri Gram Score for sentence #7: 5.0
Adding Bi Gram Score for sentence #9: 2.0
Adding Bi Gram Score for sentence #10: 16.0
Adding Bi Gram Score for sentence #12: 16.0
Adding Tri Gram Score for sentence #16: 3.0
Adding Tri Gram Score for sentence #16: 2.0
Adding Bi Gram Score for sentence #16: 16.0
Adding Tri Gram Score for sentence #18: 5.0
Adding Tri Gram Score for sentence #18: 3.0
Adding Tri Gram Score for sentence #18: 3.0
Adding Tri Gram Score for sentence #19: 2.0
Adding Tri Gram Score for sentence #19: 2.0
Adding Tri Gram Score for sentence #20: 2.0
Adding Bi Gram Score for sentence #20: 2.0
Adding Tri Gram Score for sentence #21: 2.0
Adding Tri Gram Score for sentence #31: 2.0
Adding Tri Gram Score for sentence #33: 2.0
Adding Bi Gram Score for sentence #33: 4.

In [None]:
print(version_2_result)

It also gathers information on participants.
The fund is for the families of victims.
Another person was killed on the barricades.
Two other people were also injured.
The Barricades are also commemorated by numerous monuments in Latvia.
Another bombing took place at 8:45 pm.
Artists were invited to entertain people.
17 cars were burned during the day.
The OMON attacked Brasa and Vecmilgrāvis bridges.
Rumors were spread that attacks were planned.
It was noted that the attackers also suffered casualties.
The individual barricades were organised by regions.
They included colleagues and students.
This announcement was broadcast in the Soviet media.
An often noted example is Lithuania.
The Popular Front withdrew its call to protect the barricades.
This was seen by some as disaffection with the whole idea.
Others gathered after the midday demonstration.
In 1995, a support fund for Participants of the Barricades of 1991 was created.
A delegation of the Supreme Soviet of the USSR visited Riga.

### Version 3

**ABANDONED - DIDN'T WORK AS INTENDED**


In this version I decided to try and make a copy of the text, but **Lemmatized**, and use that for the dictionary.

#### Functions

##### Main function

In [None]:
def version_3_summarizer_function(text, percentage_of_total_text):
    text_doc = nlp(text)

    lemmatized_text_doc =  version_3_lemmatize_doc(text_doc)
    lemmatized_text_doc = nlp(lemmatized_text_doc)

    word_dictionary = version_3_create_word_dictionary(lemmatized_text_doc)
    word_dictionary = version_3_normalize_word_dictionary(word_dictionary)

    sentences = version_3_score_each_sentence(text_doc, lemmatized_text_doc, word_dictionary)

    number_of_lines = int(len(sentences) * (percentage_of_total_text / 100))

    # sort sentence by word occurrences
    sorted_sentences = sorted(sentences, key=lambda x: -x[1])
    # return top 10
    sorted_sentences = sorted_sentences[:number_of_lines]

    text_summary = ""
    for sentence in sorted_sentences:
        text_summary += sentence[0] + "\n"

    return text_summary

##### Word dictionary function

In [None]:
def version_3_create_word_dictionary(text_doc):
    # create dictionary
    word_dictionary = {}

    # loop through every token and give it a weight
    for token in text_doc:
        if not nlp.vocab[token.text].is_stop and not nlp.vocab[token.text].is_punct and not nlp.vocab[token.text].like_num and not nlp.vocab[token.text].is_space and not nlp.vocab[token.text].is_bracket and not nlp.vocab[token.text].is_left_punct and not nlp.vocab[token.text].is_right_punct and not nlp.vocab[token.text].is_quote and not nlp.vocab[token.text].is_currency:
            word = token.text.lower()
            if word in word_dictionary:
                word_dictionary[word] += 1
            else:
                word_dictionary[word] = 1

    # the same for bi-gram
    bi_grams = list(textacy.extract.basics.ngrams(text_doc, 2, min_freq=2))

    for token in bi_grams:
        bi_gram = token.text.lower()
        if bi_gram in word_dictionary:

            # Thought about adding += 2, = 4
            # Still testing
            word_dictionary[bi_gram] += 1
        else:
            word_dictionary[bi_gram] = 1

    # the same for tri-gram
    tri_grams = list(textacy.extract.basics.ngrams(text_doc, 3, min_freq=2))

    for token in tri_grams:
        tri_gram = token.text.lower()
        if tri_gram in word_dictionary:

            # Thought about adding += 3, = 6
            # Still testing
            word_dictionary[tri_gram] += 1
        else:
            word_dictionary[tri_gram] = 1

    return word_dictionary

##### Dictionary normalization function

In [None]:
def version_3_normalize_word_dictionary(word_dictionary):
    max_occurrence = max(word_dict.values())

    for word in word_dictionary.keys():
        word_dictionary[word] = word_dictionary[word] / max_occurrence

    # print(word_dictionary)
    return word_dictionary

##### Sentence scoring function

In [None]:
def version_3_score_each_sentence(original_text_doc, lemmatized_text_doc, word_dictionary):
    # create a list of tuple (sentence text, score, index)
    sentences = []

    # score sentences
    sentence_score = 0
    for index, sentence in enumerate(lemmatized_text_doc.sents):
        for j, word in enumerate(sentence):

            # Tri-Grams
            # -1 because of end of sentence punctuation
            if (j+2) < (len(sentence) - 1):
                if (sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower() + ' ' + sentence[j + 2].text.lower()) in word_dictionary.keys():
                    tri_gram_word = sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower() + ' ' + sentence[j + 2].text.lower()
                    sentence_score += word_dictionary[tri_gram_word]
                    print('Adding Tri Gram Score for sentence #{}: {}'.format(index, word_dictionary[tri_gram_word]))

                # Skip to next iteration if matched, to avoid duplicates
                # (World War 1, World War, World) for example would add three separate dictionary words to the sentence score.
                continue


            # Bi-Grams
            # -1 because of end of sentence punctuation
            if (j+1) < (len(sentence) - 1):
                if (sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower()) in word_dictionary.keys():
                    bi_gram_word = sentence[j].text.lower() + ' ' + sentence[j + 1].text.lower()
                    sentence_score += word_dictionary[bi_gram_word]
                    print('Adding Bi Gram Score for sentence #{}: {}'.format(index, word_dictionary[bi_gram_word]))

                # Skip to next iteration if matched, to avoid duplicates
                # (World War 1, World War, World) for example would add three separate dictionary words to the sentence score.
                continue


            # Single words
            if sentence[j].text.lower() in word_dictionary.keys(): # Checking whether the word is even in the dictionary, otherwise it throws an error.
                singular_word = sentence[j].text.lower()
                sentence_score += word_dictionary[singular_word]


        sentences.append((original_text_doc.sents[index].text.replace("\n", " "), sentence_score/len(sentence), index))

    return sentences

##### Article retrieval and processing function

In [None]:
def version_3_get_text_from_wiki_article(topic):
    wikipedia_page = wikipedia.page(topic)
    wikipedia_text = wikipedia_page.content

    seperator = '== See also =='
    split_wikipedia_text = wikipedia_text.split(seperator, 1)[0]

    end_wikipedia_text = re.sub(r'==.*==', '', split_wikipedia_text)
    end_wikipedia_text = re.sub(r'"', '', end_wikipedia_text)
    end_wikipedia_text = re.sub(r"'", '', end_wikipedia_text)
    end_wikipedia_text = end_wikipedia_text.replace('/^\s+|\s+$|\s+(?=\s)/g', '')
    end_wikipedia_text = end_wikipedia_text.replace('\n', '')

    return end_wikipedia_text

##### Document lemmatize function

In [None]:
def version_3_lemmatize_doc(document):
    processed_document = []

    for token in document:
        processed_document.append(token.lemma_.lower())

    processed_document = ' '.join([i for i in processed_document])

    return processed_document

##### Print summarization with the summarizer_function

**ABANDONED - DIDN'T WORK AS INTENDED**

In [None]:
# version_3_result = version_3_summarizer_function(barricades_text, 20)

In [None]:
# print(version_3_result)

### Gensim text summarizer

In [None]:
from gensim.summarization import summarize

gensim_summarization = summarize(barricades_text)

print(gensim_summarization)

The Barricades (Latvian: Barikādes) were a series of confrontations between the Republic of Latvia and the Union of Soviet Socialist Republics in January 1991 which took place mainly in Riga.
After attacks by the Soviet OMON on Riga in early January, the government called on people to build barricades for protection of possible targets (mainly in the capital city of Riga and nearby Ulbroka, as well as Kuldīga and Liepāja).
Most victims were shot during the Soviet attack on the Latvian Ministry of the Interior on January 20, while another person died in a building accident reinforcing the barricades.
Consequently, the tension in relations between Latvia and the Soviet Union and between the independence movement and pro-Soviet forces, such as the International Front of the Working People of Latvia (Interfront) and the Communist Party of Latvia, along with its All-Latvian Public Rescue Committee, grew.
A series of bombings occurred in December 1990, Marshal of the Soviet Union Dmitry Yazo

### Answer to the TF-IDF Question

Question:
NLP: tf–idf used?
Could you use the tf-idf technique ? And how?

Answer:
From my understanding in order to use TF-IDF you need multiple documents (a document corpus),
since it uses the count of each term and the total number of documents for its weight calculations.

If we're talking about the gensim TfidModel() function, then it doesn't make sense to make use of it in my scenario, since I'm only making a summarization for one document.

But if we're talking about TF-IDF as a whole, I already implemented that with the word weighing function.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Below I also added a snippet on how TF-IDF would look with one document.
I didn't want to go too much in depth since it's a lot of work to accomplish what I've already done.

Source: [https://radimrehurek.com/gensim/models/tfidfmodel.html](https://radimrehurek.com/gensim/models/tfidfmodel.html)
Source: [https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/](https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/)

TF (Term Frequency)
IDF (Inverse Document Frequency)

### Gensim TF-IDF snippet

In [None]:
temp_text = nlp(barricades_text)

temp_processed_document_corpus = []
temp_processed_document = []

for token in temp_text:
        if not nlp.vocab[token.text].is_stop and not nlp.vocab[token.text].is_punct and not nlp.vocab[token.text].like_num and not nlp.vocab[token.text].is_space and not nlp.vocab[token.text].is_bracket and not nlp.vocab[token.text].is_left_punct and not nlp.vocab[token.text].is_right_punct and not nlp.vocab[token.text].is_quote and not nlp.vocab[token.text].is_currency:
            word = token.text.lower()

            temp_processed_document.append(word)

temp_processed_document_corpus.append(temp_processed_document)

print(temp_processed_document_corpus)



In [None]:
from gensim import corpora

dictionary = corpora.Dictionary(temp_processed_document_corpus)

bow_corpus = [dictionary.doc2bow(text) for text in temp_processed_document_corpus]

tfidf = models.TfidfModel(bow_corpus)

print(tfidf)

TfidfModel(num_docs=1, num_nnz=902)


In [None]:
print(dictionary.token2id)



In [None]:
print(dictionary.doc2bow(barricades_text.lower().split()))

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (6, 1), (9, 1), (10, 1), (11, 3), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 2), (24, 1), (25, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 3), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (47, 1), (48, 3), (49, 1), (50, 4), (51, 1), (52, 2), (53, 1), (54, 1), (55, 1), (56, 1), (57, 12), (58, 7), (59, 1), (60, 5), (61, 2), (62, 2), (63, 1), (64, 1), (66, 2), (67, 3), (68, 1), (69, 1), (70, 1), (71, 1), (72, 2), (74, 1), (75, 9), (76, 1), (78, 1), (79, 29), (80, 1), (81, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 2), (88, 1), (89, 1), (90, 3), (91, 1), (93, 1), (94, 2), (95, 1), (96, 1), (97, 2), (99, 1), (100, 2), (101, 1), (105, 3), (106, 3), (107, 1), (108, 1), (109, 3), (110, 5), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 9), (117, 4), (118, 1), (120, 2), (121, 1), (122, 1), (123, 2), (12