In [2]:
with open('Text.txt', 'r') as f:
    text = f.read()

print(text)

Those Who Are Resilient Stay In The Game Longer
“On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche.
Challenges and setbacks are not meant to defeat you, but promote you. However, I realize after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments. Have you experienced this before? To be honest, I don’t have the answers. I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles. Same failure, yet different

# Approccio 1: Score-sentence basato su media delle frequenze delle parole

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

Calcoliamo gli n-grams (senza punteggiatura e stop-words) e la frequenza di ognuno di questi.

In [5]:
def preprocess(text: str, lemmatizer: WordNetLemmatizer, n):

    # To lower case and tokenization
    tokens = word_tokenize(text.lower())

    # Stop word and punctuation removal
    filtered_tokens = [token for token in tokens if token.isalpha() and (token not in stopwords.words('english'))]

    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    if n <= 1:
        return lemmatized_tokens
    
    # NGram generation
    ngram_set = []
    
    for i in range(1, n + 1):
        processed_text = ngrams(lemmatized_tokens, i)
        ngram_set.extend([' '.join(grams) for grams in processed_text])

    return ngram_set

In [7]:
lemmatizer = WordNetLemmatizer()
tokens = preprocess(text, lemmatizer, 1) # Estrae i 1-grams: le singole parole
print(tokens[:10])

['resilient', 'stay', 'game', 'longer', 'mountain', 'truth', 'never', 'climb', 'vain', 'either']


In [8]:
from collections import Counter
from collections import defaultdict

In [9]:
def build_dictionary(tokens):
    dictionary = defaultdict(lambda: 0)
    
    for token in tokens:
        dictionary[token] += 1
    
    return dictionary

In [10]:
dictionary = build_dictionary(tokens)

print(list(dictionary.items())[:10])

[('resilient', 2), ('stay', 2), ('game', 3), ('longer', 2), ('mountain', 1), ('truth', 1), ('never', 2), ('climb', 2), ('vain', 1), ('either', 1)]


In [11]:
from nltk.tokenize import sent_tokenize

In [12]:
sentences = sent_tokenize(text)

In [17]:
def score_sentences(sentences, dictionary, lemmatizer):
    sentence_score = dict()

    for idx, sentence in enumerate(sentences):
        relevant_tokens = 0
        score = 0
        
        for token in word_tokenize(sentence.lower()):
            token = lemmatizer.lemmatize(token)

            if token in dictionary:
                relevant_tokens += 1
                score += dictionary[token]
        
        sentence_score[idx] = score / relevant_tokens

    return sentence_score

In [18]:
scores = score_sentences(sentences, dictionary, lemmatizer)

print(list(scores.items())[:3])

[(0, 1.5714285714285714), (1, 2.0), (2, 2.1666666666666665)]


In [19]:
def average_score(scores: dict):
    average = 0

    for value in scores.values():
        average += value

    average /= len(scores)

    return average

In [20]:
avg_score = average_score(scores) # Calcola la media dei punteggi
print('avg_score', avg_score)

alpha = 1.3

threshold = avg_score * alpha # Soglia di punteggio
print('threshold', threshold)

avg_score 2.2208785663728694
threshold 2.8871421362847305


Costruiamo il summary prendendo le frasi con punteggio maggiore del threshold calcolato precedentemente.

In [23]:
def generate_summary(sentences, scores, threshold):
    count = 0
    summary = ''

    for idx, sentence in enumerate(sentences):
        if scores[idx] >= (threshold):
            summary += ' ' + sentence
            count += 1

    return summary[1:], count

In [24]:
summary, sentence_count = generate_summary(sentences, scores, threshold)

print(f'Original text is composed by {len(sentences)} sentences')
print(f'Summary is composed by {sentence_count} sentences')

Original text is composed by 54 sentences
Summary is composed by 9 sentences


In [26]:
print(summary)

I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. Same failure, yet different responses. If you settle for less, you will receive less than you deserve and convince yourself you are justified to receive it. Where are you settling in your life right now? Are you willing to play bigger even if it means repeated failures and setbacks? So become intentional on what you want out of life. Nurture your dreams. Don’t leave your dreams to chance.


# Approccio 2: TF-IDF

Usiamo le frasi come fossero documenti e calcoliamo il TF-IDF per ogni token.

In [27]:
from collections import defaultdict
import numpy as np

In [28]:
def compute_tf_idf(sentences, dictionary, lemmatizer: WordNetLemmatizer):
    tf_dict = defaultdict(lambda: defaultdict(lambda: 0))
    relevant_tokens = defaultdict(lambda: 0)
    token_sentences_dict = defaultdict(set)

    for idx, sentence in enumerate(sentences):
        for token in word_tokenize(sentence.lower()):
            token = lemmatizer.lemmatize(token)

            if token in dictionary:
                relevant_tokens[idx] += 1
                tf_dict[idx][token] += 1
                token_sentences_dict[token].add(idx)
    
    tf_idf = defaultdict(lambda: defaultdict(lambda: 0))
    
    for token, token_sentences in token_sentences_dict.items():
        idf = np.log10(len(sentences) / len(token_sentences))
        
        for idx in token_sentences:
            tf_idf[idx][token] = tf_dict[idx][token] / relevant_tokens[idx] * idf

    return tf_idf

In [29]:
tf_idf = compute_tf_idf(sentences, dictionary, lemmatizer)

print(tf_idf)

defaultdict(<function compute_tf_idf.<locals>.<lambda> at 0x7f7a82253d80>, {0: defaultdict(<function compute_tf_idf.<locals>.<lambda>.<locals>.<lambda> at 0x7f7a82253ce0>, {'resilient': 0.06816017924566606, 'stay': 0.06816017924566606, 'game': 0.05977488119539552, 'longer': 0.06816017924566606, 'mountain': 0.08249494094395088, 'truth': 0.08249494094395088, 'never': 0.06816017924566606, 'climb': 0.16498988188790176, 'vain': 0.08249494094395088, 'either': 0.08249494094395088, 'reach': 0.08249494094395088, 'point': 0.06816017924566606, 'higher': 0.16498988188790176, 'today': 0.08249494094395088, 'training': 0.08249494094395088, 'power': 0.06816017924566606, 'able': 0.08249494094395088, 'friedrich': 0.08249494094395088, 'nietzsche': 0.08249494094395088}), 12: defaultdict(<function compute_tf_idf.<locals>.<lambda>.<locals>.<lambda> at 0x7f7a82253e20>, {'resilient': 0.17892047051987342, 'stay': 0.17892047051987342, 'game': 0.15690906313791325, 'longer': 0.17892047051987342, 'mean': 0.1291779

In [31]:
def score_sentences2(sentences, tf_idf, lemmatizer: WordNetLemmatizer):
    sentence_score = dict()

    for idx, sentence in enumerate(sentences):
        relevant_tokens = 0
        score = 0
        
        for token in word_tokenize(sentence.lower()):
            token = lemmatizer.lemmatize(token)

            if token in tf_idf[idx]:
                relevant_tokens += 1
                score += tf_idf[idx][token]
        
        sentence_score[idx] = score / relevant_tokens

    return sentence_score

In [34]:
scores2 = score_sentences2(sentences, tf_idf, lemmatizer)

print(list(scores2.items())[:3])

[(0, 0.09303070922192891), (1, 0.2962678517604615), (2, 0.1206603480602574)]


In [35]:
avg_score2 = average_score(scores2)
print('avg_score', avg_score2)

alpha = 1.3

threshold2 = avg_score2 * alpha
print('threshold', threshold2)

avg_score 0.3126222712255653
threshold 0.4064089525932349


In [36]:
summary2, sentence_count2 = generate_summary(sentences, scores2, threshold2)

print('Original text is composed by {} sentences'.format(len(sentences)))
print('Summary is composed by {} sentences'.format(sentence_count2))

Original text is composed by 54 sentences
Summary is composed by 10 sentences


In [38]:
print(summary2)

Have you experienced this before? To be honest, I don’t have the answers. Who is right and who is wrong? Neither. It was at that point their biggest breakthrough came. It must come from within you. Where are you settling in your life right now? Commit to it. Nurture your dreams. Don’t leave your dreams to chance.


# Approccio 3: Embedding

Idea: vettorizziamo il testo e le singole frasi, poi le compariamo con un classificatore binario.

Dobbiamo calcolare le funzioni di vettorizzazione e quella di comparazione.

Utilizziamo un approccio basato su reti neurali.

Embedding basato su self-attention

Compare basato su cosine similarity