# Text Summarization: Sport News and Travel Forum Discussion Data

Project to summarize the text. Instead of abstractive methods, this project will use **Extractive Summarization**, by giving important phrases and setences weight to form a summary of an entire text. Functions referred to Blueprints for Text Analytics by Albrecht et al. (2021) with several adjustments to make it more clear. For the ready-to-use functions, please refer to file **fun_nlp_spacy_text.py.** T

There are several steps at least until we get our summarized text:

1. Create an intermediate representation of the text
2. Score the sentences/phrases based on the chosen representation
3. Rank and choose sentences to create a summary of the text

In [311]:
import pandas as pd
import numpy as np
import regex as re
import os
from tqdm import tqdm

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize
import spacy

from newspaper import Article
import reprlib

    Reading the Data

In [258]:
from newspaper import Article
 
url = "https://www.mirror.co.uk/sport/football/transfer-news/lionel-messi-sergio-ramos-psg-29440311"
 
# download and parse article
article = Article(url)
article.download()
article.parse()

In [259]:
import reprlib  # to limit printing the text up to a certain number of string

r = reprlib.Repr()
r.maxstring = 800  # set the number of string here

print(r.repr(article.text))

'Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status\n\nParis Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.\n\nPSG were one of 10 clubs fine...til definitive decisions have been reached.\n\nAlongside Mbappe’s new deal last year, PSG also signed Portugal international midfield duo Renato Sanches and Vitinha from Lille and Porto respectively, alongside midfielder Carlos Soler from Valencia. They also added striker Hugo Ekitike (Reims) and defender Nordi Mukiele (RB Leipzig) while making Nuno Mendes’s loan move from Sporting CP permanent.'


    Data Cleaning

In [260]:
# delete \n\n from the text

sentence = str(article.text)
clean_text = ''
tit = 0

for i, char in enumerate(article.text):
    if char == "\n" and sentence[i+1] == "\n" and sentence[i-1] != "." :
        clean_text += ". "
    elif char == "\n" and sentence[i+1] != "\n":
        clean_text += " "
    elif char != "\n":
        clean_text += char

In [261]:
clean_text

'Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status.  Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches. PSG were one of 10 clubs fined for breaching UEFA’s Financial Fair Play (FFP) rules for the 2020-21 season. They paid out a €10million fine while having a further €45m suspended pending future accounts, with further punishments possible. PSG have been posting record losses in recent years with their most recent set of accounts showing €370m in losses. Messi and Ramos are among their highest earning stars and with both out of contract this summer, the club must make tough financial decisions. Possible sanctions could include the French club being unable to register any new players to their European squad for next season and a po

In [262]:
# check if we still have \n

re.findall('\n', clean_text)

[]

## Method 1 - Identifying Important Words with TF-IDF Values

In [263]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize

# tokenize by sentence
# and transform it using TfidfTransformizer

sentences = tokenize.sent_tokenize(clean_text)
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

In [264]:
# sort the sentence in descending order by tfidf sum value

sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum, axis=0)[::-1]

In [265]:
# print the most important sentences in order they appear
# set how many sentences from num_sum_sent parameter

num_sum_sent = 3

for i in range(len(sentences)):
    if i in important_sent[:num_sum_sent]:
        print(sentences[i])

Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.
Now aged 36, Ramos impressed in PSG’s Champions League elimination this week but the club have a deal lined up to sign Milan Skriniar from Inter while they want to sign Pau Torres from Villarreal.
A report from L’Équipe has outlined how PSG want to keep Messi and Ramos at the club with informal discussions opened but the FFP pressures and threat of sanctions may mean new deals are not feasible.


In [266]:
# make it into a reproducible function

def tfidf_summary(text, num_sum_sent):
    sum_list = []
    
    sentences = tokenize.sent_tokenize(text)
    tfidfVectorizer = TfidfVectorizer()
    words_tfidf = tfidfVectorizer.fit_transform(sentences)
    
    sent_sum = words_tfidf.sum(axis=1)
    important_sent = np.argsort(sent_sum, axis=0)[::-1]
    
    for i in range(len(sentences)):
        if i in important_sent[:num_sum_sent]:
            sum_list.append(sentences[i])
            
    return sum_list

In [267]:
tfidf_sum = tfidf_summary(clean_text, 3)
tfidf_sum

['Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.',
 'Now aged 36, Ramos impressed in PSG’s Champions League elimination this week but the club have a deal lined up to sign Milan Skriniar from Inter while they want to sign Pau Torres from Villarreal.',
 'A report from L’Équipe has outlined how PSG want to keep Messi and Ramos at the club with informal discussions opened but the FFP pressures and threat of sanctions may mean new deals are not feasible.']

## Method 2 - LSA Algorithm

Instead of building from the scratch, this one will use LSA library, <a href='https://miso-belica.github.io/sumy/'>Sumy, made Michal Belica.</a> It makes the summarization using LSA method easier by only following 4 steps:

1. Tokenize the string using Tokenizer()
2. Create the document using PlaintextParser()
3. Stemmer(), to normalize the words into the single one
4. Summarizer(), to summarize the document (with combination of taking out stop words for better result), we can choose the method as well here

In [268]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

from sumy.summarizers.lsa import LsaSummarizer

In [269]:
lang = 'english'
stemmer = Stemmer(lang)

parser = PlaintextParser.from_string(clean_text, Tokenizer(lang))
summarizer = LsaSummarizer(stemmer)  # we can actually choose the summarizer method here, by changing the method
summarizer.stop_words = get_stop_words(lang)

In [270]:
num_sum_sent = 3

for sentence in summarizer(parser.document, num_sum_sent):
    print(str(sentence))

Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status.
Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.
They also added striker Hugo Ekitike (Reims) and defender Nordi Mukiele (RB Leipzig) while making Nuno Mendes’s loan move from Sporting CP permanent.


In [271]:
# make it into a reproducible function

def lsa_summary(text, num_sum_sent, lang):
    sum_list = []
    lang = lang
    stemmer = Stemmer(lang)

    parser = PlaintextParser.from_string(text, Tokenizer(lang))
    summarizer = LsaSummarizer(stemmer)  # we can actually choose the summarizer method here, by changing the method
    summarizer.stop_words = get_stop_words(lang)
    
    for sentence in summarizer(parser.document, num_sum_sent):
        sum_list.append(str(sentence))
    
    return sum_list

In [272]:
lsa_sum = lsa_summary(clean_text, 3, 'english')
lsa_sum

['Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status.',
 'Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.',
 'They also added striker Hugo Ekitike (Reims) and defender Nordi Mukiele (RB Leipzig) while making Nuno Mendes’s loan move from Sporting CP permanent.']

## Method 3 - Indicator Representation (Page Rank)

Page Rank method derived from Google method to give rank to their web pages. For example, if a webpage 'X' links to webpage 'W', 'X' contributes to the importance of 'W,' according to PageRank, which assumes that the rank of a webpage W depends on the value of a webpage provided by other web pages in terms of connections to the page. We can assume webpage as sentence. More, check this article written by Mehul Gupta: <a href='https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390'>Text summarization using TextRank in NLP</a> 

In [273]:
# we can use sumy as well to use the page rank method

from sumy.summarizers.text_rank import TextRankSummarizer

parser = PlaintextParser.from_string(clean_text, Tokenizer(lang))
summarizer = TextRankSummarizer(stemmer)  # this is the example we can change summarizer method, from LsaSummarizer to TextRank
summarizer.stop_words = get_stop_words(lang)

In [274]:
num_sum_sent = 3

for sentence in summarizer(parser.document, num_sum_sent):
    print(str(sentence))

Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status.
Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.
A report from L’Équipe has outlined how PSG want to keep Messi and Ramos at the club with informal discussions opened but the FFP pressures and threat of sanctions may mean new deals are not feasible.


In [275]:
# make it into a reproducible function

def pagerank_summary(text, num_sum_sent, lang):
    sum_list = []
    lang = lang
    
    parser = PlaintextParser.from_string(clean_text, Tokenizer(lang))
    summarizer = TextRankSummarizer(stemmer)  # this is the example we can change summarizer method, from LsaSummarizer to TextRank
    summarizer.stop_words = get_stop_words(lang)
    
    for sentence in summarizer(parser.document, num_sum_sent):
        sum_list.append(str(sentence))
    
    return sum_list

In [276]:
pagerank_sum = pagerank_summary(clean_text, num_sum_sent, lang)
pagerank_sum

['Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid threats of UEFA sanctions which could hurt their Champions League status.',
 'Paris Saint-Germain have put contract talks with Lionel Messi and Sergio Ramos on hold amid concerns they will face fresh Champions League sanctions for further Financial Fair Play (FFP) breaches.',
 'A report from L’Équipe has outlined how PSG want to keep Messi and Ramos at the club with informal discussions opened but the FFP pressures and threat of sanctions may mean new deals are not feasible.']

    Important about using PageRank!

- TextRank generally works better for longer content, as it is able to identify correlation between sentences using graph linkages
- For a shorter text, the sentence would be fewer, thus the network/linkages would be smaller as well
- It works best for: research paper, Wikipedia page, collection of writings, etc. 

## ROUGE-N Measuring the Performance of Text Summarization 

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a mtehod to measure the performance of text summary by comparing the number of shared terms between the benchmark summary and the summary generated by these previous algorithms. N on ROUGE-N stands for the number of common n-grams.

In [277]:
# create a text summarization for the benchmark
# for the convenience purpose, i used summary from chatgpt

with open('activity-7_bench-sum.txt') as f:
    bench_sum = f.readlines()

print(bench_sum)

['Paris Saint-Germain has put contract negotiations with Lionel Messi and Sergio Ramos on hold due to concerns that further Financial Fair Play (FFP) breaches could lead to UEFA sanctions. The club faces possible sanctions, including the inability to register new players and a reduction in the size of their registered squad for European competitions, which could impact their Champions League status.']


In [278]:
from rouge_score import rouge_scorer

for i, sum in zip(['tfidf_sum', 'pagerank_sum', 'lsa_sum'], [tfidf_sum, pagerank_sum, lsa_sum]):
    summary = ' '.join(sum)
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)  # define the scorer first
    scores = scorer.score(str(bench_sum), summary)  # then input the parameters to score method
    print(i, scores)

tfidf_sum {'rouge1': Score(precision=0.375, recall=0.6290322580645161, fmeasure=0.4698795180722892)}
pagerank_sum {'rouge1': Score(precision=0.42105263157894735, recall=0.6451612903225806, fmeasure=0.5095541401273885)}
lsa_sum {'rouge1': Score(precision=0.4024390243902439, recall=0.532258064516129, fmeasure=0.4583333333333333)}


    Important! Variations of ROUGE-N

- ROUGE-L: Measures the number of common squences between the reference and generated summary

# Hands-on: Summarizing Text Using Machine Learning (ML) Method

For this one, I will make a text analysis based on thread from Online Forum Discussion based on Tarnpradab et al. research. For the raw data set, please refer thorugh <a href='https://www.dropbox.com/s/dcds423fl7fscow/threadDataSet.zip'>this link</a>. This analysis will **summarize all the threads into a 2-3 sentences summary using Machine Learning method**.

## Data Cleaning

In [292]:
# list all xml files from the dataset directory

path = "dataset/activity-7_threadDataSet/threads as original xml/"
dir_file = []

for (root, dirs, file) in os.walk(path):
    for f in file:
        if '.xml' in f:
            f_path = os.path.join(root, f)
            dir_file.append(f_path)

len(dir_file)

699

In [293]:
df = pd.DataFrame()

for f in dir_file:
    
    try:
        data = pd.read_xml(f)
    
        data['ThreadID'] = data.loc[0, 'ThreadID']
        data['Title'] = data.loc[1, 'Title']
        data.loc[2, 'rcontent'] = data.loc[2, 'icontent']
    
        data.rename(columns={'rcontent': 'text'}, inplace=True)
        data.drop(index=[0,1], columns='icontent', inplace=True)
        data.reset_index(inplace=True, drop=True)
    
        data['postNum'] = data.index + 1
        data[['Date', 'ThreadID', 'Title', 'postNum', 'text', 'UserID']]
        df = pd.concat([df, data])
    
    except:
        print('error in data:', f)
        pass

error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_1613009.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_693940.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_1439149.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_1802405.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_871889.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_1521254.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_one/60763_5_1353950.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_two/60763_5_1107220.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_two/60763_5_1606932.xml
error in data: dataset/activity-7_threadDataSet/threads as original xml/batch_two/60

In [294]:
display(df.head(5))
print('length of the dataframe:', len(df))

Unnamed: 0,ThreadID,Title,UserID,Date,text,postNum
0,60763_5_666188,New York Trip Day 2,twixbar,"18 June 2006, 6:02",So even though we were exhausted and had gone ...,1
1,60763_5_666188,New York Trip Day 2,SummerShowers...,"18 June 2006, 19:05",Day Two really was a perfect day! Sorry your f...,2
2,60763_5_666188,New York Trip Day 2,Daisiegee,"18 June 2006, 21:34",You've got me hooked......,3
3,60763_5_666188,New York Trip Day 2,twixbar,"19 June 2006, 2:57",I'm glad we went to the game too. It's stil ha...,4
0,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 1:11",Any info on the construction work near the Cas...,1


length of the dataframe: 7234


In [295]:
# list all xml files from the dataset directory

path = "dataset/activity-7_threadDataSet/human summaries/"
dir_file = []

for (root, dirs, file) in os.walk(path, topdown=True):
    dirs[:] = [d for d in dirs if d not in ['batch_one_Annotator_Two', 'batch_two_Annotator_Two', 'gold_Annotator_Two']]
    for f in file:
        if '.txt' in f:
            f_path = os.path.join(root, f)
            dir_file.append(f_path)

In [296]:
sum_text = {}
pattern = r"/([^/]+)\.txt$"

for f in dir_file:
    try:
        with open(f) as t:
            text = t.read().replace('\n', ' ')
            match = re.search(pattern, f)
            
            if match:
                file_name = match.group(1)
                sum_text[file_name] = text
                    
    except:
       print('error in data:', f)

error in data: dataset/activity-7_threadDataSet/human summaries/gold_Annotator_One/60763_5_3056374.txt
error in data: dataset/activity-7_threadDataSet/human summaries/gold_Annotator_One/60763_5_3155258.txt
error in data: dataset/activity-7_threadDataSet/human summaries/gold_Annotator_One/60974_588_2410400.txt


In [297]:
display(sum_text['60763_5_666188'])
print('length of the summary:', len(sum_text))

'User twixbar posted a long trip report on their second day in New York. On three hours of sleep, woke up and went to the French Roast for breakfast, a nice local place. Didn’t rush them at all. Took a cab to B&H, a huge photography store. Next took a cab to Battery Park, used pre-purchased tickets to take the ferry to the Statue of Liberty, took a tour inside the Statue. Quickly explored Ellis Island, hailed a cab to the Ed Sullivan Theater for the David Letterman Show. Got to sit up front because they looked enthusiastic, ate at the Hello Deli. Brittany Spears was the surprise guest, and Kurt Russel and a group called the Little Willies (lead singer is Norah Jones) also played. Her friend had purchased tickets to the Yankee-Red Socks game, but lost them. They managed to get tickets again from a ticket agency, went to the game and had great seats, but the friend stayed behind. After the game, wandered home down a different street, took pictures at Rockefeller Center and walked past Sa

length of the summary: 696


In [298]:
sum_text = pd.DataFrame.from_dict(sum_text, orient='index')
df = df.merge(sum_text, left_on='ThreadID', right_index=True)
df.rename(columns={0: 'summary'}, inplace=True)

df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,ThreadID,Title,UserID,Date,text,postNum,summary
0,60763_5_666188,New York Trip Day 2,twixbar,"18 June 2006, 6:02",So even though we were exhausted and had gone ...,1,User twixbar posted a long trip report on thei...
1,60763_5_666188,New York Trip Day 2,SummerShowers...,"18 June 2006, 19:05",Day Two really was a perfect day! Sorry your f...,2,User twixbar posted a long trip report on thei...
2,60763_5_666188,New York Trip Day 2,Daisiegee,"18 June 2006, 21:34",You've got me hooked......,3,User twixbar posted a long trip report on thei...
3,60763_5_666188,New York Trip Day 2,twixbar,"19 June 2006, 2:57",I'm glad we went to the game too. It's stil ha...,4,User twixbar posted a long trip report on thei...
4,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 1:11",Any info on the construction work near the Cas...,1,User trollking asked about the construction wo...


In [306]:
print('total threads available for analysis:', df['ThreadID'].nunique())

total threads available for analysis: 686


## Step 1: Creating Target Labels

In [299]:
# take a look at one thread for example

df[df['ThreadID'] == '60763_5_666188'].head(1).T

Unnamed: 0,0
ThreadID,60763_5_666188
Title,New York Trip Day 2
UserID,twixbar
Date,"18 June 2006, 6:02"
text,So even though we were exhausted and had gone ...
postNum,1
summary,User twixbar posted a long trip report on thei...


    Text Pre-processing

In [309]:
# preprocessing text using the functions we built before

import fun_nlp_spacy_text as pp_text

df['text'] = df['text'].apply(pp_text.clean)

In [312]:
# extract lemmatized version of the text
# and also the version with only nav

nlp = spacy.load('en_core_web_sm')
pos_to_take = ['NOUN', 'PROPN', 'ADJ', 'ADV', 'VERB']

for i, row in tqdm(df.iterrows(), total=df.shape[0]):
    doc = nlp(str(row['text']))
    df.at[i, 'lemmas'] = ' '.join([token.lemma_ for token in doc])  # lemmatized version
    df.at[i, 'nav'] = ' '.join([token.lemma_ for token in doc if token.pos_ in pos_to_take])  # noun-adjective-verb

100%|██████████| 7210/7210 [02:04<00:00, 57.80it/s]


In [321]:
df.head(5)

Unnamed: 0,ThreadID,Title,UserID,Date,text,postNum,summary,lemmas,nav
0,60763_5_666188,New York Trip Day 2,twixbar,"18 June 2006, 6:02",So even though we were exhausted and had gone ...,1,User twixbar posted a long trip report on thei...,so even though we be exhausted and have go to ...,so even exhausted go bed New York Time only ha...
1,60763_5_666188,New York Trip Day 2,SummerShowers...,"18 June 2006, 19:05",Day Two really was a perfect day! Sorry your f...,2,User twixbar posted a long trip report on thei...,day two really be a perfect day ! sorry your f...,day really perfect day friend join Yankees rea...
2,60763_5_666188,New York Trip Day 2,Daisiegee,"18 June 2006, 21:34",You've got me hooked......,3,User twixbar posted a long trip report on thei...,you 've get I hook ......,get hook
3,60763_5_666188,New York Trip Day 2,twixbar,"19 June 2006, 2:57",I'm glad we went to the game too. It's stil ha...,4,User twixbar posted a long trip report on thei...,I be glad we go to the game too . it be stil h...,glad go game too stil hard ot believe Yankee S...
4,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 1:11",Any info on the construction work near the Cas...,1,User trollking asked about the construction wo...,any info on the construction work near the Cas...,info construction work Casablanca hotel mentio...


In [314]:
# splitting between train and test split
# using group shuffle split, because it's grouped by the thread id

from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, test_size=0.2)
train_split, test_split = next(gss.split(df, groups=df['ThreadID']))

train_df = df.iloc[train_split]
test_df = df.iloc[test_split]

In [320]:
print('number of threads for train:', train_df['ThreadID'].nunique())
print('number of threads for test:', test_df['ThreadID'].nunique())

number of threads for train: 548
number of threads for test: 138


    Determining Target Label for Each Post

Determining whether a post in a thread should be included in the summary or not by calculating similarity between the text and picking the posts that are most similar.

In [370]:
import textdistance  # for calculating similarity between text

compression_factor = 0.3  # how many sentences from a thread do we want to include for the summary, in percentage

train_df['similarity'] = train_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.summary), axis=1)  # using jaro winkler, but there's other methods we can use
train_df['rank'] = train_df.groupby('ThreadID')['similarity'].rank(
    'max', ascending=False)

topN = lambda x: x <= np.ceil(compression_factor * x.max())
train_df['summaryPost'] = train_df.groupby('ThreadID')['rank'].apply(topN)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['similarity'] = train_df.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['rank'] = train_df.groupby('ThreadID')['similarity'].rank(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['summaryPost'] = train_df.groupby('ThreadID')['rank'].apply(topN)


In [371]:
# check 5 data

train_df[['text', 'summary', 'summaryPost']][train_df['ThreadID'] == '60763_5_912589'].sample(5)

Unnamed: 0,text,summary,summaryPost
14,"Thanks again Cockle and Yea no probs Dublin, w...",User trollking asked about the construction wo...,True
7,"I've never stayed at the Casablanca, but many ...",User trollking asked about the construction wo...,True
10,I stayed at the Casablanca Thursday night and ...,User trollking asked about the construction wo...,True
18,When do we get the trip report??? ;),User trollking asked about the construction wo...,False
20,The trip report shall be done this evening.,User trollking asked about the construction wo...,False


In [372]:
# eliminate post that contains no more than 20 string 
# might be only noises

train_df.loc[train_df['text'].str.len() <= 20, 'summaryPost'] = False

## Step 2: Adding Features to Assist Model Prediction

    Adding similarity between post and the title

In [373]:
train_df['titleSimilarity'] = train_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1
)

train_df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['titleSimilarity'] = train_df.apply(


Unnamed: 0,ThreadID,Title,UserID,Date,text,postNum,summary,lemmas,nav,similarity,rank,summaryPost,titleSimilarity,textLength
4,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 1:11",Any info on the construction work near the Cas...,1,User trollking asked about the construction wo...,any info on the construction work near the Cas...,info construction work Casablanca hotel mentio...,0.581996,11.0,False,0.535753,182
5,60763_5_912589,Construction Work Near Casablanca,Rosie12,"19 December 2006, 1:39",We were staying at the Casablanca from 24 nov ...,2,User trollking asked about the construction wo...,we be stay at the Casablanca from 24 nov to 28...,stay Casablanca nov nov construction work dist...,0.630352,8.0,False,0.516112,334
6,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 19:22",I have emailed the Casablanca and they have sa...,3,User trollking asked about the construction wo...,I have email the Casablanca and they have say ...,email Casablanca say know long construction go...,0.626002,10.0,False,0.512962,311


    Length of the post

In [374]:
train_df['textLength'] = train_df['text'].str.len()
train_df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['textLength'] = train_df['text'].str.len()


Unnamed: 0,ThreadID,Title,UserID,Date,text,postNum,summary,lemmas,nav,similarity,rank,summaryPost,titleSimilarity,textLength
4,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 1:11",Any info on the construction work near the Cas...,1,User trollking asked about the construction wo...,any info on the construction work near the Cas...,info construction work Casablanca hotel mentio...,0.581996,11.0,False,0.535753,182
5,60763_5_912589,Construction Work Near Casablanca,Rosie12,"19 December 2006, 1:39",We were staying at the Casablanca from 24 nov ...,2,User trollking asked about the construction wo...,we be stay at the Casablanca from 24 nov to 28...,stay Casablanca nov nov construction work dist...,0.630352,8.0,False,0.516112,334
6,60763_5_912589,Construction Work Near Casablanca,trollking,"19 December 2006, 19:22",I have emailed the Casablanca and they have sa...,3,User trollking asked about the construction wo...,I have email the Casablanca and they have say ...,email Casablanca say know long construction go...,0.626002,10.0,False,0.512962,311


    Vector of the lemmatized text

In [375]:
tfidf = TfidfVectorizer(min_df=10, ngram_range=(1,2), stop_words='english')
tfidf_result = tfidf.fit_transform(train_df['nav']).toarray()  # using noun-adjective-verb, could also be done using lemmatized form
tfidf_df = pd.DataFrame(tfidf_result, columns=tfidf.get_feature_names()) 

tfidf_df.head(5)



Unnamed: 0,10th,11th,12th,14th,14th street,15th,17th,18th,19th,1st,...,yorker,yorkers,young,yr,yr old,yummy,zabar,zero,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


    Inserting all the features, combine into one dataframe

In [376]:
# renaming columns of tfidf with token, to show that this is weight of each words
tfidf_df.columns = ['token_' + str(x) for x in tfidf_df.columns]
tfidf_df.index = train_df.index

# add post num, might be also explaining whether it should be included in the summary
feature_cols = ['titleSimilarity', 'textLength', 'postNum']  
train_df_tf = pd.concat([train_df[feature_cols], tfidf_df], axis=1)
train_df_tf.head(5)

Unnamed: 0,titleSimilarity,textLength,postNum,token_10th,token_11th,token_12th,token_14th,token_14th street,token_15th,token_17th,...,token_yorker,token_yorkers,token_young,token_yr,token_yr old,token_yummy,token_zabar,token_zero,token_zone,token_zoo
4,0.535753,182,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.516112,334,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.512962,311,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.493886,567,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.500226,515,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### One that missing! Make the dataframe for test_df!

In [377]:
### similarity to the summary

compression_factor = 0.3  # how many sentences from a thread do we want to include for the summary, in percentage

test_df['similarity'] = test_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.summary), axis=1)  # using jaro winkler, but there's other methods we can use
test_df['rank'] = test_df.groupby('ThreadID')['similarity'].rank(
    'max', ascending=False)

topN = lambda x: x <= np.ceil(compression_factor * x.max())
test_df['summaryPost'] = test_df.groupby('ThreadID')['rank'].apply(topN)

test_df.loc[test_df['text'].str.len() <= 20, 'summaryPost'] = False


### feature engineering

test_df['titleSimilarity'] = test_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1
)

test_df['textLength'] = test_df['text'].str.len()

tfidf_result = tfidf.transform(test_df['nav']).toarray()  # IMPORTANT, for test, instead of fit_transform(), we use transform()
tfidf_df = pd.DataFrame(tfidf_result, columns=tfidf.get_feature_names()) 


### inserting all features into a dataframe

tfidf_df.columns = ['token_' + str(x) for x in tfidf_df.columns]
tfidf_df.index = test_df.index

feature_cols = ['titleSimilarity', 'textLength', 'postNum']  
test_df_tf = pd.concat([test_df[feature_cols], tfidf_df], axis=1)
test_df_tf.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['similarity'] = test_df.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['rank'] = test_df.groupby('ThreadID')['similarity'].rank(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['summaryPost'] = test_df.groupby('ThreadID')['rank'].apply(topN)
A value is trying to be set

Unnamed: 0,titleSimilarity,textLength,postNum,token_10th,token_11th,token_12th,token_14th,token_14th street,token_15th,token_17th,...,token_yorker,token_yorkers,token_young,token_yr,token_yr old,token_yummy,token_zabar,token_zero,token_zone,token_zoo
0,0.501797,7293,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.041082,0.0,0.0,0.0,0.0
1,0.497845,245,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.403027,26,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.415121,84,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.513547,3028,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058798,0.0,0.0


## Step 3: Build the Machine Learning Model

Build the model using the RandomForestClassifier; tree-based algorithm might perform better for a data with number + categorical features combined

In [379]:
# build the ml model and fit it to the test model

from sklearn.ensemble import RandomForestClassifier

x = train_df_tf
y = train_df['summaryPost']

model = RandomForestClassifier()
model.fit(x, y)

test_df['summaryPost_predicted'] = model.predict(test_df_tf)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['summaryPost_predicted'] = model.predict(test_df_tf)


In [381]:
# make a function to calculate rouge score

def calculate_rouge_score(x, column_name):
    ref_summary = x['summary'].values[0]
    
    predicted_summary = ''.join(x['text'][x[column_name]])
    
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    scores = scorer.score(ref_summary, predicted_summary)
    return scores['rouge1'].fmeasure

In [385]:
rouge_score = test_df.groupby('ThreadID')[['summary', 'text', 'summaryPost_predicted']].\
    apply(calculate_rouge_score, column_name='summaryPost_predicted').mean()
    
print('rouge score for test:', rouge_score)

rouge score for test: 0.35351810647553134


In [440]:
# take a look at the data that we fit already

thread_id = '60974_588_1849664'

instance_df = test_df[test_df['ThreadID'] == thread_id]
print(instance_df['summary'].iloc[0])

print('\nPredicted Summary:\n', instance_df[instance_df['summaryPost_predicted'] == True]['text'].values, '\n')

display(instance_df[['postNum', 'text', 'summaryPost', 'summaryPost_predicted']])

A woman wanted to spend 2 hours with the kids in Niagara. She felt spending time at McDonald was a good idea but still needed suggestions to spend time. Someone suggested visiting the zoo which had elephants. The other places suggested were a museum of Science, beautiful collection of greenhouses.  Some other person suggests her to visit Aquatic and Fitness Center on Delaware which had Olympic-size swimming pool and which was 15 min ride from the Darwin Martin house. The other suggested places were ToyTown museum, Roycroft campus and Vidlers. Finally she ended up at the Niagara Aquarium. 

Predicted Summary:
 ['Buffalo doesn\'t have a children\'s musuem, unfortunately. The closest one is in Rochester, NY and that\'s a one hour drive. The Bflo zoo is an ideal location since you can park near the Darwin Martin house and walk to the zoo. They have river otters so cute! And you can see the elephants up close and personal if they\'re not outside (it looks like it might be cold tomorrow, but

Unnamed: 0,postNum,text,summaryPost,summaryPost_predicted
6757,1,We are in Niagara CA side and will be in Buffa...,True,False
6758,2,"Buffalo doesn't have a children's musuem, unfo...",False,True
6759,3,Thank you Lady Dee. I read about the one in Ro...,False,False
6760,4,Sorry I am late....A few hours at the Galleria...,False,False
6761,5,"I'm late, too, but another option, if you're i...",True,False
6762,6,Explore and More is a children's museum in Eas...,True,True
6763,7,"Wow, some great suggestions! Unfortunately I d...",False,False
