#### Idea
We are interested in exploring text summarization and, in particular, headline generation, here is a first baseline, we just take first 30 words of each text as a hypothesis for a title. 

#### Data
[Kaggle Dataset](https://www.kaggle.com/kashnitsky/news-about-major-cryptocurrencies-20132018-40k) with ~40k articles sharing news on major cryptocurrencies. 

#### Task
All articles have `title` and `text`, the task is to generate a title given the text. The chosen metric is an avarage of ROUGE-1, ROUGE-2, and ROUGE-L, see [this report](http://www.dialog-21.ru/media/4661/camerareadysubmission-157.pdf) describing the metric, page 3.

#### Results

ROUGE scores (F1 variant):
- ROUGE-1 – 18.4%
- ROUGE-2 – 5.3%
- ROUGE-L – 16.9%
- Average - 13.5%

Pretty mediocre. Hope ML models will do a better job

#### Installing the Rouge package and playing around with the metric 

In [None]:
# https://pypi.org/project/rouge/
!pip install rouge > /dev/null

Example of Rouge calculation

In [None]:
from rouge import Rouge 

hypothesis = "Some London Underground stations should be closed, as the city is trying to reduce the impact of a coronavirus outbreak.".lower()

reference = "Up to 40 stations on the London Underground network are to be shut as the city attempts to reduce the effect of the coronavirus outbreak.".lower()

rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)
scores

#### Reading and briefly exploring data

In [None]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
from pathlib import Path

from matplotlib import pyplot as plt
%config InlineBackend.figure_format = 'retina'

In [None]:
PATH_TO_CRYPTO_NEWS = Path('../input/news-about-major-cryptocurrencies-20132018-40k/')

In [None]:
train_df = pd.read_csv(PATH_TO_CRYPTO_NEWS / 'crypto_news_parsed_2013-2017_train.csv')
valid_df = pd.read_csv(PATH_TO_CRYPTO_NEWS / 'crypto_news_parsed_2018_validation.csv')

In [None]:
train_df.info()

In [None]:
# readling empty strings is a bit different locally and here, but not a big deal 
train_df['text'].fillna(' ', inplace=True)

In [None]:
valid_df.info()

In [None]:
train_df.head(2)

**URL**

It is an id of a news article

In [None]:
train_df['url'].nunique() == len(train_df)

We can take a look at some of the actual articles on the Web

In [None]:
train_df.loc[:5, 'url']

https://www.ccn.com/paris-hiltons-hotel-mogul-father-to-sell-38-million-mansion-for-cryptocurrency/

<img src="https://habrastorage.org/webt/4c/3n/eg/4c3neg5owcdohooydlz4dbdwzdo.png" width=70% />

**Title**

These are on avearge pretty short, the median is just 9 words

In [None]:
train_df['title'].apply(lambda s: len(s.split())).describe()

Dunno if wordclouds have ever been useful but let's build one

In [None]:
from wordcloud import WordCloud, STOPWORDS

wordcloud = WordCloud(background_color='black', stopwords = STOPWORDS,
                max_words = 200, max_font_size = 100, 
                random_state = 17, width=800, height=400)

plt.figure(figsize=(16, 12))
wordcloud.generate(str(train_df['title']))
plt.imshow(wordcloud);

**Text**

Text are pretty long, of normal length for a news online, the median is around 400 words

In [None]:
train_df['text'].apply(lambda s: len(s.split())).describe()

Let's extract first sentences of each text in a dumb way simply splitting by the dot (it's far from perfect!).

First sentences are longer than titles, the median is 21 words, max. 232. So also makes sense to try just a part of a first sentence as a hypothesis for a title.

In [None]:
first_sentences_dumb = train_df['text'].apply(lambda s: s.split('.')[0])
first_sentences_dumb.apply(lambda s: len(s.split())).describe()

Let's perform a sanity check – whether texts actually start as normal articles and don't have any placeholders in the beggining (like timestamp), this we check simply by taking first 10 words (10 is an arbitrary choice) of each sentence and checking the number of unique values. 

In [None]:
first_ten_words_dumb = first_sentences_dumb.apply(lambda s: " ".join(s.split()[:10]))
first_ten_words_dumb.value_counts().head(20)

Indeed, we see some problems with taking everything before the first dot as a first sentence. 

Picularities of the 1st sentence:

 - Splitting on a dot is imperfect, thus it splits some phrases like "Dr. Brown"
 - Looks like there're duplicates of some news (plagiarism?), published in different sources ("Noelle Acheson is a 10-year veteran of company analysis" over 40 times)
 - Already fixed some commom welcome messages ("The views and opinions expressed here are solely those of ")

#### So let's try to use `sent_tokenize` to better extract first sentence

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
def extract_first_sent(text):
    
    sent_tok = sent_tokenize(text)
    
    return sent_tok[0].strip() if sent_tok else ''

Now we see that first sentences are on average twice longer that those dumb ones

In [None]:
first_sentences = train_df['text'].progress_apply(extract_first_sent)
first_sentences.apply(lambda s: len(s.split())).describe()

In [None]:
first_ten_words = first_sentences.apply(lambda s: " ".join(s.split()[:10]))
first_ten_words.value_counts().head(20)

Now it's a bit better, though still not perfect


**Year**

The train-validation split is done based on year.

In [None]:
train_df['year'].value_counts()

In [None]:
valid_df['year'].value_counts()

**Author**

In [None]:
train_df['author'].nunique()

In [None]:
train_df['author'].value_counts().head()

**Source**

These is a feature of the actual scraping, some articles come from websites having tags in metadata (no more information on that).

In [None]:
train_df['source'].nunique()

In [None]:
train_df['source'].value_counts().head()

That's it for the analysis, let's now create a first headline generation baseline. We saw that titles are short, up to 30 words, so we'll just use first 30 words as a hypothesis for a title.

#### Now calculating ROUGE scores for the validation part with first 30 words as hypotheses.

In [None]:
true_val_titles = valid_df['title'].str.lower()

In [None]:
first_sentences_val = valid_df['text'].progress_apply(extract_first_sent)
first_thirty_words_val = first_sentences_val.loc[valid_df.index].apply(lambda s: " ".join(s.split()[:30]).lower())

In [None]:
%%time
rouge = Rouge()
scores = rouge.get_scores(hyps=first_thirty_words_val, refs=true_val_titles, avg=True, ignore_empty=True)

In [None]:
scores

Average between ROUGE-1, ROUGE-2, and ROUGE-L (the metric of interest)

In [None]:
final_metric = (scores['rouge-1']['f'] + scores['rouge-2']['f'] + scores['rouge-l']['f']) / 3
final_metric