# Text Summarization

You can actually categorize the methods for summarizing a text. These categories are:
- Extractive vs. Abstractive
- Extractive summaries are summaries built out of document itself, it consists of text taken from the original document, it's a subset
- Abstractive summaries can contain new sequences of texts not necessarily taken from the input
- the former are much easier to generate than latter: you'll just need to identify relevant parts with the code vs you need to develop an understanding of the input & convert that understanding into a text
- This section of the course will focus on extractive summaries
- Abstractive summaries more suited to deep learning such as seq2seq, transformers

### We'll look at two methods:
- Method 1: requires only knowledge of vector based methods (Tfidf)
- Method 2: more complex, based on Google's Page Rank, requires a bit more knowledge on prob and algebra (specifically Markov chains?) different from probability used earlier

## Using Vectors
- we'll use Tfidf
- split the document into sentences
- score each sentecne
- rank sentences by those scores
- our summary will simply be the top scoring sentences

### Sentence Splitting & TF-IDF
- called sentence tokenization can be done with nltk:
nltk.sent_tokenize(text)
- then treat each as if they were a separate document
- build TF-idf matrix
- previously the rows were documents, now they're sentences, columns are still terms
#### Scoring each sentence:
- simplest way: average of non-zero values in each sentence
- yani feature sayısına değil, non-zero olan term sayısına böl
- why does it work? remember each component of tfidf vector tells us how often a specific term appears
- so if a word appears in many sentences tfidf will shrink
- unimpotant words thşs way will have a smaller value
- why mean not sum? sum would be biased towards larger sentences
#### What to do with scores?
- idea: sort scores, pick the sentences with highest scores
- multiple ways try and choose best
- simplest : take top N sentences
- also simple: top N words or top N characters if you have limt , ex : if you're building a search engine there is limited space for you to show your result summary
- or top X% of sentences words
- or you may define a theshold score, ex: average score, but can be too much
- maybe threshold = average score * factor

## Exercise Prompt
- bbc dataset again
- try it on multiple articles
- spplit article into sentences
- Tfidf matrix from sentences
- score each sentence by takşng average
- sort each sentence by score
- filter out 

### My solutions:

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.stem.porter import PorterStemmer 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#bu dersin hocası Multinomial kullanıyo :/
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix
from sklearn.metrics import accuracy_score 

In [2]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yagmuraslan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yagmuraslan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2023-06-14 14:34:45--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4.8M) [text/csv]
Saving to: 'bbc_text_cls.csv'


2023-06-14 14:34:50 (1.86 MB/s) - 'bbc_text_cls.csv' saved [5085081/5085081]



In [5]:
df = pd.read_csv('bbc_text_cls.csv')
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [89]:
article = np.random.choice(df["text"])

In [90]:
df2 = nltk.sent_tokenize(article)
df2

['Brussels raps mobile call charges\n\nThe European Commission has written to the mobile phone operators Vodafone and T-Mobile to challenge "the high rates" they charge for international roaming.',
 'In letters sent to the two companies, the Commission alleged the firms were abusing their dominant market position in the German mobile phone market.',
 "It is the second time Vodafone has come under the Commission's scrutiny.",
 'The UK operator is already appealing against allegations that its UK roaming rates are "unfair and excessive".',
 "Vodafone's response to the Commission's letter was defiant.",
 '"We believe the roaming market is competitive and we expect to resist the charges," said a Vodafone spokesman.',
 '"However we will need time to examine the statement of objections in detail before we formally respond."',
 "The Commission's investigation into Vodafone and Deutsche Telekom's T-Mobile centres on the tariffs the two companies charge foreign mobile operators to access their 

In [91]:
vectorizer = TfidfVectorizer(max_features = 2000)
Tfidf_matrix = vectorizer.fit_transform(df2)
Tfidf_matrix.shape

(17, 176)

In [92]:
scores = []
for i in range(Tfidf_matrix.shape[0]):
    score = Tfidf_matrix[i,:][Tfidf_matrix[i,:]!=0].mean()
    scores.append(score)

In [93]:
scores

[0.19994132833176456,
 0.21976090204394835,
 0.2910878344231112,
 0.2323031631934066,
 0.3272872917776975,
 0.2413613513341671,
 0.23462907933313298,
 0.17222826036904718,
 0.23460764343564638,
 0.19605695778684437,
 0.20576261432126783,
 0.22506468991753328,
 0.21371595211004196,
 0.21343302648096493,
 0.21015165229124375,
 0.16995956087787353,
 0.27118940091231314]

In [94]:
len(scores)

17

In [95]:
scores2idx = pd.Series(scores)
scores2idx

0     0.199941
1     0.219761
2     0.291088
3     0.232303
4     0.327287
5     0.241361
6     0.234629
7     0.172228
8     0.234608
9     0.196057
10    0.205763
11    0.225065
12    0.213716
13    0.213433
14    0.210152
15    0.169960
16    0.271189
dtype: float64

In [96]:
# ex: top 8 sentence:

indices = scores2idx.nlargest(8)
summary = []
for i in range(len(df2)):
    if i in indices.index:
        summary.append(df2[i])
        
print(" ".join(summary))    

It is the second time Vodafone has come under the Commission's scrutiny. The UK operator is already appealing against allegations that its UK roaming rates are "unfair and excessive". Vodafone's response to the Commission's letter was defiant. "We believe the roaming market is competitive and we expect to resist the charges," said a Vodafone spokesman. "However we will need time to examine the statement of objections in detail before we formally respond." The Commission believes these wholesale prices are too high and that the excess is passed on to consumers. Vodafone sent the Commission a response to those allegations in December last year and is now waiting for a reply. The investigation involves regulators assessing whether there is effective competition in the roaming market.


In [97]:
#top 30 % of the sentences

indices = scores2idx.nlargest(int(len(scores2idx) * 0.3))
summary = []
for i in range(len(df2)):
    if i in indices.index:
        summary.append(df2[i])
        
print(" ".join(summary))  

It is the second time Vodafone has come under the Commission's scrutiny. Vodafone's response to the Commission's letter was defiant. "We believe the roaming market is competitive and we expect to resist the charges," said a Vodafone spokesman. "However we will need time to examine the statement of objections in detail before we formally respond." The investigation involves regulators assessing whether there is effective competition in the roaming market.


#### hoca benden farklı olarak title'ları ayırdı, tfidf te stopwords ve normalization kullandı, genel bi fonksiyon def tanımladı.