# Loading the library

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
nltk.download('punkt')

# laoding the datset

In [None]:
df = pd.read_csv('../input/olympic-news-dataset/olympic_news.csv',encoding= 'unicode_escape')

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
#I will drop the article_title column.
# Reason: Well I am trying to keep things simple and easy.

In [None]:
df.drop(['article_title'], axis = 1, inplace=True)

In [None]:
df.head()

In [None]:
#Now I am looking at first 3 article_text using the while loop. 
#It will help me in getting proper understanding of the article text.
i = 0
while (i < 3):   
    i = i + 1
    print(df['article_text'][i], sep=' ')

# **Preprocessing**

# **1. TOKENIZATION (Spliting the whole paragraph into sentence)**

what is tokenization

Tokenization is a way of separating a piece of text into smaller units called tokens. 
Here, tokens can be either words, characters, or subwords. 
Hence, tokenization can be broadly classified into 3 types
1.word, 2.character, and 3.subword (n-gram characters) tokenization.

In this case we are splitting the paragraph into sentences.

In [None]:
from nltk.tokenize import sent_tokenize
sentences = [sent_tokenize(s) for s in df['article_text']]

sentences = [y for x in sentences for y in x] # flatten list

# Above I have used list comprehension technique instead of conventional for loop method.
#checking the first 3 sentences.
sentences[:3]

# **2. WORD EMBEDDING (Then spliting the sentecnec into words.)**

In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text.
Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

I am going to use Glove for word embedding.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. 
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, 
and the resulting representations showcase interesting linear substructures of the word vector space

Read more here https://nlp.stanford.edu/projects/glove/

In [None]:
#downloading the "glove.6B.100d.txt"
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

In [None]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [None]:
len(word_embeddings)

# **3. Remove punctuations, special characters and numbers.**

Doing this will help in processing the text faster.

In [None]:
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ",  regex=True)

In [None]:
print(clean_sentences[0])
print(clean_sentences[1])
print(clean_sentences[2])

**converting to lower case**

**Reason:**

I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off.

In [None]:
clean_sentences = [s.lower() for s in clean_sentences]

# **4. Removing stops words**

**What are the stop words?**

These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

**Why we remove the stop words?**

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information. In order words, we can say that the removal of such words does not show any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of tokens involved in the training.

In [None]:
nltk.download('stopwords')

In [None]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [None]:
clean_sentences[0:5]

# **5. Vector representation of sentences**

In [None]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [None]:
#vector representation is prerequiste for applying similarity matrix.

In [None]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

# **6. Similarity matrix**

I will use cosine similarity for finding the similarity between the sentecnes. Sentences which has highest similairyt will be of more importance and we will rank them according to that and later on we will form the summarization using that. 

[Read more on cosine similarity.](https://www.machinelearningplus.com/nlp/cosine-similarity/#:~:text=Cosine%20similarity%20is%20a%20metric,in%20a%20multi%2Ddimensional%20space.&text=The%20smaller%20the%20angle%2C%20higher%20the%20cosine%20similarity.)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity_matrix = np.zeros([len(sentences), len(sentences)])
# The above code will help me in forming the matrix of the size of sentences. 

In [None]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [None]:
print(similarity_matrix.shape)

# **7. Converting similarity matrix sim_mat into a graph**

The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings

In [None]:
import networkx as nx

nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

# **8. Summarization**

Sorting the sentences on the basis of highest score

In [None]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [None]:
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])