Text Summarization involves condensing a piece of text into a shorter version, reducing the size of the original text while preserving key information and the meaning of the content. Since manual text synthesis is a long and generally laborious task, task automation is gaining in popularity and therefore a strong motivation for academic research. In this article, I will take you through the task of Natural Language Processing to summarize text with Machine Learning.

In Machine Learning, there are important applications for text summarization in various Natural Language Processing related tasks such as text classification, answering questions, legal text synthesis, news synthesis, and headline generation which can be achieved with Machine Learning. The intention to summarize a text is to create an accurate and fluid summary containing only the main points described in the document.

#### Types of Approaches to Summarize Text

Generally, Text Summarization is classified into two main types: Extraction Approach and Abstraction Approach. Now let’s go through both these approaches before we dive into the coding part.

#### 1. The Extractive Approach

The Extractive approach takes sentences directly from the document according to a scoring function to form a cohesive summary. This method works by identifying the important sections of the text cropping and assembling parts of the content to produce a condensed version.

#### 2. The Abstractive Approach

The Abstraction approach aims to produce a summary by interpreting the text using advanced natural language techniques to generate a new, shorter text – parts of which may not appear in the original document, which conveys the most information.

Here extractive approach is used  to summarize text using Machine Learning and Python. Will be using the TextRank algorithm which is an extractive and unsupervised machine learning algorithm for text summarization.






In [None]:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("tennis.csv")
df.head()

In [None]:
df['article_text'][1]

 split the sequences into the data by tokenizing them using a list



In [None]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x]

Now I am going to use the Glove method for word representation, it is an unsupervised learning algorithm developed by Stanford University to generate word integrations by aggregating the global word-to-word co-occurrence matrix from a corpus.

In [None]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

create vectors for the sentences:


In [None]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

Finding Similarities to Summarize Text

The next step is to find similarities between the sentences, and I will use the cosine similarity approach for this task. Let’s create an empty similarity matrix for this task and fill it with cosine similarities of sentences:

In [None]:
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

Now I am going to convert the sim_mat similarity matrix into the graph, the nodes in this graph will represent the sentences and the edges will represent the similarity scores between the sentences:

In [None]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

let’s summarize text:

In [None]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
for i in range(5):
  print("ARTICLE:")
  print(df['article_text'][i])
  print('\n')
  print("SUMMARY:")
  print(ranked_sentences[i][1])
  print('\n')