<a href="https://colab.research.google.com/github/vikramrajeevreddy/NLP_Term_Project/blob/main/NLP_Term_Project_cosine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
from nltk.tokenize import sent_tokenize
import numpy as np
import networkx as nx
import re

In [3]:
def read_article(text):  

  sentences =[]        
  sentences = sent_tokenize(text)  

  for sentence in sentences:        
    sentence.replace("[^a-zA-Z0-9]"," ")    
   
  return sentences

In [4]:
def sentence_similarity(sent1,sent2,stopwords=None):    
  if stopwords is None:        
    stopwords = []        
  sent1 = [w.lower() for w in sent1]    
  sent2 = [w.lower() for w in sent2]
        
  all_words = list(set(sent1 + sent2))   
     
  vector1 = [0] * len(all_words)    
  vector2 = [0] * len(all_words)        
  #build the vector for the first sentence    
  for w in sent1:        
    if not w in stopwords:
      vector1[all_words.index(w)]+=1                                                             
  #build the vector for the second sentence    
  for w in sent2:        
    if not w in stopwords:            
      vector2[all_words.index(w)]+=1 
               
  return 1-cosine_distance(vector1,vector2)

In [5]:
def build_similarity_matrix(sentences,stop_words):
  #create an empty similarity matrix
  similarity_matrix = np.zeros((len(sentences),len(sentences)))
  for idx1 in range(len(sentences)):
      for idx2 in range(len(sentences)):
        if idx1!=idx2:
          similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1],sentences[idx2],stop_words)
  return similarity_matrix

In [6]:
def generate_summary(text,top_n):
  nltk.download('stopwords')    
  nltk.download('punkt')
  stop_words = stopwords.words('english')    
  summarize_text = []
  # Step1: read text and tokenize    
  sentences = read_article(text)
  # Step2: generate similarity matrix            
  sentence_similarity_matrix = build_similarity_matrix(sentences,stop_words)
  # Step3: Rank sentences in similarity matrix
  sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
  scores = nx.pagerank(sentence_similarity_graph)
  # Step4: sort the rank and place top sentences
  ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True)
  
  # Step5: get the top n number of sentences based on rank
  for i in range(top_n):
    summarize_text.append(ranked_sentences[i][1])
  # Step6 : output the summarized version
  return " ".join(summarize_text),len(sentences)

In [7]:
result = generate_summary("Several cooks and kitchen workers at the roadside canteen were arrested after investigators found that rice there contained datura, a poisonous weed of the nightshade family. Laboratory tests showed the rice contained quantities of the weed, which is otherwise known as thorn-apple. But physicians caring for some of the survivors have questioned whether the small amounts detected could have been the cause of illness. Reports said the victims, all of them men, began to become delirious within two hours after eating, with symptoms that included heavy salivating and extreme breathing difficulties. An account in The Pioneer, a newspaper published in New Delhi, described wrenching scenes among the victims, most of whom had come to the city to work from distant states like Uttar Pradesh in north-central India and were thus without family members to care for them. ''Almost all of them could be seen thrashing around, writhing with pain and tied with gauze bandages to their beds,'' the report said. Other accounts said that local hospitals had no supply available of an antitoxin that could have helped the men and that it was five days before supplies were flown in from Poland. The newspaper accounts of the discovery of the poisonous weed and the detention of canteen employees, two of whom fled 1,000 miles to their homes in Uttar Pradesh, also suggested possible motives. The accounts said the police were investigating the relationship between one of the detained employees and the canteen owner, who was among those who died. They also said the police were probing a possible link to a dispute that began when the owner resisted pressures from other canteen operators to raise prices. The episode has focused attention and renewed concern across India on the woeful conditions that face the vast majority of Indians who are poor, and in particular on the health hazards caused by the country's dismal public hygiene and sanitation. One recent study showed that 700 million of the country's 930 million people, having no toilets, either defecate into buckets or on open land. Shortcomings in hygiene, especially in the food chain, were emphasized by two other episodes of food poisoning, this time involving schoolchildren, that occurred in the same district as the textile town a week after the canteen disaster. Almost 440 children, most of them ages 6 to 12, fell ill after eating milk-based sweets known as pedas that schools distributed at celebrations commemorating India's independence on Aug. 15, 1947. The food had been kept unrefrigerated in 90-degree heat. None of the children died, though many remained in the hospital for days. The possibility that the canteen deaths resulted from datura poisoning set Indians to delving into folklore, which indicates that the weed has been used for centuries, especially in north India, for various kinds of skulduggery. The Pioneer reported that datura seeds ''are often mixed in food and offered to train passengers to knock them off and rob them.'' As early as the 16th century, accounts by early western travelers to India told of servants drugging their employers with the weed in order to rob them and of wives punishing unfaithful husbands by intoxicating them with datura-based potions. For decades, Indian leaders have promised crash programs to improve conditions. This year, Prime Minister H. D. Deve Gowda, who grew up the son of a poor farmer, renewed the pledges in his independence day speech. ''I regret that despite 50 years of independence, basic amenities still elude the rural and urban poor,'' he said in the speech in New Delhi on Thursday. He said his Government would increase spending on the poor and provide at least basic levels of sanitation, electricity and housing for all of India's people by 2000.",2)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [8]:
result

('The possibility that the canteen deaths resulted from datura poisoning set Indians to delving into folklore, which indicates that the weed has been used for centuries, especially in north India, for various kinds of skulduggery. Shortcomings in hygiene, especially in the food chain, were emphasized by two other episodes of food poisoning, this time involving schoolchildren, that occurred in the same district as the textile town a week after the canteen disaster.',
 23)

In [9]:
type(result)

tuple

In [10]:
model_text, out = result

In [11]:
model_text

'The possibility that the canteen deaths resulted from datura poisoning set Indians to delving into folklore, which indicates that the weed has been used for centuries, especially in north India, for various kinds of skulduggery. Shortcomings in hygiene, especially in the food chain, were emphasized by two other episodes of food poisoning, this time involving schoolchildren, that occurred in the same district as the textile town a week after the canteen disaster.'

In [12]:
reference = "With the death toll at 52 in what has been described as India's worst food poisoning case in many years and with 18 more patients still in intensive care, investigators in a textile town north of Bombay say they are still unsure whether the episode was a case of mass murder or an accident. More than 120 people, most of them migrant workers at local textile plants, became ill 12 days ago after eating lunch at a canteen in Bhiwandi, a boom town about 75 miles north of Bombay."

In [13]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [14]:
from rouge import Rouge

In [15]:
rouge = Rouge()


In [16]:
rouge.get_scores(model_text, reference)


[{'rouge-1': {'f': 0.22047243599727212, 'p': 0.24561403508771928, 'r': 0.2},
  'rouge-2': {'f': 0.026143785926781224,
   'p': 0.029850746268656716,
   'r': 0.023255813953488372},
  'rouge-l': {'f': 0.15748031001302018,
   'p': 0.17543859649122806,
   'r': 0.14285714285714285}}]