<a href="https://colab.research.google.com/github/suhanik19/research-paper-summarizer/blob/main/Summarize_Research_Papers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U sentence-transformers
!pip install keybert
!pip install transformers

import nltk
nltk.download('wordnet')

Collecting sentence-transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Downloading sentence_transformers-3.1.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.0
Collecting keybert
  Downloading keybert-0.8.5-py3-none-any.whl.metadata (15 kB)
Downloading keybert-0.8.5-py3-none-any.whl (37 kB)
Installing collected packages: keybert
Successfully installed keybert-0.8.5


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from sentence_transformers import SentenceTransformer, util
from keybert import KeyBERT
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
import numpy as np
import os
import json
import requests

  from tqdm.autonotebook import tqdm, trange


In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
key_model = KeyBERT('all-MiniLM-L6-v2')
sentiment_model = "distilbert-base-uncased-finetuned-sst-2-english"
lemmatizer = WordNetLemmatizer()


def summarizer(abstract):
  #splits the text into individual sentences
  papers = abstract.split(".")

  # Compute sentence embeddings
  sentence_embeddings = np.array(model.encode(papers))

  # Calculate the mean embedding (overall theme of the text)
  mean_embedding = np.mean(sentence_embeddings, axis=0)

  similarities = cosine_similarity([mean_embedding], sentence_embeddings)[0]

  # Rank sentences based on cosine similarity (highest similarity = most important)
  top_n = int(input("Number of sentences you want in the summary: "))  # Number of sentences you want in the summary
  top_sentence_indices = similarities.argsort()[-top_n:][::-1]  # Get top-N indices

  # Generate the summary
  summary = [papers[i] for i in top_sentence_indices]

  # Print the summary
  print("\nSummary:")
  for sentence in summary:
      print(sentence)

def keywords(abstract):

  # Extract only unique keywords (convert to set and back to list to ensure uniqueness)
  # unique_keywords = list(set([keyword for keyword, score in keywords]))

  # Function to get the part of speech for lemmatization
  def get_wordnet_pos(word):
      """helps the lemmatizer to correctly identify whether a word is a noun, verb, adjective, etc., improving accuracy."""
      tag = wordnet.synsets(word)
      if not tag:
          return wordnet.NOUN
      tag = tag[0].pos()
      return tag

  # Extract keywords (keyphrase_ngram_range=(1, 1) ensures single words)
  keywords = key_model.extract_keywords(abstract, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=10)

  # Lemmatize the keywords to group similar words (like "predictive", "prediction", etc.)
  lemmatized_keywords = list(set([lemmatizer.lemmatize(keyword, get_wordnet_pos(keyword)) for keyword, score in keywords]))

  # Select the top N unique keywords
  top_n = 5
  final_keywords = lemmatized_keywords[:top_n]

  # Print unique and lemmatized keywords
  print("\nUnique Keywords:")
  print(final_keywords)

def sentiment_analysis(abstract):
  # Perform sentiment analysis using the Hugging Face model
  sentiment_analysis_model = pipeline("sentiment-analysis", sentiment_model)
  sentiment_results = sentiment_analysis_model(abstract)

  # Print the sentiment result
  print("\nSentiment Analysis Result:")
  print(sentiment_results)


def main():
  question = input("Choose between the following: \n1. Summarizer \n2. Keywords \n3. Sentiment Analysis \n4. All \n")
  abstract = input("Enter abstract here: ")
  if question == "1":
    summarizer(abstract)
  elif question == "2":
    keywords(abstract)
  elif question == "3":
    sentiment_analysis(abstract)
  elif question == "4":
    summarizer(abstract)
    keywords(abstract)
    sentiment_analysis(abstract)
  else:
    print("Invalid input")

main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Choose between the following: 
1. Summarizer 
2. Keywords 
3. Sentiment Analysis 
4. All 
4
Enter abstract here: The increasing complexity of climate systems presents a significant challenge in accurately predicting future environmental conditions. This study explores the application of deep neural networks (DNNs) to enhance climate prediction models under conditions of high uncertainty. Leveraging historical climate data and advanced machine learning algorithms, we constructed a multi-layered neural network capable of learning non-linear dependencies between atmospheric variables. The model was trained on both regional and global climate datasets, incorporating variables such as temperature, humidity, and greenhouse gas concentrations. Results indicate that the proposed DNN model outperforms traditional statistical methods, particularly in scenarios with incomplete or noisy data. The model's predictive accuracy was validated using cross-validation techniques, with performance metrics 

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Sentiment Analysis Result:
[{'label': 'POSITIVE', 'score': 0.99456787109375}]
