<a href="https://colab.research.google.com/github/suhanik19/research-paper-summarizer/blob/main/Summarize_Research_Papers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Uses cosine similarity to get extractive summaries, finds you similar research papers, mines out the top 5 key words, sentiment analysis using HuggingFace

In [None]:
!pip install -U sentence-transformers
!pip install keybert
!pip install transformers

import nltk
nltk.download('wordnet')



[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from sentence_transformers import SentenceTransformer, util
from keybert import KeyBERT
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
import numpy as np
import os
import json
import requests

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
key_model = KeyBERT('all-MiniLM-L6-v2')
sentiment_model = "distilbert-base-uncased-finetuned-sst-2-english"
lemmatizer = WordNetLemmatizer()


def summarizer(abstract):
  #splits the text into individual sentences
  papers = abstract.split(".")

  # Compute sentence embeddings
  sentence_embeddings = np.array(model.encode(papers))

  # Calculate the mean embedding (overall theme of the text)
  mean_embedding = np.mean(sentence_embeddings, axis=0)

  similarities = cosine_similarity([mean_embedding], sentence_embeddings)[0]

  # Rank sentences based on cosine similarity (highest similarity = most important)
  top_n = int(input("Number of sentences you want in the summary: "))  # Number of sentences you want in the summary
  top_sentence_indices = similarities.argsort()[-top_n:][::-1]  # Get top-N indices

  # Generate the summary
  summary = [papers[i] for i in top_sentence_indices]

  # Print the summary
  print("\nSummary:")
  for sentence in summary:
      print(sentence)

def keywords(abstract):

  # Extract only unique keywords (convert to set and back to list to ensure uniqueness)
  # unique_keywords = list(set([keyword for keyword, score in keywords]))

  # Function to get the part of speech for lemmatization
  def get_wordnet_pos(word):
      """helps the lemmatizer to correctly identify whether a word is a noun, verb, adjective, etc., improving accuracy."""
      tag = wordnet.synsets(word)
      if not tag:
          return wordnet.NOUN
      tag = tag[0].pos()
      return tag

  # Extract keywords (keyphrase_ngram_range=(1, 1) ensures single words)
  keywords = key_model.extract_keywords(abstract, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=10)

  # Lemmatize the keywords to group similar words (like "predictive", "prediction", etc.)
  lemmatized_keywords = list(set([lemmatizer.lemmatize(keyword, get_wordnet_pos(keyword)) for keyword, score in keywords]))

  # Select the top N unique keywords
  top_n = 5
  final_keywords = lemmatized_keywords[:top_n]

  # Print unique and lemmatized keywords
  print("\nUnique Keywords:")
  print(final_keywords)

def sentiment_analysis(abstract):
  # Perform sentiment analysis using the Hugging Face model
  sentiment_analysis_model = pipeline("sentiment-analysis", sentiment_model)
  sentiment_results = sentiment_analysis_model(abstract)

  # Print the sentiment result
  print("\nSentiment Analysis Result:")
  print(sentiment_results)


def main():
  question = input("Choose between the following: \n1. Summarizer \n2. Keywords \n3. Sentiment Analysis \n4. All \n")
  abstract = input("Enter abstract here: ")
  if question == "1":
    summarizer(abstract)
  elif question == "2":
    keywords(abstract)
  elif question == "3":
    sentiment_analysis(abstract)
  elif question == "4":
    summarizer(abstract)
    keywords(abstract)
    sentiment_analysis(abstract)
  else:
    print("Invalid input")

main()

Choose between the following: 
1. Summarizer 
2. Keywords 
3. Sentiment Analysis 
4. All 
4
Enter abstract here: The increasing complexity of climate systems presents a significant challenge in accurately predicting future environmental conditions. This study explores the application of deep neural networks (DNNs) to enhance climate prediction models under conditions of high uncertainty. Leveraging historical climate data and advanced machine learning algorithms, we constructed a multi-layered neural network capable of learning non-linear dependencies between atmospheric variables. The model was trained on both regional and global climate datasets, incorporating variables such as temperature, humidity, and greenhouse gas concentrations. Results indicate that the proposed DNN model outperforms traditional statistical methods, particularly in scenarios with incomplete or noisy data. The model's predictive accuracy was validated using cross-validation techniques, with performance metrics 

during trials, i realized that one of the outputs were ['climate', 'predictive', 'predicting', 'forecasts', 'prediction']. So to address the issue of keywords like "predictive," "predicting," and "prediction" appearing as separate keywords even though they convey the same meaning, you can apply lemmatization to normalize the words to their base form, and then remove duplicates or highly similar words.

Steps to Resolve the Issue:
Lemmatization: This will convert words like "predicting," "prediction," and "predictive" to their root form (e.g., "predict").
Remove Duplicates: After lemmatization, ensure that you filter out any duplicates that result from the lemmatization process.
Solution Using Lemmatization:
We will use NLTK's WordNetLemmatizer for this task, as it will convert words to their base form (lemmas).