# Extractive text Summerization using NLP techniques

# Step 1: Set Up Your Environment

## 1.Install Required Libraries:

In [15]:
!pip install nltk spacy gensim scikit-learn torch transformers rouge-score
!pip install python-docx PyPDF2




[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip




## 2.Download Required NLP Models:

In [16]:
import nltk
nltk.download('punkt')
# import spacy
# spacy.cli.download("en_core_web_sm")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\visut\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Step 2: Data Collection

## 1.Read Input Files:
## Create functions to read text, DOCX, and PDF files.

In [17]:
import os
from docx import Document
from PyPDF2 import PdfReader

def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def read_docx_file(file_path):
    doc = Document(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

def read_pdf_file(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

def read_file(file_path):
    _, file_extension = os.path.splitext(file_path)
    if file_extension == '.txt':
        return read_text_file(file_path)
    elif file_extension == '.docx':
        return read_docx_file(file_path)
    elif file_extension == '.pdf':
        return read_pdf_file(file_path)
    else:
        raise ValueError("Unsupported file format")

# Since the notebook and PDF are in the same folder, use just the file name
# file_path = "JP summer repot.pdf"
file_path = "D3.pdf"

text = read_file(file_path)
print(text[:500])  # Print the first 500 characters to verify the content


Genome and Symptoms based Disease Prediction
Mrunal Kurhade
Computer Engg. Dept, S.P .I.T
Mumbai, India
mrunal.kurhade@spit.ac.inMeghan Lendhe
Computer Engg. Dept, S.P .I.T
Mumbai, India
meghan.lendhe@spit.ac.inShreya Patel
Computer Engg. Dept, S.P .I.T
Mumbai, India
shreya.patel@spit.ac.in
I. L ITERATURE SURVEY
Translation of genomic knowledge for use in medical prac-
tice is a highly anticipated goal. Disease causing genomes can
be identiﬁed by various methods. Few methods are discussed
in thi


# Step 3: Preprocessing

## Clean the Text:
Remove `special characters`, `stop words`, and irrelevant information.
Tokenize sentences using `nltk` or `spaCy`.

In [18]:
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess(text):
    sentences = sent_tokenize(text)
    words = [word_tokenize(sentence) for sentence in sentences]
    # Additional cleaning steps like removing stop words, lowercasing, etc.
    return sentences, words

sentences, words = preprocess(text)
print(sentences)
print(words)

['Genome and Symptoms based Disease Prediction\nMrunal Kurhade\nComputer Engg.', 'Dept, S.P .I.T\nMumbai, India\nmrunal.kurhade@spit.ac.inMeghan Lendhe\nComputer Engg.', 'Dept, S.P .I.T\nMumbai, India\nmeghan.lendhe@spit.ac.inShreya Patel\nComputer Engg.', 'Dept, S.P .I.T\nMumbai, India\nshreya.patel@spit.ac.in\nI. L ITERATURE SURVEY\nTranslation of genomic knowledge for use in medical prac-\ntice is a highly anticipated goal.', 'Disease causing genomes can\nbe identiﬁed by various methods.', 'Few methods are discussed\nin this paper.', 'Thus, genetic-based predictive models can have\nprofound impact in diagnosis and detection.', 'However, given\nthat (i) symptoms/disease are affected by genetic and environ-\nmental factors, (ii) the genetic view of susceptibility to dis-\neases is not well-understood, and (iii) replicable susceptibility\nalleles, in combination, account for only a moderate amount\nof disease heritability, prediction of disease using genomes is\ndifﬁcult [8].', 'Consid

# Step 4: Feature Extraction

## Extract Features:
Use `spaCy` for `POS tagging` and `Gensim` for `word embeddings`

In [19]:
import nltk
from nltk import word_tokenize

# Download the necessary NLTK data files (only needed once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_features(sentences):
    # Tokenize the sentences into words
    words = word_tokenize(" ".join(sentences))
    
    # Perform POS tagging
    pos_tags = nltk.pos_tag(words)
    
    # Extract only the POS tags (you can modify this if you need more information)
    pos_only = [tag for word, tag in pos_tags]
    
    # Further feature extraction like TF-IDF or word embeddings can be done here
    return pos_only

# Example usage
# sentences = ["This is a sentence.", "Here is another one."]
features = extract_features(sentences)
print(features)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\visut\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\visut\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


['NNP', 'CC', 'NNP', 'VBN', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'VBD', 'NNP', 'NN', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'NNP', 'NNP', 'VBD', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'NN', 'NNP', 'NN', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'IN', 'JJ', 'NN', 'IN', 'NN', 'IN', 'JJ', 'JJ', 'NN', 'VBZ', 'DT', 'RB', 'JJ', 'NN', '.', 'NNP', 'VBG', 'NNS', 'MD', 'VB', 'VBN', 'IN', 'JJ', 'NNS', '.', 'JJ', 'NNS', 'VBP', 'VBN', 'IN', 'DT', 'NN', '.', 'RB', ',', 'JJ', 'JJ', 'NNS', 'MD', 'VB', 'VBN', 'NN', 'IN', 'NN', 'CC', 'NN', '.', 'RB', ',', 'VBN', 'IN', '(', 'NN', ')', 'NN', 'VBP', 'VBN', 'IN', 'JJ', 'CC', 'JJ', 'JJ', 'NNS', ',', '(', 'NN', ')', 'DT', 'JJ', 'NN', 'IN', 'NN', 'TO', 'JJ', 'NNS', 'VBZ', 'RB', 'JJ', ',', 'CC', '(', 'NN', ')', 'JJ', 'NN', 'NNS', ',', 'IN', 'NN', ',', 'NN', 'IN', 'RB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NN', ',', 'NN', 'IN', 'NN', 'VBG', 'NNS', 'VBZ', 'JJ

In [20]:
# Download the necessary NLTK data files (only needed once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_features(sentences):
    # Tokenize the sentences into words
    words = word_tokenize(" ".join(sentences))
    
    # Perform POS tagging
    pos_tags = nltk.pos_tag(words)
    
    # Extract only the POS tags (you can modify this if you need more information)
    pos_only = [tag for word, tag in pos_tags]
    
    # Further feature extraction like TF-IDF or word embeddings can be done here
    return pos_only

# Use the sentences from preprocessing
features = extract_features(sentences)
print(features)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\visut\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\visut\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


['NNP', 'CC', 'NNP', 'VBN', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'VBD', 'NNP', 'NN', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'NNP', 'NNP', 'VBD', 'NNP', 'NNP', 'NNP', '.', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'NN', 'NNP', 'NN', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'IN', 'JJ', 'NN', 'IN', 'NN', 'IN', 'JJ', 'JJ', 'NN', 'VBZ', 'DT', 'RB', 'JJ', 'NN', '.', 'NNP', 'VBG', 'NNS', 'MD', 'VB', 'VBN', 'IN', 'JJ', 'NNS', '.', 'JJ', 'NNS', 'VBP', 'VBN', 'IN', 'DT', 'NN', '.', 'RB', ',', 'JJ', 'JJ', 'NNS', 'MD', 'VB', 'VBN', 'NN', 'IN', 'NN', 'CC', 'NN', '.', 'RB', ',', 'VBN', 'IN', '(', 'NN', ')', 'NN', 'VBP', 'VBN', 'IN', 'JJ', 'CC', 'JJ', 'JJ', 'NNS', ',', '(', 'NN', ')', 'DT', 'JJ', 'NN', 'IN', 'NN', 'TO', 'JJ', 'NNS', 'VBZ', 'RB', 'JJ', ',', 'CC', '(', 'NN', ')', 'JJ', 'NN', 'NNS', ',', 'IN', 'NN', ',', 'NN', 'IN', 'RB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NN', ',', 'NN', 'IN', 'NN', 'VBG', 'NNS', 'VBZ', 'JJ

# Step 5: Sentence Ranking

## Develop Ranking Model:
Use traditional ML models (like logistic regression) or transformer-based models to rank sentences.
A basic approach could involve sentence length, keyword presence, or cosine similarity with the document title.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def rank_sentences(sentences, features):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    similarity_matrix = cosine_similarity(tfidf_matrix)
    ranked_sentences = sorted(
        ((i, s) for i, s in enumerate(sentences)),
        key=lambda x: sum(similarity_matrix[x[0]]), reverse=True)
    return [sentences[i] for i, _ in ranked_sentences[:5]]

ranked_summary = rank_sentences(sentences, features)
print(ranked_summary)

['However, given\nthat (i) symptoms/disease are affected by genetic and environ-\nmental factors, (ii) the genetic view of susceptibility to dis-\neases is not well-understood, and (iii) replicable susceptibility\nalleles, in combination, account for only a moderate amount\nof disease heritability, prediction of disease using genomes is\ndifﬁcult [8].', 'G ENOMIC PREDICTOR FOR DISEASES\nDisease prediction based symptoms and genome analysis\nis one of the most interesting and challenging task.', 'Thus DNA can be\nused for prediction of disease.', 'However,\ngiven that (i) medical traits result from a complex interplay\nbetween genetic and environmental factors, (ii) the underlying\ngenetic architectures for susceptibility to common diseases are\nnot well-understood, and (iii) replicable susceptibility alleles,\nin combination, account for only a moderate amount of disease\nheritability, there are substantial challenges to constructing and\nimplementing genetic risk prediction models wit

# Step 6: Summary Generation

## Generate Summaries:
Combine the top-ranked sentences to form the summary

In [22]:
summary = " ".join(ranked_summary)
print(summary)

However, given
that (i) symptoms/disease are affected by genetic and environ-
mental factors, (ii) the genetic view of susceptibility to dis-
eases is not well-understood, and (iii) replicable susceptibility
alleles, in combination, account for only a moderate amount
of disease heritability, prediction of disease using genomes is
difﬁcult [8]. G ENOMIC PREDICTOR FOR DISEASES
Disease prediction based symptoms and genome analysis
is one of the most interesting and challenging task. Thus DNA can be
used for prediction of disease. However,
given that (i) medical traits result from a complex interplay
between genetic and environmental factors, (ii) the underlying
genetic architectures for susceptibility to common diseases are
not well-understood, and (iii) replicable susceptibility alleles,
in combination, account for only a moderate amount of disease
heritability, there are substantial challenges to constructing and
implementing genetic risk prediction models with high utility. Given this 

# Step 7: Evaluation

## Evaluate Summaries:
Use `ROUGE` metrics and manual evaluation.

In [23]:
from rouge_score import rouge_scorer

# Assuming you have a reference summary for evaluation
reference_summary = summary

def evaluate_summary(reference, generated):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, generated)
    return scores

evaluation_scores = evaluate_summary(reference_summary, summary)
print(evaluation_scores)

{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}


# Step 8: Experimentation and Scaling

# Experiment with Models:

Use Hugging Face’s transformers for a more advanced summarization model.
Try different architectures and hyperparameters.

In [14]:
# from transformers import pipeline

# summarizer = pipeline("summarization")
# transformer_summary = summarizer(text)[0]['summary_text']
# print(transformer_summary)