# Import Necessary Libraries

We will use the following libraries to process and analyze text from PDF files:

1. **pdfminer.high_level**: To extract text from PDF files.
2. **re**: Python's regular expressions module for searching and replacing text patterns.
3. **nltk**: The Natural Language Toolkit, used for text processing tasks.
4. **Counter from collections**: For counting word frequencies.

In [10]:
from pdfminer.high_level import extract_text
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from collections import Counter

# Download NLTK data ( Already downloaded)

In [11]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shivankar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shivankar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Extract Text from PDF

## Function: `extract_text_from_pdf`

This function will:
1. **Extract Text**: Use `pdfminer` to read the text from a PDF file.
2. **Fix Spacing**:
   - Insert spaces between lowercase and uppercase letters (e.g., "helloWorld" becomes "hello World").
   - Ensure there’s a space after periods and commas if followed by a letter (e.g., "end.Here" becomes "end. Here").
   - Add spaces between words and numbers (e.g., "word123" becomes "word 123").
3. **Remove Extra Whitespace**: Collapse multiple spaces into one and trim any leading or trailing spaces.

In [12]:
def extract_text_from_pdf(pdf_path):
    # Extract text using pdfminer
    text = extract_text(pdf_path)
    
    # Join characters that are incorrectly separated
    text = re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', text)  # Insert space between lowercase and uppercase letters
    text = re.sub(r'(?<=[.])(?=[A-Za-z])', ' ', text)  # Insert space after periods if not followed by space
    text = re.sub(r'(?<=[,])(?=[A-Za-z])', ' ', text)  # Insert space after commas if not followed by space
    text = re.sub(r'(?<=\w)(?=\d)', ' ', text)  # Insert space between words and digits
    text = re.sub(r'(?<=\d)(?=\w)', ' ', text)  # Insert space between digits and words

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Summarize Text

We will define a function `summarize` that performs the following tasks:

1. **Tokenize Sentences**: Splits the text into individual sentences.
2. **Process Words**:
    - Converts text to lowercase and splits it into words.
    - Removes common English stopwords and non-alphanumeric characters.
3. **Count Word Frequencies**: Uses `Counter` to count how often each word appears in the text.
4. **Score Sentences**: Assigns a score to each sentence based on the frequency of the words it contains.
5. **Select Top Sentences**: Picks the top `num_sentences` sentences with the highest scores.
6. **Construct Summary**: Joins these top sentences into a single summary string.

In [13]:
def summarize(text, num_sentences=25):
    sentences = sent_tokenize(text)
    words = text.lower().split()
    
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words and word.isalnum()]
    
    word_freq = Counter(words)
    
    sentence_scores = {}
    for sentence in sentences:
        for word in sentence.lower().split():
            if word in word_freq:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_freq[word]
                else:
                    sentence_scores[sentence] += word_freq[word]
    
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:num_sentences]
    
    summary_sentences = [sentence for sentence in sentences if sentence in summary_sentences]
    
    return ' '.join(summary_sentences)

# Specify PDF Path

In [17]:
pdf_path = r"D:\Shivankar\Resulation AI\New folder\Operations Management.pdf"

# Extract and Print Text

In this section, we will:
1. Call the `extract_text_from_pdf` function to get the full text from the PDF.
2. Print the length (word count) of the extracted text.
3. Print the first 500 characters as a sample of the extracted text.

In [18]:
# Extract text from PDF
full_text = extract_text_from_pdf(pdf_path)

# Print extracted text length
print(f"Extracted text length: {len(full_text.split())} words")

# Print a sample of the extracted text
print("\nSample of extracted text:")
print(full_text[:500])

Extracted text length: 5443 words

Sample of extracted text:
Operations Management: Oil and Gas Report Introduction Operations management is a branch of management that deals with the designing and supervision of operational processes in a business organization. Operations management covers the responsibility over all processes that involve the production of goods and services as well as the delivery of such productions to the final consumers. In its duties, an operations management department ensures that processes are planned for and executed in an effi


# Generate and Print Summary

1. Calls summarize to create a summary of the extracted text.
2. Prints the summary
3. Prints the length (word count) of the summary.

In [16]:
# Summarize the extracted text
summary = summarize(full_text)

# Print the summary
print("\nSummary:")
print(summary)

# Print summary length
print(f"\nSummary length: {len(summary.split())} words")


Summary:
Operations Management: Oil and Gas Report Introduction Operations management is a branch of management that deals with the designing and supervision of operational processes in a business organization. Operations management covers the responsibility over all processes that involve the production of goods and services as well as the delivery of such productions to the final consumers. Operations management ensures “effective management of resources and activities that produce or deliver goods and services of any business” (Sox 1). Operations management therefore involves the management of “people, materials, equipments and information resources that a business may need” (Sox 1) in its daily activities. History of Operations Management The history of operations management stems all the way back to the eighteenth century. Operations management is however still on its development with focus being made on its elements such as “market focus, globalization, quality management system

In [2]:
import streamlit
import pdfminer
import nltk

print(f"Streamlit version: {streamlit.__version__}")
print(f"pdfminer.six version: {pdfminer.__version__}")
print(f"NLTK version: {nltk.__version__}")

Streamlit version: 1.35.0
pdfminer.six version: 20240706
NLTK version: 3.8.1
