#### Tutorial: Text Extraction and Filtering from Word Documents using Python

This tutorial provided a step-by-step guide on how to extract, filter, and process text from Word documents using Python. With these tools, you can handle various text-processing tasks in your NLP projects. Feel free to extend this script and customize it for your specific needs!

### Introduction

In this tutorial, we will explore how to use Python to extract text from Microsoft Word documents (.docx) and filter these documents based on specific terms. 

This guide is designed for NLP (Natural Language Processing) enthusiasts who are interested in document processing, as well as LLM (Large Language Model) enthusiasts who want to preprocess data for model training.


### We'll cover the following topics:

Extracting text from .docx files using python-docx.

Filtering documents based on the presence of specific allowed and disallowed terms.

Extending the script with additional features such as sorting and categorizing documents.

### Prerequisites

Before starting, ensure you have the following installed:

Python 3.x

The python-docx library

You can install python-docx using pip:

In [1]:
! pip install python-docx


Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting lxml>=3.1.0 (from python-docx)
  Downloading lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.8 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
Downloading lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m139.8 kB/s[0m eta [36m0:00:00[0m00:01[0m00:03[0m
[?25hInstalling collected packages: lxml, python-docx
Successfully installed lxml-5.3.0 python-docx-1.1.2


your_project/
│
├── docs/
│   ├── document1.docx
│   ├── document2.docx
│   └── document3.docx
│
└── filter_documents.py


1. Extracting Text from Word Documents

We start by creating a function to extract text from a .docx file. This is done using the python-docx library.

In [23]:
from docx import Document

# Function to extract text from a Word document
# Function to extract text from a Word document
def get_doc_text(doc_path):
    doc = Document(doc_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)


### Example Usage:


In [4]:
text = get_doc_text("BP.docx")
print(text)


Context 

Market Analysis 

Strategy and development plan 

Context Section

Context: High-Frequency Trading (HFT) Industry Overview and the Need for Accessibility

Industry Broad Spectrum:
The High-Frequency Trading (HFT) industry is a dynamic and influential segment of the financial markets. HFT accounts for a substantial portion of trading activity, especially in the U.S. equities market, where it constitutes approximately 56% of trade volume (Jones et al., 2020). This high level of activity underscores HFT's critical role in modern trading ecosystems.

Challenges and Trends:
While HFT has introduced efficiencies and liquidity to the market, it also presents significant challenges. Regulatory bodies, such as the SEC in the United States and the European Securities and Markets Authority (ESMA) in Europe, have implemented stringent measures to ensure market integrity and fairness (SEC, 2018). Technological advancements continue to shape the HFT landscape, with firms constantly innovat

### 2. Filtering Documents Based on Specific Terms

Next, we create a function that checks whether a document contains any of the allowed terms and excludes those with disallowed terms.

In [5]:
# Function to check if document contains any of the allowed terms and excludes the disallowed ones
def contains_specific_terms(text, allowed_terms, disallowed_terms):
    contains_allowed = any(term in text for term in allowed_terms)
    contains_disallowed = any(term in text for term in disallowed_terms)
    return contains_allowed and not contains_disallowed


Example Usage:

This function is useful for filtering out documents that are irrelevant based on your criteria.


In [26]:
# Directory containing the Word documents
directory = "/Users/skalaliya/Desktop"

In [32]:
allowed_terms = ["C0", "C1", "C2"]
disallowed_terms = ["C3", "C4", "C5"]

text = get_doc_text("BP.docx")
if contains_specific_terms(text, allowed_terms, disallowed_terms):
    print("Document meets the criteria")
else:
    print("Document does not meet the criteria")


Document does not meet the criteria


### 3. Processing Multiple Documents

We can now use these functions to process multiple documents in a directory. 

We will filter them based on the terms and then sort them by filename.

In [33]:
import os

# Directory containing the Word documents
directory = '/Users/skalaliya/Desktop'  #"path_to_your_directory"
# Get a list of all .docx files in the directory
docx_files = [f for f in os.listdir(directory) if f.endswith('.docx')]

docx_files

['BP.docx',
 'DACE Final Project.docx',
 'EDB_Fall 2023_Template Orange - Case study.docx',
 'ACTE MRIAGE -2 marocain aya.docx',
 'sgement 2.docx',
 '4 SLIDES BDH.docx']

In [36]:
# Create a list of tuples (file_name, document_text)
filtered_docs = []

for file in docx_files:
    doc_path = os.path.join(directory, file)
    text = get_doc_text(doc_path)
    if contains_specific_terms(text, allowed_terms, disallowed_terms):
        filtered_docs.append((file, text))
print(filtered_docs)

[]


In [37]:

# Optionally, sort the documents by content or any other criteria
sorted_docs = sorted(filtered_docs, key=lambda x: x[0])  # Sort by file name
sorted_docs

[]

In [18]:


# Print or process the filtered and sorted documents
for doc_name, text in sorted_docs:
    print(f"Document: {doc_name}, First Line: {text.splitlines()[0] if text else 'Empty Document'}")


### Example Scenarios:

**Filtering Sensitive Documents:** Imagine you are working on a project where you need to filter out documents containing confidential information (e.g., terms like "confidential" or "private") while keeping those that are public.

**Categorizing Documents by Topic:** You can categorize documents by checking for the presence of specific keywords. For instance, categorize documents related to "Finance", "Technology", or "Health" by looking for these terms.

**Quality Assurance in Data Collection: **Use this script to ensure that only relevant documents are included in your dataset for training a language model, by filtering out documents that do not meet your criteria.

### 4. Advanced Features and Extensions

**Adding a User Interface:**

You can add a simple user interface using tkinter to allow users to select directories and input terms interactively.

**Storing Filtered Results in a File:**

Extend the script to save the names of filtered documents and their content to a CSV file or another format for further analysis.

In [19]:
import csv

# Save filtered document info to a CSV file
with open('filtered_documents.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Document Name", "First Line"])
    for doc_name, text in sorted_docs:
        writer.writerow([doc_name, text.splitlines()[0] if text else 'Empty Document'])


Sorting by Content Length:

Sort the documents by the length of their content instead of their filenames.

In [20]:
sorted_docs = sorted(filtered_docs, key=lambda x: len(x[1]))  # Sort by content length


### Extending the Script with Examples

Here are a few examples and scenarios where this script could be extended:

Example 1: Filtering by Multiple Sets of Terms

Suppose you want to filter documents based on different sets of allowed and disallowed terms. 

You could extend the script to handle multiple sets of terms:

In [38]:
term_sets = [
    {"allowed": ["A1", "A2"], "disallowed": ["B1", "B2"]},
    {"allowed": ["C1", "C2"], "disallowed": ["D1", "D2"]},
]

for terms in term_sets:
    allowed_terms = terms["allowed"]
    disallowed_terms = terms["disallowed"]

    # Filter and process documents using the same logic as before


### Example 2: Saving Filtered Documents to a New Directory

If you want to save the filtered documents to a new directory instead of just printing them, you can modify the script to write the filtered text to new files:

In [39]:
output_directory = "filtered_docs"
os.makedirs(output_directory, exist_ok=True)

for doc_name, text in sorted_docs:
    output_path = os.path.join(output_directory, doc_name)
    with open(output_path.replace('.docx', '.txt'), 'w') as f:
        f.write(text)


### Applying NLP Techniques to Filtered Text

Once you've filtered the documents, you might want to apply further NLP techniques, such as sentiment analysis or keyword extraction, using libraries like nltk, spacy, or transformers.

In [40]:
import nltk
from nltk.tokenize import word_tokenize

for doc_name, text in sorted_docs:
    tokens = word_tokenize(text)
    print(f"Document: {doc_name}, Number of Tokens: {len(tokens)}")


#### Full CODE

In [43]:
from docx import Document
import os
import shutil

# Function to extract text from a Word document
def get_doc_text(doc_path):
    doc = Document(doc_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

# Function to check if document contains any of the allowed terms and excludes the disallowed ones
def contains_specific_terms(text, allowed_terms, disallowed_terms):
    contains_allowed = any(term in text for term in allowed_terms)
    contains_disallowed = any(term in text for term in disallowed_terms)
    return contains_allowed and not contains_disallowed

# Directory containing the Word documents
directory = "/Users/skalaliya/Desktop"
output_directory = "/Users/skalaliya/Desktop/filtered_docs"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Terms to include and exclude
allowed_terms = ["C0", "C1", "C2"]
disallowed_terms = ["C3", "C4", "C5"]

# Get a list of all .docx files in the directory
docx_files = [f for f in os.listdir(directory) if f.endswith('.docx')]

# Create a list of tuples (file_name, document_text)
filtered_docs = []

for file in docx_files:
    doc_path = os.path.join(directory, file)
    text = get_doc_text(doc_path)
    if contains_specific_terms(text, allowed_terms, disallowed_terms):
        filtered_docs.append((file, text))

# Optionally, sort the documents by content or any other criteria
sorted_docs = sorted(filtered_docs, key=lambda x: x[0])  # Sort by file name, or modify as needed

# Save copies of the filtered and sorted documents to the output directory
for doc_name, _ in sorted_docs:
    src_path = os.path.join(directory, doc_name)
    dest_path = os.path.join(output_directory, doc_name)
    shutil.copy(src_path, dest_path)
    print(f"Copied {doc_name} to {output_directory}")

print("All filtered and sorted documents have been copied to the output directory.")


All filtered and sorted documents have been copied to the output directory.


### Here’s a simple command to copy all .docx files from one directory to another without any filtering or conditions:

#### How to Use files below:

Replace the source_directory and output_directory paths as needed for your specific use case.

Copy and paste the code into a Python script.

Run the script, and it will copy all files of the specified type from the source directory to the output directory.

In [44]:
import os
import shutil

# Directory containing the Word documents
source_directory = "/Users/skalaliya/Desktop"
output_directory = "/Users/skalaliya/Desktop/all_docs"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Get a list of all .docx files in the source directory
docx_files = [f for f in os.listdir(source_directory) if f.endswith('.docx')]

# Copy each .docx file to the output directory
for file in docx_files:
    src_path = os.path.join(source_directory, file)
    dest_path = os.path.join(output_directory, file)
    shutil.copy(src_path, dest_path)
    print(f"Copied {file} to {output_directory}")

print(f"All .docx files have been copied to the '{output_directory}' directory.")


Copied BP.docx to /Users/skalaliya/Desktop/all_docs
Copied DACE Final Project.docx to /Users/skalaliya/Desktop/all_docs
Copied EDB_Fall 2023_Template Orange - Case study.docx to /Users/skalaliya/Desktop/all_docs
Copied ACTE MRIAGE -2 marocain aya.docx to /Users/skalaliya/Desktop/all_docs
Copied sgement 2.docx to /Users/skalaliya/Desktop/all_docs
Copied 4 SLIDES BDH.docx to /Users/skalaliya/Desktop/all_docs
All .docx files have been copied to the '/Users/skalaliya/Desktop/all_docs' directory.


### For Excel Files (.xlsx and .xls):


In [45]:
import os
import shutil

# Directory containing the Excel files
source_directory = "/Users/skalaliya/Desktop"
output_directory = "/Users/skalaliya/Desktop/all_excel_files"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Get a list of all .xlsx and .xls files in the source directory
excel_files = [f for f in os.listdir(source_directory) if f.endswith(('.xlsx', '.xls'))]

# Copy each Excel file to the output directory
for file in excel_files:
    src_path = os.path.join(source_directory, file)
    dest_path = os.path.join(output_directory, file)
    shutil.copy(src_path, dest_path)
    print(f"Copied {file} to {output_directory}")

print(f"All Excel files have been copied to the '{output_directory}' directory.")


Copied fastfood_data.xlsx to /Users/skalaliya/Desktop/all_excel_files
All Excel files have been copied to the '/Users/skalaliya/Desktop/all_excel_files' directory.


#### For PowerPoint Files (.pptx and .ppt):


In [46]:
import os
import shutil

# Directory containing the PowerPoint files
source_directory = "/Users/skalaliya/Desktop"
output_directory = "/Users/skalaliya/Desktop/all_ppt_files"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Get a list of all .pptx and .ppt files in the source directory
ppt_files = [f for f in os.listdir(source_directory) if f.endswith(('.pptx', '.ppt'))]

# Copy each PowerPoint file to the output directory
for file in ppt_files:
    src_path = os.path.join(source_directory, file)
    dest_path = os.path.join(output_directory, file)
    shutil.copy(src_path, dest_path)
    print(f"Copied {file} to {output_directory}")

print(f"All PowerPoint files have been copied to the '{output_directory}' directory.")


Copied Formation GIT Bitbucket.pptx to /Users/skalaliya/Desktop/all_ppt_files
Copied CV Elliot.pptx to /Users/skalaliya/Desktop/all_ppt_files
Copied McCabeStrategicEntryDeterrenceWeek7_edit.pptx to /Users/skalaliya/Desktop/all_ppt_files
All PowerPoint files have been copied to the '/Users/skalaliya/Desktop/all_ppt_files' directory.


### For PDF Files (.pdf):


In [47]:
import os
import shutil

# Directory containing the PDF files
source_directory = "/Users/skalaliya/Desktop"
output_directory = "/Users/skalaliya/Desktop/all_pdf_files"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Get a list of all .pdf files in the source directory
pdf_files = [f for f in os.listdir(source_directory) if f.endswith('.pdf')]

# Copy each PDF file to the output directory
for file in pdf_files:
    src_path = os.path.join(source_directory, file)
    dest_path = os.path.join(output_directory, file)
    shutil.copy(src_path, dest_path)
    print(f"Copied {file} to {output_directory}")

print(f"All PDF files have been copied to the '{output_directory}' directory.")


Copied https___evisa.imigrasi.go.id_web_visa-status_z4gIku0korktNszWis4xNaZzur1SLNuO9bJ+AwOC7nS9STmPxcvcQLPLJh2CmgQ3.pdf to /Users/skalaliya/Desktop/all_pdf_files
Copied Aya_Passport.pdf to /Users/skalaliya/Desktop/all_pdf_files
All PDF files have been copied to the '/Users/skalaliya/Desktop/all_pdf_files' directory.
