TSA_Text-Preprocessing

A project on "Text Cleaning and Pre-Processing Tool using NLTK" which is done on my third year in college.

Document Text Cleaner

A Python tool that extracts and cleans text from Word documents (.docx) and PDF files using natural language processing techniques.

Features

Extract text from Word documents (.docx)
Extract text from PDF files (.pdf)
Text cleaning using NLTK:
- Tokenization
- Stopword removal
- Lemmatization
- Punctuation removal

Installation

Install required dependencies:

pip install -r requirements.txt

Usage

Run the script with a document file as an argument:

python main.py <document_file>

Examples

# Process a Word document
python main.py document.docx

# Process a PDF file
python main.py report.pdf

Output

The tool will display:

Original text length
Cleaned text length
The processed and cleaned text

Supported File Formats

.docx - Microsoft Word documents
.pdf - PDF documents

Requirements

Python 3.6+
See requirements.txt for package dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test.docx		test.docx
test.pdf		test.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TSA_Text-Preprocessing

Document Text Cleaner

Features

Installation

Usage

Examples

Output

Supported File Formats

Requirements

About

Uh oh!

Releases

Packages

Languages

sanrobin/TSA_Text-Preprocessing

Folders and files

Latest commit

History

Repository files navigation

TSA_Text-Preprocessing

Document Text Cleaner

Features

Installation

Usage

Examples

Output

Supported File Formats

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages