A project on "Text Cleaning and Pre-Processing Tool using NLTK" which is done on my third year in college.
A Python tool that extracts and cleans text from Word documents (.docx) and PDF files using natural language processing techniques.
- Extract text from Word documents (.docx)
- Extract text from PDF files (.pdf)
- Text cleaning using NLTK:
- Tokenization
- Stopword removal
- Lemmatization
- Punctuation removal
 
- Install required dependencies:
pip install -r requirements.txtRun the script with a document file as an argument:
python main.py <document_file># Process a Word document
python main.py document.docx
# Process a PDF file
python main.py report.pdfThe tool will display:
- Original text length
- Cleaned text length
- The processed and cleaned text
- .docx- Microsoft Word documents
- .pdf- PDF documents
- Python 3.6+
- See requirements.txtfor package dependencies