A robust NLP pipeline for document processing and semantic analysis with TF-IDF and Word2Vec embeddings.
- ✔️ Automated text cleaning pipeline
- ✔️ Customizable stopword filtering
- ✔️ Punctuation and special character removal
- 🎯 TF-IDF vectorization with scikit-learn
- 🎯 Word2Vec embedding training
- 🎛️ Configurable hyperparameters via
global_options.py
- 🔍 Seed word similarity scoring
- 📊 Document-level semantic profiling
- 💾 Results export to CSV
text_processing_pipeline/
│
├── data/
│ ├── input/ # Raw documents (.txt)
│ ├── processed/ # Cleaned text and intermediate files
│ └── dictionaries/ # Seed words and stopwords
│
├── models/ # Serialized Word2Vec models
│ └── word_vectors.kv # Pretrained embeddings
│
├── outputs/
│ ├── word_similarities/ # Per-seed-word similarity scores
│ └── df_listscore.csv # Final aggregated scores
│
├── config/
│ └── global_options.py # Path configurations
│
└── scripts/ # Processing modules
├── NER_pipeline.py
├── preprocessing.py
├── ML.py
├── feature_engineering.py
└── litigation_score_final.py