sukilau / chinese-nlp Public

Notifications You must be signed in to change notification settings
Fork 3
Star 4

Topic modeling for Chinese news articles using Chinese NLP techniques

4 stars 3 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
main.py		main.py
prediction.csv		prediction.csv
stopwords.txt		stopwords.txt
tagging-prediction.ipynb		tagging-prediction.ipynb

Repository files navigation

Chinese Text Classification

Task :

Supervised chinese text classification (in news context, traditional Chinese)

Dataset :

Training set : ~4000 labeled Chinese news articles (3 classes)
Test set : ~1000 unlabeled Chinese news articles

Algorithm :

Text Preprocessing and word segmentation using beautifulsoup4, jieba and customized stopwords.
Create Bag of Words using TfidfVectorizer in sklearn.
Train Random Forest Classifier and make prediction on test set.

Evaluation :

Average of 0.99 accuracy on 10-fold CV.

Requirements

Python 3.6
Modules: pandas, numpy, scikit-learn, beautifulsoup4, jieba

To install required modules : $ pip install pandas numpy scikit-learn beautifulsoup4 jieba

Instruction

To run python script : $ python main.py

What is in this repo

main.py

Python script for chinese text classifcation

tagging-prediction.ipynb

Jupyter notebook with the same script

prediction.csv

Prediction on test set

stopwords.txt

Customized chinese stopwords

About

Topic modeling for Chinese news articles using Chinese NLP techniques

Report repository

Releases

No releases published

Packages

No packages published

Languages