Skip to content

Topic modeling for Chinese news articles using Chinese NLP techniques

Notifications You must be signed in to change notification settings

sukilau/chinese-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinese Text Classification

Task :

  • Supervised chinese text classification (in news context, traditional Chinese)

Dataset :

  • Training set : ~4000 labeled Chinese news articles (3 classes)
  • Test set : ~1000 unlabeled Chinese news articles

Algorithm :

  • Text Preprocessing and word segmentation using beautifulsoup4, jieba and customized stopwords.
  • Create Bag of Words using TfidfVectorizer in sklearn.
  • Train Random Forest Classifier and make prediction on test set.

Evaluation :

  • Average of 0.99 accuracy on 10-fold CV.

Requirements

  • Python 3.6
  • Modules: pandas, numpy, scikit-learn, beautifulsoup4, jieba

To install required modules : $ pip install pandas numpy scikit-learn beautifulsoup4 jieba

Instruction

To run python script : $ python main.py

What is in this repo

main.py

  • Python script for chinese text classifcation

tagging-prediction.ipynb

  • Jupyter notebook with the same script

prediction.csv

  • Prediction on test set

stopwords.txt

  • Customized chinese stopwords

About

Topic modeling for Chinese news articles using Chinese NLP techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published