Skip to content

tychen5/IR_TextMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval and Text Mining project implementation

  1. Text Preprocessing
    • Tokenization
    • Lowercasing
    • Stemming
    • Stopword
  2. Construct dictionary & tf-idf vector
    • term dictionary
    • tf-idf unit vector
    • cosine similarity
  3. Naive Bayes classification
    • Multinomial NB classifier
    • feature selection
    • smoothing
  4. HAC clustering
    • hierarchical clustering
    • pair-wise document similarity
    • similarity between clusters