GitHub - solversa/tensorflow-nlp: Building Blocks for NLP and Text Generation in TensorFlow 2.x / 1.x

Code has been run on Google Colab which provides free GPU memory

Natural Language Processing（自然语言处理）
- Text Classification（文本分类）
  - IMDB（English Data）
- Text Matching（文本匹配）
  - SNLI（English Data）
  - 微众银行智能客服（Chinese Data）
- Topic Modelling（主题模型）
- Spoken Language Understanding（对话理解）
  - ATIS（English Data）
- Generative Dialog（生成式对话）
  - 青云语料（Chinese Data）
    - Python Inference（基于 Python 的推理）
    - Java Inference（基于 Java 的推理）
- Multi-turn Dialogue Rewriting（多轮对话改写）
  - 微信 AI 研发数据（Chinese Data）
    - Python Inference（基于 Python 的推理）
    - Java Inference（基于 Java 的推理）
- Semantic Parsing（语义解析）
  - Facebook AI Research Data（English Data）
- Question Answering（问题回答）
  - bAbI（Engish Data）
- Text Processing Tools（文本处理工具）
  - Word Extraction
  - Word Segmentation
Knowledge Graph（知识图谱）
Recommender System（推荐系统）
- Movielens 1M（English Data）

Text Classification

└── finch/tensorflow2/text_classification/imdb
	│
	├── data
	│   └── glove.840B.300d.txt          # pretrained embedding, download and put here
	│   └── make_data.ipynb              # step 1. make data and vocab: train.txt, test.txt, word.txt
	│   └── train.txt  		     # incomplete sample, format <label, text> separated by \t 
	│   └── test.txt   		     # incomplete sample, format <label, text> separated by \t
	│   └── train_bt_part1.txt  	     # (back-translated) incomplete sample, format <label, text> separated by \t
	│
	├── vocab
	│   └── word.txt                     # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── attention_linear.ipynb   # step 2: train and evaluate model
		└── attention_conv.ipynb     # step 2: train and evaluate model
		└── fasttext_unigram.ipynb   # step 2: train and evaluate model
		└── fasttext_bigram.ipynb    # step 2: train and evaluate model
		└── sliced_rnn.ipynb         # step 2: train and evaluate model
		└── sliced_rnn_bt.ipynb      # step 2: train and evaluate model

Task: IMDB（English Data）
```
  Training Data: 25000, Testing Data: 25000, Labels: 2
```
- <Notebook>: Make Data & Vocabulary
- Model: TF-IDF + Logistic Regression
  - PySpark
    - <Notebook> TF + IDF + Logistic Regression -> 88.2% Testing Accuracy
  - Sklearn
    - <Notebook> TF + IDF + Logistic Regression -> 88.3% Testing Accuracy
    - <Notebook> TF (binary) + IDF + Logistic Regression -> 88.8% Testing Accuracy
- Model: FastText
  - Facebook Official Release
    - <Notebook> Unigram FastText -> 87.3% Testing Accuracy
    - <Notebook> Bigram FastText -> 89.8% Testing Accuracy
  - TensorFlow 2
    - <Notebook> Unigram FastText -> 89.1 % Testing Accuracy
    - <Notebook> Bigram FastText -> 90.2 % Testing Accuracy
- Model: Feedforward Attention
  - TensorFlow 2
- Model: Sliced RNN
  - TensorFlow 2
    - <Notebook> Sliced LSTM -> 91.4 % Testing Accuracy
    - <Notebook> Sliced LSTM + Back-Translation -> 91.7 % Testing Accuracy
    - <Notebook> Sliced LSTM + Back-Translation + Char Embedding -> 92.3 % Testing Accuracy
    - <Notebook> Sliced LSTM + Back-Translation + Char Embedding + Label Smoothing
      
      -> 92.5 % Testing Accuracy
    - <Notebook> Sliced LSTM + Back-Translation + Char Embedding + Label Smoothing + Cyclical LR
      
      -> 92.6 % Testing Accuracy
      
      This result (without transfer learning) is higher than CoVe (with transfer learning)

Text Matching

└── finch/tensorflow2/text_matching/snli
	│
	├── data
	│   └── glove.840B.300d.txt       # pretrained embedding, download and put here
	│   └── download_data.ipynb       # step 1. run this to download snli dataset
	│   └── make_data.ipynb           # step 2. run this to generate train.txt, test.txt, word.txt 
	│   └── train.txt  		  # incomplete sample, format <label, text1, text2> separated by \t 
	│   └── test.txt   		  # incomplete sample, format <label, text1, text2> separated by \t
	│
	├── vocab
	│   └── word.txt                  # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── dam.ipynb      	  # step 3. train and evaluate model
		└── esim.ipynb      	  # step 3. train and evaluate model
		└── ......

Task: SNLI（English Data）
```
  Training Data: 550152, Testing Data: 10000, Labels: 3
```
- <Notebook>: Download Data
- <Notebook>: Make Data & Vocabulary
  - <Text File>: Data Example
  - <Text File>: Vocabulary Example
- TensorFlow 2
  - Model: DAM
    - <Notebook> DAM -> 85.3% Testing Accuracy
      
      The accuracy of this implementation is higher than UCL MR Group's implementation (84.6%)
  - Model: Match Pyramid
    - <Notebook> Pyramid -> 87.1% Testing Accuracy
      
      The accuracy of this model is 0.3% below ESIM, however the speed is 1x faster than ESIM
  - Model: ESIM
    - <Notebook> ESIM -> 87.4% Testing Accuracy
      
      The accuracy of this implementation is comparable to UCL MR Group's implementation (87.2%)
  - Model: RE2

└── finch/tensorflow2/text_matching/chinese
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.csv  		  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── test.csv   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── esim.ipynb      	  # step 2. train and evaluate model
		└── ......

Task: 微众银行智能客服（Chinese Data）
```
  Training Data: 100000, Testing Data: 10000, Labels: 2
```
- <Notebook>: Make Data & Vocabulary
  - <Text File>: Data Example
  - <Text File>: Vocabulary
- Model
  - TensorFlow 2
    These results are higher than the results here and the result here
  - TensorFlow 1 + bert4keras
    - <Notebook> BERT -> 85.0% Testing Accuracy
      
      Weights downloaded from here

Topic Modelling

Data: 2373 Lines of Book Titles（English Data）
- Model: TF-IDF + LDA
  - PySpark
    - <Notebook> TF + IDF + LDA
  - Sklearn + pyLDAvis
    - <Notebook> TF + IDF + LDA
    - <Notebook> Visualization

Spoken Language Understanding

└── finch/tensorflow2/spoken_language_understanding/atis
	│
	├── data
	│   └── glove.840B.300d.txt           # pretrained embedding, download and put here
	│   └── make_data.ipynb               # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── atis.train.w-intent.iob       # incomplete sample, format <text, slot, intent>
	│   └── atis.test.w-intent.iob        # incomplete sample, format <text, slot, intent>
	│
	├── vocab
	│   └── word.txt                      # list of words in vocabulary
	│   └── intent.txt                    # list of intents in vocabulary
	│   └── slot.txt                      # list of slots in vocabulary
	│	
	└── main              
		└── bigru.ipynb               # step 2. train and evaluate model
		└── bigru_self_attn.ipynb     # step 2. train and evaluate model
		└── transformer.ipynb         # step 2. train and evaluate model
		└── transformer_elu.ipynb     # step 2. train and evaluate model

Task: ATIS（English Data）
```
  Training Data: 4978, Testing Data: 893
```
- <Text File>: Data Example
- <Notebook>: Make Vocabulary
  - <Text File>: Vocabulary Example
- Model: Bi-directional RNN
  - TensorFlow 2
    - <Notebook> Bi-GRU
      
      97.4% Intent Acc, 95.4% Slot Micro-F1 on Testing Data
    - <Notebook> Bi-GRU + Self-Attention
      
      97.6% Intent Acc, 95.7% Slot Micro-F1 on Testing Data
- Model: ELMO Embedding
  - TensorFlow 1
    - <Notebook> ELMO + Bi-GRU
      
      97.5% Intent Acc, 96.1% Slot Micro-F1 on Testing Data

Generative Dialog

└── finch/tensorflow1/free_chat/chinese_qingyun
	│
	├── data
	│   └── raw_data.csv           		# raw data downloaded from external
	│   └── make_data.ipynb           	# step 1. run this to generate vocab {char.txt} and data {train.txt & test.txt}
	│   └── train.txt           		# processed text file generated by {make_data.ipynb}
	│
	├── vocab
	│   └── char.txt                	# list of chars in vocabulary for chinese
	│   └── cc.zh.300.vec			# fastText pretrained embedding downloaded from external
	│   └── char.npy			# chinese characters and their embedding values (300 dim)	
	│	
	└── main
		└── lstm_seq2seq_train.ipynb    # step 2. train and evaluate model
		└── lstm_seq2seq_export.ipynb   # step 3. export model
		└── lstm_seq2seq_infer.ipynb    # step 4. model inference
		└── transformer_train.ipynb     # step 2. train and evaluate model
		└── transformer_export.ipynb    # step 3. export model
		└── transformer_infer.ipynb     # step 4. model inference

Task: 青云语料（Chinese Data）
```
  Training Data: 107687, Testing Data: 3350
```
- Data
- Model: RNN Seq2Seq + Attention
  - TensorFlow 1
    - <Notebook> Training
      
      LSTM + Attention + Beam Search -> 3.540 Testing Perplexity
  - Model Inference
    - <Notebook> Model Export
    - <Notebook> Python Inference
- Model: Transformer
  - TensorFlow 1 + texar
    - <Notebook> Training
      
      Transformer (6 Layers, 8 Heads) -> 3.540 Testing Perplexity
  - Model Inference
    - <Notebook> Model Export
    - <Notebook> Python Inference

└── FreeChatInference
	│
	├── data
	│   └── transformer_export/
	│   └── char.txt
	│   └── libtensorflow-1.14.0.jar
	│   └── tensorflow_jni.dll
	│
	└── src              
	    └── ModelInference.java

<Notebook> Java Inference

Semantic Parsing

└── finch/tensorflow2/semantic_parsing/tree_slu
	│
	├── data
	│   └── glove.840B.300d.txt     	# pretrained embedding, download and put here
	│   └── make_data.ipynb           	# step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── train.tsv   		  	# incomplete sample, format <text, tokenized_text, tree>
	│   └── test.tsv    		  	# incomplete sample, format <text, tokenized_text, tree>
	│
	├── vocab
	│   └── source.txt                	# list of words in vocabulary for source (of seq2seq)
	│   └── target.txt                	# list of words in vocabulary for target (of seq2seq)
	│	
	└── main
		└── lstm_seq2seq_tf_addons.ipynb           # step 2. train and evaluate model
		└── ......

Task: Semantic Parsing for Task Oriented Dialog（English Data）
```
  Training Data: 31279, Testing Data: 9042
```
- <Text File>: Data Example
- <Notebook>: Make Vocabulary
  - <Text File>: Vocabulary Example
- Model: RNN Seq2Seq + Attention
  - TensorFlow 2
    - <Notebook> LSTM + Attention + Beam Search ->
      
      72.4% Exact Match Accuracy on Testing Data
    - <Notebook> LSTM + Attention + Beam Search + Cyclical LR + Label Smoothing ->
      
      74.1% Exact Match Accuracy on Testing Data

Knowledge Graph Inference

└── finch/tensorflow2/knowledge_graph_completion/wn18
	│
	├── data
	│   └── download_data.ipynb       	# step 1. run this to download wn18 dataset
	│   └── make_data.ipynb           	# step 2. run this to generate vocabulary: entity.txt, relation.txt
	│   └── wn18  		          	# wn18 folder (will be auto created by download_data.ipynb)
	│   	└── train.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│   	└── valid.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t 
	│   	└── test.txt   		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│
	├── vocab
	│   └── entity.txt                  	# incomplete sample, list of entities in vocabulary
	│   └── relation.txt                	# incomplete sample, list of relations in vocabulary
	│	
	└── main              
		└── distmult_1-N.ipynb    	# step 3. train and evaluate model

Task: WN18
```
  Training Data: 141442, Testing Data: 5000
```
- <Notebook>: Download Data
  - <Text File>: Data Example
- <Notebook>: Make Vocabulary
  - <Text File>: Vocabulary Example
- We use 1-N Fast Evaluation to largely accelerate evaluation process
  
  MRR: Mean Reciprocal Rank
- Model: DistMult
  - TensorFlow 2
    - <Notebook> DistMult -> 79.7% MRR on Testing Data
- Model: TuckER
  - TensorFlow 2
    - <Notebook> TuckER -> 88.5% MRR on Testing Data
- Model: ComplEx
  - TensorFlow 2
    - <Notebook> ComplEx -> 93.8% MRR on Testing Data

Knowledge Graph Tools

Data Scraping
- Using Scrapy
- Downloaded
SPARQL
- WN18 Example
Neo4j + Cypher
- Getting Started

Question Answering

└── finch/tensorflow1/question_answering/babi
	│
	├── data
	│   └── make_data.ipynb           		# step 1. run this to generate vocabulary: word.txt 
	│   └── qa5_three-arg-relations_train.txt       # one complete example of babi dataset
	│   └── qa5_three-arg-relations_test.txt	# one complete example of babi dataset
	│
	├── vocab
	│   └── word.txt                  		# complete list of words in vocabulary
	│	
	└── main              
		└── dmn_train.ipynb
		└── dmn_serve.ipynb
		└── attn_gru_cell.py

Task: bAbI（English Data）
- <Text File>: Data Example
- <Notebook>: Make Vocabulary
- Model: Dynamic Memory Network
  - TensorFlow 1
    - <Notebook> DMN -> 99.4% Testing Accuracy
    - Inference

Text Processing Tools

Word Extraction
- Chinese
  - <Notebook>: Regex Rule
Word Segmentation
- Chinese
  - Custom TensorFlow Op added by applenob
    - <Notebook>

Recommender System

└── finch/tensorflow1/recommender/movielens
	│
	├── data
	│   └── make_data.ipynb           		# run this to generate vocabulary
	│
	├── vocab
	│   └── user_job.txt
	│   └── user_id.txt
	│   └── user_gender.txt
	│   └── user_age.txt
	│   └── movie_types.txt
	│   └── movie_title.txt
	│   └── movie_id.txt
	│	
	└── main              
		└── dnn_softmax.ipynb
		└── ......

Task: Movielens 1M（English Data）
```
  Training Data: 900228, Testing Data: 99981, Users: 6000, Movies: 4000, Rating: 1-5
```
- <Notebook>: Make Vocabulary
  - <Text File>: Data Example
- Model: Fusion
  - TensorFlow 1
    
    MAE: Mean Absolute Error
    - <Notebook> Fusion + Sigmoid -> 0.663 Testing MAE
    - <Notebook> Fusion + Softmax -> 0.633 Testing MAE
    - <Notebook> Fusion + Softmax + Cyclical LR -> 0.628 Testing MAE
      
      The MAE results seem better than the all the results here and all the results here

Multi-turn Dialogue Rewriting

└── finch/tensorflow1/multi_turn_rewrite/chinese/
	│
	├── data
	│   └── make_data.ipynb         # run this to generate vocab, split train & test data, make pretrained embedding
	│   └── corpus.txt		# original data downloaded from external
	│   └── train_pos.txt		# processed positive training data after {make_data.ipynb}
	│   └── train_neg.txt		# processed negative training data after {make_data.ipynb}
	│   └── test_pos.txt		# processed positive testing data after {make_data.ipynb}
	│   └── test_neg.txt		# processed negative testing data after {make_data.ipynb}
	│
	├── vocab
	│   └── cc.zh.300.vec		# fastText pretrained embedding downloaded from external
	│   └── char.npy		# chinese characters and their embedding values (300 dim)	
	│   └── char.txt		# list of chinese characters used in this project 
	│	
	└── main              
		└── baseline_lstm_train.ipynb
		└── baseline_lstm_export.ipynb
		└── baseline_lstm_predict.ipynb

Task: Multi-turn Dialogue Rewriting（Chinese Data）
```
  Training Data (Positive): 18986, Testing Data (Positive): 1008

  Training Data = 2 * 18986 because of 1:1 Negative Sampling 
```
- <Text File>: Data
- <Notebook>: Make Data & Vocabulary & Pretrained Embedding
- Model: RNN Seq2Seq + Attention + Dynamic Memory
  - TensorFlow 1
    - <Notebook> LSTM Seq2Seq + Attention + Memory + Beam Search
      
      -> BLEU-1: 94.6, BLEU-2: 89.1, BELU-4: 78.5, EM: 56.2%
      
      This result (without BERT) is comparable to the result here with BERT
    - <Notebook> Export
    - <Notebook> Inference

└── MultiDialogInference
	│
	├── data
	│   └── baseline_lstm_export/
	│   └── char.txt
	│   └── libtensorflow-1.14.0.jar
	│   └── tensorflow_jni.dll
	│
	└── src              
	    └── ModelInference.java

<Notebook> Java Inference

Knowledge Base Question Answering

Rule-based System（基于规则的系统）
- refo + jieba
  - <Notebook> Example

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
finch		finch
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

Text Classification

Text Matching

Topic Modelling

Spoken Language Understanding

Generative Dialog

Semantic Parsing

Knowledge Graph Inference

Knowledge Graph Tools

Question Answering

Text Processing Tools

Recommender System

Multi-turn Dialogue Rewriting

Knowledge Base Question Answering

About

Releases

Packages

Languages

License

solversa/tensorflow-nlp

Folders and files

Latest commit

History

Repository files navigation

Contents

Text Classification

Text Matching

Topic Modelling

Spoken Language Understanding

Generative Dialog

Semantic Parsing

Knowledge Graph Inference

Knowledge Graph Tools

Question Answering

Text Processing Tools

Recommender System

Multi-turn Dialogue Rewriting

Knowledge Base Question Answering

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages