Chinese Typo Correction with Taiwan-LLaMa

Abstract

Developed a language model capable of identifying commonly misused words with an accuracy of 98.6%, surpassing the performance of GPT-4, which achieved only 82% accuracy.

Data Generation and Preprocessing

Data Generation

python3 generator.py --number_of_data n --output_dir /path/to/output.json

Data Preprocessing

python3 preprocessing.py --data_dir /path/to/output.json --output_dir_0 /path/to/zero_shot.json --output_dir_1 /path/to/one_shot.json --output_dir_2 /path/to/two_shot.json

Do the following to process the training data

python3 generator.py \
    --number_of_data 1000 \
    --output_dir data/output.json

python3 preprocessing.py  \
    --data_dir data/output.json \
    --output_dir_0 data/train_1000_zero_shot.json \
    --output_dir_1 data/train_1000_one_shot.json \
    --output_dir_2 data/train_1000_two_shot.json

Training

accelerate launch -m axolotl.cli.train examples/llama-2/qlora_final.yml --datasets.path="/path/to/dataset" --output_dir="/path/to/output/"

or modify the training_final.sh and do the following

bash training_final.sh

Inference and Evaluation

bash run.sh /path/to/Taiwan-LLM-7B-v2.0-chat/ /path/to/qlora-out/ /path/to/test.json/ /path/to/prediction.json/ /path/to/combined_prediction.json/

or modify the inference.sh and do the following

bash inference.sh

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
all_results		all_results
data		data
figures		figures
log		log
source		source
tw_rouge		tw_rouge
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Typo Correction with Taiwan-LLaMa

Abstract

Data Generation and Preprocessing

Training

Inference and Evaluation

About

Releases

Packages

Contributors 2

Languages

roy0428/ADL_Final

Folders and files

Latest commit

History

Repository files navigation

Chinese Typo Correction with Taiwan-LLaMa

Abstract

Data Generation and Preprocessing

Training

Inference and Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages