-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: add show-encoding cmd options
Signed-off-by: Yaohui Wang <wangyaohuicn@gmail.com>
- Loading branch information
wangyaohui
committed
Oct 5, 2023
1 parent
fdaa42a
commit 1305b51
Showing
2 changed files
with
159 additions
and
208 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,221 +1,164 @@ | ||
# gocloc | ||
# ctoc | ||
|
||
[![GoDoc](https://godoc.org/github.com/hhatto/gocloc?status.svg)](https://godoc.org/github.com/hhatto/gocloc) | ||
[![ci](https://github.com/hhatto/gocloc/workflows/Go/badge.svg)](https://github.com/hhatto/gocloc/actions) | ||
[![Go Report Card](https://goreportcard.com/badge/github.com/hhatto/gocloc)](https://goreportcard.com/report/github.com/hhatto/gocloc) | ||
[![Docker Pulls](https://img.shields.io/docker/pulls/hhatto/gocloc)](https://hub.docker.com/r/hhatto/gocloc) | ||
[![Docker Image Size](https://img.shields.io/docker/image-size/hhatto/gocloc)](https://hub.docker.com/r/hhatto/gocloc) | ||
_Count Tokens of Code_. | ||
|
||
A little fast [cloc(Count Lines Of Code)](https://github.com/AlDanial/cloc), written in Go. | ||
Inspired by [tokei](https://github.com/Aaronepower/tokei). | ||
> Token counts plays a key role in shaping an LLM's memory and conversation history.<br/> | ||
> **ctoc** provides a lightweight tool to analyze codebases at the token level. | ||
> | ||
> Built on top of [gocloc](https://github.com/hhatto/gocloc). | ||
## Installation | ||
|
||
require Go 1.19+ | ||
|
||
``` | ||
$ go install github.com/hhatto/gocloc/cmd/gocloc@latest | ||
``` | ||
[![GoDoc](https://godoc.org/github.com/yaohui-wyh/ctoc?status.svg)](https://godoc.org/github.com/yaohui-wyh/ctoc) | ||
[![ci](https://github.com/yaohui-wyh/ctoc/workflows/Go/badge.svg)](https://github.com/yaohui-wyh/ctoc/actions) | ||
[![Go Report Card](https://goreportcard.com/badge/github.com/hhatto/gocloc)](https://goreportcard.com/report/github.com/yaohui-wyh/ctoc) | ||
|
||
Arch Linux user can also install from AUR: [gocloc-git](https://aur.archlinux.org/packages/gocloc-git/). | ||
<details> | ||
<summary>What are <b>Tokens</b>? (in the context of Large Language Model)</summary> | ||
|
||
## Usage | ||
> https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens | ||
### Basic Usage | ||
``` | ||
$ gocloc . | ||
``` | ||
- **Tokens**: basic units of text/code for LLM AI models to process/generate language. | ||
- **Tokenization**: splitting input/output texts into smaller units for LLM AI models. | ||
- **Vocabulary size**: the number of tokens each model uses, which varies among different GPT models. | ||
- **Tokenization cost**: affects the memory and computational resources that a model needs, which influences the cost | ||
and performance of running an OpenAI or Azure OpenAI model. | ||
</details> | ||
|
||
``` | ||
$ gocloc . | ||
------------------------------------------------------------------------------- | ||
Language files blank comment code | ||
------------------------------------------------------------------------------- | ||
Markdown 3 8 0 18 | ||
Go 1 29 1 323 | ||
------------------------------------------------------------------------------- | ||
TOTAL 4 37 1 341 | ||
------------------------------------------------------------------------------- | ||
``` | ||
## Installation | ||
|
||
### Via Docker | ||
with [dockerhub](https://hub.docker.com/repository/docker/hhatto/gocloc) | ||
``` | ||
$ docker run --rm -v "${PWD}":/workdir hhatto/gocloc . | ||
``` | ||
require Go 1.19+ | ||
|
||
with [GitHub Packages](https://github.com/hhatto/gocloc/packages/350535) on GitHub Actions | ||
``` | ||
jobs: | ||
build: | ||
name: example of code measurement using gocloc | ||
runs-on: ubuntu-18.04 | ||
steps: | ||
- uses: actions/checkout@master | ||
- name: Login GitHub Registry | ||
run: docker login docker.pkg.github.com -u owner -p ${{ secrets.GITHUB_TOKEN }} | ||
- name: Run gocloc | ||
run: docker run --rm -v "${PWD}":/workdir docker.pkg.github.com/hhatto/gocloc/gocloc:latest . | ||
$ go install github.com/yaohui-wyh/ctoc/cmd/ctoc@latest | ||
``` | ||
|
||
### Integration Jenkins CI | ||
use [SLOCCount Plugin](https://wiki.jenkins-ci.org/display/JENKINS/SLOCCount+Plugin). | ||
## Usage | ||
|
||
``` | ||
$ cloc --by-file --output-type=sloccount . > sloccount.scc | ||
``` | ||
### Basic Usage | ||
|
||
``` | ||
$ cat sloccount.scc | ||
398 Go ./main.go | ||
190 Go ./language.go | ||
132 Markdown ./README.md | ||
24 Go ./xml.go | ||
18 Go ./file.go | ||
15 Go ./option.go | ||
$ ctoc -h | ||
Usage: | ||
ctoc [OPTIONS] | ||
Application Options: | ||
--by-file report results for every encountered source file | ||
--sort=[name|files|blank|comment|code|tokens] sort based on a certain column (default: code) | ||
--output-type= output type [values: default,cloc-xml,sloccount,json] (default: default) | ||
--exclude-ext= exclude file name extensions (separated commas) | ||
--include-lang= include language name (separated commas) | ||
--match= include file name (regex) | ||
--not-match= exclude file name (regex) | ||
--match-d= include dir name (regex) | ||
--not-match-d= exclude dir name (regex) | ||
--debug dump debug log for developer | ||
--skip-duplicated skip duplicated files | ||
--show-lang print about all languages and extensions | ||
--version print version info | ||
--encoding=[cl100k_base|p50k_base|p50k_edit|r50k_base] specify tokenizer encoding (default: cl100k_base) | ||
Help Options: | ||
-h, --help Show this help message | ||
``` | ||
|
||
``` | ||
$ ctoc . | ||
------------------------------------------------------------------------------------------------ | ||
Language files blank comment code tokens | ||
------------------------------------------------------------------------------------------------ | ||
Go 15 282 153 2096 21839 | ||
XML 3 0 0 140 1950 | ||
YAML 1 0 0 40 237 | ||
Markdown 1 13 0 34 322 | ||
Makefile 1 6 0 15 128 | ||
------------------------------------------------------------------------------------------------ | ||
TOTAL 21 301 153 2325 24476 | ||
------------------------------------------------------------------------------------------------ | ||
``` | ||
|
||
## Support Languages | ||
use `--show-lang` option | ||
|
||
``` | ||
$ gocloc --show-lang | ||
``` | ||
> Same as [gocloc](https://github.com/hhatto/gocloc#support-languages) | ||
``` | ||
$ ctoc --show-lang | ||
``` | ||
|
||
## Support Models | ||
|
||
``` | ||
$ ctoc --show-encoding | ||
text-davinci-002 (p50k_base) | ||
text-davinci-001 (r50k_base) | ||
babbage (r50k_base) | ||
text-babbage-001 (r50k_base) | ||
code-cushman-002 (p50k_base) | ||
code-search-ada-code-001 (r50k_base) | ||
text-davinci-003 (p50k_base) | ||
davinci (r50k_base) | ||
text-similarity-ada-001 (r50k_base) | ||
text-curie-001 (r50k_base) | ||
curie (r50k_base) | ||
ada (r50k_base) | ||
code-davinci-002 (p50k_base) | ||
text-davinci-edit-001 (p50k_edit) | ||
text-embedding-ada-002 (cl100k_base) | ||
text-similarity-curie-001 (r50k_base) | ||
text-similarity-babbage-001 (r50k_base) | ||
gpt2 (gpt2) | ||
gpt-4 (cl100k_base) | ||
text-ada-001 (r50k_base) | ||
code-davinci-001 (p50k_base) | ||
text-search-davinci-doc-001 (r50k_base) | ||
text-search-curie-doc-001 (r50k_base) | ||
code-search-babbage-code-001 (r50k_base) | ||
code-cushman-001 (p50k_base) | ||
cushman-codex (p50k_base) | ||
code-davinci-edit-001 (p50k_edit) | ||
gpt-3.5-turbo (cl100k_base) | ||
text-similarity-davinci-001 (r50k_base) | ||
text-search-babbage-doc-001 (r50k_base) | ||
text-search-ada-doc-001 (r50k_base) | ||
davinci-codex (p50k_base) | ||
``` | ||
|
||
The BPE dictionary is automatically downloaded and cached upon its initial run for each encoding.<br/> | ||
For additional information, please refer to [tiktoken-go#cache](https://github.com/pkoukk/tiktoken-go#cache) | ||
|
||
## Performance | ||
* CPU 3.8GHz 8core Intel Core i7 / 32GB 2667MHz DDR4 / MacOSX 13.3.1 | ||
* cloc 1.96 | ||
* tokei 12.1.2 compiled with serialization support: json | ||
* gocloc [a88edc5](https://github.com/hhatto/gocloc/commit/a88edc52b3eea697687f9546f6ac74a03c91c5fb) | ||
* target repository is [golang/go commit:f742ddc](https://github.com/golang/go/tree/f742ddc349723667fc9af5d0f16233f7762aeaa0) | ||
|
||
### cloc | ||
|
||
``` | ||
$ time cloc . | ||
12003 text files. | ||
11150 unique files. | ||
1192 files ignored. | ||
8 errors: | ||
Line count, exceeded timeout: ./src/cmd/dist/build.go | ||
Line count, exceeded timeout: ./src/cmd/trace/static/webcomponents.min.js | ||
Line count, exceeded timeout: ./src/net/http/requestwrite_test.go | ||
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables10.0.0.go | ||
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables11.0.0.go | ||
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables12.0.0.go | ||
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables13.0.0.go | ||
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables9.0.0.go | ||
github.com/AlDanial/cloc v 1.96 T=35.07 s (317.9 files/s, 78679.3 lines/s) | ||
----------------------------------------------------------------------------------- | ||
Language files blank comment code | ||
----------------------------------------------------------------------------------- | ||
Go 9081 205135 337681 1779107 | ||
Text 1194 11530 0 210849 | ||
Assembly 563 15549 21625 122329 | ||
HTML 17 3197 78 24983 | ||
C 139 1324 982 6895 | ||
JSON 20 0 0 3122 | ||
CSV 1 0 0 2119 | ||
Markdown 27 674 106 1949 | ||
Bourne Shell 16 253 868 1664 | ||
JavaScript 10 234 221 1517 | ||
Perl 10 173 171 1111 | ||
C/C++ Header 26 145 346 724 | ||
Bourne Again Shell 16 120 263 535 | ||
Python 1 133 104 375 | ||
CSS 3 4 13 337 | ||
DOS Batch 5 56 66 207 | ||
Windows Resource File 4 23 0 146 | ||
Logos 2 16 0 101 | ||
Dockerfile 2 13 15 47 | ||
C++ 2 11 14 24 | ||
make 5 9 10 21 | ||
Objective-C 1 2 3 11 | ||
Fortran 90 2 1 3 8 | ||
awk 1 1 6 7 | ||
YAML 1 0 0 5 | ||
MATLAB 1 1 0 4 | ||
----------------------------------------------------------------------------------- | ||
SUM: 11150 238604 362575 2158197 | ||
----------------------------------------------------------------------------------- | ||
cloc . 33.70s user 1.48s system 99% cpu 35.237 total | ||
``` | ||
|
||
### tokei | ||
|
||
``` | ||
$ time tokei --sort code --exclude "**/*.txt" . | ||
=============================================================================== | ||
Language Files Lines Code Comments Blanks | ||
=============================================================================== | ||
Go 9242 2330107 1812147 318036 199924 | ||
GNU Style Assembly 565 159534 127093 16888 15553 | ||
C 143 9272 6949 1000 1323 | ||
JSON 21 3122 3122 0 0 | ||
Shell 16 2785 2267 342 176 | ||
JavaScript 10 1972 1520 218 234 | ||
Perl 9 1360 1032 170 158 | ||
C Header 27 1222 727 349 146 | ||
BASH 16 918 521 279 118 | ||
Python 1 612 421 70 121 | ||
CSS 3 354 337 13 4 | ||
Autoconf 9 283 274 0 9 | ||
Batch 5 329 207 66 56 | ||
Alex 2 117 101 0 16 | ||
Dockerfile 2 75 47 15 13 | ||
C++ 2 49 24 14 11 | ||
Makefile 5 40 20 10 10 | ||
Objective-C 2 21 15 3 3 | ||
FORTRAN Modern 2 12 8 3 1 | ||
Markdown 18 2402 0 1853 549 | ||
------------------------------------------------------------------------------- | ||
HTML 17 19060 18584 49 427 | ||
|- CSS 4 2071 1852 10 209 | ||
|- HTML 1 219 212 0 7 | ||
|- JavaScript 8 6920 6876 16 28 | ||
(Total) 28270 27524 75 671 | ||
=============================================================================== | ||
Total 10117 2533646 1975416 339378 218852 | ||
=============================================================================== | ||
tokei --sort code --exclude "**/*.txt" . 0.76s user 0.50s system 562% cpu 0.224 total | ||
- CPU 2.6GHz 6core Intel Core i7 / 32GB 2667MHz DDR4 / MacOSX 13.5.2 | ||
- ctoc [fdaa42](https://github.com/yaohui-wyh/ctoc/commit/fdaa42) | ||
|
||
``` | ||
➜ kubernetes git:(master) time ctoc . | ||
------------------------------------------------------------------------------------------------ | ||
Language files blank comment code tokens | ||
------------------------------------------------------------------------------------------------ | ||
Go 15172 503395 992193 3921496 53747627 | ||
JSON 430 2 0 1011821 10428573 | ||
YAML 1224 612 1464 156024 974131 | ||
Markdown 461 24842 170 93141 3251948 | ||
BASH 318 6522 12788 33010 528217 | ||
Protocol Buffers 130 5864 19379 12809 358110 | ||
Assembly 50 2212 925 8447 129534 | ||
Plain Text 31 203 0 6664 48218 | ||
Makefile 58 594 940 2027 31548 | ||
Bourne Shell 9 154 119 687 8055 | ||
sed 4 4 32 439 3138 | ||
Python 7 114 160 418 5435 | ||
Zsh 1 14 3 191 1872 | ||
PowerShell 3 44 79 181 2496 | ||
C 5 42 55 140 1799 | ||
TOML 6 31 107 101 2049 | ||
HTML 2 0 0 2 21 | ||
Batch 1 2 17 2 170 | ||
------------------------------------------------------------------------------------------------ | ||
TOTAL 17912 544651 1028431 5247600 69522941 | ||
------------------------------------------------------------------------------------------------ | ||
ctoc . 160.09s user 8.08s system 119% cpu 2:20.96 total` | ||
``` | ||
|
||
### gocloc | ||
|
||
``` | ||
$ time gocloc --exclude-ext=txt . | ||
------------------------------------------------------------------------------- | ||
Language files blank comment code | ||
------------------------------------------------------------------------------- | ||
Go 9096 205242 352844 1764503 | ||
Assembly 563 15555 21624 122324 | ||
HTML 17 3197 212 24849 | ||
C 139 1324 983 6894 | ||
JSON 20 0 0 3122 | ||
BASH 27 345 1106 2122 | ||
Markdown 18 549 28 1825 | ||
JavaScript 10 234 218 1520 | ||
C Header 26 145 346 724 | ||
Perl 10 173 584 698 | ||
Python 1 133 104 375 | ||
CSS 3 4 13 337 | ||
Batch 5 56 0 273 | ||
Plan9 Shell 4 23 50 96 | ||
Bourne Shell 5 28 24 78 | ||
C++ 2 11 14 24 | ||
Makefile 5 10 10 20 | ||
Objective-C 2 3 3 15 | ||
FORTRAN Modern 2 1 3 8 | ||
Awk 1 1 6 7 | ||
------------------------------------------------------------------------------- | ||
TOTAL 9956 227034 378172 1929814 | ||
------------------------------------------------------------------------------- | ||
gocloc --exclude-ext=txt . 0.65s user 0.51s system 119% cpu 0.970 total | ||
``` | ||
|
||
## License | ||
|
||
MIT |
Oops, something went wrong.