Skip to content

Commit

Permalink
chore: add show-encoding cmd options
Browse files Browse the repository at this point in the history
Signed-off-by: Yaohui Wang <wangyaohuicn@gmail.com>
  • Loading branch information
wangyaohui committed Oct 5, 2023
1 parent fdaa42a commit 1305b51
Show file tree
Hide file tree
Showing 2 changed files with 159 additions and 208 deletions.
331 changes: 137 additions & 194 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,221 +1,164 @@
# gocloc
# ctoc

[![GoDoc](https://godoc.org/github.com/hhatto/gocloc?status.svg)](https://godoc.org/github.com/hhatto/gocloc)
[![ci](https://github.com/hhatto/gocloc/workflows/Go/badge.svg)](https://github.com/hhatto/gocloc/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/hhatto/gocloc)](https://goreportcard.com/report/github.com/hhatto/gocloc)
[![Docker Pulls](https://img.shields.io/docker/pulls/hhatto/gocloc)](https://hub.docker.com/r/hhatto/gocloc)
[![Docker Image Size](https://img.shields.io/docker/image-size/hhatto/gocloc)](https://hub.docker.com/r/hhatto/gocloc)
_Count Tokens of Code_.

A little fast [cloc(Count Lines Of Code)](https://github.com/AlDanial/cloc), written in Go.
Inspired by [tokei](https://github.com/Aaronepower/tokei).
> Token counts plays a key role in shaping an LLM's memory and conversation history.<br/>
> **ctoc** provides a lightweight tool to analyze codebases at the token level.
>
> Built on top of [gocloc](https://github.com/hhatto/gocloc).
## Installation

require Go 1.19+

```
$ go install github.com/hhatto/gocloc/cmd/gocloc@latest
```
[![GoDoc](https://godoc.org/github.com/yaohui-wyh/ctoc?status.svg)](https://godoc.org/github.com/yaohui-wyh/ctoc)
[![ci](https://github.com/yaohui-wyh/ctoc/workflows/Go/badge.svg)](https://github.com/yaohui-wyh/ctoc/actions)
[![Go Report Card](https://goreportcard.com/badge/github.com/hhatto/gocloc)](https://goreportcard.com/report/github.com/yaohui-wyh/ctoc)

Arch Linux user can also install from AUR: [gocloc-git](https://aur.archlinux.org/packages/gocloc-git/).
<details>
<summary>What are <b>Tokens</b>? (in the context of Large Language Model)</summary>

## Usage
> https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens
### Basic Usage
```
$ gocloc .
```
- **Tokens**: basic units of text/code for LLM AI models to process/generate language.
- **Tokenization**: splitting input/output texts into smaller units for LLM AI models.
- **Vocabulary size**: the number of tokens each model uses, which varies among different GPT models.
- **Tokenization cost**: affects the memory and computational resources that a model needs, which influences the cost
and performance of running an OpenAI or Azure OpenAI model.
</details>

```
$ gocloc .
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Markdown 3 8 0 18
Go 1 29 1 323
-------------------------------------------------------------------------------
TOTAL 4 37 1 341
-------------------------------------------------------------------------------
```
## Installation

### Via Docker
with [dockerhub](https://hub.docker.com/repository/docker/hhatto/gocloc)
```
$ docker run --rm -v "${PWD}":/workdir hhatto/gocloc .
```
require Go 1.19+

with [GitHub Packages](https://github.com/hhatto/gocloc/packages/350535) on GitHub Actions
```
jobs:
build:
name: example of code measurement using gocloc
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@master
- name: Login GitHub Registry
run: docker login docker.pkg.github.com -u owner -p ${{ secrets.GITHUB_TOKEN }}
- name: Run gocloc
run: docker run --rm -v "${PWD}":/workdir docker.pkg.github.com/hhatto/gocloc/gocloc:latest .
$ go install github.com/yaohui-wyh/ctoc/cmd/ctoc@latest
```

### Integration Jenkins CI
use [SLOCCount Plugin](https://wiki.jenkins-ci.org/display/JENKINS/SLOCCount+Plugin).
## Usage

```
$ cloc --by-file --output-type=sloccount . > sloccount.scc
```
### Basic Usage

```
$ cat sloccount.scc
398 Go ./main.go
190 Go ./language.go
132 Markdown ./README.md
24 Go ./xml.go
18 Go ./file.go
15 Go ./option.go
$ ctoc -h
Usage:
ctoc [OPTIONS]
Application Options:
--by-file report results for every encountered source file
--sort=[name|files|blank|comment|code|tokens] sort based on a certain column (default: code)
--output-type= output type [values: default,cloc-xml,sloccount,json] (default: default)
--exclude-ext= exclude file name extensions (separated commas)
--include-lang= include language name (separated commas)
--match= include file name (regex)
--not-match= exclude file name (regex)
--match-d= include dir name (regex)
--not-match-d= exclude dir name (regex)
--debug dump debug log for developer
--skip-duplicated skip duplicated files
--show-lang print about all languages and extensions
--version print version info
--encoding=[cl100k_base|p50k_base|p50k_edit|r50k_base] specify tokenizer encoding (default: cl100k_base)
Help Options:
-h, --help Show this help message
```

```
$ ctoc .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 15 282 153 2096 21839
XML 3 0 0 140 1950
YAML 1 0 0 40 237
Markdown 1 13 0 34 322
Makefile 1 6 0 15 128
------------------------------------------------------------------------------------------------
TOTAL 21 301 153 2325 24476
------------------------------------------------------------------------------------------------
```

## Support Languages
use `--show-lang` option

```
$ gocloc --show-lang
```
> Same as [gocloc](https://github.com/hhatto/gocloc#support-languages)
```
$ ctoc --show-lang
```

## Support Models

```
$ ctoc --show-encoding
text-davinci-002 (p50k_base)
text-davinci-001 (r50k_base)
babbage (r50k_base)
text-babbage-001 (r50k_base)
code-cushman-002 (p50k_base)
code-search-ada-code-001 (r50k_base)
text-davinci-003 (p50k_base)
davinci (r50k_base)
text-similarity-ada-001 (r50k_base)
text-curie-001 (r50k_base)
curie (r50k_base)
ada (r50k_base)
code-davinci-002 (p50k_base)
text-davinci-edit-001 (p50k_edit)
text-embedding-ada-002 (cl100k_base)
text-similarity-curie-001 (r50k_base)
text-similarity-babbage-001 (r50k_base)
gpt2 (gpt2)
gpt-4 (cl100k_base)
text-ada-001 (r50k_base)
code-davinci-001 (p50k_base)
text-search-davinci-doc-001 (r50k_base)
text-search-curie-doc-001 (r50k_base)
code-search-babbage-code-001 (r50k_base)
code-cushman-001 (p50k_base)
cushman-codex (p50k_base)
code-davinci-edit-001 (p50k_edit)
gpt-3.5-turbo (cl100k_base)
text-similarity-davinci-001 (r50k_base)
text-search-babbage-doc-001 (r50k_base)
text-search-ada-doc-001 (r50k_base)
davinci-codex (p50k_base)
```

The BPE dictionary is automatically downloaded and cached upon its initial run for each encoding.<br/>
For additional information, please refer to [tiktoken-go#cache](https://github.com/pkoukk/tiktoken-go#cache)

## Performance
* CPU 3.8GHz 8core Intel Core i7 / 32GB 2667MHz DDR4 / MacOSX 13.3.1
* cloc 1.96
* tokei 12.1.2 compiled with serialization support: json
* gocloc [a88edc5](https://github.com/hhatto/gocloc/commit/a88edc52b3eea697687f9546f6ac74a03c91c5fb)
* target repository is [golang/go commit:f742ddc](https://github.com/golang/go/tree/f742ddc349723667fc9af5d0f16233f7762aeaa0)

### cloc

```
$ time cloc .
12003 text files.
11150 unique files.
1192 files ignored.
8 errors:
Line count, exceeded timeout: ./src/cmd/dist/build.go
Line count, exceeded timeout: ./src/cmd/trace/static/webcomponents.min.js
Line count, exceeded timeout: ./src/net/http/requestwrite_test.go
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables10.0.0.go
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables11.0.0.go
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables12.0.0.go
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables13.0.0.go
Line count, exceeded timeout: ./src/vendor/golang.org/x/net/idna/tables9.0.0.go
github.com/AlDanial/cloc v 1.96 T=35.07 s (317.9 files/s, 78679.3 lines/s)
-----------------------------------------------------------------------------------
Language files blank comment code
-----------------------------------------------------------------------------------
Go 9081 205135 337681 1779107
Text 1194 11530 0 210849
Assembly 563 15549 21625 122329
HTML 17 3197 78 24983
C 139 1324 982 6895
JSON 20 0 0 3122
CSV 1 0 0 2119
Markdown 27 674 106 1949
Bourne Shell 16 253 868 1664
JavaScript 10 234 221 1517
Perl 10 173 171 1111
C/C++ Header 26 145 346 724
Bourne Again Shell 16 120 263 535
Python 1 133 104 375
CSS 3 4 13 337
DOS Batch 5 56 66 207
Windows Resource File 4 23 0 146
Logos 2 16 0 101
Dockerfile 2 13 15 47
C++ 2 11 14 24
make 5 9 10 21
Objective-C 1 2 3 11
Fortran 90 2 1 3 8
awk 1 1 6 7
YAML 1 0 0 5
MATLAB 1 1 0 4
-----------------------------------------------------------------------------------
SUM: 11150 238604 362575 2158197
-----------------------------------------------------------------------------------
cloc . 33.70s user 1.48s system 99% cpu 35.237 total
```

### tokei

```
$ time tokei --sort code --exclude "**/*.txt" .
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
Go 9242 2330107 1812147 318036 199924
GNU Style Assembly 565 159534 127093 16888 15553
C 143 9272 6949 1000 1323
JSON 21 3122 3122 0 0
Shell 16 2785 2267 342 176
JavaScript 10 1972 1520 218 234
Perl 9 1360 1032 170 158
C Header 27 1222 727 349 146
BASH 16 918 521 279 118
Python 1 612 421 70 121
CSS 3 354 337 13 4
Autoconf 9 283 274 0 9
Batch 5 329 207 66 56
Alex 2 117 101 0 16
Dockerfile 2 75 47 15 13
C++ 2 49 24 14 11
Makefile 5 40 20 10 10
Objective-C 2 21 15 3 3
FORTRAN Modern 2 12 8 3 1
Markdown 18 2402 0 1853 549
-------------------------------------------------------------------------------
HTML 17 19060 18584 49 427
|- CSS 4 2071 1852 10 209
|- HTML 1 219 212 0 7
|- JavaScript 8 6920 6876 16 28
(Total) 28270 27524 75 671
===============================================================================
Total 10117 2533646 1975416 339378 218852
===============================================================================
tokei --sort code --exclude "**/*.txt" . 0.76s user 0.50s system 562% cpu 0.224 total
- CPU 2.6GHz 6core Intel Core i7 / 32GB 2667MHz DDR4 / MacOSX 13.5.2
- ctoc [fdaa42](https://github.com/yaohui-wyh/ctoc/commit/fdaa42)

```
➜ kubernetes git:(master) time ctoc .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 15172 503395 992193 3921496 53747627
JSON 430 2 0 1011821 10428573
YAML 1224 612 1464 156024 974131
Markdown 461 24842 170 93141 3251948
BASH 318 6522 12788 33010 528217
Protocol Buffers 130 5864 19379 12809 358110
Assembly 50 2212 925 8447 129534
Plain Text 31 203 0 6664 48218
Makefile 58 594 940 2027 31548
Bourne Shell 9 154 119 687 8055
sed 4 4 32 439 3138
Python 7 114 160 418 5435
Zsh 1 14 3 191 1872
PowerShell 3 44 79 181 2496
C 5 42 55 140 1799
TOML 6 31 107 101 2049
HTML 2 0 0 2 21
Batch 1 2 17 2 170
------------------------------------------------------------------------------------------------
TOTAL 17912 544651 1028431 5247600 69522941
------------------------------------------------------------------------------------------------
ctoc . 160.09s user 8.08s system 119% cpu 2:20.96 total`
```

### gocloc

```
$ time gocloc --exclude-ext=txt .
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Go 9096 205242 352844 1764503
Assembly 563 15555 21624 122324
HTML 17 3197 212 24849
C 139 1324 983 6894
JSON 20 0 0 3122
BASH 27 345 1106 2122
Markdown 18 549 28 1825
JavaScript 10 234 218 1520
C Header 26 145 346 724
Perl 10 173 584 698
Python 1 133 104 375
CSS 3 4 13 337
Batch 5 56 0 273
Plan9 Shell 4 23 50 96
Bourne Shell 5 28 24 78
C++ 2 11 14 24
Makefile 5 10 10 20
Objective-C 2 3 3 15
FORTRAN Modern 2 1 3 8
Awk 1 1 6 7
-------------------------------------------------------------------------------
TOTAL 9956 227034 378172 1929814
-------------------------------------------------------------------------------
gocloc --exclude-ext=txt . 0.65s user 0.51s system 119% cpu 0.970 total
```

## License

MIT
Loading

0 comments on commit 1305b51

Please sign in to comment.