Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

可以使用mmap的方式了节约TRIE树或字典的内存吗? #3

Open
jannson opened this issue Nov 7, 2013 · 12 comments

Comments

@jannson
Copy link

@jannson jannson commented Nov 7, 2013

字典或TRIE树只需要在一个进程当中创建一次就可以了,其它进程再次分词的时候只需要读到此TRIE树的内存,而不需要再一次从文件加载并创建字典。使用mmap方式可以节约内存与进一步减少启动时间。
基于mmap的C++ allocator有类似这样的开源项目:
https://github.com/johannesthoma/mmap_allocator

虽然http方式也可以实现集中处理分词的效果,但若mmap相信效率有更好的提升。

@yanyiwu

This comment has been minimized.

Copy link
Owner

@yanyiwu yanyiwu commented Nov 15, 2013

@jannson
我没用过mmap,找个时间再看看。
不过个人觉得载入字典这个初始化过程毕竟只有一次,不优化关系也不大。
如果有关于切词函数中算法值得优化的地方会更有效。

@jannson

This comment has been minimized.

Copy link
Author

@jannson jannson commented Nov 18, 2013

设想优化这个的目的不是加快运行速度,而是节约内存。因为我想把它放到一个云主机去运行,内存有限。很多个进程都需要用到分词,都需要加载这个大的字典,若能通过共享内存的方式来使用这个字典就最好了。
当然尝试过几天,发现没有我想象得那么简单。大概如这哥们说的,

This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:

Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
record the index (let's call it N)
swap the new element and the squatter
put the squatter in a new free index, (F).
follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.

This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.

暂且先放弃了。

@yanyiwu

This comment has been minimized.

Copy link
Owner

@yanyiwu yanyiwu commented Nov 18, 2013

@jannson
噢噢,你这么一说我就明白了,我之前还纳闷你为什么一直对内存很在意。

@jannson

This comment has been minimized.

Copy link
Author

@jannson jannson commented Mar 17, 2014

最近找到一些Double Array Trie 实现的trie树,感觉很节约内存,并且还很快。但是我只是用了没有深入测试与研究。

On Sun, Mar 16, 2014 at 2:59 PM, soe-coe notifications@github.com wrote:

是否可以对词典大小进行估量,进而建立trienode内存池,以减少每次调用new的开销?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-37750879
.

@ydlme

This comment has been minimized.

Copy link

@ydlme ydlme commented Mar 18, 2014

Double Array Trie确实是个不错的选择,不过我对c++11标准还不太熟悉,就不好献丑了。祝好。
哦,不知道尝试对trienode进行线索做成aho-corasick类似的结构,会不会能提高性能?

@yanyiwu

This comment has been minimized.

Copy link
Owner

@yanyiwu yanyiwu commented Mar 18, 2014

了解了一下Double Array Trie,应该能用来节约内存, @jannson 你用的是Darts那个库吗?
@soe-coe 本项目用到c++11的地方很少呀,完全可以用c++98的语法来写的, aho-corasick我没做过,或许你可以fork此项目去尝试看看 :)

@yanyiwu

This comment has been minimized.

Copy link
Owner

@yanyiwu yanyiwu commented Mar 18, 2014

@soe-coe
如果有测试的需要的话,性能测试的方法可以在build目录下使用这个命令即可make load_test && time ./test/load_test看看时间即可

@jannson
最近也在一些云主机部署上因为内存占用吃了点苦头,我暂时是通过修改Trie.hpp文件里面的typedef unordered_map<uint16_t, struct TrieNode*> TrieNodeMap;
修改为
typedef map<uint16_t, struct TrieNode*> TrieNodeMap;
即可以让内存占用降低将近50%,同时切词速度只变慢了差不多10%。

现在将此改动弄在less_memory分支上,有需要的话可以checkout试试。:)
之后有时间再尝试看看使用DoubleArrayTrie ,如果你有时间pull request那就更好了 :)

@jannson

This comment has been minimized.

Copy link
Author

@jannson jannson commented Mar 18, 2014

谢谢,作者真是有心啊。这样的项目100个顶!

@aszxqw 我用的是这个库: http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/
这里是我作的一点修改:https://github.com/jannson/wordmaker/blob/master/src/cedar.h

@ultimate010

This comment has been minimized.

Copy link

@ultimate010 ultimate010 commented Jun 26, 2015

aho-corasick 这个其实比Double Array Trie更适合分词,一次遍历把所有匹配找出来,导师的分词系统用的这个,可惜代码没看懂。
网上原理介绍很容易,但是要做好涉及很多,比如如何压缩存储,还有序列化问题,每次启动都建trie结构也可以省掉。

@yanyiwu

This comment has been minimized.

Copy link
Owner

@yanyiwu yanyiwu commented Jun 26, 2015

@ultimate010 是的,如果我没记错的话,现在的CppJieba的Trie源码实现就是使用aho-corasick 算法。

@ultimate010

This comment has been minimized.

Copy link

@ultimate010 ultimate010 commented Jun 26, 2015

哦,我还是去年看得源代码,当时没有用,现在改了那就更好了。

@t-k-

This comment has been minimized.

Copy link
Contributor

@t-k- t-k- commented May 5, 2016

@jannson 之前我也有调查过 term lookup 的数据结构,听说是中文不太适合字典树,状态机省内存但又比较难以理解(至少我看 whoosh的实现的感觉),所以感觉大家做中文搜索的都在用 DAT,当然我没有确切的统计数据。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.