如果你管理着很多不同种类的文档报告,模版很多,我如何通过NLP 或者通过版面分析把它们按文档内容或按排版分类? #4856
WangShunzhiDQ
started this conversation in
General
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
如果你管理着很多不同种类的文档报告doc\docx\excel\ppt\pdf\jpg等等,模版很多,文档种类也很多,我如何通过NLP 或者通过版面分析把它们按文档内容或按排版分类?
目前的思路是统统把文档按页转为图片,然后通过paddlenlp标注、信息抽取,按类别训练模型。但需要时间实验与验证。
paddleNLP技术日趋成熟,但我一直没法找到更好的办法,希望得到高人指点,有Demo更好
Beta Was this translation helpful? Give feedback.
All reactions