Skip to content
The official implementation of the Corpus Definition Language (CDL), the Corpus Manipulation Language (CML), the Corpus Control Language (CCL) and the Corpus Query Language (CQL) of the Quicktext Query Language (QQL). It's a corpus server.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
CNAME
LICENSE
README.md
_config.yml

README.md

About Quickcorpus

Backgroud

When I completed my doctoral dissertation, I found that I hadn't found an effective way to manage patent corpuses.

My doctoral dissertation is about the US patent.

I downloaded the US patent from Google patent then I cleaned the data by .

I found many problems!

The problem still exists now!

Problems

The problems are as follows:

  1. The US patent file is XML format, but three are many XML schema versions!
  2. I want to clean the XML format. The official cleaning program is based on dom4j. Although I have tried other programs, such as the Gabe Fierro's solution. It still costs me much time! Gabe Fierro's solution is based on python, he stores the data onto the mysql database.
  3. I intend to store the xml files in relationship database, but it's slowly. Then I use the XML database, such as the Sedna XML Database, it's cashed many times! Finally I intend to store in file systems and process by full text engine, such as the Apache Lucene. But it's not a good choice for that patent file is semi-structured! If the xml files are indexed by the Lucence, I can't analyze the data directly!

My Solution

In recent years, I have designed a new domain specific language(DSL) to manage the paper corpuses.

I will design a new corpus server oriented to the US patents.

I have read the program source of some relationship databases, such as the Apache Derby and Sqlite.

I will design a new corpus server base of the principle of DSL, RDB and INDEX technology soon.

There are many differences between the database query language and my solutions.

Please see the features:

Features

For more information, please visit :

http://www.quicktext.org/

Genix

You can’t perform that action at this time.