GitHub - zakimjz/ECLAT: Original ECLAT algorithm for frequent itemset mining

Eclat uses the original vertical tidset approach for mining all frequent itemsets [1997-eclat], combined with the diffsets improvement [2003-diffsets]. It also contains the maxeclat, clique, and maxclique approaches mentioned in [2000-eclat:tkde].

[1997-eclat] Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. In 3rd International Conference on Knowledge Discovery and Data Mining (KDD). August 1997.
[2003-diffsets] Mohammed J. Zaki and Karam Gouda. Fast vertical mining using Diffsets. In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 2003.
[2000-ecalt:tkde] Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, May/Jun 2000. doi:10.1109/69.846291.

programs

eclat: eclat.cc is a clean version of only eclat code

assocFB: assocFB has all algorithms (eclat, clique, maxeclat, maxclique)

you will also need the tposedb utils from: https://github.com/zakimjz/tposedb

for the IBM generator see https://github.com/zakimjz/IBMGenerator

steps:

Generate a data file using the IBM data generator program, gen. OR Start with an ascii file (see chess.ascii example file in this directory)

The format of the ascii/binary file should be

If ascii file, first convert to binary using makebin
```
 makebin chess.ascii chess.data
```

Binary file MUST have .data extension

Get configuration by running getconf (gen automatically generates conf file, so this step can be skipped)
```
getconf -i chess -o chess -a
```

Before running the rest you should now have the following files

    chess.data
    chess.conf

run: exttpose -i XXX -o XXX -l -s LMINSUP -a 0

example: exttpose -i chess -o chess -l -s 0 -a 0

note: this produces the files XXX.tpose, and XXX.idx

The XXX.tpose file is the DB in vertical format, and XXX.idx is an index file specifying where the tid-list for each item begins.

You can specify a value of LMINSUP to be the same as the one you will use to run eclat below, in which case you will have to rerun exttpose each time you use a new lower MINSUP. Alternatively, you can use a small value for LMINSUP, and it will continue to work for all values of MINSUP >= LMINSUP when you run elcat.

The time for inverting is stored in summary.out. The format is: TPOSE DB_FILENAME X NUMITEMS TOTAL_TIME

(see note one TOTAL_TIME below)

You should now have the following files:

    chess.data
    chess.conf
    chess.tpose
    chess.idx

run eclat eclat -i XXX -e 1 -s -d -l

 other flags
  -d to use diffsets instead of tidsets (only from pass 3)
  -l use tidset of pass 2 but then use diffsets after that
         (this should NOT be used for sparse datasets, since tidset
          size of pass 2 is smaller than diffset size for
          sparse sets. Thus one would omit this option for
          any dataset generated by IBM gen)

MINSUP is in fractions, i.e., specify 0.5 if you want 50% minsup or 0.01 if you want 1% support.

note that the summary of the run is stored in the summary.out file. The format of this file is as follows:

ECLAT (other options) DB_FILENAME MINSUP NUMTRANS_IN_DB ACTUAL_SUPPORT [ ITER_i |Ci| |Fi| timeForIter_i avg_tidset/diffset_size ] [TOT total_cands tot_freq tot_elapsed_time] NumberofIntersections XXX XXX XXX user_time sys_time

Note3: -e 1 option is a flag indicating eclat to compute the support of 2-itemsets from scratch. The number 1 says there is only one DB partition that will be inverted entirely in main memory. If the original DB is large then this inversion will obviously take too much time. So in this case I recommend dividing the DB into chunks of size roughly 5MB (assuming there is 32MB available to the process). The exttpose program is equiped to handle this case. If you specify a <-p NUMPART> flag to exttpose it will divide the DB into NUMPART chunks. Now you can run eclat with -e NUMPART option. You must do this if the DB is large otherwise the timings for eclat will be skewed. Generally, the more the partitions the better the running time for eclat. For example:

  exttpose -i XXX -o XXX -l -a 0 -s LMINSUP -p 10
  eclat -i XXX -s MINSUP -e 10

In summary run

    eclat -i chess -e 1 -s 0.8 -d -l

since chess is a dense set I have turned on -l option

ORIGINAL ECLAT/CLIQUE/MAX-ECLAT/MAX_CLIQUE

run

  assocFB -e 1 -i XXX -s <MINSUP> -t 1000 for ECLAT
  assocFB -e 1 -i XXX -s <MINSUP> for MAXECLAT
  assocFB -e 1 -i XXX -s <MINSUP> -c -t 1000 for CLIQUE
  assocFB -e 1 -i XXX -s <MINSUP> -c for MAXCLIQUE

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
.gitignore		.gitignore
Array.cc		Array.cc
Array.h		Array.h
Eqclass.cc		Eqclass.cc
Eqclass.h		Eqclass.h
GArray.cc		GArray.cc
GArray.h		GArray.h
Graph.cc		Graph.cc
Graph.h		Graph.h
HashTable.cc		HashTable.cc
HashTable.h		HashTable.h
Itemset.cc		Itemset.cc
Itemset.h		Itemset.h
LICENSE		LICENSE
List.cc		List.cc
Lists.h		Lists.h
Makefile		Makefile
README.md		README.md
Util.cc		Util.cc
Util.h		Util.h
assoc.h		assoc.h
assocFB.cc		assocFB.cc
calcdb.cc		calcdb.cc
calcdb.h		calcdb.h
chess.ascii		chess.ascii
eclat.cc		eclat.cc
extl2.cc		extl2.cc
extl2.h		extl2.h
memman.cc		memman.cc
memman.h		memman.h
partition.cc		partition.cc
partition.h		partition.h
stats.h		stats.h

License

zakimjz/ECLAT

Folders and files

Latest commit

History

Repository files navigation

programs

ORIGINAL ECLAT/CLIQUE/MAX-ECLAT/MAX_CLIQUE

About

Topics

Resources

License

Stars

Watchers

Forks

Languages