Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples running existing algorithms on existing data #6

Open
nkons opened this issue Feb 21, 2018 · 21 comments
Open

Examples running existing algorithms on existing data #6

nkons opened this issue Feb 21, 2018 · 21 comments

Comments

@nkons
Copy link

nkons commented Feb 21, 2018

Hi,

I am trying to invoke from Eclipse the main method in de.metanome.cli.App, in order to get one of the existing algorithms to produce profiling data about some data I have locally.

Looking at the documentation (user/developer guides), it wasn't obvious how to do this, so hopefully this issue will be of help to others as well.

My starting point would be to use DUCC to discover keys in a set of CSV files.
The arguments I tried using look like the following, but no luck yet (I keep getting 'Could not initialize algorithm.').

-a de.metanome.algorithms.ducc.DuccAlgorithm
--files load:/path/to/a.csv;/path/to/b.csv;/path/to/c.csv
--file-key INPUT_GENERATOR
--algorithm-config NULL_EQUALS_NULL:true VALIDATE_PARALLEL:false 
 ENABLE_MEMORY_GUARDIAN:false MAX_UCC_SIZE:1000 INPUT_ROW_LIMIT:1000

So, I believe some examples using metanome-cli would be useful, eg. :
a) on top of existing CSV files (say, DUCC, for key discovery) and,
b) on top of a relational backend (say, BINDER-Database, to discover foreign keys in 3 tables. What happens if these are in different schemas/databases?). Not clear how to store database connection settings (is a ProfileDB necessary? if so, how would an example look like?).

Thank you in advance.

Best,
Nikos

@sekruse
Copy link
Owner

sekruse commented Feb 21, 2018

Hi Nikos,

thanks for pointing these issues out and I think having examples in the readme is an excellent idea!

That being said, let me see if I can help you troubleshooting your problem at hand:

  • Do you have a full stacktrace of your error?
  • Is the DUCC algorithm jar on the classpath of your eclipse run configuration? That's probably a step that you have to do manually.
  • The --files parameter only needs the load: prefix if you are passing a file that contains file names, e.g., --files load:my-file-list.txt.
  • If you want your algorithm on database, you need to
    1. pass a PGPass file via --db-connection that specifies the database to connect to, including credentials,
    2. specify the --table-key, and
    3. the --tables.
      Beware that algorithms implement different ways of handling databases. Some algorithms provided dedicated algorithm classes (e.g., BINDER afaik), while others don't.

Last but not least, Metanome CLI does not support multiple database connections (although technically that should be possible).

@nkons
Copy link
Author

nkons commented Feb 22, 2018

Hi,

Thanks for the prompt response. I am now able to run Ducc. Apparently, I had the class name wrong, so the constructor was missing. Corrected it to de.metanome.algorithms.ducc.Ducc, and it now works.

Also created a pgpass file, and I am able to process relational data.

Now, the problem is with BINDERDatabase as, to my understanding, the algorithm needs a set of tables as input, which I can provide by passing the argument to metanome-cli:
--inputs load:/path/to/tables.txt
but when it comes to algorithm parameters, I am not sure how to pass more than one input tables to --algorithm-config. Specifically, the arguments look like the following:

-a de.metanome.algorithms.binder.BINDERDatabase
--inputs load:/path/to/tables.txt
--input-key INPUT_DATABASE 
--db-connection /path/to/db.pgpass
--db-type postgresql
--output print
--algorithm-config
TEMP_FOLDER_PATH:/tmp/
DATABASE_NAME:testdb
DATABASE_TYPE:POSTGRESQL
INPUT_TABLES:load:/path/to/tables.txt

The issue here is with the last line of the above. BINDERDatabase requires the table names to be specified using INPUT_TABLES but I wasn't sure how to pass an array of parameter values for the algorithm configuration of this parameter. Just leaving the --inputs and ommitting the INPUT_TABLES: causes the algorithm to crash.

So, I would like to ask whether there is a way to specify multi-valued algorithm parameters, using load:, or e.g. something like:

--algorithm-config INPUT_TABLES:[schema1.table1,schema2.table2,schema3.table3]

Thanks for your support.

Best,
Nikos

@sekruse
Copy link
Owner

sekruse commented Feb 22, 2018

I am glad to hear that DUCC is already running!

I had a look at BINDERDatabase to see how it's configured.
Apparently, it requires a DatabaseConnectionGenerator for which the current setup code in metanome-cli seems weird.

-a de.metanome.algorithms.binder.BINDERDatabase
--input-key INPUT_DATABASE
--inputs some_ignored_value
--db-connection /path/to/db.pgpass
--db-type postgresql
--output print
--algorithm-config
TEMP_FOLDER_PATH:/tmp/
DATABASE_NAME:testdb
DATABASE_TYPE:POSTGRESQL
INPUT_TABLES:???
...

Unfortunately, multiple values are not supported in the current version. But I have seen just now that this pull request might introduce the functionality by supporting repeated key:value pairs, i.e. --algorithm-config INPUT_TABLES:table1 INPUT_TABLE:table2 ....
Unfortunately, the contributor has changed the formatting so that it will take some time to review the changes, but I guess that this can help you.

I would be happy to hear from you whether the PR works fine for you!

@fyndalf
Copy link

fyndalf commented Sep 3, 2018

Hi, sorry for hijacking this issue, I didn't know where else to ask this:

When running the metanome-cli, I faced two problems:

1.) When running BINDERDatabase,

Initializing algorithm.
Algorithm does not implement a supported input method (relational/tables).

was returned, despite using the latest algorithm and a (hopefully correct) configuration. I can't quite see how I should change my configuration to fix this.

2.) When running BINDERFile, I still don't quite understand what the --input-key parameter does, as I've already specified the files using the --files key. Putting anything as an input key yields a de.metanome.algorithm_integration.AlgorithmConfigurationException: Unknown configuration: FILENAME -> de.metanome.backend.input.file.DefaultFileInputGenerator What do I actually have to provide with that key for the algorithm to work?

Thanks in advance!

@sekruse
Copy link
Owner

sekruse commented Sep 3, 2018

No worries, let's see if I can help you there.

  1. BINDERDatabase is a DatabaseConnectionParameterAlgorithm, but it implements neither RelationalInputParameterAlgorithm nor TableInputParameterAlgorithm, as expected by Metanome CLI. Turns out that BINDERDatabase creates the table inputs itself - a case that Metanome CLI apparently did not account for. However, in the light of this new insight, Metanome CLI should only fail if a --db-connection/parameters.pgpassPath is specified and the specified algorithm is neither a RelationalInputParameterAlgorithm nor a TableInputParameterAlgorithm nor a DatabaseConnectionParameterAlgorithm. Furthermore, it should not be possible for an algorithm to be a (RelationalInputParameterAlgorithm xor TableInputParameterAlgorithm) and DatabaseConnectionParameterAlgorithm at the same time, as already allowed. All this should be easily achieved by restructuring this code branch.

    Do you want to give it a try and see if that does the trick for you? It would be awesome if you could share a PR on success!

  2. Metanome algorithms are configured via key-value pairs. The input files are also a key-value pair, but Metanome CLI has a special way of exposing them (namely via --input-key <key> --input-files <values...>, because they require special interpretation. For BINDERFile, the --input-key parameter must be INPUT_FILES.

Feel free to reach out if you have further questions!

@wunderbarr
Copy link

Hello,
I would like to ask for running csv files. I test the adult csv and my cmd is: java -cp metanome-cli-1.1.0.jar:pyro-distro-1.0-SNAPSHOT-distro.jar de.metanome.cli.App --algorithm de.hpi.isg.pyro.algorithms.Pyro --files load:file.txt --file-key INPUT_GENERATOR. In the file.txt I store the path to csv file.
And I get the error:
Running de.hpi.isg.pyro.algorithms.Pyro

  • in: [load:file.txt]
  • out: file
  • configuration: []
    Initializing algorithm.
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    Could not initialize algorithm.
    java.lang.IllegalArgumentException: Unsupported argument.
    at de.hpi.isg.pyro.algorithms.Pyro.setRelationalInputConfigurationValue(Pyro.java:407)
    at de.metanome.cli.App.setUpInputGenerators(App.java:374)
    at de.metanome.cli.App.configureAlgorithm(App.java:268)
    at de.metanome.cli.App.run(App.java:83)
    at de.metanome.cli.App.main(App.java:47)
    Actually, I am not sure what should I pass to --file-key. I see the above parameter configuration, try INPUT_GENERATOR and INPUT_FILES, and both do not work.
    Could you give me some hints?
    Thank you!

@wunderbarr
Copy link

Hello,
I add slf4j-simple-1.7.25.jar to the classpath then the above error disappears. But I still cannot run Pyro.

My cmd:
java -cp metanome-cli-1.1.0.jar:slf4j-simple-1.7.25.jar:pyro-distro-1.0-SNAPSHOT-distro.jar de.metanome.cli.App --algorithm de.hpi.isg.pyro.algorithms.Pyro --files adult.csv --file-key INPUT_GENERATOR --algorithm-config maxFdError:0.01

Is there any configuration problem?

Running de.hpi.isg.pyro.algorithms.Pyro

  • in: [adult.csv]
  • out: file
  • configuration: [maxFdError:0.01]
    Initializing algorithm.
    Could not initialize algorithm.
    java.lang.IllegalArgumentException: Unsupported argument.
    at de.hpi.isg.pyro.algorithms.Pyro.setRelationalInputConfigurationValue(Pyro.java:407)
    at de.metanome.cli.App.setUpInputGenerators(App.java:374)
    at de.metanome.cli.App.configureAlgorithm(App.java:268)
    at de.metanome.cli.App.run(App.java:83)
    at de.metanome.cli.App.main(App.java:47)

Thank you!

@sekruse
Copy link
Owner

sekruse commented Mar 4, 2019

I think that it should read --file-key inputFile (cf. here and here).

@wunderbarr
Copy link

Thank you! It works!

@Ryang326
Copy link

Ryang326 commented Jun 7, 2019

Hello sekruse,
I would like to ask for running csv files for fun algorithm to figure out the fds. I test the iris.csv and my cmd is: java -cp metanome-cli-1.1.jar:fun_for_metanome-0.0.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.uni_potsdam.hpi.metanome.algorithms.fun.Fun --file-key Relational Input --files load:iris.csv
In the file.txt I store the path to csv file.
And I get the error:Could not parse command line args: Was passed main parameter 'Input' but no main parameter was defined
I tried the cmd without any parameter. It asked me to input: --file-key --input-key, --table-key --files, --inputs, --tables
However, after reading problems above I only try to figure out the file-key and files parameter and i am not sure it is right. I am confused with what should I input for these six parameters.
Could you please give me some hints?
Thank you!

@sekruse
Copy link
Owner

sekruse commented Jun 15, 2019

Hi Ryang326 and sorry for the delayed response.

Essentially, --file-key, --input-key, and --table-key are all the same and what you need to put here depends on the Metanome algorithm. For FUN, you need to specify indeed Relational Input according to the code. However, Relational Input needs to be put in quotation marks. Otherwise, your Shell will deliver it as two individual arguments to the Metanome CLI, which expects only a single argument.

--files, --inputs, and --tables are also all the same and you can pick one of them. After that, you can add your CSV file. But don't use load:. That would be correct only if iris.csv was a file, which contained a list of CSV files.

So you can try:

java -cp metanome-cli-1.1.jar:fun_for_metanome-0.0.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.uni_potsdam.hpi.metanome.algorithms.fun.Fun --file-key "Relational Input" --files iris.csv

@cccshuang
Copy link

cccshuang commented Jul 4, 2019

when use DC algorithm: java -cp metanome-cli.jar;hydra-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.hpi.naumann.dc.algorithms.hybrid.HydraMetanome --file-key "INPUT" --files Tax.csv --escape \ --separator , --algorithm-config EFFICIENCY_THRESHOLD:0.005 CROSS_COLUMN_STRING_MIN_OVERLAP:0.15 SAMPLE_ROUNDS:20 NO_CROSS_COLUMN:false
I get the error:
... 15:33:54.771 [main] INFO d.h.n.d.a.hybrid.HydraMetanome - Result size: 5117 Algorithm crashed. java.lang.NullPointerException at de.hpi.naumann.dc.algorithms.hybrid.HydraMetanome.execute(HydraMetanome.java:76) at de.metanome.cli.App.run(App.java:89) at de.metanome.cli.App.main(App.java:47) Elapsed time: 0:00:03.332 (3332 ms).
DC can be work in Metanome Tool, but failed by Metanome-cli.
I think it may be cause by this: Initializing algorithm. Could not configure any result receiver.. But I dont know how to sovle it.
Could you please give me some hints? Thank you very much!

@sekruse
Copy link
Owner

sekruse commented Jul 5, 2019

Since this line is crashing, I think it really is the missing result receiver.

In fact, it appears that this method is lacking the necessary code to configure a DenialConstraintResultReceiver for DenialConstraintAlgorithms.

Unfortunately, I don't have the time to fix this. Do you want to send a PR with a fix?

@cccshuang
Copy link

cccshuang commented Jul 5, 2019

Since this line is crashing, I think it really is the missing result receiver.

In fact, it appears that this method is lacking the necessary code to configure a DenialConstraintResultReceiver for DenialConstraintAlgorithms.

Unfortunately, I don't have the time to fix this. Do you want to send a PR with a fix?

Ok. After add some code in "configureResultReceiver" method, it can support DC now.
It may reyly on the latested metanome 1.2 , beacuse of metanome 1.1 may be not contain DC algorithm.
I don't know when I run the project it would cause this error, so I also add depency com.ecwid.ecwid-mailchimp to solve this problem.

@faisal-ksolves
Copy link

hello @sekruse,
what is the maximum dataset size metanome algorithms can run on?

@sekruse
Copy link
Owner

sekruse commented Dec 13, 2022

@faisal-ksolves – That depends on the algorithm, your hardware, and various dataset properties besides it size. Most often, RAM is the limiting factor, especially for datasets with many columns. Please refer to the research papers of the individual algorithms for a detailed evaluation.

@faisal-ksolves
Copy link

@sekruse can i get some quick links of those papers?

@sekruse
Copy link
Owner

sekruse commented Dec 13, 2022

@faisal-ksolveshttps://hpi.de/naumann/projects/data-profiling-and-analytics/metanome-data-profiling.html should contain most links. The BINDER paper is called Divide & Conquer-based Inclusion Dependency Discovery.

@faisal-ksolves
Copy link

Hello,
can anyone help me to run HyMD algorithm on metanome cl,
actually it throws error while run.
Exception in thread "main" java.lang.NoSuchFieldError: SNAKE_CASE at de.metanome.algorithms.hymd.Jackson.createMapper(Jackson.java:22) at de.metanome.algorithms.hymd.Jackson.createReader(Jackson.java:16) at de.metanome.algorithms.hymd.HyMD.readConfig(HyMD.java:198) at de.metanome.algorithms.hymd.HyMD.setStringConfigurationValue(HyMD.java:156) at de.metanome.backend.configuration.ConfigurationValueString.triggerSetValue(ConfigurationValueString.java:69) at de.metanome.cli.Helpers.AlgorithmInitializer.triggerSetValue(AlgorithmInitializer.java:173) at de.metanome.cli.Helpers.AlgorithmInitializer.apply(AlgorithmInitializer.java:79) at de.metanome.cli.App.loadMiscConfigurations(App.java:372) at de.metanome.cli.App.configureAlgorithm(App.java:343) at de.metanome.cli.App.run(App.java:134) at de.metanome.cli.App.main(App.java:98)
I am using the following command
--algorithm de.metanome.algorithms.hymd.HyMD --file-key RELATION --files src/main/java/de/metanome/cli/Inputs/test.csv --header

@RaVincentHuang
Copy link

Hello sekruse,
I can't run CFDFinder on cli. My algorithm and cli version are both 1.2.
My command is
java -cp metanome-cli-1.2-SNAPSHOT.jar:CFDFinder-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.metanome.algorithms.cfdfinder.CFDFinder --files ./adult.csv --file-key INPUT_GENERATOR --output print
And it throws:

(metanome-cli) ERROR    Could not configure any result receiver.
(metanome-cli) ERROR    Algorithm crashed.: de.metanome.algorithm_integration.AlgorithmConfigurationException: No result receiver set!
	at de.metanome.algorithms.cfdfinder.CFDFinder.execute(CFDFinder.java:286)
	at de.metanome.cli.App.run(App.java:110)
	at de.metanome.cli.App.main(App.java:75)
(metanome-cli) INFO     Elapsed time: 0:00:00.001 (1 ms).
(metanome-cli) INFO     Results:

I found that CFD's Receiver has been implemented in the 1.2 cli version, but the code if (algorithm instanceof ConditionalFunctionalDependencyAlgorithm) did not execute successfully. I don't know the reason for this problem.

1 similar comment
@RaVincentHuang
Copy link

Hello sekruse,
I can't run CFDFinder on cli. My algorithm and cli version are both 1.2.
My command is
java -cp metanome-cli-1.2-SNAPSHOT.jar:CFDFinder-1.2-SNAPSHOT.jar de.metanome.cli.App --algorithm de.metanome.algorithms.cfdfinder.CFDFinder --files ./adult.csv --file-key INPUT_GENERATOR --output print
And it throws:

(metanome-cli) ERROR    Could not configure any result receiver.
(metanome-cli) ERROR    Algorithm crashed.: de.metanome.algorithm_integration.AlgorithmConfigurationException: No result receiver set!
	at de.metanome.algorithms.cfdfinder.CFDFinder.execute(CFDFinder.java:286)
	at de.metanome.cli.App.run(App.java:110)
	at de.metanome.cli.App.main(App.java:75)
(metanome-cli) INFO     Elapsed time: 0:00:00.001 (1 ms).
(metanome-cli) INFO     Results:

I found that CFD's Receiver has been implemented in the 1.2 cli version, but the code if (algorithm instanceof ConditionalFunctionalDependencyAlgorithm) did not execute successfully. I don't know the reason for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants