Add a script which tests key functions #16

nshahquinn · 2021-01-29T18:21:39Z

It won't be very helpful to look at the raw files here; instead, this can be reviewed by checking out the branch on one of the analytics clients and opening the notebook there.

bearloga · 2021-02-01T21:27:15Z

Stopped at first cell in the Loading Data section:

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException:
  Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/neilpquinn.db/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Replacing neilpquinn database with bearloga in the queries worked.

One potential solution is to replace it with a placeholder variable whose value is os.environ['USER'] by default, but that relies on the heavy assumption that any user running this test notebook would use the same name for their personal database on the cluster as their shell username.

nshahquinn · 2021-02-04T15:33:00Z

Stopped at first cell in the Loading Data section:

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException:
  Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/neilpquinn.db/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Thanks for catching this! I just updated it to use Hive's default database, which everyone should be able to write to.

bearloga · 2021-02-06T14:30:49Z

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: 
Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Indeed:

$ hdfs dfs -ls /user/hive/warehouse

drwxr-x---   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_1
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_2
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_3
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:15 /user/hive/warehouse/wmfdata_test_4

Similar problem with default.wmfdata_test_3:

ChildProcessError: The Hive command line client encountered the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask.
Directory hdfs://analytics-hadoop/user/hive/warehouse/wmfdata_test_2 could not be cleaned up.

@elukey: Do you have a recommendation for fixing our permission problems? We want anyone with analytics client access to be able to run the cells in this test notebook to verify that the package works. It appears that using the default database doesn't affect how the permissions are set up in HDFS, so the default.wmfdata_test_* tables are locked to @neilpquinn's account. What's weird is that dfs://analytics-hadoop/user/hive/warehouse/wmfdata_test_2 has group write permissions, so I should be able to drop the table associated with it, right?

It may be that anyone who wishes to test the package with the notebook will need to run it as the analytics-product user? Although we have yet to use that system user to schedule notebook executions in crontab.

@ottomata Is there a way to use Anaconda as the analytics-product system user? Because I tried to run

$ chgrp analytics-product-users ~/temp/test
$ cd ~/temp/test
$ sudo --user=analytics-product --group=analytics-product-users \
  /home/bearloga/venv/bin/jupyter nbconvert --execute --to html Untitled.ipynb

and got

PermissionError: [Errno 13] Permission denied: '/srv/home/bearloga/.local/share/jupyter/nbconvert/templates/conf.json'

(even after chmod --recursive o+r /srv/home/bearloga/.local)

I think jupyter nbconvert temporarily creates a conf.json file, so that prevents analytics-product from being able to use my jupyter binary. Hence my interest in creating a conda env for that system user, if possible.

nshahquinn · 2021-02-08T17:31:22Z

From my DM to @bearloga:

We can probably fix the database permissions problem by having the script delete the tables after it's finished. It's not a perfect solution, because if someone creates the tables but hits an error midway through, they might not ever clean up the created tables. But we can just put a reminder to do that in the notebook, and that should be good enough?

bearloga · 2021-02-11T18:26:48Z

@neilpquinn Do you get this error when you run the !pip setup cells?

When I restart the kernel and skip those cells, I can import wmfdata no problem. Maybe we shouldn't use !pip magic in this case? Between testing out a PR and fetching it with git and having to authenticate with kinit, people are going to have to open up a terminal in Jupyter anyway so we may as well require them to run pip install . manually if they're testing a patched version of wmfdata.

test_data_1_via_hive fails for me, because of parquet-related messages that get printed to the tsv:

Feb 11, 2021 6:06:55 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 26 ms. row count = 10

Everything else runs without problems.

nshahquinn · 2021-03-11T17:16:09Z

Note to self: the Presto test can currently fail because the order of the output differs from the input. I need to add an ORDER BY to the query.

nshahquinn · 2021-12-06T16:55:21Z

Okay, I have a new version of this I'm happy with! I've solved the permissions issues by simply having the user pass the name of the Hive database to use when they invoke the script. That way, each user should be using their own database.

If one of the tests fails, the script stops with an error. If all the tests pass, the output will look like this:

neilpquinn-wmf@stat1008:~/wmfdata-python/testing$ ./tests.py nshahquinn
PYSPARK_PYTHON=/home/neilpquinn-wmf/.conda/envs/2021-10-29T21.52.56_neilpquinn-wmf/bin/python3
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/06 16:47:07 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
TEST PASSED: Read via Hive                                                                            
TEST PASSED: Silent command via Hive
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
TEST PASSED: Read via Spark
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
TEST PASSED: Silent command via Spark
TEST PASSED: Read via Presto
TEST PASSED: Read via MariaDB
TEST PASSED: Silent command via MariaDB
TEST PASSED: Load CSV to Data Lake

There's a lot of annoying logspam, but I think it still gets the point across.

ottomata · 2021-12-06T17:08:20Z

Is the title of this PR incorrect? a notebook? :)

nshahquinn · 2021-12-06T17:11:06Z

Is the title of this PR incorrect? a notebook? :)

Sorry, it was a mutation due to the excessive radiation coming off the Data Lake! Also, I decided a script would be cleaner and more maintainable than a notebook 😊

wmfdata_tests/tests.py

milimetric

Seems nice to me, I'll keep it in mind as we think about testing airflow pipelines

milimetric · 2021-12-15T23:12:46Z

For the record I tested the tests and they test great. I approve these changes but we can't merge until all reviewers agree I think.

nshahquinn · 2021-12-16T16:25:58Z

For the record I tested the tests and they test great. I approve these changes but we can't merge until all reviewers agree I think.

Thank you! I don't think it was the reviewers; I think you just didn't have merge permissions. I've given @wikimedia/data-engineering those rights, so next time you should be able to.

nshahquinn force-pushed the testing branch from 2f083dc to de986fd Compare January 30, 2021 09:52

bearloga self-requested a review February 1, 2021 15:24

nshahquinn force-pushed the testing branch from de986fd to b6bbadd Compare February 4, 2021 15:30

nshahquinn force-pushed the testing branch from b6bbadd to 0ae2fe1 Compare February 8, 2021 17:27

nshahquinn force-pushed the master branch from c91408f to 429b1b3 Compare March 11, 2021 16:58

nshahquinn force-pushed the testing branch 5 times, most recently from f702c70 to 9b1e4fb Compare December 6, 2021 16:49

nshahquinn requested review from nettrom, ottomata and milimetric and removed request for bearloga December 6, 2021 16:55

nshahquinn changed the title ~~Add a notebook which tests key functions~~ Add a script which tests key functions Dec 6, 2021

milimetric reviewed Dec 6, 2021

View reviewed changes

wmfdata_tests/tests.py Show resolved Hide resolved

milimetric approved these changes Dec 6, 2021

View reviewed changes

nshahquinn force-pushed the testing branch 2 times, most recently from 3add46d to a6adba9 Compare December 8, 2021 20:55

nshahquinn force-pushed the testing branch from a6adba9 to f1dd232 Compare December 14, 2021 00:51

nshahquinn changed the base branch from master to release December 14, 2021 16:26

nshahquinn changed the base branch from release to master December 14, 2021 16:26

Add a script which tests key functions

8ce39b9

nshahquinn force-pushed the testing branch from f1dd232 to 8ce39b9 Compare December 14, 2021 16:33

nshahquinn merged commit c0f7511 into master Dec 16, 2021

nshahquinn deleted the testing branch December 16, 2021 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a script which tests key functions #16

Add a script which tests key functions #16

nshahquinn commented Jan 29, 2021

bearloga commented Feb 1, 2021

nshahquinn commented Feb 4, 2021

bearloga commented Feb 6, 2021 •

edited

Loading

nshahquinn commented Feb 8, 2021

bearloga commented Feb 11, 2021

nshahquinn commented Mar 11, 2021

nshahquinn commented Dec 6, 2021

ottomata commented Dec 6, 2021

nshahquinn commented Dec 6, 2021

milimetric left a comment

milimetric commented Dec 15, 2021

nshahquinn commented Dec 16, 2021

Add a script which tests key functions #16

Add a script which tests key functions #16

Conversation

nshahquinn commented Jan 29, 2021

bearloga commented Feb 1, 2021

nshahquinn commented Feb 4, 2021

bearloga commented Feb 6, 2021 • edited Loading

nshahquinn commented Feb 8, 2021

bearloga commented Feb 11, 2021

nshahquinn commented Mar 11, 2021

nshahquinn commented Dec 6, 2021

ottomata commented Dec 6, 2021

nshahquinn commented Dec 6, 2021

milimetric left a comment

Choose a reason for hiding this comment

milimetric commented Dec 15, 2021

nshahquinn commented Dec 16, 2021

bearloga commented Feb 6, 2021 •

edited

Loading