Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script which tests key functions #16

Merged
merged 1 commit into from
Dec 16, 2021
Merged

Add a script which tests key functions #16

merged 1 commit into from
Dec 16, 2021

Conversation

nshahquinn
Copy link
Collaborator

It won't be very helpful to look at the raw files here; instead, this can be reviewed by checking out the branch on one of the analytics clients and opening the notebook there.

@bearloga
Copy link
Member

bearloga commented Feb 1, 2021

Stopped at first cell in the Loading Data section:

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException:
  Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/neilpquinn.db/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Replacing neilpquinn database with bearloga in the queries worked.

One potential solution is to replace it with a placeholder variable whose value is os.environ['USER'] by default, but that relies on the heavy assumption that any user running this test notebook would use the same name for their personal database on the cluster as their shell username.

@nshahquinn
Copy link
Collaborator Author

Stopped at first cell in the Loading Data section:

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException:
  Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/neilpquinn.db/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Thanks for catching this! I just updated it to use Hive's default database, which everyone should be able to write to.

@bearloga
Copy link
Member

bearloga commented Feb 6, 2021

Py4JJavaError: An error occurred while calling o65.saveAsTable.
: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: 
Permission denied: user=bearloga, access=READ_EXECUTE,
  inode="/user/hive/warehouse/wmfdata_test_1":neilpquinn-wmf:hadoop:drwxr-x---

Indeed:

$ hdfs dfs -ls /user/hive/warehouse
drwxr-x---   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_1
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_2
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:27 /user/hive/warehouse/wmfdata_test_3
drwxrwxrwt   - neilpquinn-wmf   hadoop             0 2021-02-04 15:15 /user/hive/warehouse/wmfdata_test_4

Similar problem with default.wmfdata_test_3:

ChildProcessError: The Hive command line client encountered the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask.
Directory hdfs://analytics-hadoop/user/hive/warehouse/wmfdata_test_2 could not be cleaned up.

@elukey: Do you have a recommendation for fixing our permission problems? We want anyone with analytics client access to be able to run the cells in this test notebook to verify that the package works. It appears that using the default database doesn't affect how the permissions are set up in HDFS, so the default.wmfdata_test_* tables are locked to @neilpquinn's account. What's weird is that dfs://analytics-hadoop/user/hive/warehouse/wmfdata_test_2 has group write permissions, so I should be able to drop the table associated with it, right?


It may be that anyone who wishes to test the package with the notebook will need to run it as the analytics-product user? Although we have yet to use that system user to schedule notebook executions in crontab.

@ottomata Is there a way to use Anaconda as the analytics-product system user? Because I tried to run

$ chgrp analytics-product-users ~/temp/test
$ cd ~/temp/test
$ sudo --user=analytics-product --group=analytics-product-users \
  /home/bearloga/venv/bin/jupyter nbconvert --execute --to html Untitled.ipynb

and got

PermissionError: [Errno 13] Permission denied: '/srv/home/bearloga/.local/share/jupyter/nbconvert/templates/conf.json'

(even after chmod --recursive o+r /srv/home/bearloga/.local)

I think jupyter nbconvert temporarily creates a conf.json file, so that prevents analytics-product from being able to use my jupyter binary. Hence my interest in creating a conda env for that system user, if possible.

@nshahquinn
Copy link
Collaborator Author

From my DM to @bearloga:

We can probably fix the database permissions problem by having the script delete the tables after it's finished. It's not a perfect solution, because if someone creates the tables but hits an error midway through, they might not ever clean up the created tables. But we can just put a reminder to do that in the notebook, and that should be good enough?

@bearloga
Copy link
Member

@neilpquinn Do you get this error when you run the !pip setup cells?

Screen Shot 2021-02-11 at 1 02 03 PM

When I restart the kernel and skip those cells, I can import wmfdata no problem. Maybe we shouldn't use !pip magic in this case? Between testing out a PR and fetching it with git and having to authenticate with kinit, people are going to have to open up a terminal in Jupyter anyway so we may as well require them to run pip install . manually if they're testing a patched version of wmfdata.


test_data_1_via_hive fails for me, because of parquet-related messages that get printed to the tsv:

Screen Shot 2021-02-11 at 1 12 14 PM

Feb 11, 2021 6:06:55 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Feb 11, 2021 6:06:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 26 ms. row count = 10

Everything else runs without problems.

@nshahquinn
Copy link
Collaborator Author

Note to self: the Presto test can currently fail because the order of the output differs from the input. I need to add an ORDER BY to the query.

@nshahquinn nshahquinn force-pushed the testing branch 5 times, most recently from f702c70 to 9b1e4fb Compare December 6, 2021 16:49
@nshahquinn
Copy link
Collaborator Author

Okay, I have a new version of this I'm happy with! I've solved the permissions issues by simply having the user pass the name of the Hive database to use when they invoke the script. That way, each user should be using their own database.

If one of the tests fails, the script stops with an error. If all the tests pass, the output will look like this:

neilpquinn-wmf@stat1008:~/wmfdata-python/testing$ ./tests.py nshahquinn
PYSPARK_PYTHON=/home/neilpquinn-wmf/.conda/envs/2021-10-29T21.52.56_neilpquinn-wmf/bin/python3
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/06 16:47:07 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
21/12/06 16:47:07 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
21/12/06 16:47:07 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
21/12/06 16:47:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
TEST PASSED: Read via Hive                                                                            
TEST PASSED: Silent command via Hive
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
TEST PASSED: Read via Spark
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
TEST PASSED: Silent command via Spark
TEST PASSED: Read via Presto
TEST PASSED: Read via MariaDB
TEST PASSED: Silent command via MariaDB
TEST PASSED: Load CSV to Data Lake

There's a lot of annoying logspam, but I think it still gets the point across.

@nshahquinn nshahquinn requested review from nettrom, ottomata and milimetric and removed request for bearloga December 6, 2021 16:55
@ottomata
Copy link
Collaborator

ottomata commented Dec 6, 2021

Is the title of this PR incorrect? a notebook? :)

@nshahquinn
Copy link
Collaborator Author

Is the title of this PR incorrect? a notebook? :)

Sorry, it was a mutation due to the excessive radiation coming off the Data Lake! Also, I decided a script would be cleaner and more maintainable than a notebook 😊

@nshahquinn nshahquinn changed the title Add a notebook which tests key functions Add a script which tests key functions Dec 6, 2021
Copy link
Contributor

@milimetric milimetric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems nice to me, I'll keep it in mind as we think about testing airflow pipelines

@nshahquinn nshahquinn changed the base branch from release to master December 14, 2021 16:26
@milimetric
Copy link
Contributor

For the record I tested the tests and they test great. I approve these changes but we can't merge until all reviewers agree I think.

@nshahquinn
Copy link
Collaborator Author

For the record I tested the tests and they test great. I approve these changes but we can't merge until all reviewers agree I think.

Thank you! I don't think it was the reviewers; I think you just didn't have merge permissions. I've given @wikimedia/data-engineering those rights, so next time you should be able to.

@nshahquinn nshahquinn merged commit c0f7511 into master Dec 16, 2021
@nshahquinn nshahquinn deleted the testing branch December 16, 2021 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants