Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment of v4.3.3 on corefacility #159

Closed
kevinkle opened this issue Jul 11, 2017 · 36 comments
Closed

deployment of v4.3.3 on corefacility #159

kevinkle opened this issue Jul 11, 2017 · 36 comments

Comments

@kevinkle
Copy link
Member

No description provided.

@kevinkle kevinkle changed the title deployment on corefacility deployment of v4.3.3 on corefacility Jul 11, 2017
@kevinkle
Copy link
Member Author

Running Blazegraph on host, via screen and inside /Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4 - the bigdata.jnl will be stored there.

java -server -Xmx4g -Dbigdata.propertyFile=/Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4/RWStore.properties -jar blazegraph.jar

@kevinkle
Copy link
Member Author

Looks like #159 (comment) works for storing the database in /Warehouse. But now need to network the docker composition to the blazegraph instance running on the host moby/moby#1143 (comment) - doesn't look like there is an easy/Docker-approved method.

@kevinkle
Copy link
Member Author

kevinkle commented Jul 11, 2017

Implemented moby/moby#1143 (comment) in our docker-compose.yml file:
From the comment:

version: '2'
services:
  <container_name>:
    image: <image_name>
    networks:
      - dockernet

networks:
  dockernet:
    driver: bridge
    ipam:
      config:
        - subnet: 192.168.0.0/24
          gateway: 192.168.0.1

except, we added to all our services:

networks:
      - dockernet

which still lets us use namespaces defined by docker (such as redis) to link containers. Then we add a firewall rule on the centos host:

sudo firewall-cmd --zone=public --add-port=9999/tcp --permanent
sudo firewall-cmd --reload

which also dangerously exposes port 9999 to the outside world, but for now we can also now run:

curl http://192.168.0.1:9999/blazegraph/

to connect to the blazegraph instance running on our host.

@kevinkle
Copy link
Member Author

After #159 (comment) , can now start reactapp outside of corefacility and uploading files/get results. Note that files are still being stored inside the VM instead of on /Warehouse, but this will be addressed in #148 and doesn't require any modification on corefacility.

@kevinkle
Copy link
Member Author

Need to address superphy/grouch#14 now so we can run reactapp out of corefacility.

@kevinkle
Copy link
Member Author

screen -r 15365.pts-1.superphy

@kevinkle
Copy link
Member Author

kevinkle commented Jul 11, 2017

We're up! 9766690 uses a prefix for reactapp uri's + a homepage spec. For a weird reason, the commit is listed under Chad (prob since I pushed it from the VM, but w/e)

@kevinkle
Copy link
Member Author

Subtyping and Database works fine, but there is an issue in the Fishers task not returning any results

@kevinkle kevinkle reopened this Jul 12, 2017
@kevinkle
Copy link
Member Author

This is weird as the Database task can retrieve serotype info. w/o problems which implies the ECTyper jobs finished correctly, whereas you can't compare VF results in the Fishers task.

@kevinkle
Copy link
Member Author

There's only 15 genomes in the db atm., perhaps this is just a case of no shared VFs between 2 H types (since there may be only 1 genome per H type)? Will upload a larger set of ref. genomes to test.

@kevinkle
Copy link
Member Author

Looks like we're hitting the timeouts again superphy/grouch#43

Perhaps this is because the VM has lower specs? May need to bump timeouts even more.

@kevinkle
Copy link
Member Author

Retested #159 (comment) locally and we don't have this problem. It's almost like the corefacility deployment of blazegraph is losing transactions.

@kevinkle
Copy link
Member Author

kevinkle commented Jul 12, 2017

Figured out the error. I didn't merge branch inferencing of our docker-blazegraph repo into master and was running blazegraph on corefacility without inferencing.

In /Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing

new command: java -server -Xmx4g -Dbigdata.propertyFile=/Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing/RWStore.properties -jar blazegraph.jar

@kevinkle
Copy link
Member Author

Still need to address #159 (comment)

@kevinkle
Copy link
Member Author

Looks to be associated with the harakiri option: https://stackoverflow.com/questions/24127601/uwsgi-request-timeout-in-python
Added in #165 , testing now...

@kevinkle
Copy link
Member Author

*** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers ***

@kevinkle
Copy link
Member Author

@kevinkle
Copy link
Member Author

An addition to the fix added: #168

@kevinkle
Copy link
Member Author

Still didn't work, same error:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Bad Gateway</title>
</head><body>
<h1>Bad Gateway</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
</p>
</body></html>

@kevinkle
Copy link
Member Author

kevinkle commented Jul 14, 2017

So far tested:
~40ish genomes, unzipped, works
~50+ genomes (Ex. 56 genomes, 290.9 MB), unzipped, 502 error
48 genomes (249.6 MB) zipped to 75.2 MB, works
124 genomes (652.5MB) zipped to 197.2 MB, works
656 genomes (3.6 GB) zipped to 1.1 GB, 502 error

@kevinkle
Copy link
Member Author

Tested on Cybera and still getting problems. This: pallets/flask#2086 (comment) looks super relevant and may be a possible fix.

@kevinkle
Copy link
Member Author

kevinkle commented Jul 17, 2017

Something different...

webserver_1              | 2017/07/17 16:03:41 [error] 10#10: *1 client intended to send too large body: 197175003 bytes, client: 192.168.0.1, server: , request: "POST /api/v0/upload HTTP/1.1", host: "localhost:8090", referrer: "https://lfz.corefacility.ca/superphy/grouch/subtyping"
webserver_1              | 192.168.0.1 - - [17/Jul/2017:16:03:41 +0000] "POST /api/v0/upload HTTP/1.1" 413 601 "https://lfz.corefacility.ca/superphy/grouch/subtyping" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36" "192.197.71.189"

This occurs only after Upload... hits 100%.

@kevinkle
Copy link
Member Author

#159 (comment) is fixed in 33970dc

@kevinkle
Copy link
Member Author

I'm wondering if this is possibly above the docker level, because on successful POSTs we get the GET request in docker logs, but when it fails there is absolutely nothing.

@kevinkle
Copy link
Member Author

kevinkle commented Jul 17, 2017

🍰 Partial success!
Looks like the error was with the nginx.conf running above Docker on the VM. There was a max file size specified there capping it to 200m, changing it to 60g (60GB) now sends the files into Docker. Slightly weird though, because the access.log on that Nginx isn't saying anything about the POST request until axios on the front-end reports 100% upload - perhaps there is another webserver running at corefacility which is also handling buffering???

We just need to up the space available/ offload the temporary storage of the files somewhere on the VM and we'll be set.

EDIT: another webserver we're unaware of would also explain why we get generic 502 errors instead of 500 errors.

@kevinkle
Copy link
Member Author

The space when unzipping issue (#159 (comment)) is confirmed in https://sentry.io/share/issue/3133353338392e333031353732303031/

@kevinkle
Copy link
Member Author

Heads up! May have to adjust speed of Blazegraph queries as /Warehouse seems slow https://sentry.io/share/issue/3133353338392e333132353036333439/

@kevinkle
Copy link
Member Author

kevinkle commented Jul 18, 2017

New HDD was attached to VM. Using https://forums.docker.com/t/how-do-i-change-the-docker-image-installation-directory/1169 to host docker stuff there via symlink.

@kevinkle
Copy link
Member Author

kevinkle commented Jul 19, 2017

What a successful upload should look like in the logs:

webserver_1              | upload(): received req. at 2017-07-19-16-24
webserver_1              | [<FileStorage: u'GCA_001912665.1_ASM191266v1_genomic.fna' ('application/octet-stream')>]
webserver_1              | upload(): about to enqueue files
webserver_1              | upload(): all files enqueued, returning...
webserver_1              | handle_groupresults(): started
webserver_1              | handle_groupresults(): finished
webserver_1              | [pid: 21|app: 0|req: 90/162] 192.168.0.1 () {56 vars in 1113 bytes} [Wed Jul 19 16:24:42 2017] POST /api/v0/upload => generated 162 bytes in 2946 msecs (HTTP/1.1 200) 4 headers in 144 bytes (20 switches on core 0)
webserver_1              | 192.168.0.1 - - [19/Jul/2017:16:24:45 +0000] "POST /api/v0/upload HTTP/1.1" 200 162 "https://lfz.corefacility.ca/superphy/grouch/subtyping" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" "10.0.10.83"
webserver_1              | [pid: 20|app: 0|req: 73/163] 192.168.0.1 () {52 vars in 962 bytes} [Wed Jul 19 16:24:45 2017] GET /api/v0/results/blob2907573319237084527 => generated 10 bytes in 6 msecs (HTTP/1.1 200) 3 headers in 103 bytes (2 switches on core 0)
webserver_1              | 192.168.0.1 - - [19/Jul/2017:16:24:45 +0000] "GET /api/v0/results/blob2907573319237084527 HTTP/1.1" 200 10 "https://lfz.corefacility.ca/superphy/grouch/subtyping" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" "10.0.10.83"

What we got with large files (note: looks like files are still being enqueued just no blob id being returned)

webserver_1              | upload(): received req. at 2017-07-19-16-59
webserver_1              | [<FileStorage: u'656_files-3.6_GB-ecoli-genomes.zip' ('application/zip')>]
webserver_1              | upload(): about to enqueue files
webserver_1              | 192.168.0.1 - - [19/Jul/2017:17:00:01 +0000] "POST /api/v0/upload HTTP/1.1" 499 0 "https://lfz.corefacility.ca/superphy/grouch/subtyping" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36" "192.197.71.189"
webserver_1              | upload(): all files enqueued, returning...
webserver_1              | handle_groupresults(): started
webserver_1              | handle_groupresults(): finished
webserver_1              | Wed Jul 19 17:00:32 2017 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /api/v0/upload (ip 192.168.0.1) !!!
webserver_1              | Wed Jul 19 17:00:32 2017 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 296] during POST /api/v0/upload (192.168.0.1)
webserver_1              | IOError: write error
webserver_1              | [pid: 20|app: 0|req: 76/168] 192.168.0.1 () {52 vars in 1040 bytes} [Wed Jul 19 16:50:02 2017] POST /api/v0/upload => generated 0 bytes in 631156 msecs (HTTP/1.1 200) 4 headers in 0 bytes (3782 switches on core 0)

@kevinkle
Copy link
Member Author

From #159 (comment) looks like uwsgi is still active and the disconnect is happening on either nginx or reactapp.

@kevinkle
Copy link
Member Author

Only immediately after a 502 error shows up on reactapp do we get this in logs:

webserver_1              | upload(): received req. at 2017-07-19-21-05
webserver_1              | [<FileStorage: u'656_files-3.6_GB-ecoli-genomes.zip' ('application/zip')>]
webserver_1              | upload(): about to enqueue files
webserver_1              | 192.168.0.1 - - [19/Jul/2017:21:05:49 +0000] "POST /api/v0/upload HTTP/1.1" 499 0 "https://lfz.corefacility.ca/superphy/grouch/subtyping" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36"

@kevinkle
Copy link
Member Author

#159 (comment) looks like an error with uwsgi in docker. the 192.168.0.1 refers to nginx running above docker on the vm.

@kevinkle
Copy link
Member Author

[claing@superphy docker]$ sudo tail /var/log/nginx/error.log
2017/07/19 11:25:10 [warn] 10922#0: *323 an upstream response is buffered to a temporary file /var/lib/nginx/tmp/proxy/9/00/0000000009 while reading upstream, client: 10.0.0.10, server: lfz.corefacility.ca, request: "GET //grouch/static/js/main.bfae872c.js.map HTTP/1.1", upstream: "http://[::1]:8091/static/js/main.bfae872c.js.map", host: "lfz.corefacility.ca"
2017/07/19 11:25:19 [warn] 10922#0: *327 a client request body is buffered to a temporary file /var/lib/nginx/tmp/client_body/0000000010, client: 10.0.0.10, server: lfz.corefacility.ca, request: "POST //spfy/api/v0/upload HTTP/1.1", host: "lfz.corefacility.ca", referrer: "https://lfz.corefacility.ca/superphy/grouch/subtyping"
2017/07/19 13:28:58 [warn] 13540#0: *15 a client request body is buffered to a temporary file /var/lib/nginx/tmp/client_body/0000000001, client: 10.0.0.10, server: lfz.corefacility.ca, request: "POST //spfy/api/v0/upload HTTP/1.1", host: "lfz.corefacility.ca", referrer: "https://lfz.corefacility.ca/superphy/grouch/subtyping"
2017/07/19 14:04:37 [error] 13540#0: *15 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.0.0.10, server: lfz.corefacility.ca, request: "POST //spfy/api/v0/upload HTTP/1.1", upstream: "http://127.0.0.1:8090/api/v0/upload", host: "lfz.corefacility.ca", referrer: "https://lfz.corefacility.ca/superphy/grouch/subtyping"
2017/07/19 14:04:37 [warn] 13540#0: *15 upstream server temporarily disabled while reading response header from upstream, client: 10.0.0.10, server: lfz.corefacility.ca, request: "POST //spfy/api/v0/upload HTTP/1.1", upstream: "http://127.0.0.1:8090/api/v0/upload", host: "lfz.corefacility.ca", referrer: "https://lfz.corefacility.ca/superphy/grouch/subtyping"

@kevinkle
Copy link
Member Author

[claing@superphy 2.1.4-inferencing]$ ls -lah
total 7.6G
drwxrwxr-x.  2 claing collaborators  152 Jul 12 16:14 .
drwxrwxr-x. 11 claing nobody         329 Jul 12 16:02 ..
-rwxrwxr-x.  1 claing collaborators 6.2G Jul 20 12:16 bigdata.jnl
-rwxrwxr-x.  1 claing nobody         54M May 23 22:08 blazegraph.jar
-rwxrwxr-x.  1 claing nobody         622 Jul 12 16:02 Dockerfile
-rwxrwxr-x.  1 claing nobody        701M Jul 20 12:16 rules.log
-rwxrwxr-x.  1 claing nobody        2.7K Jul 12 16:02 RWStore.properties
[claing@superphy 2.1.4-inferencing]$ pwd
/Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing

@kevinkle
Copy link
Member Author

Looks like database status queries are working fine, but POSTs are failing erratically. This is with # of Genome Files: 637

@kevinkle
Copy link
Member Author

I'm closing this issue; there are a number of major changes that will probably be required to bring this into viable production usage, including a number of breaking changes. Everything will be tracked under https://github.com/superphy/backend/milestone/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant