Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd Logging System Part 1 #652

Merged
merged 32 commits into from Apr 30, 2019
Merged

Fluentd Logging System Part 1 #652

merged 32 commits into from Apr 30, 2019

Conversation

rkooo567
Copy link
Collaborator

This is the part of Logging project. #625.

Summary of PR.

  1. Booting up Fluentd instance inside Clipper cluster if users turn on the flag use_centralized_log=True for DockerContainerManager.
  2. Fluentd container image will be pulled from official fluentd image. This will be changed in the second phase PR of this project.
  3. Our Fluentd instance will copy the config file to clipper_admin/docker/logging/fluentd/clipper_fluentd.conf. It will be done within FluentdConfig class. It will also write the correct port number.
  4. Fluentd conf file will be stored in a temp file like metric conf files.
  5. Once everything is setup, Fluentd instance will centralize all the logs within the cluster. Currently, the config is very simple; It collects all the logs to stdout of fluentd instance, meaning it is not that useful yet.
  6. README.md contains about how to use this feature.

Note

Since @simon-mo mentions that Grafana or other logging tools can be potentially used, I tried to make this part as pluggable as possible. If we want to use a different logging system, we can just change logging_system to a different class and create a new class that has same public functions as Fluentd class. I didn't define Interface because I thought it was too much at this point. I will write about it in README.md later.

Testing

  1. Tested if connect() works correctly. (3 tests). 2 tests are checking if it raises a ConnectRefusedError when old connection and new connection has different flag value for use_centralized_log. You can check this from _is_valid_logging_state_to_connect within DockerContainerManager
  2. Tested if other Clipper nodes logs are in Fluentd stdout logs. (which is docker logs )
  3. Tested if logs of models are in fluentd stdout logs once they are deployed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@simon-mo simon-mo self-requested a review March 22, 2019 01:42
@simon-mo
Copy link
Contributor

jenkins ok to test

@simon-mo
Copy link
Contributor

jenkins add to whitelist

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1811/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

Okay. There are 2 issues in both fluentd integration test. Please let me know if you can think of any reason why it occurs.

  1. integration_py2_fluentd: It cannot import fluentd module that I created for some reasons. I will test locally with python 2.
[integration_py2_fluentd] File "/clipper/clipper_admin/clipper_admin/docker/docker_container_manager.py", line 25, in <module> 
 [integration_py2_fluentd] from clipper_admin.docker.logging.fluentd.fluentd import Fluentd 
 [integration_py2_fluentd] ImportError: No module named fluentd.fluentd 
  1. integration_py3_fluentd: For some reasons, it cannot initialize docker fluentd logging driver. I will test it again in vm.

…Used requests' ConnectionError class instead of ConnectionRefusedError which is not supported in python2. Added user='root' inside fluentd container run function so that it can access conf file existing in a root folder within a container.
@rkooo567
Copy link
Collaborator Author

I fixed errors. Please test it again!

@simon-mo
Copy link
Contributor

Jenkins test this please

@simon-mo
Copy link
Contributor

Jenkins ok to test

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1815/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Mar 24, 2019

Looks like there is an issue with Pytorch container?

 [pytorch-container] resp = super(CacheControlAdapter, self).send(request, **kw) 
 [pytorch-container] File "/usr/local/lib/python2.7/site-packages/pip/_vendor/requests/adapters.py", line 508, in send 
 [pytorch-container] raise ConnectionError(e, request=request) 
 [pytorch-container] ConnectionError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/93/b3/672813e65ac7605bac14a2a8e40e1a18e03bf0ac588c6a799508ee66c289/torch-1.0.1.post2-cp27-cp27mu-manylinux1_x86_64.whl (Caused by ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",))

Do you think it is due to the change? Let me test deploy_pytorch_container locally in vm

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1858/
Test FAILed.

@simon-mo
Copy link
Contributor

@rkooo567 it seems there are two issues

  1. The ports are not assigned to unbound port somehow. You can occupy a port with netcat -l 22424 and then try to start clipper with fluentd.

  2. There are some sort of infinite loop, see the console log.

@@ -459,6 +489,26 @@ def stop_all(self, graceful=True):
else:
c.kill()

def _is_valid_logging_state_to_connect(self, all_labels):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the logic of this part. I will make new clipper connection to turn on use_log_centralization flag if there is a fluentd instance running in a cluster regardless of use_log_centralization flag of the current DockerContainerManager instance. As you can see the current logic is that if the flag is different from the cluster context (meaning if use_centralization is on but there's no fluentd instance), it will cause an error. I will change this to

  1. If Fluentd instance is within a cluster and the current flag is off -> current flag is on.
  2. If current flag is on but there's no fluentd instance running -> ClipperException
  3. Otherwise same

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1890/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1891/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1893/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1894/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1895/
Test PASSed.

@rkooo567
Copy link
Collaborator Author

Yes! Finally passed tests. @simon-mo. Please review the PR and leave me some comments. Also, are there some other people who will be involved in code review?

@rkooo567
Copy link
Collaborator Author

@withsmilo If you have time, can you also review the PR? I will appreciate it!

@withsmilo
Copy link
Collaborator

@rkooo567 Sure, I will review this PR tonight.

@simon-mo
Copy link
Contributor

Sorry about the delay. I'll review this over the weekend.

or not os.path.isfile(self._file_path):
self._file_path = self.build_temp_file()

# Logging-TODO: Currently, it copies the default conf from clipper_fluentd.conf.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this as an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to #625 instead (because it will be anyway handled at PR2, and it is not merged yet). If you still want me to add this to a new issue, I will do it after this pr is merged! Please let me know

@@ -40,7 +40,7 @@ def signal_handler(signal, frame):

if __name__ == '__main__':
signal.signal(signal.SIGINT, signal_handler)
clipper_conn = ClipperConnection(DockerContainerManager())
clipper_conn = ClipperConnection(DockerContainerManager(use_centralized_log=False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it turned off by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to expose this option to users by adding it to the example code. If you think it is better removing it, I will do that! Let me know

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1910/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Apr 27, 2019

@simon-mo Can you run Jenkins again? It failed at docker-metric test, but I could pass it locally on vm.
The log said

[integration_py3_docker_metric] 19-04-27:19:31:52 ERROR [clipper_metric_docker.py:126] Failed to parse: http://localhost:31119/api/v1/series?match[]=clipper_mc_pred_total

@simon-mo
Copy link
Contributor

Jenkins test this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1912/
Test FAILed.

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Apr 27, 2019

Hmm.. I got the same error. idk how we fail to parse url. When I urlparse in the interactive shell, it looks fine.. I will try to figure out soon

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Apr 29, 2019

It is supposed to pass tests. If so, I will rebase the commits.
nvm. It seems like it is automatically squashed once merged.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/1921/
Test PASSed.

@simon-mo simon-mo merged commit ac6aa42 into ucbrise:develop Apr 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants