Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C10D: Added TCPStore to support C10D store interface #7560

Merged
merged 5 commits into from May 17, 2018

Conversation

teng-li
Copy link
Contributor

@teng-li teng-li commented May 15, 2018

Added TCP Store as another Store for the C10D interface.

For more details on the C10D Store interface, refer: #7434

A few things done in the PR:

(1) TCP Store can be initialized either as a master who will launch a daemon thread or a slave who will connect to the master's store daemon thread.
(2) TCP Store supports all the interface functions for Store.h
(3) Refactored some of the TCP helper functions into Utils.hoo/cpp
(4) Refactored test to some common folders.

Test Plan:

~/pytorch/torch/lib/c10d$ bin/test.sh
+ mkdir -p build
+ cd build
+ cmake ../
-- Configuring done
-- Generating done
-- Build files have been written to: /private/home/tengli/pytorch/torch/lib/c10d/build
+ make all test
[ 11%] Building CXX object CMakeFiles/store.dir/Utils.cpp.o
[ 22%] Building CXX object CMakeFiles/store.dir/Store.cpp.o
[ 33%] Building CXX object CMakeFiles/store.dir/FileStore.cpp.o
[ 44%] Building CXX object CMakeFiles/store.dir/TcpStore.cpp.o
[ 55%] Linking CXX static library libstore.a
[ 55%] Built target store
Scanning dependencies of target TcpStoreTest
[ 66%] Building CXX object test/CMakeFiles/TcpStoreTest.dir/TcpStoreTest.cpp.o
[ 77%] Linking CXX executable TcpStoreTest
[ 77%] Built target TcpStoreTest
[ 88%] Building CXX object test/CMakeFiles/FileStoreTest.dir/FileStoreTest.cpp.o
[100%] Linking CXX executable FileStoreTest
[100%] Built target FileStoreTest
Running tests...
Test project /private/home/tengli/pytorch/torch/lib/c10d/build
    Start 1: FileStoreTest
1/2 Test #1: FileStoreTest ....................   Passed    0.03 sec
    Start 2: TcpStoreTest
2/2 Test #2: TcpStoreTest .....................   Passed    0.69 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   0.74 sec

@teng-li teng-li requested review from pietern and apaszke May 15, 2018 01:51
@teng-li teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 15, 2018
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like most of the code is coming straight from THD, is that right? Can you please post diffs comparing them to the original files? That would make reviewing much easier.

int connect(const std::string& address,
PortType port,
bool wait,
int timeout) {

This comment was marked as off-topic.

This comment was marked as off-topic.


bytesToReceive -= bytesReceived;
currentBytes += bytesReceived;
}

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

c10d::test::set(*slaveStores[i], key, val);
c10d::test::check(*slaveStores[i], key, val);
}
})));

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li
Copy link
Contributor Author

teng-li commented May 15, 2018

@apaszke Utils.cpp/hpp are mostly from THD with some refactoring. TcpStore.hpp/cpp are based on architecture of Gloo store, but the API implementation needs some changes to support these new functions like wait with timeout, add, check functions that are defined in the new base store class. I sent you a diff file internally on messenger.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @teng-li!

General comments:

  • Can we name it TCPStore instead?
  • There is inconsistency in style, w.r.t. double newlines, braces on newline vs end of line, function argument indentation, namespace comments.
  • A few comments below talk about this: I think we should not let the exception propagation determine shutdown of the daemon thread. It will happen one node dies for whatever reason and this will immediately propagate everywhere. Instead it should just go away, be removed from the fd vector, and the daemon should keep running.

keysAwaited_.push_back(0);
fds.push_back({ .fd = sockFd, .events = POLLIN });
}
for (size_t rank = 0; rank < sockets_.size(); rank++) {

This comment was marked as off-topic.

This comment was marked as off-topic.


SYSCHECK(::poll(fds.data(), fds.size(), -1));

if (fds[0].revents != 0) {

This comment was marked as off-topic.

This comment was marked as off-topic.


if (fds[0].revents != 0) {
if (fds[0].revents ^ POLLIN) {
throw std::system_error(ECONNABORTED, std::system_category());

This comment was marked as off-topic.

This comment was marked as off-topic.

* side has been closed. If the closing was due to normal exit, then the
* store should exit too. Otherwise, if it was different exception,
* other processes will get an exception once they try to use the store.
*/

This comment was marked as off-topic.

This comment was marked as off-topic.

}

void TcpStoreDaemon::join() {
daemonThread_.join();

This comment was marked as off-topic.

This comment was marked as off-topic.


{
if (isServer_) {
// Openning up the listening socket

This comment was marked as off-topic.

This comment was marked as off-topic.

return false;
} else {
throw std::runtime_error("stop_waiting or keep_waiting response expected");
}

This comment was marked as off-topic.

This comment was marked as off-topic.

* or, in the case of wait
* type of query | number of args | size of arg1 | arg1 | ...
*/
void TcpStoreDaemon::query(RankType rank) {

This comment was marked as off-topic.

This comment was marked as off-topic.

void TcpStore::set(const std::string& key, const std::vector<uint8_t>& data) {
tcputil::sendValue<QueryType>(storeSocket_, QueryType::SET);
tcputil::sendString(storeSocket_, key, true);
tcputil::sendVector<uint8_t>(storeSocket_, data);

This comment was marked as off-topic.

@@ -13,4 +14,5 @@ foreach(test_src ${test_srcs})
target_include_directories(${test_name} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/..)
target_link_libraries(${test_name} ${test_libraries})
add_test(NAME ${test_name} COMMAND $<TARGET_FILE:${test_name}>)

This comment was marked as off-topic.

@teng-li
Copy link
Contributor Author

teng-li commented May 16, 2018

@pietern Changed the file name to TCPStore, but somehow the new commit cannot see the diff :(, I can revert the name back so that you can see the diff and later rename it.

Anyway, addressed all of your comments, like properly shutdown the daemon using pipes etc.

@apaszke addressed your comments too,

@teng-li teng-li force-pushed the TcpStore branch 6 times, most recently from bbd1059 to e11d975 Compare May 16, 2018 02:09
ret = false;
it++;
} else {
it = keys.erase(it);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

}
/* sleep override */
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

}

// send a string's length and data
inline void sendString(int socket,

This comment was marked as off-topic.

This comment was marked as off-topic.

const std::chrono::milliseconds& timeout = kNoTimeout);

// Helper resource guard class
class ResourceGuard {

This comment was marked as off-topic.

This comment was marked as off-topic.

std::string val = "thread_val_" +
std::to_string(numIterations - 1);
c10d::test::check(*slaveStores[i], key, val);
}

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few instances of the style inconsistencies I was talking about in the previous review. There are functions that start with a newline and ones that don't. Also the mixed use of // and /**/ is odd on the eyes.

}

TCPStoreDaemon::~TCPStoreDaemon() {

This comment was marked as off-topic.

}

void TCPStoreDaemon::run() {

This comment was marked as off-topic.

void TCPStoreDaemon::run() {

// Create the control pipe
controlPipeFd_ = std::vector<int>{-1, -1};

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

return listenPort;
}

}

This comment was marked as off-topic.

return static_cast<RankType>(rank);
}

// TCP util namespace

This comment was marked as off-topic.

c10d::test::Semaphore sem1, sem2;

// Each thread will have a slave store to send/recv data
std::vector<std::unique_ptr<c10d::TCPStore>> slaveStores;

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro nit remaining

}

bool TCPStoreDaemon::
checkKeys(const std::vector<std::string>& keys) const {

This comment was marked as off-topic.

// we hit an exception here.
::close(fds[fdIdx].fd);
fds.erase(fds.begin() + fdIdx);
sockets_.erase(sockets_.begin() + fdIdx - 2);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li teng-li force-pushed the TcpStore branch 2 times, most recently from e5f7c5c to 0171a57 Compare May 17, 2018 18:08
@teng-li teng-li merged commit 0d27d26 into pytorch:master May 17, 2018
@teng-li teng-li deleted the TcpStore branch May 17, 2018 20:38
onnxbot added a commit to onnxbot/onnx-fb-universe that referenced this pull request May 17, 2018
petrex added a commit to petrex/pytorch that referenced this pull request May 23, 2018
* upstream/master:
  Makes AccumulateGrad high priority in backwards passes (pytorch#7604)
  [C++ API] Implement builder style construction (pytorch#7597)
  C10D: Added TCPStore to support C10D store interface (pytorch#7560)
  [auto] Update onnx to ba86ec2 - Protobuf typing (onnx/onnx#982) onnx/onnx@ba86ec2
  Add LBFGS optimization algorithm to C++ API (pytorch#7596)
weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018
Reference: pytorch#7434

* C10D: Added TCPStore to support C10D store interface

* Used pipe to terminate the store daemon and addressed all comments

* Used notify/wake for wait and addressed all comments

* Clean up nits

* Clean up all socket states when the socket is closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants