Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
cluster design consideration
I realize this isn't directly related to ROCKS, but the venerable Beowulf list is considerably slower these days.
We are soon to grow our cluster from 78 nodes to 118, and then after that to 374. I anticpate that there will be unanticipated scaling issues.
- You won't see any appreciable scaling issues with Rocks itself at 118 nodes, we have several users who have clusters in the O(300) range (with a single GigE frontend).
- Identifying your storage as the potential bottleneck is right on, especially if jobs never span more than a few nodes.
Currently our links are all gigabit. We think that this will scale as almost all of our compute jobs are single-node jobs, and when not, are only 3-4 node jobs max. We can easily arrange that the 3-4 node jobs do not span switches.
The biggest anticipated scaling issue will be that we are currently aggregating our post-processed data on a single nfs server. As jobs tend to spread themselves out over time, we rarely have all 78 nodes writing to that server at once - they can do most of their writing to local disk and only write a smaller chunk of post-processed data to the nfs server. In addition they also write an even smaller chunk of data to a mysql database. I figure that just adding more nfs servers will solve this problem.
- Either more NFS servers or a beefed-up NFS server. If you look at 24-48 drive systems with 10Gigabit ethernet, you can generally see several 100MB/sec of aggregrate NFS service.
- The SMC 8748L2 is a 48-port gigE switch with 2 optional 10gige ports and costs with ports bit no optics about $2K. We've successfully run Sun x4500s (using ZFS and Solaris) as NFS servers and have had quite good results with 100+ nodes of simultaneous access (each client node connected at gigE).
- I would worry most about network bisection (especially to storage). With 370 nodes, you could build out of 8748s with channel-bonded switch-to-switch links. 11 such switches, you could configure one as root and 4-port channel-bonds to 10 client (or leaf) switches. With 40 gigE clients on each leaf switch, you would a 10:1 bisection. That's OK, but not great and channel bonding doesn't always work the way you expect it to. This is ~$20K for an inexpensive switch structure.
- A different take might use 8 48-port 8748's with 10GigE uplinks to a root 10GigE switch. (848 = 384 nodes). The SMC 8708 (about $6K) could serve as a root switch. You have to add the cost of optics, which is O($700.00)/XFP (10 Gigabit pluggable optics like an SFP). 16$700 = $11K + 8*2K + 6K = $33K and yields a 5:1 bisection
- If you want better bisection, then you need a larger (more expensive) root switch. The next level up is something like a 24-port Force10 2410 ($20K). And then finally, you get to an enterprise level switch with lots of gigE capacity and forgo any leaf switches. Force10 E300, Cisco 6509, Foundry, etc all have switches in this space.