Tensorflow serving performance and optimization tips

Would someone post some data on their performance numbers for their Tensorflow serving systems in production? I'm curious about some latency numbers like tp99/90/50, QPS numbers, response/request sizes, and some comparisons of numbers within data centers vs the open web?

Also, what are some best practices in squeezing out performance? For instance, streaming/batching?