New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep packet inspection to classify V2Ray traffic #569
Comments
interesting |
What kind of V2Ray configurations were used in the data collection stage? If a "tcp transport" was used, I am not surprised a DNN model can classify between non-HTTP/HTTPS data (data encrypted with VMess in your case) and HTTP/HTTPS data with a high score. If a "ws+tls transport" was used, a DNN model could still learn to classify between different TLS fingerprints. (v2ray/v2ray-core#2098) DNN models are powerful, but if you are not using the right dataset, you are making a wrong assumption. As the first step, maybe you could try to include more diverse data in your test set, e.g. data encrypted with Shadowsocks, replicate some VMess encrypted data and re-fill them with random bytes (keep the length), mark them as non-V2Ray data, and see if there is a performance drop. |
I used vmess on TCP. I just released packet post processing repository. If you are interested in it, you can replicate my experiment on your own V2Ray configuration and your own traffic. |
I am not going to reproduce your work. If you can't eliminate the above possibility, your work is meaningless, IMO. |
Meaningless is a strong word, kid. I placed PacketSorter in my pfSense router over two months. The non-v2ray traffic includes OpenVPN, ssh, SMTP, IMAP, http, https, video streaming like Netflix, iTune and Youtube. It is a typical American family home network traffic. There are naysayers and doers in our world. You are free to make your own choices. |
Good, you gave more information about your dataset. Nowadays many protocols are using TLS (including OpenVPN), header fields of the TLS protocol are salient features that DNN models would love to learn from, just looking at the version number field would achieve a high classification rate for TLS traffic, they're easy to learn compare to VMess's "random" bytes. So why not a DNN model just learn to distinguish these structured-data from "random" bytes? I believe almost all "normal" web traffic are using structured-protocols, they're all easy to learn by a DNN model. But, "random" bytes need not be VMess data. If you are not allow someone arguing your results and keep saying "you can replicate my experiment", why you're posting your results here. |
This comment has been minimized.
This comment has been minimized.
It seems the word meaningless hits you hard, but I am using the right word and I have my precondition. Well, I put my arguments quite clear, if you don't get them, that's fine. V2Ray users will have their own judgements and make their own choices. |
Chill guys. Both of you are comrades fighting against the evil, which I can only admire without much contribution! The research is very interesting so let's stick to the technical and hope some others to chime in too. |
Today, I did another round of tests. I spin up a Google Compute Engine (GCE) to deploy V2Ray server. It is completely different IP, port, UUID and OS from Amazon Lightsail. I used previously trained model by V2Ray server in AWS to classify traffic from V2Ray server in GCE. The classification still can achieve (loss, acc) -- Regarding vmess with TLS, I followed the user guide. But from the captured packets by Wireshark, I don't see clear text client hello, server hello or even key exchange like HTTPS. It seem the very first packet is encrypted already. Am I missing something? Or does vmess with TLS masquerade vmess protocol inside TLS? |
I did another round configuration trials with TLS today. In vmess protocol, TLS with websocket For sure, pure vmess over TCP can be easily classified now from my previous research. But it would be interesting to see if TLS with websocket and TLS with HTTP/2 can be masqueraded with other legitimate web browser HTTPS traffic. From my initial observation in Wireshark, the answer might be not. We will see. [Update] From initial parsing result by Wireshark, V2ray TLS with websocket uses TLS v1.3 which doesn't show key exchange communication but other legitimate web broswer HTTPS traffic still uses TLS v1.2 does show completely different pattern of key exchange process. I'm not an expert on TLS. But it seems that Golang provides latest version TLS implementation where majority web server uses in the the old version. |
I amn't sure if it can use your training model to find the correct answer from a large number of TCP links without giving the correct conclusion (that is, the IP address and port required to deploy the service)? This may include many unfamiliar TCP links, and some may use data headers with similar patterns. Can it identify them? TLS does what it is supposed to do-it loads Vmess data streams into TLS data streams. This may also check whether this multi-layer load transport mode has the same or similar communication pattern to be identified. As far as I know, the introduction of TLS 1.3 is based on user feedback, while older versions of V2ray still use TLS 1.2. Whether the use of the TLS version of TLS 1.3 can be used as a pattern depends on the popularity of TLS 1.3. Some websites still introduce it as a test feature. Due to the disappearance of the main developers, it may be difficult for the "night guard" developers to bring about feature changes to this protocol. |
V2ray v4.18.1 introduced TLS 1.3. You can check V2ray of older version to check TLS 1.2. |
@Cwek The model trained by the traffic data from V2Ray server run in AWS can classify the traffic from V2Ray server run in Google Compute Engine with 99% accuracy. Machine learning is about data. That's why I release my infrastructure. Let anyone collect their own traffic in mainland China and train and evaluate the model on your own. From my observation, vmess over TCP should NOT be recommended by default. It can be classified with high true positive rate (>99%) and low false positive rate (<1%) Regarding to the TLS, I'm reading V2Ray source code. It uses Golang's
I'm not concern with how to get back TLS 1.2. I'm curious if CNN can classify traffic with TLS combination. Because V2Ray use Golang TLS implementation. So by default it carries quite distinct TLS traffic pattern than mainstream web browser like Chrome, Firefox and etc. If CNN could classify vmess over TLS, adversarial sample might be the last defense against AI tools. |
I would say it is indeed meaningless to launch a fruitless quarrel, what makes sense is that one formulate a well-posed question, such as : "Does the dataset really reflects a real family data transfer?" or "does this trained classifier work in another network?", a corresponding experiment will solve this problem. |
I second your point. As I pointed it out, the data I trained my classification model is a typical North American home network traffic. I released all data collection and processing tools and classification model source code. Because I expect those live inside the mainland China to replicate the test with their own data. Hopefully someone can tell me something with empirical data rather than calling other's work meaningless here. Pretending to be a knowledgable guy is easier than getting your hands dirty to do something meaningful to our world. |
V2RAY is a rather complicated tool involving multiple protocols, different combination of protocols can generate different packets and thus result in different features. |
In the first experiment, I used pure VMESS over TCP settings. The model is in Python notebook. You can see how I participate the test set and training set in data generator and how I collect data. All network traffic from my home network passed through pfSense router. Because I used one V2Ray server with known IP and port. So I used this as tag to label V2Ray traffic, while anything else is non-V2Ray traffic. For sure, I masquerade IP and Port with zero value in post processing so that the model is IP and port agnostic. Also, Please read my last blog post. I used pre-trained classifier to test on unseen V2Ray server traffic where V2Ray server moved from AWS to Google Compute engine, different OS flavor and different UUID. It still achieve near perfect ROC curve. I'm not concern the model is over-fitting. I'm not saying V2Ray should not be used now. But we should NOT use VMESS protocol over TCP. Big brother know you are using it now without decryption. To counter classifier, I'm doing research on adversarial example to evade the classifier. So far I have tried two gradient based approaches, namely FGSM and DeepFool, to add noise in the first 16th non-empty TCP payload packet. The adversarial example generated by FGSM is easy to compute and effective while the one by DeepFool is not easy to compute due to the vanishing gradient in activation function of output layer. The hard part is to figure it out how to apply this adversarial noise in our communication settings, where we have domain specific constraints such as how to encode and decode the noise between two parties and how to apply adversarial noise effectively in TCP payload only (You can't perturb the TCP and IP headers) In any case, I will publish more research in my blog once I sorted out a feasible solution. |
I published new blog post on white-box adversarial example. There is a Python notebook where I wrote a adversarial noise generator in Tensorflow 2.0. Under three assumptions below, we could evade DNN classifier by adding adversarial noise. In other word, the V2Ray traffic with adversarial noise will be miss-classified as non-V2Ray traffic.
My next step is to remove this assumption one-by-one. |
You don't need to tell me why this is important. If this were not important, I would not spend my spare time in this. Free speech is the only cure to the sick country. There are many configuration settings in V2Ray. It takes time to collect traffic data with one particular settings for machine learning purpose. I spent a month to collect network traffic with vmess over TCP setting in my first report. I started to collect a new set of data with vmess + TLS configuration on Jan 29th. Today, I just found that there was a power outbreak on Feb 6th from my home server apcupsd daemon log. My pfSense router is running on a retired PC which doesn't connect to UPS. The power outbreak interrupted the data collection. Now I missed seven day long data collection. There is some minor setback in this.
There are tons of smart people in mainland China. Can anyone replicate this experiment on your own V2Ray settings and traffic data? I knew I repeated myself. But God helps those who help themselves. TBH, I don't need V2Ray at all. I have no ideas how you setup V2Ray. Neither do I have the similar traffic pattern like yours. We need collective efforts to beat big brother. They have the trove of data to train their model. But I can't fight on this alone. If this were well-known that vmess over TCP was insecure before my releasing report, why it is still recommended settings in v2ray.com user guide? Why not deprecate the settings out of the box in V2Ray? One more note, I'm not sure how you can get the percentage of secure on the V2Ray configuration. We need an experiment to verify the claim. This is science not religion. The vmess over TCP is not secure at all. This has been proven. IMO, security in software is a binary thing: either secure or insecure. There is no gray area. |
I'm not interested in publishing paper. I'm more like a garage scientist. My motivation is to defeat censorship in all forms. Even in a free society like America, we still have censorship like political correctness. But I use my vote in America rather than the scientific web browsing technique you use in China to defeat it. I understood some of your points. But one thing I want to point out is that we should make our research public. In the encryption field, the most secure way to design your encryption algorithm is to publish your algorithm. Only in this way, any other eyeballs can help find the flaws in its principles and the bugs in its implementation. Keeping your encryption in the safe doesn't make your encryption secure. I think Nazi and Jap learned their lessons from WWII. The modern asymmetric encryption algorithm is based on the hard math problem -- a large number factorization. Its encryption principle and implementation are in the public domain. We use the well-known NPC problem to defend our encryption. But we no longer keep the algorithm under our mattress as Nazi or Jap did. We don' know if big brother uses DNN or not to classify network traffic now. But these techniques are feasible in reality. If the Chinese scholars applied for the patent, it is more likely that they are going to commercialize this technique in the near future. That is how the software company or patent troll company protects its IP in the West. To accelerate the research on finding a way to defeat it, it is better that we open our research and let other smart minds join the fight. |
At the beginning I thought Mr. Su is from the authority and is here to guide technical bypassing into a swamp where censorship can be easily achieved. However, after reading his meaningless debate, I jump to the conclusion that he/she is such a crying baby that begs for a stop of research into the potential technologies that could very likely be used as a method for traffic sniffing. |
I cherish Western Universal Values. Freedom of speech means everyone even those working for the evil regime is entitled to speak freely as well. In the past, I was furious to see so many fifty cents are active in Western social media like Facebook, Twitter, and Youtube to influence the view of people living beyond the wall. Now I'm not grumpy any more after I realized why Americans allow anyone, regardless of their motivation, to use their social platform. Because given free information access to anyone even fifty cents or hardcore commies, their value and ideology will change. The debate as a rational and civilized human is an enlightenment process that benefits every participant. One thing I learned from @suhuaixing lengthy argument is that how cool kids in China today use V2Ray to bypass censorship. Now I change my course to research on TLS traffic. I understood that English is the second language for all of us here. The adjective meaningless is a strong word. Even someone or I ask a stupid question or make an argument like a fool, it is great that someone else could point it out with solid evidence and rational reasoning. I'm not pointing finger to say who is fifty cents or who is working for the evil regime. Nobody is an expert on everything. The more we discuss, the better informed we are. |
Protocols like shadosocks and vmess share the same “flaw” that their packets are very random. To identify those “random” packets, the easiest way is computing the entropy of the first packets of a TCP stream like this https://github.com/isofew/sssniff. It’s a known issue for years. That’s why the DNN approach here is thought as “meaningless”, or at least “overkill”. BTW, the right direction of future circumventing tools is obfuscating. |
Throughout the thread I see nothing about any specific issue to the project but only discussion on your own research. I'll say this is the place to report issues to V2Ray developers instead of a general discussion board, which you may found at v2ray/discussion. Therefore I reckon this issue can be closed for now since you are not seems to come up with a fix. I'll say that V2Ray itself, though widely used for firewall trespassing, it's still a general network utility that focus on efficiency, functionality and more, instead of just hiding your traffic. If that's what you previously thought about V2Ray you are in the wrong place. Last thing I would like to say is that as long as it's a designed protocol, there should be a way to classify it out from others, it's just means and effort are uncertain. If you truly want to mix every sort of your packets together, just simply wrap it up with other protocols as one more layer out of it. |
The issue is crystal clear. The vmess protocol over TCP can be identified without using decryption, active probe, port or IP. Thus, it can be blocked in practice. I read the research report on shadownsocks. If the tool can be detected, the next step is to be blocked by GFW. Don't we learn the lesson now? My initial proposal is to disable the settings using the vmess protocol over TCP out of the box. It is unwise to expose citizens at risk if we know for sure that it can be identified. Can V2Ray dev team make this change? Or at least update your official user guide to inform users about this. I'm still collecting data with the setting using vmess over TLS, where traffic is masqueraded as TLS. I'm not sure this is a solution. Because we can classify SMTP over VPN, web traffic over VPN, and etc. Why TLS can make a difference? This is my speculation. It takes time to prove it. I don't use V2Ray. I don't need any circumvention tools to access Internet. The only reason I spent my time here is to defend freedom of expression. You can keep moving the post left and right. But it doesn't address the problem at all. |
What is crystal clear is that your logic is problematic. Disabling VMess over TCP is exact same logic as the case below, which is ridiculous:
Though V2Ray project identifies itself as a "unified platform for anti-censorship." which means it takes anti-censorship in high priority, however the prerequisite is it is still a unified platform, in other words, a versatile network utility. As a core function and foundation of the whole project, I don't think there will be any chance we stop supporting VMess over TCP in any near future. With basic statistics knowledge you should quickly realize that you won't gain a bigger possibility of successfully passing a firewall with an option disabled.
|
Are we playing a GRE analytical writing game here? If so, I will point you out the fallacy in your analogy. There are many ways to deliver safe water. If we know for sure that water delivering through the city water pipe may contain poisonous lead, should you stop the water distribution this way? Hell, yes. Look up the news on "Flint Water Crisis". But it doesn't mean we stop water supply completely. People in Flint, Michigan still need water. Therefore, they switched to drinking bottled water temporarily. At the same time, the city is rebuilding a lead-free water pipe system. The same logic applies to V2Ray. There are many settings provided by V2Ray. I never advise people to discontinue V2Ray completely so that no one can circumvent GFW. I just propose to stop using a specific insecure configuration that has been proven or at least give a warning in the user guide. My statistical background can NOT come to your conclusion.
If you imply the non-techie should use the insecure setting to increase the chance for the techie to evade GFW, that makes me sick. You don't need to quote my words from my profile. This will be my next step. |
|
I am really interesting about the vmess-ws-tls combination. Have you finished the second-stage data collection? And what is the result? |
You reason like a Chinese foreign ambassador in Aussie:
The vmess protocol is neither water nor water source. We are talking about a flawed application protocol that may expose the users' circumvention from GFW. But if you want to play with the analogy, then fine. Let's do it right. Free information behind GFW is like the clean water we desperately want. The vmess protocol over TCP configuration in V2Ray is like the city water pipe that may leak the poisonous lead. If we knew this, we should inform the techies and also the non-techies about the risk of consumption. Can we find such FYI precautions in the user guide? No, it doesn't exist. I'm not sure why you quote the limitation of liability clause at the end of your argument. Do you imply that it is a justification to allow the existence of a flawed application protocol without a warning? That's a silly argument. I'm done with one-month-long data collecting with the TLS+WS setting yesterday. The DNN model can still achieve >99% classification accuracy in the validation set (I split 80% of data for the training set and 20% for the validation set.) In the following weeks, I will collect the traffic data from a different host with a different TLS certificate and a different UUID and see if the previously trained model can generalize classification capabilities. Once all are done, I will release a blog post in public. |
Oh no! Basically that means we are doomed! Thanks for your hard work. |
I published my finding on TLS + WS in my blog post and uploaded pre-trained model as well. In a nutshell, the pre-trained model can generalize its classification capabilities to different UUID, different IP, different ports, different domain names, different OS flavors, and different TLS certificates. Can @madeye elaborate "obfuscating" solutoin? |
In last 3 months, I'm verifying the patent claim by replicating it with code and data.
First, I spent two months to write a TCP packet sorter. Then I collected home traffic for a month where I mixed v2ray traffic with other normal network traffic.
Last night, I stopped capturing packet data and started to build a classifier. I spent two hours to write a 1D Convolution Neutral Network to do binary classification on v2ray traffic and non-v2ray traffic.
To my surprise, I can achieve classification accuracy rate up to 98.86%. In a nutshell, the patent claim is real.
I shared my classification model in Python notebook and findings in my blog post.
The text was updated successfully, but these errors were encountered: