Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep packet inspection to classify V2Ray traffic #569

Closed
rickyzhang82 opened this issue Jan 19, 2020 · 33 comments
Closed

Deep packet inspection to classify V2Ray traffic #569

rickyzhang82 opened this issue Jan 19, 2020 · 33 comments
Labels

Comments

@rickyzhang82
Copy link

In last 3 months, I'm verifying the patent claim by replicating it with code and data.

First, I spent two months to write a TCP packet sorter. Then I collected home traffic for a month where I mixed v2ray traffic with other normal network traffic.

Last night, I stopped capturing packet data and started to build a classifier. I spent two hours to write a 1D Convolution Neutral Network to do binary classification on v2ray traffic and non-v2ray traffic.

To my surprise, I can achieve classification accuracy rate up to 98.86%. In a nutshell, the patent claim is real.

I shared my classification model in Python notebook and findings in my blog post.

@twskipper
Copy link

interesting

@eycorsican
Copy link

What kind of V2Ray configurations were used in the data collection stage?

If a "tcp transport" was used, I am not surprised a DNN model can classify between non-HTTP/HTTPS data (data encrypted with VMess in your case) and HTTP/HTTPS data with a high score.

If a "ws+tls transport" was used, a DNN model could still learn to classify between different TLS fingerprints. (v2ray/v2ray-core#2098)

DNN models are powerful, but if you are not using the right dataset, you are making a wrong assumption.

As the first step, maybe you could try to include more diverse data in your test set, e.g. data encrypted with Shadowsocks, replicate some VMess encrypted data and re-fill them with random bytes (keep the length), mark them as non-V2Ray data, and see if there is a performance drop.

@rickyzhang82
Copy link
Author

@eycorsican

I used vmess on TCP. I just released packet post processing repository.

If you are interested in it, you can replicate my experiment on your own V2Ray configuration and your own traffic.

@eycorsican
Copy link

If a "tcp transport" was used, I am not surprised a DNN model can classify between non-HTTP/HTTPS data (data encrypted with VMess in your case) and HTTP/HTTPS data with a high score.

I am not going to reproduce your work. If you can't eliminate the above possibility, your work is meaningless, IMO.

@rickyzhang82
Copy link
Author

Meaningless is a strong word, kid.

I placed PacketSorter in my pfSense router over two months. The non-v2ray traffic includes OpenVPN, ssh, SMTP, IMAP, http, https, video streaming like Netflix, iTune and Youtube. It is a typical American family home network traffic.

There are naysayers and doers in our world. You are free to make your own choices.

@eycorsican
Copy link

Good, you gave more information about your dataset.

Nowadays many protocols are using TLS (including OpenVPN), header fields of the TLS protocol are salient features that DNN models would love to learn from, just looking at the version number field would achieve a high classification rate for TLS traffic, they're easy to learn compare to VMess's "random" bytes. So why not a DNN model just learn to distinguish these structured-data from "random" bytes? I believe almost all "normal" web traffic are using structured-protocols, they're all easy to learn by a DNN model.

But, "random" bytes need not be VMess data.

If you are not allow someone arguing your results and keep saying "you can replicate my experiment", why you're posting your results here.

@rickyzhang82

This comment has been minimized.

@eycorsican
Copy link

If you can't eliminate the above possibility, your work is meaningless, IMO.

It seems the word meaningless hits you hard, but I am using the right word and I have my precondition.

Well, I put my arguments quite clear, if you don't get them, that's fine.

V2Ray users will have their own judgements and make their own choices.

@GreatBigWhiteWorld
Copy link

Chill guys. Both of you are comrades fighting against the evil, which I can only admire without much contribution! The research is very interesting so let's stick to the technical and hope some others to chime in too.

@rickyzhang82
Copy link
Author

Today, I did another round of tests.

I spin up a Google Compute Engine (GCE) to deploy V2Ray server. It is completely different IP, port, UUID and OS from Amazon Lightsail. I used previously trained model by V2Ray server in AWS to classify traffic from V2Ray server in GCE.

The classification still can achieve (loss, acc) -- [0.023288102071980347, 0.99675000000000002]. The ROC curve still looks perfect.

Regarding vmess with TLS, I followed the user guide. But from the captured packets by Wireshark, I don't see clear text client hello, server hello or even key exchange like HTTPS. It seem the very first packet is encrypted already. Am I missing something? Or does vmess with TLS masquerade vmess protocol inside TLS?

@rickyzhang82
Copy link
Author

rickyzhang82 commented Jan 22, 2020

I did another round configuration trials with TLS today.

In vmess protocol, TLS with websocket ws or TLS with http/2 http shows a typical TLS clear text communication-- client hello, server hello but without server/client key exchange like HTTPS in our web browser. But TLS with tcp shows very similar encrypted packet pattern like vmess over TCP without TLS.

For sure, pure vmess over TCP can be easily classified now from my previous research. But it would be interesting to see if TLS with websocket and TLS with HTTP/2 can be masqueraded with other legitimate web browser HTTPS traffic. From my initial observation in Wireshark, the answer might be not. We will see.

[Update] From initial parsing result by Wireshark, V2ray TLS with websocket uses TLS v1.3 which doesn't show key exchange communication but other legitimate web broswer HTTPS traffic still uses TLS v1.2 does show completely different pattern of key exchange process. I'm not an expert on TLS.

But it seems that Golang provides latest version TLS implementation where majority web server uses in the the old version.

@Cwek
Copy link

Cwek commented Jan 22, 2020

I amn't sure if it can use your training model to find the correct answer from a large number of TCP links without giving the correct conclusion (that is, the IP address and port required to deploy the service)? This may include many unfamiliar TCP links, and some may use data headers with similar patterns. Can it identify them?

TLS does what it is supposed to do-it loads Vmess data streams into TLS data streams. This may also check whether this multi-layer load transport mode has the same or similar communication pattern to be identified.

As far as I know, the introduction of TLS 1.3 is based on user feedback, while older versions of V2ray still use TLS 1.2. Whether the use of the TLS version of TLS 1.3 can be used as a pattern depends on the popularity of TLS 1.3. Some websites still introduce it as a test feature.

Due to the disappearance of the main developers, it may be difficult for the "night guard" developers to bring about feature changes to this protocol.

@Cwek
Copy link

Cwek commented Jan 22, 2020

V2ray v4.18.1 introduced TLS 1.3. You can check V2ray of older version to check TLS 1.2.

@rickyzhang82
Copy link
Author

rickyzhang82 commented Jan 22, 2020

@Cwek
The configuration I used in my Python notebook is vmess over TCP. The data set is collected over one month period in my home router. It represents a typical north American home network traffic. IP and port from both source and destination have been masked as zero value. So the trained model is immune from IP and port change.

The model trained by the traffic data from V2Ray server run in AWS can classify the traffic from V2Ray server run in Google Compute Engine with 99% accuracy.

Machine learning is about data. That's why I release my infrastructure. Let anyone collect their own traffic in mainland China and train and evaluate the model on your own.

From my observation, vmess over TCP should NOT be recommended by default. It can be classified with high true positive rate (>99%) and low false positive rate (<1%)

Regarding to the TLS, I'm reading V2Ray source code. It uses Golang's crypto/tls implementation and force to use TLS 1.3 through environment variable. I'm not sure if the CNN model can classify vmess over TLS + websocket or vmess over TLS + http/2 yet. I need to collect new training data. But forcing everyone to use TLS 1.3 might not be a good idea while the majority TLS traffic is still in 1.2 version.

func init() {
	// opt-in TLS 1.3 for Go1.12
	// TODO: remove this line when Go1.13 is released.
	if !strings.Contains(os.Getenv("GODEBUG"), "tls13") {
		_ = os.Setenv("GODEBUG", os.Getenv("GODEBUG")+",tls13=1")
	}
}

I'm not concern with how to get back TLS 1.2. I'm curious if CNN can classify traffic with TLS combination. Because V2Ray use Golang TLS implementation. So by default it carries quite distinct TLS traffic pattern than mainstream web browser like Chrome, Firefox and etc.

If CNN could classify vmess over TLS, adversarial sample might be the last defense against AI tools.

@ailinwang
Copy link

Good, you gave more information about your dataset.

Nowadays many protocols are using TLS (including OpenVPN), header fields of the TLS protocol are salient features that DNN models would love to learn from, just looking at the version number field would achieve a high classification rate for TLS traffic, they're easy to learn compare to VMess's "random" bytes. So why not a DNN model just learn to distinguish these structured-data from "random" bytes? I believe almost all "normal" web traffic are using structured-protocols, they're all easy to learn by a DNN model.

But, "random" bytes need not be VMess data.

If you are not allow someone arguing your results and keep saying "you can replicate my experiment", why you're posting your results here.

I would say it is indeed meaningless to launch a fruitless quarrel, what makes sense is that one formulate a well-posed question, such as : "Does the dataset really reflects a real family data transfer?" or "does this trained classifier work in another network?", a corresponding experiment will solve this problem.
It is therefore suggested to open a new board for discussion, any security flaw counts as it will CERTAINLY affect the efficacy of the anti-censorship software.
Machine learning is currently posing MUCH threat to cybersecurity.

@rickyzhang82
Copy link
Author

@ailinwang

I second your point.

As I pointed it out, the data I trained my classification model is a typical North American home network traffic.

I released all data collection and processing tools and classification model source code. Because I expect those live inside the mainland China to replicate the test with their own data. Hopefully someone can tell me something with empirical data rather than calling other's work meaningless here.

Pretending to be a knowledgable guy is easier than getting your hands dirty to do something meaningful to our world.

@ailinwang
Copy link

ailinwang commented Feb 8, 2020

@rickyzhang82

@ailinwang

I second your point.

As I pointed it out, the data I trained my classification model is a typical North American home network traffic.

I released all data collection and processing tools and classification model source code. Because I expect those live inside the mainland China to replicate the test with their own data. Hopefully someone can tell me something with empirical data rather than calling other's work meaningless here.

Pretending to be a knowledgable guy is easier than getting your hands dirty to do something meaningful to our world.

V2RAY is a rather complicated tool involving multiple protocols, different combination of protocols can generate different packets and thus result in different features.
Let me sum up your experiment: you conducted your experiment on ur own dataset collected in ur home in the US for 2 weeks. You have several clients that connect the Internet via a pfSense router which has a copy of V2RAY installed, the data was collected in the upstream port of your router. Then you train your data, I cannot find the detailed arch of your network in ur blog post, you predict with ur model trained and got a 98%~99% of accuracy. B.T.W, how did you partition your test set and training set? and how good is its generalization capability?
I think it is better to write a well formatted and rigorous paper to illustrate the experiment, including data collection procedure, your training procedure, and means of evaluation, your analysis, etc.
Besides, I would greatly appreciate it if you can provide a counter solution thru PR or algorithm proposal to lower its detection rate or to avoid any possible detection.

@rickyzhang82
Copy link
Author

rickyzhang82 commented Feb 8, 2020

In the first experiment, I used pure VMESS over TCP settings.

The model is in Python notebook. You can see how I participate the test set and training set in data generator and how I collect data.

All network traffic from my home network passed through pfSense router. Because I used one V2Ray server with known IP and port. So I used this as tag to label V2Ray traffic, while anything else is non-V2Ray traffic. For sure, I masquerade IP and Port with zero value in post processing so that the model is IP and port agnostic.

Also, Please read my last blog post. I used pre-trained classifier to test on unseen V2Ray server traffic where V2Ray server moved from AWS to Google Compute engine, different OS flavor and different UUID. It still achieve near perfect ROC curve. I'm not concern the model is over-fitting.

I'm not saying V2Ray should not be used now. But we should NOT use VMESS protocol over TCP. Big brother know you are using it now without decryption.

To counter classifier, I'm doing research on adversarial example to evade the classifier. So far I have tried two gradient based approaches, namely FGSM and DeepFool, to add noise in the first 16th non-empty TCP payload packet. The adversarial example generated by FGSM is easy to compute and effective while the one by DeepFool is not easy to compute due to the vanishing gradient in activation function of output layer.

The hard part is to figure it out how to apply this adversarial noise in our communication settings, where we have domain specific constraints such as how to encode and decode the noise between two parties and how to apply adversarial noise effectively in TCP payload only (You can't perturb the TCP and IP headers)

In any case, I will publish more research in my blog once I sorted out a feasible solution.

@rickyzhang82
Copy link
Author

rickyzhang82 commented Feb 13, 2020

I published new blog post on white-box adversarial example. There is a Python notebook where I wrote a adversarial noise generator in Tensorflow 2.0.

Under three assumptions below, we could evade DNN classifier by adding adversarial noise. In other word, the V2Ray traffic with adversarial noise will be miss-classified as non-V2Ray traffic.

  1. We own the classifier model.
  2. We are oracles. Thus, we know the first 16th non-empty TCP payload traffic ahead.
  3. We can apply adversarial noise without any restrictions.

My next step is to remove this assumption one-by-one.

@rickyzhang82
Copy link
Author

You don't need to tell me why this is important. If this were not important, I would not spend my spare time in this.

Free speech is the only cure to the sick country.

There are many configuration settings in V2Ray. It takes time to collect traffic data with one particular settings for machine learning purpose. I spent a month to collect network traffic with vmess over TCP setting in my first report.

I started to collect a new set of data with vmess + TLS configuration on Jan 29th. Today, I just found that there was a power outbreak on Feb 6th from my home server apcupsd daemon log. My pfSense router is running on a retired PC which doesn't connect to UPS. The power outbreak interrupted the data collection. Now I missed seven day long data collection.

There is some minor setback in this.

Feb 06 13:59:41 t20 apcupsd[1729]: Power failure.
Feb 06 13:59:43 t20 apcupsd[1729]: Power is back. UPS running on mains.
Feb 06 13:59:52 t20 apcupsd[1729]: Power failure.
Feb 06 13:59:54 t20 apcupsd[1729]: Power is back. UPS running on mains.
Feb 06 14:00:06 t20 apcupsd[1729]: Power failure.
Feb 06 14:00:07 t20 apcupsd[1729]: Power is back. UPS running on mains.
Feb 06 14:00:08 t20 apcupsd[1729]: Power failure.
Feb 06 14:00:10 t20 apcupsd[1729]: Power is back. UPS running on mains.
Feb 06 14:00:24 t20 apcupsd[1729]: Power failure.
Feb 06 14:00:26 t20 apcupsd[1729]: Power is back. UPS running on mains.
Feb 06 14:00:46 t20 apcupsd[1729]: Power failure.
Feb 06 14:00:48 t20 apcupsd[1729]: Power is back. UPS running on mains.

There are tons of smart people in mainland China. Can anyone replicate this experiment on your own V2Ray settings and traffic data? I knew I repeated myself. But God helps those who help themselves.

TBH, I don't need V2Ray at all. I have no ideas how you setup V2Ray. Neither do I have the similar traffic pattern like yours. We need collective efforts to beat big brother. They have the trove of data to train their model. But I can't fight on this alone.

If this were well-known that vmess over TCP was insecure before my releasing report, why it is still recommended settings in v2ray.com user guide? Why not deprecate the settings out of the box in V2Ray?

One more note, I'm not sure how you can get the percentage of secure on the V2Ray configuration. We need an experiment to verify the claim. This is science not religion. The vmess over TCP is not secure at all. This has been proven. IMO, security in software is a binary thing: either secure or insecure. There is no gray area.

@rickyzhang82
Copy link
Author

I'm not interested in publishing paper. I'm more like a garage scientist. My motivation is to defeat censorship in all forms. Even in a free society like America, we still have censorship like political correctness. But I use my vote in America rather than the scientific web browsing technique you use in China to defeat it.

I understood some of your points. But one thing I want to point out is that we should make our research public. In the encryption field, the most secure way to design your encryption algorithm is to publish your algorithm. Only in this way, any other eyeballs can help find the flaws in its principles and the bugs in its implementation. Keeping your encryption in the safe doesn't make your encryption secure. I think Nazi and Jap learned their lessons from WWII. The modern asymmetric encryption algorithm is based on the hard math problem -- a large number factorization. Its encryption principle and implementation are in the public domain. We use the well-known NPC problem to defend our encryption. But we no longer keep the algorithm under our mattress as Nazi or Jap did.

We don' know if big brother uses DNN or not to classify network traffic now. But these techniques are feasible in reality. If the Chinese scholars applied for the patent, it is more likely that they are going to commercialize this technique in the near future. That is how the software company or patent troll company protects its IP in the West.

To accelerate the research on finding a way to defeat it, it is better that we open our research and let other smart minds join the fight.

@ailinwang
Copy link

At the beginning I thought Mr. Su is from the authority and is here to guide technical bypassing into a swamp where censorship can be easily achieved. However, after reading his meaningless debate, I jump to the conclusion that he/she is such a crying baby that begs for a stop of research into the potential technologies that could very likely be used as a method for traffic sniffing.
The best defense is to attack, as Bruce Lee put it. Without knowing the mechanism behind which the sniffing works, one will NEVER know what vulnerabilities that V2RAY has.
Please BE SMART ! @suhuaixing

@rickyzhang82
Copy link
Author

I cherish Western Universal Values. Freedom of speech means everyone even those working for the evil regime is entitled to speak freely as well. In the past, I was furious to see so many fifty cents are active in Western social media like Facebook, Twitter, and Youtube to influence the view of people living beyond the wall. Now I'm not grumpy any more after I realized why Americans allow anyone, regardless of their motivation, to use their social platform. Because given free information access to anyone even fifty cents or hardcore commies, their value and ideology will change.

The debate as a rational and civilized human is an enlightenment process that benefits every participant. One thing I learned from @suhuaixing lengthy argument is that how cool kids in China today use V2Ray to bypass censorship. Now I change my course to research on TLS traffic.

I understood that English is the second language for all of us here. The adjective meaningless is a strong word. Even someone or I ask a stupid question or make an argument like a fool, it is great that someone else could point it out with solid evidence and rational reasoning.

I'm not pointing finger to say who is fifty cents or who is working for the evil regime. Nobody is an expert on everything. The more we discuss, the better informed we are.

@madeye
Copy link

madeye commented Feb 25, 2020

Protocols like shadosocks and vmess share the same “flaw” that their packets are very random.

To identify those “random” packets, the easiest way is computing the entropy of the first packets of a TCP stream like this https://github.com/isofew/sssniff.

It’s a known issue for years. That’s why the DNN approach here is thought as “meaningless”, or at least “overkill”.

BTW, the right direction of future circumventing tools is obfuscating.

@nicholascw
Copy link

nicholascw commented Feb 25, 2020

Throughout the thread I see nothing about any specific issue to the project but only discussion on your own research. I'll say this is the place to report issues to V2Ray developers instead of a general discussion board, which you may found at v2ray/discussion. Therefore I reckon this issue can be closed for now since you are not seems to come up with a fix. I'll say that V2Ray itself, though widely used for firewall trespassing, it's still a general network utility that focus on efficiency, functionality and more, instead of just hiding your traffic. If that's what you previously thought about V2Ray you are in the wrong place.

Last thing I would like to say is that as long as it's a designed protocol, there should be a way to classify it out from others, it's just means and effort are uncertain. If you truly want to mix every sort of your packets together, just simply wrap it up with other protocols as one more layer out of it.

@nicholascw nicholascw transferred this issue from v2ray/v2ray-core Feb 26, 2020
@rickyzhang82
Copy link
Author

rickyzhang82 commented Feb 26, 2020

The issue is crystal clear. The vmess protocol over TCP can be identified without using decryption, active probe, port or IP. Thus, it can be blocked in practice. I read the research report on shadownsocks. If the tool can be detected, the next step is to be blocked by GFW. Don't we learn the lesson now?

My initial proposal is to disable the settings using the vmess protocol over TCP out of the box. It is unwise to expose citizens at risk if we know for sure that it can be identified. Can V2Ray dev team make this change? Or at least update your official user guide to inform users about this.

I'm still collecting data with the setting using vmess over TLS, where traffic is masqueraded as TLS. I'm not sure this is a solution. Because we can classify SMTP over VPN, web traffic over VPN, and etc. Why TLS can make a difference? This is my speculation. It takes time to prove it.

I don't use V2Ray. I don't need any circumvention tools to access Internet. The only reason I spent my time here is to defend freedom of expression.

You can keep moving the post left and right. But it doesn't address the problem at all.

@nicholascw
Copy link

What is crystal clear is that your logic is problematic. Disabling VMess over TCP is exact same logic as the case below, which is ridiculous:

   New study shows that there're measures to poison water without noticed by target victim.
=> There must be people poison any target water source that use water for whatever reasons.
=> We should cease water supply to anybody as soon as possible.

Though V2Ray project identifies itself as a "unified platform for anti-censorship." which means it takes anti-censorship in high priority, however the prerequisite is it is still a unified platform, in other words, a versatile network utility. As a core function and foundation of the whole project, I don't think there will be any chance we stop supporting VMess over TCP in any near future.

With basic statistics knowledge you should quickly realize that you won't gain a bigger possibility of successfully passing a firewall with an option disabled.

If you don't like it, fork it. [1]

[1] Zhang, R. @rickyzhang82 GitHub Profile. Retrieved February 27, 2020, from https://github.com/rickyzhang82

@rickyzhang82
Copy link
Author

Are we playing a GRE analytical writing game here? If so, I will point you out the fallacy in your analogy.

There are many ways to deliver safe water. If we know for sure that water delivering through the city water pipe may contain poisonous lead, should you stop the water distribution this way? Hell, yes. Look up the news on "Flint Water Crisis".

But it doesn't mean we stop water supply completely. People in Flint, Michigan still need water. Therefore, they switched to drinking bottled water temporarily. At the same time, the city is rebuilding a lead-free water pipe system.

The same logic applies to V2Ray. There are many settings provided by V2Ray. I never advise people to discontinue V2Ray completely so that no one can circumvent GFW. I just propose to stop using a specific insecure configuration that has been proven or at least give a warning in the user guide.

My statistical background can NOT come to your conclusion.

With basic statistics knowledge you should quickly realize that you won't gain a bigger possibility of successfully passing a firewall with an option disabled.

If you imply the non-techie should use the insecure setting to increase the chance for the techie to evade GFW, that makes me sick.

You don't need to quote my words from my profile. This will be my next step.

@nicholascw
Copy link

nicholascw commented Feb 29, 2020

  1. You are substituting water with specific water source.
    VMess itself which is a protocol, is generally the same as water rather than specific water source until you deploy it into a specific environment. Removing VMess over TCP is like forbidding use non-drinkable water to run a steam engine.

  2. There are various ways provided to wrap, encapsulate, and deliver VMess including features like UNIX socket, mKCP, mux, websocket, tls. Features are provided and their working mechanism, properties and performance are described in each page of the official manual. We never tend to guide user to use ANY form of configurations into ANY environment. It is the users' job to decide which way is the way that fits their need.

  3. At the same time, you shouldn't automatically assume in any degree that the provided configuration file sample, which is the most basic, most plain VMess configuration demo, is a recommended way to use in any circumstances. We made no recommendations in the document, neither VMess or any other combined configuration, and the demo files were their simply act as a skeleton for user to modify it. Also, there are more configuration templates to be found at https://github.com/KiriKira/vTemplate, which is a repo listed as a third party tool in the manual, not to mention that we also forked, maintained, and internationalizing the ToutyRater's V2Ray Guide and operate at https://guide.v2fly.org, which listed a bunch of configuration scenarios and intend to let "non-techie" understand.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

@vrnobody
Copy link

vrnobody commented Mar 4, 2020

I am really interesting about the vmess-ws-tls combination.
Sadly i know nothing about machine learning, so i could not conduct the experiment myself.

Have you finished the second-stage data collection? And what is the result?

@rickyzhang82
Copy link
Author

@nicholascw

VMess itself which is a protocol, is generally the same as water rather than specific water source until you deploy it into a specific environment. Removing VMess over TCP is like forbidding use non-drinkable water to run a steam engine.

You reason like a Chinese foreign ambassador in Aussie:

  • Evade the crux of the matter.
  • Shift the focus like a monkey jumping here and there.

The vmess protocol is neither water nor water source. We are talking about a flawed application protocol that may expose the users' circumvention from GFW.

But if you want to play with the analogy, then fine. Let's do it right. Free information behind GFW is like the clean water we desperately want. The vmess protocol over TCP configuration in V2Ray is like the city water pipe that may leak the poisonous lead. If we knew this, we should inform the techies and also the non-techies about the risk of consumption.

Can we find such FYI precautions in the user guide? No, it doesn't exist.

I'm not sure why you quote the limitation of liability clause at the end of your argument. Do you imply that it is a justification to allow the existence of a flawed application protocol without a warning? That's a silly argument.

@vrnobody

I'm done with one-month-long data collecting with the TLS+WS setting yesterday. The DNN model can still achieve >99% classification accuracy in the validation set (I split 80% of data for the training set and 20% for the validation set.)

In the following weeks, I will collect the traffic data from a different host with a different TLS certificate and a different UUID and see if the previously trained model can generalize classification capabilities.

Once all are done, I will release a blog post in public.

@vrnobody
Copy link

vrnobody commented Mar 6, 2020

Oh no! Basically that means we are doomed!

Thanks for your hard work.

@rickyzhang82
Copy link
Author

@madeye @vrnobody

I published my finding on TLS + WS in my blog post and uploaded pre-trained model as well.

In a nutshell, the pre-trained model can generalize its classification capabilities to different UUID, different IP, different ports, different domain names, different OS flavors, and different TLS certificates.

Can @madeye elaborate "obfuscating" solutoin?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants