Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension endpoint, dealing with numbers #1078

Closed
bobvanluijt opened this issue Feb 8, 2020 · 7 comments · Fixed by #1097
Closed

Extension endpoint, dealing with numbers #1078

bobvanluijt opened this issue Feb 8, 2020 · 7 comments · Fixed by #1097

Comments

@bobvanluijt
Copy link
Member

Currently, we can't add extensions with numbers in them.

Current behavior: Add concept with capitals

POST /v1/c11y/extensions

{
  "concept": "APT12",
  "definition": "APT12 is a threat group that has been attributed to China. The group has targeted a variety of victims including but not limited to media outlets, high-tech companies, and multiple governments.",
  "weight": 1
}

results in:

{
    "error": [
        {
            "message": "rpc error: code = InvalidArgument desc = invalid extension: concept must be made up of all lowercase letters and/or numbers, for custom compund words use spaces, e.g. 'flux capacitor'"
        }
    ]
}

Expected behavior: Weaviate handles capitalization automatically.


Current behavior: Add concept with lowercase letters

POST /v1/c11y/extensions

{
  "concept": "apt12",
  "definition": "APT12 is a threat group that has been attributed to China. The group has targeted a variety of victims including but not limited to media outlets, high-tech companies, and multiple governments.",
  "weight": 1
}

results in a 200 OK


Current behavior: Locate the new concept

GET /v1/c11y/concepts/Apt12

{
	"error": [{
		"message": "invalid word input: words must match ^[a-zA-Z]+$"
	}]
}

results in:

{
	"error": [{
		"message": "invalid word input: words must match ^[a-zA-Z]+$"
	}]
}

Expected behavior: Correct representation for this word.

PS:
Source of the example: https://attack.mitre.org/groups/G0005/

@bobvanluijt
Copy link
Member Author

Another example that ideally works like this:

POST /v1/c11y/extensions

{
  "concept": "Winlogon Helper DLL",
  "definition": "Winlogon. exe is a Windows component responsible for actions at logon/logoff as well as the secure attention sequence (SAS) triggered by Ctrl-Alt-Delete. Registry entries in HKLM\\SoftwareMicrosoft\\Windows NT\\CurrentVersion\\Winlogon\\ and HKCU\\Software\\Microsoft\\Windows NT\\CurrentVersion\\Winlogon\\ are used to manage additional helper programs and functionalities that support Winlogon.  Malicious modifications to these Registry keys may cause Winlogon to load and execute malicious DLLs and/or executables. Specifically, the following subkeys have been known to be possibly vulnerable to abuse: Winlogon\\Notify - points to notification package DLLs that handle Winlogon eventsWinlogon\\Userinit - points to userinit. exe, the user initialization program executed when a user logs onWinlogon\\Shell - points to explorer. exe, the system shell executed when a user logs adversaries may take advantage of these features to repeatedly execute malicious code and establish Persistence.",
  "weight": 1
}

@etiennedi
Copy link
Member

I didn't yet have the time to look into this in more detail, but two thoughts upfront.

On Problem 1:

Expected behavior: Weaviate handles capitalization automatically.

The workaround at the moment is to lowercase it yourself. I'm not sure if auto-lowercasing is a good idea. Let's discuss this when we have some time.

On Problem 2:

Expected behavior: Correct representation for this word.

This is only a limitation of the /c11y/concpets endpoint. I.e. we can't view it at the moment if it contains numbers, but this limitation is not part of the vectorization, the word apt12 (in all capitalizations) would be correctly recognized in a vectorization or search.

@bobvanluijt
Copy link
Member Author

  • Problem 1 curious to understand this better.
  • Problem 2 I notice that others including myself rely on this end-point to validate things. So it would be great if the c11y/concepts endpoint reflects what is in Weaviate :-)

@etiennedi
Copy link
Member

On Problem 1:

The contextionary itself (based on the training data) is all lowercase at the moment. That's why we currently have the requirement that additions also have to be lowercased. If we magically lowercase all the user input, the user will never know that the meaning they added is lost. So for example, they might add "abc" -> "alphabet" and then add "ABC" -> "American Broadcast Company" thinking they added two extensions. However, since we lowercased them both, the user actually just overwrote the first on and only ends up with one.

On Problem 2:

Makes sense, this is an easy fix, I'll prioritize this.

@bobvanluijt
Copy link
Member Author

On Problem 1

I'm a bit confused, because with the current behavior

  1. "abc" -> "alphabet" would result in a 200.
  2. "ABC" -> "American Broadcast Company" would result in the error mentioned above.

Or am I missing something?

@etiennedi
Copy link
Member

Correct and I think that's good. Because if we don't error on (1) and lowercase the input, it would overwerite (0), but the user would be none the wiser.

@etiennedi
Copy link
Member

As agreed on Hangouts, we're only focusing on Problem 2 for now. We're leaving Problem 1 as is - this means the user will have to add some toLower() to their import scripts. This additional step is worth being explicit about the contextionary being all lowercase at the moment.

In the future we should look into what it would take to build a cased contextionary and accept both entries as individual extenstions.

etiennedi added a commit that referenced this issue Mar 2, 2020
closes #1078

Prior to this we used a very simple regex to validate. Now instead we're
accepting any letter/digit utf-8 value.

There were also no tests on this feature, so I've added a few simple
test cases to cover the most common cases and make this a bit more
robust.
etiennedi added a commit that referenced this issue Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants