# KWIC Tutorial

In the KWIC module, you can generate a keywords in context (KWIC) table.

In [1]:
# Python imports
from lexos.kwic import Kwic
from lexos.tokenizer import Tokenizer

# We'll load a small English model for demonstration purposes
tokenizer = Tokenizer(model="en_core_web_sm")

In the next cell, we will define a text and search for a character pattern with a context window of 10 characters.

In [None]:
text = "A bloody great Key Word in Context is a common term in NLP. KWIC is an acronym for Keyword in Context."
pattern = "KWIC"
kwic = Kwic()
kwic(docs=text, patterns=pattern, window=10)

We can use multiple patterns.

In [None]:
patterns = ["KWIC", "Key Word in Context"]
kwic = Kwic()
kwic(docs=text, labels=None, patterns=patterns, window=10)

We can also use regex patterns.

In [None]:
patterns = [r".WIC"]
kwic = Kwic()
kwic(docs=text, labels=None, patterns=patterns, window=10)

And we can search multiple docs.

In [None]:
text1 = "A bloody great Key Word in Context is a common term in NLP. KWIC is an acronym for Keyword in Context."
text2 = "KWIC is very useful in literary studies."
texts = [text1, text2]
kwic = Kwic()
kwic(docs=texts, labels=None, patterns=patterns, window=10)

We can supply our own labels.

In [None]:
kwic(docs=texts, labels=["First Doc", "Second Doc"], patterns=patterns, window=10)

We can make our search case insensitive.

In [None]:
text1 = "A bloody great Key Word in Context is a common term in NLP. KWIC is an acronym for Keyword in Context."
text2 = "The kwic tool is very useful in literary studies."
texts = [text1, text2]
kwic(docs=texts, labels=["First Doc", "Second Doc"], patterns=patterns, window=10, case_sensitive=False)

We can also sort the results.

In [None]:
kwic(docs=texts, patterns=["KWIC", "NLP"], window=10, sort_by="keyword", ascending=False)

Of course, we can also use pandas to sort the dataframe afterwards. However, we can use Lexos to help us do the sorting, especially if we want to get the results as a list of tuples using `as_df=False`.

In [None]:
kwic(docs=texts, patterns=["KWIC", "NLP"], window=10, sort_by="keyword", ascending=False, as_df=False)

## Searching Token Patterns

The above examples all perform regex searches on the raw text (if you use spaCy Docs, they will be converted to text strings).

If you have spaCy Docs, you may wish to search for token patterns. A simple way to do this is to set the matcher to "tokens".

In [None]:
text = "A bloody great Key Word in Context is a common term in NLP. KWIC is an acronym for Keyword in Context."
doc = tokenizer.make_doc(text)
patterns = "KWIC"
kwic(docs=doc, labels=None, patterns=patterns, window=5, matcher="tokens")

If we want to use regex, we can use the `use_regex` parameter. The `case_sensitive` parameter also works.

In [None]:
patterns = r".wic"
kwic(docs=doc, labels=None, patterns=patterns, window=5, matcher="tokens", use_regex=True, case_sensitive=False)

We can also perform more sophisticated token-based searches using spaCy's rule-matching syntax. To use it, we set the `matcher` parameter to "rule". See the [spaCy documentation](https://spacy.io/usage/rule-based-matching#matcher) for details of how to construct rules for token-based matching. 

In [None]:
pattern1 = [{"LOWER": "key"}, {"LOWER": "word"}, {"LOWER": "in"}, {"LOWER": "context"}]
pattern2 = [{"TEXT": "KWIC"}]
patterns = [pattern1, pattern2]
kwic(docs=doc, patterns=patterns, window=5, matcher="rule")

Finally, we can also search multi-token patterns using spaCy's `PhraseMatcher` by setting `matcher` to "phrase". 

In [None]:
patterns = ["Key Word in Context", "KWIC"]
kwic(docs=doc, patterns=patterns, window=5, matcher="phrase")
