# Lesson 1: Exploring Syntactic Dependencies and Token Shapes in NLP

### Introduction to Syntactic Dependencies

Hello, welcome to the next step in your linguistic journey! In today's lesson, we'll expand upon your foundational linguistic knowledge and step into the world of syntactic dependencies and token shapes. This knowledge will equip you to delve even deeper into the fascinating realm of Natural Language Processing (NLP).

The first stop in our journey is syntactic dependencies. So, what are they? Simply put, syntactic dependencies are the grammatical relationships between words in a sentence. This could be a subject-verb relationship, an adjective-noun relationship, or other types of grammatical relations. Why are they important in NLP? They help us understand how words relate to each other and how they come together to convey meaning in a sentence.

In Python, with the help of SpaCy, we can extract these dependencies easily. Let's take a look at how to do this with our sample text from the Reuters corpus.

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")
```

In each line of output, the first word is the token, the second word is the type of syntactic dependency, and the third word is the head of the token. The head of a token is typically the word that governs the relationship between the words. This simple code gives us a depth of insight into the grammatical structure of the text!

### Unpacking Syntactic Dependencies Output

Alright, let's take a concrete look at the potential output our syntactic dependencies code could produce.

```
ASIAN      compound   EXPORTERS 
EXPORTERS  nsubj      FEAR      
FEAR       ccomp      said      
DAMAGE     nsubj      raised    
FROM       prep       DAMAGE    
U.S.-JAPAN compound   friction  
```

Even at first glance, we can already start to see patterns and relationships emerge from this data. However, to truly gain insights, we must understand what these outcome values mean:

- **ASIAN**: Here, "ASIAN" has a compound dependency type. A compound relationship is formed when two nouns come together to form a new noun, such as "ASIAN EXPORTERS".
- **EXPORTERS**: The nominal subject (nsubj) of the verb "FEAR" is "EXPORTERS". The nominal subject is typically the "doer" of the action and corresponds to "who" or "what" in the sentence.
- **FEAR**: The ccomp in this case stands for clausal complement, referring to "FEAR". These complements are subclauses that provide additional information but usually can't make sense as separate sentences.
- **DAMAGE**: It is considered to be the nominal subject (nsubj) for the verb "raised".
- **FROM**: Labeled with a prep, which stands for preposition, "FROM" provides a relationship between "U.S.-JAPAN" and another word in the sentence.

Besides these, there are different types of dependencies that you might encounter as well:

- **relcl**: It stands for relative clause modifier. They use words like "who" or "which" to provide more detail about the noun.
- **dobj**: Denotes direct object. This may be the noun or noun phrase that is receiving the action in the sentence.
- **ROOT**: This is the main verb in any given sentence, to which all other words are connected in a manner that is either direct or indirect.
- **nsubjpass**: This refers to the nominal subject in a passive sentence. In such sentences, the subject is usually receiving the action of the verb.
- **pobj**: Stands for object of a preposition. This is usually the noun coming after the preposition in the sentence.

Finally, remember that understanding these dependencies is vital if you want to dive deeply into the grammatical structure and meaning of a sentence. Now that we've dissected syntactic dependencies output, let's move on to our next interesting segment - the exploration of token shapes.

### Delving into Token Shapes

The next concept we'll explore is token shapes. A token shape is a type of transformation applied to the string representation of a token to provide a description of its orthographic structure — in other words, its shape focuses on the form of characters rather than their actual content.

Here's how the transformation works:

- Alphabetic characters are replaced by x or X. Lowercase characters become x and uppercase characters become X.
- Numeric characters are replaced by d.
- Sequences of the same character are truncated after length 4.

For example, a word like "Python" has an initial uppercase letter followed by lowercase letters, and thus gets transformed to "Xxxxx".

Let's see how to get these token shapes using our example text:

```python
print('\nToken Shapes:\n')
for token in doc[25:]:
    print(f"{token.text:<10s} {token.shape_:<10s}")
```

When put to work, token shapes can provide valuable insights. You may realize, for instance, that uppercase words are typically proper nouns, and digits represent numerical values, among other patterns.

### Understanding the Token Shapes Output

Looking at the output produced by our code:

```
seven      xxxx      
and        xxx       
12         dd        
pct        xxx       
of         xx        
China      Xxxxx     
's         'x     
```

Here's how to interpret these shapes:

- **seven**: The shape xxxx conveys that "seven" is composed of lowercase letters, hinting at its alphabetic nature without indicating specific letters, which helps in analyzing text patterns while abstracting away the details. Note that the shape was truncated to 4 characters.
- **and**: With a shape of xxx, this indicates that "and" consists of three lowercase letters. This distinct shape aids in recognizing small, commonly used words in analyses.
- **12**: Represented as dd, it clearly illustrates that "12" is a numeric token, consisting of two digits. This differentiation is vital for tasks that require numeric value processing or identification.
- **pct**: The token "pct" is shown with a shape xxx, indicating three lowercase letters.
- **of**: Its shape xx succinctly reflects that "of" is a short, two-letter word, all in lowercase. Recognizing such functional tokens is crucial for understanding the grammatical structure of sentences.
- **China**: The shape Xxxxx signals that "China" starts with an uppercase letter followed by lowercase letters, a characteristic feature of proper nouns. This insight is fundamental for tasks like Named Entity Recognition, as it distinguishes proper nouns from other text elements.
- **'s**: With a shape of 'x, this combination suggests the presence of a punctuation mark followed by a lowercase letter, a common feature in possessive constructions or contractions. Identifying these constructions is essential for parsing and understanding sentence structures.

With this understanding of token shapes, you can now integrate this intelligence into your NLP tasks, yielding even more insightful results!

### Experience the Power of Linguistics

Now that we've extracted syntactic dependencies and token shapes from our text, let's take a moment to reflect on the insights that these features offer. First, the syntactic dependencies give us a good understanding of the grammatical structure of the text. This can be extremely helpful when we're trying to parse sentences and understand the relationships between words.

On the other hand, token shapes allow us to observe patterns in the structure of words. This can be especially useful in tasks such as spam detection, where certain patterns of words or characters might be more common.

On the whole, understanding these linguistic features provides us with a deeper understanding of text, equipping us to perform more nuanced analyses.

### Lesson Summary and Practice

Congratulations on completing this detailed journey into syntactic dependencies and token shapes! You've not only learned what these concepts are, but have also extracted them from a text using Python and spaCy. Remember, linguistics is at the heart of Natural Language Processing, and understanding these features will stand you in good stead for more advanced tasks in this field.

In the upcoming practice exercises, you'll have the opportunity to apply these concepts to various texts. This practice will solidify your understanding and prepare you for the next lesson, where we'll explore the intricacies of semantics in NLP. Happy learning!Executed 1st Code Block


## Filtering Syntactic Dependencies and Token Shapes

Excellent work, Stellar Navigator! Let's further manipulate our textual analysis:

Change the syntactic dependency filter to 'relcl' instead of 'dobj'.
Alter the token shape filter to select those that are entirely digits.
Go ahead and adjust the code, then observe your new output.

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    if token.dep_ == 'dobj':
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    if all(char == 'x' for char in token.shape_):
        print(f"{token.text:<10s} {token.shape_:<10s}")

```

Here’s the adjusted code to filter syntactic dependencies for 'relcl' instead of 'dobj' and to select token shapes that are entirely digits:

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    if token.dep_ == 'relcl':  # Changed from 'dobj' to 'relcl'
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    if token.shape_ == 'dd':  # Changed to select tokens that are entirely digits
        print(f"{token.text:<10s} {token.shape_:<10s}")
```

### Key Changes:

1. **Syntactic Dependency Filter:**  
   The filter for syntactic dependencies has been changed from `dobj` to `relcl`, which will now display tokens that are relative clause modifiers.

2. **Token Shape Filter:**  
   The condition for token shapes has been modified to check if the shape is exactly 'dd', which indicates that the token consists entirely of digits.

### Expected Output:

When you run this code, you will see two sections of output:

1. **Syntactic Dependencies:**  
   This will list all tokens that have a syntactic dependency of 'relcl', along with their respective heads.

2. **Token Shapes:**  
   This will display tokens that are entirely numeric, along with their shape representation.

This adjustment will help you analyze the specific syntactic relationships and numeric tokens in the text from the Reuters corpus. If you have any further questions or need additional modifications, feel free to ask!


## Filtering Specific Syntactic Dependencies and Token Shapes

Excellent work, Stellar Navigator! Now, let's make some changes to our linguistic feature extraction code. Your task is to modify this code to only print out the syntactic dependencies of type 'nsubj' and token shapes that are fully uppercase (using isupper function on shape_ object). Both of these changes need to be reflected in your final output. Good luck!

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    print(f"{token.text:<10s} {token.shape_:<10s}")


```

To modify the provided code to filter for syntactic dependencies of type 'nsubj' and token shapes that are fully uppercase, you can adjust the loops accordingly. Here’s the updated code:
```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens of type 'nsubj'
print('\nSyntactic Dependencies (nsubj):\n')
for token in doc:
    if token.dep_ == 'nsubj':
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes that are fully uppercase
print('\nToken Shapes (Fully Uppercase):\n')
for token in doc:
    if token.shape_.isupper():
        print(f"{token.text:<10s} {token.shape_:<10s}")
```

### Changes Made:
1. In the first loop, a condition checks if `token.dep_` is equal to `'nsubj'` before printing the token details.
2. In the second loop, a condition checks if `token.shape_` is fully uppercase using the `isupper()` method before printing the token shapes.

This will ensure that only the relevant syntactic dependencies and token shapes are printed in the output.

## Creating Sentence with Unique Dependency and Shape

Nice work, Stellar Navigator! Ready for further exploration? Fill in the missing sections, marked with ____, to create a sentence that includes at least one token starting with a number (shape 'dxxxx') and a syntactic dependency of 'ROOT'. Let's bring it into reality!


```python
import spacy

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Provide a sentence with a token having a shape 'dxxxx' and a syntactic dependency 'ROOT'
sentence = ____

# Pass the sentence to the nlp object
doc = nlp(sentence)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    if ____:
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    if ____:
        print(f"{token.text:<10s} {token.shape_:<10s}")

```

To create a sentence that includes at least one token starting with a number (shape 'dxxxx') and a syntactic dependency of 'ROOT', you can fill in the missing sections as follows:

```python
import spacy

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Provide a sentence with a token having a shape 'dxxxx' and a syntactic dependency 'ROOT'
sentence = "2023 is a year of significant changes."

# Pass the sentence to the nlp object
doc = nlp(sentence)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    if token.dep_ == 'ROOT':
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    if token.shape_ == 'dxxxx':
        print(f"{token.text:<10s} {token.shape_:<10s}")
```

### Explanation of the Filled Sections:
1. **Sentence**: "2023 is a year of significant changes." This sentence includes the token "2023," which starts with a number and has the shape 'dxxxx'.
2. **Syntactic Dependency Check**: The condition checks if `token.dep_` is equal to 'ROOT'.
3. **Token Shape Check**: The condition checks if `token.shape_` is equal to 'dxxxx'.

This code will output the syntactic dependencies and token shapes as specified.

## Syntactic Dependencies and Token Shapes Filtering

Great strides, Stellar Navigator! Next stop: code additions!

Fill in the missing lines to print syntactic dependencies of the type 'nsubjpass' and token shapes that are not purely alphanumeric (.isalnum can be helpful here). Take your time, and trust your learning. You've got this!

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
# TODO: Add a loop to print syntactic dependencies of type 'nsubjpass'

# For token shapes
print('\nToken Shapes:\n')
# TODO: Add a loop to print token shapes not purely alphanumeric


```


To complete the code by adding loops that print syntactic dependencies of type 'nsubjpass' and token shapes that are not purely alphanumeric, you can fill in the missing lines as follows:

```python
import spacy
from nltk.corpus import reuters

# Let's load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Take a sample text from reuters corpus
sample_text = reuters.raw(reuters.fileids()[0])

# Pass the text to the nlp object
doc = nlp(sample_text)

# For syntactic tokens
print('\nSyntactic Dependencies:\n')
for token in doc:
    if token.dep_ == 'nsubjpass':
        print(f"{token.text:<10s} {token.dep_:<10s} {token.head.text:<10s}")

# For token shapes
print('\nToken Shapes:\n')
for token in doc:
    if not token.text.isalnum():
        print(f"{token.text:<10s} {token.shape_:<10s}")
```

### Explanation of the Added Lines:
1. **Syntactic Dependencies Loop**: The loop iterates through each token in the document and checks if the syntactic dependency (`token.dep_`) is equal to 'nsubjpass'. If it is, it prints the token text, its dependency type, and the head token's text.
  
2. **Token Shapes Loop**: This loop checks if the token text is not purely alphanumeric using `not token.text.isalnum()`. If the condition is met, it prints the token text and its shape.

This code will effectively filter and display the desired syntactic dependencies and token shapes.

## Filtering Syntactic Dependencies and Numerically Initiated Token Shapes

Stellar Navigator, you have demonstrated outstanding groundwork in linguistics. It is now time for the grand finale of this lesson. Assemble a comprehensive Python code that utilizes the SpaCy NLP model to:

Process the text from the Reuters corpus.
Filter and print tokens with the 'pobj' syntactic dependency.
Display token shapes for tokens that start with a numeric digit.
Trust your learning - you can do this!

```python
# TODO: Import necessary libraries, load the English NLP model, and take a sample text from Reuters corpus

# TODO: Create an NLP object and pass the sample text to it

# TODO: Write a loop to print syntactic dependencies of type 'pobj'

# TODO: Write a loop to print token shapes that start with a digit


```