In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemmer = PorterStemmer()

In [5]:
corpus="""Running Runner Ran Runs Easily Heavily fairly"""

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
words = word_tokenize(corpus)

In [10]:
stemmed_words = [stemmer.stem(word) for word in words]

In [15]:
print(stemmed_words)

['run', 'runner', 'ran', 'run', 'easili', 'heavili', 'fairli']


In [16]:
stemmer.stem('congratulations')

'congratul'

In [22]:
stemmer.stem('maternal')

'matern'

### Quantifiers

Quantifiers define **how many times** a character or group should appear in a match. Here are the three main quantifiers:

1. `*` (asterisk) – Matches **0 or more** occurrences of the preceding character or group.
2. `+` (plus) – Matches **1 or more** occurrences of the preceding character or group.
3. `{n,m}` (curly braces) – Matches **between `n` and `m`** occurrences of the preceding character or group.

Let's look at each one in detail with examples.

#### Example 1: `*` (Matches 0 or More Times)

Suppose we want to match words ending in "sing" or "ssing." We can use the pattern `r'ss*ing$'`:
- The `s*` means **0 or more "s" characters**, so it can match either "s" (for "sing") or "ss" (for "ssing").

```python
import re

pattern = r'ss*ing$'
words = ["sing", "ssing", "missing", "assing", "bossing"]
matches = [word for word in words if re.search(pattern, word)]
print(matches)  # Output: ['sing', 'ssing', 'missing', 'bossing']
```

Explanation:
- The pattern `ss*ing$` finds words ending with either "sing" or "ssing" (or with "s" appearing multiple times before "ing").
- This would also match "missing" and "bossing," since "s*" can match any number of "s" characters, including zero.

#### Example 2: `+` (Matches 1 or More Times)

Let’s say we want to find words that contain at least one "s" followed by "ing". Here, we can use `s+ing$`:
- The `s+` means **1 or more "s" characters**, so it will only match words that have at least one "s" before "ing."

```python
pattern = r's+ing$'
words = ["sing", "ssing", "missing", "ing"]
matches = [word for word in words if re.search(pattern, word)]
print(matches)  # Output: ['sing', 'ssing', 'missing']
```

Explanation:
- The pattern `s+ing$` matches "sing", "ssing", and "missing" because each has at least one "s" before "ing".
- It does **not** match "ing" because it lacks the required "s" character.

#### Example 3: `{n,m}` (Matches Between `n` and `m` Times)

Suppose we want to match words where "s" appears **between 1 and 2 times** before "ing." We can use `s{1,2}ing$`:
- The `{1,2}` means **between 1 and 2 occurrences** of "s".

```python
pattern = r's{1,2}ing$'
words = ["sing", "ssing", "sssings", "missing", "bossing"]
matches = [word for word in words if re.search(pattern, word)]
print(matches)  # Output: ['sing', 'ssing', 'missing']
```

Explanation:
- The pattern `s{1,2}ing$` matches "sing" and "ssing" because "s" appears between 1 and 2 times before "ing".
- It does **not** match "sssings" because it has 3 "s" characters, which is outside the specified range.

### Escaping Special Characters with `\`

Some characters have special meanings in regex. To match these characters **literally** (as plain text), we need to "escape" them using a backslash (`\`). Here are some common cases:

1. **`.` (dot)** – Matches any character except a newline.
   - To match an actual dot (like in an email or URL), use `\.`.

2. **`*`, `+`, `?`, `$`** – Used for quantifiers, end of line, etc.
   - To match them literally, prefix with `\`.

#### Example: Matching ".com" at the End of a Word

Suppose we want to match any word that ends with ".com". We can use the pattern `r'\.com$'`:
- The `\.` escapes the dot, so it matches a literal period (`.`) instead of "any character".
- The `$` indicates the end of the word, so this pattern will match words that **end** with ".com".

```python
pattern = r'\.com$'
words = ["website.com", "mydomain.com", "textcom", "example.org"]
matches = [word for word in words if re.search(pattern, word)]
print(matches)  # Output: ['website.com', 'mydomain.com']
```

Explanation:
- The pattern `\.com$` matches words ending with ".com" exactly.
- It does **not** match "textcom" because there is no dot before "com".

### Summary

- **Quantifiers** help define repetition: `*` (0+), `+` (1+), `{n,m}` (between `n` and `m`).
- **Escaping special characters** with `\` allows matching them literally, rather than using their special meanings.

In [27]:
from nltk.stem import RegexpStemmer

In [28]:
pattern = r'(ing|ed|ly)&'

In [29]:
stemmer = RegexpStemmer(pattern)

In [30]:
words = ["running", "played", "quickly", "joyfully", "cat"]

In [31]:
stemmed_words = [stemmer.stem(word) for word in words]

In [32]:
print(stemmed_words)

['running', 'played', 'quickly', 'joyfully', 'cat']


In [33]:
pattern1 = r'^(un|re)'

In [46]:
stemmer1 = RegexpStemmer(pattern1,min=5)

In [47]:
words1 = ["undo", "redo", "unanimously", "unknown", "replay", "relay"]

In [48]:
for word in words1:
    print(stemmer1.stem(word))

undo
redo
animously
known
play
lay


#### Combining Complex Affixes

In [61]:
pattern2 = r'^(un|re|dis|non)|ing$'

In [62]:
reg_stemmer = RegexpStemmer(pattern2)

In [63]:
words2 = ["running","unload","ingwalking","reload","disproportionate","nondiscriminatory"]

In [64]:
for word in words2:
    print(reg_stemmer.stem(word))

runn
load
ingwalk
load
proportionate
discriminatory


Creating patterns for `RegexpStemmer` in NLTK requires an understanding of regular expressions (regex), which allow for precise control over what gets removed or modified in each word. Here’s a breakdown of instructions and rules to create effective patterns for `RegexpStemmer`.

### Basic Regular Expression Rules for `RegexpStemmer`

1. **Identify Affix Position:** 
   - Use `^` to indicate a **prefix** (start of a word).
   - Use `$` to indicate a **suffix** (end of a word).

   For example:
   - `r'^un'` matches words starting with "un" (prefix).
   - `r'ing$'` matches words ending with "ing" (suffix).

2. **Match Multiple Affixes Using the OR Operator (`|`):**
   - Use `|` to match multiple patterns in one regex.
   - For example, `r'(ing|ed|ly)$'` matches words that end in "ing", "ed", or "ly".

3. **Grouping Affixes with Parentheses `()`**:
   - Parentheses group parts of the pattern, allowing you to apply the pattern to multiple affixes.
   - For example, `r'(un|re|in)'` will match any word starting with "un", "re", or "in".

4. **Character Classes and Sets `[]`:**
   - Use square brackets `[]` to define a set of characters that should match one position.
   - For example, `r'[aeiou]ing$'` matches words ending in "ing" that are preceded by a vowel (e.g., "doing", "seeing").

5. **Quantifiers**:
   - Quantifiers specify how many times a character or group can repeat.
   - `*` matches 0 or more times, `+` matches 1 or more times, and `{n,m}` matches between `n` and `m` times.
   - For example, `r'ss*ing$'` matches words ending in "sing" or "ssing".

6. **Escaping Special Characters with `\`**:
   - Some characters like `.`, `*`, `+`, `?`, and `$` have special meanings in regex. Use `\` to escape them when you want to match the literal character.
   - For example, `r'\.com$'` matches words ending with ".com".

### Steps to Design a Pattern for `RegexpStemmer`

#### Step 1: Define Your Affixes

Start by defining the specific affixes you want to remove. Identify whether they are prefixes or suffixes:
- Suffixes (like `ing`, `ed`, `ly`) are removed by targeting the end of the word with `$`.
- Prefixes (like `un`, `re`, `pre`) are removed by targeting the beginning of the word with `^`.

#### Step 2: Create Patterns with `|` for Multiple Affixes

If you have multiple suffixes or prefixes to remove, use the `|` operator to combine them:
```python
pattern = r'(ing|ed|ly)$'  # Matches any word ending in "ing", "ed", or "ly"
```

#### Step 3: Use Quantifiers to Target Variable-Length Affixes

Quantifiers help match affixes that might vary slightly in length. For example:
```python
pattern = r'ing$'  # Matches any word ending with "ing"
pattern = r'(ing|ings)$'  # Matches "ing" or "ings" at the end of a word
```

#### Step 4: Add Minimum Length (Optional)

You may not want to stem very short words. Use `min` to set a minimum word length after stemming to prevent over-stemming:
```python
from nltk.stem import RegexpStemmer

pattern = r'(ing|ed|ly)$'
stemmer = RegexpStemmer(pattern, min=3)
```

#### Step 5: Test the Pattern

Use sample words to test your pattern. Try various cases to ensure it works as expected.

### Common Patterns for `RegexpStemmer`

Here are a few common patterns and explanations:

1. **Removing Common English Suffixes:**
   ```python
   pattern = r'(ing|ed|s|es|ly)$'
   ```
   - This pattern removes common suffixes like `ing`, `ed`, `s`, `es`, and `ly` at the end of a word.

2. **Removing Prefixes like "un" and "re":**
   ```python
   pattern = r'^(un|re|in|dis|non)'
   ```
   - This pattern removes the prefixes `un`, `re`, `in`, `dis`, and `non` at the beginning of a word.

3. **Stemming Words Ending with "ful", "ness", or "ation":**
   ```python
   pattern = r'(ful|ness|ation)$'
   ```
   - This pattern removes suffixes commonly used in English nouns and adjectives.

4. **Removing Plurals:**
   ```python
   pattern = r's$'
   ```
   - This pattern removes a single `s` at the end of a word (for plurals), but it would also match any word ending in `s`, which can sometimes lead to over-stemming.

5. **Combining Complex Affixes:**
   ```python
   pattern = r'^(un|re|dis|non)|ing$'
   ```
   - This pattern removes multiple prefixes at the start or the suffix "ing" at the end.

### Examples of Pattern Testing

Here’s an example of testing patterns with different words:
```python
from nltk.stem import RegexpStemmer

pattern = r'(ing|ed|ly)$'
stemmer = RegexpStemmer(pattern)

words = ["running", "quickly", "joyfully", "played", "cats", "runner"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)  # ['runn', 'quick', 'joyful', 'play', 'cats', 'runner']
```

### Summary

To create effective patterns:
- Use `^` for prefixes and `$` for suffixes.
- Group multiple affixes with `|` inside `()`.
- Control length and repetition with quantifiers.
- Set `min` to avoid overly short stems.
Testing the pattern on various words will help ensure it works as intended for your application.