<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/Tokenization_and__text_manipulation_By_Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization and text manipulation:
Tokenization and text manipulation are fundamental concepts in natural language processing (NLP) and data preprocessing. Let’s break them down:

**- Tokenization:** It’s the process of breaking down a text into smaller units called tokens. Tokens can be words, sentences, or even subwords, depending on the task.
  - For example: Text: "Pedestrian detection is important."
  - Word-level tokens: ["Pedestrian", "detection", "is", "important", "."]
  - Character-level tokens: ['P', 'e', 'd', 'e', 's', 't', ...]
  - Sentence-level tokens (if the text had multiple sentences) would be entire sentences.

- Tokenization helps convert text into a structured format so it can be processed by machine learning models.

**- Text Manipulation:** This refers to the various operations you can apply to text data to clean, transform, and prepare it. Common text manipulation techniques include:
  - Lowercasing: Converting text to lowercase for consistency (e.g., “Detection” → “detection”).
  - Removing punctuation: Cleaning up unnecessary characters like .,?!.
  - Stopword removal: Removing common words like "is", "the", "and" that don’t carry much meaning.
  - Stemming and Lemmatization: Reducing words to their root form (e.g., "running" → "run").
  - Replacing or removing special characters.
  - Normalization: Standardizing text, like converting different forms of words into one common form.

Together, tokenization and text manipulation help convert raw text into a clean, structured format suitable for analysis or feeding into models.

### Step 1: Import libraries and read data

In [1]:
# Import required libraried
import re
import string

In [7]:
# Read Data
text = "Hello there!   I'm working on pedestrian   detection, it's quite exciting. 123   times more  than I expected!"

text

"Hello there!   I'm working on pedestrian   detection, it's quite exciting. 123   times more  than I expected!"

## Step 2: Preprocess Text
- Convert text to lowercase for consistency.
- Remove extra spaces & special characters (optional).

In [8]:
# Convert text to lowercase
text = text.lower()

print(text, "\n")

# Remove extra spaces and special characters (optional)
text = text.strip()

print(text)

hello there!   i'm working on pedestrian   detection, it's quite exciting. 123   times more  than i expected! 

hello there!   i'm working on pedestrian   detection, it's quite exciting. 123   times more  than i expected!


### Step 3: Tokenization Using Regex
- Use regex to split words, numbers, and contractions correctly.

In [11]:
# Regex pattern to split words, numbers, and contractions
pattern = r"\b\w+\b|\'\w+|\w+\'\w+"
tokens = re.findall(pattern, text)

print(tokens)

['hello', 'there', 'i', "'m", 'working', 'on', 'pedestrian', 'detection', 'it', "'s", 'quite', 'exciting', '123', 'times', 'more', 'than', 'i', 'expected']


# Step 4: Text Manipulation & Cleaning
- Remove numbers (if not needed).
- Remove punctuation (optional).
- Expand contractions for better NLP processing.

In [14]:
# Remove numbers (if not needed)
tokens = [token for token in tokens if not token.isdigit()]

print(tokens, "\n")

# Remove punctuation (optional)
tokens = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
tokens = [token for token in tokens if token]  # Remove empty strings

print(tokens, "\n")

# Expand contractions (simple example)
contractions = {"i'm": "i am", "it's": "it is"}
tokens = [contractions.get(token, token) for token in tokens]

print(tokens)

['hello', 'there', 'i', "'m", 'working', 'on', 'pedestrian', 'detection', 'it', "'s", 'quite', 'exciting', 'times', 'more', 'than', 'i', 'expected'] 

['hello', 'there', 'i', 'm', 'working', 'on', 'pedestrian', 'detection', 'it', 's', 'quite', 'exciting', 'times', 'more', 'than', 'i', 'expected'] 

['hello', 'there', 'i', 'm', 'working', 'on', 'pedestrian', 'detection', 'it', 's', 'quite', 'exciting', 'times', 'more', 'than', 'i', 'expected']
