<table align="left" width=100%>
    <tr>
        <td width="10%">
            <img src="../images/RA_Logo.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> 3. Lower casing </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/03_LowerCasing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/03_LowerCasing.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

## What is Lowercasing?
Lowercasing refers to the process of converting all characters in a piece of text to lowercase. This step is important because it helps standardize the text data by treating words like "Hello", "hello", and "HELLO" as identical, regardless of their original capitalization.

## Why Lowercasing?
Lowercasing is necessary in NLP tasks to ensure consistency and improve model performance. By converting all text to lowercase:

We reduce the vocabulary size by collapsing words that differ only by case into a single representation.
We avoid treating the same word differently based on capitalization, which could lead to sparsity in the data and affect the accuracy of models.

## How to Achieve Lowercasing Programmatically?
Using SpaCy:

In [1]:
import spacy

# Print the version of SpaCy installed
print(spacy.__version__)

3.5.4


In [2]:
import spacy

# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Example text
text = "This is an Example Sentence Demonstrating the Lowercasing of Text."

# Process the text with SpaCy
doc = nlp(text)

# Convert text to lowercase using SpaCy
lowercased_text_spacy = [token.text.lower() for token in doc]

print(lowercased_text_spacy)

['this', 'is', 'an', 'example', 'sentence', 'demonstrating', 'the', 'lowercasing', 'of', 'text', '.']


In [3]:
text = "What is SEP - Skill Enhancement Program"

In [5]:
doc = nlp(text)

In [6]:
for token in doc:
    print(token)

What
is
SEP
-
Skill
Enhancement
Program


In [7]:
[token.text.lower() for token in doc]

['what', 'is', 'sep', '-', 'skill', 'enhancement', 'program']

Using NLTK:

In [4]:
import nltk

# Print the version of NLTK installed
print("NLTK version:", nltk.__version__)

NLTK version: 3.8.1


In [8]:
from nltk.tokenize import word_tokenize
import nltk

In [9]:
# Ensure necessary resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vidyadharbendre/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
# Example text
text = "This is an Example Sentence Demonstrating the Lowercasing of Text."

In [11]:
word_tokenize(text)

['This',
 'is',
 'an',
 'Example',
 'Sentence',
 'Demonstrating',
 'the',
 'Lowercasing',
 'of',
 'Text',
 '.']

In [12]:
# Tokenize the text
words = word_tokenize(text)

In [13]:
# Convert text to lowercase using NLTK
lowercased_text_nltk = [word.lower() for word in words]

print(lowercased_text_nltk)

['this', 'is', 'an', 'example', 'sentence', 'demonstrating', 'the', 'lowercasing', 'of', 'text', '.']


## Explanation of the Code:

SpaCy Approach: We load the SpaCy English model and process the text. Each token in the document (doc) is converted to lowercase using token.text.lower().

NLTK Approach: We tokenize the text into words using word_tokenize() from NLTK. Then, each word is converted to lowercase using word.lower().

## Summary:
Lowercasing is a fundamental preprocessing step in NLP that ensures text data is standardized and consistent, which is crucial for downstream tasks such as text classification, information retrieval, and sentiment analysis. Implementing lowercasing using SpaCy or NLTK is straightforward and helps improve the quality and efficiency of NLP models.