<h1>What is Preprocessing</h1>
<p>
Preprocessing in the context of Natural Language Processing (NLP) refers to the steps taken to clean and prepare raw text data before it is used in a machine learning model. This involves transforming the text into a format that can be easily understood and analyzed by algorithms. The goal of preprocessing is to reduce noise, normalize the text, and convert it into numerical features that can be fed into machine learning models</p>

<h2>1. Steps of Preprocessing</h2>
<ol>
    <li>Lowercasing</li>
    <li>Removing Punctuation</li>
    <li>Tokenization</li>
    <li>Removing Stop Words</li>
    <li>Lemmatization</li> <ol>

#### 1.1  Lowercasing 
<p> 
   Converting all characters in the text to lowercase ensures that words like "Depression" and "depression" are treated as the same word
</p>

In [41]:
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

In [42]:
# Process the text with spaCy
text = "I am Feeling Great Today!"
doc = nlp(text)

In [43]:
# conveting text to lowercase
lowercased_text = text.lower()

print(lowercased_text)  

i am feeling great today!


#### 1.2 Removing Punctuation
<p>Remove punctuation marks (like periods, commas, and question marks) from the text since they usually do not contribute to the meaning in NLP tasks.

 </p>

In [44]:
# Import string
import string 

In [45]:
# Example text
text1 = "What is NLP?"
text2 = "Feeling Depressed and anxious."

In [46]:
# Converting to lower case
text1_lower = text1.lower()
text2_lower = text2.lower()

In [47]:
# Remove Puctuations 
text1_no_punct = text1_lower.translate(str.maketrans('', '', string.punctuation))

text2_no_punct = text2_lower.translate(str.maketrans('', '', string.punctuation))

In [48]:
text1_no_punct

'what is nlp'

In [49]:
text2_no_punct

'feeling depressed and anxious'

#### 1.3 Tokenization
<p> <b>Tokenization</b> is the process of breaking down text into smaller units called tokens. In NLP, tokens are typically words, punctuation marks, or other meaningful elements. Tokenization is a crucial first step in many NLP tasks because it allows you to work with individual units of text.

 </p>

In [50]:
# Example Text 
text = "I am feeling great today! How about you?"

In [51]:
# lower case text
text_lower = text.lower()

# removing punctuations 
text_no_punct = text_lower.translate(str.maketrans('','',string.punctuation))

In [52]:
# Process the input text
doc = nlp(text_no_punct)

In [53]:
# iterating over tokens 
for token in doc:
    print(token.text)

i
am
feeling
great
today
how
about
you


#### 1.4 Removing Stop Words
<p> Stpo Words are common words that do not carry significant meaning or information for the analysis. These words, known as stop words, are frequently occurring words such as "the," "is," "and," "in," etc. Removing stop words can help reduce noise in the text data.  </p>

In [54]:
# Input text
text_with_stop_words = "I am feeling great today, but tomorrow might not be the same."

# Process the text with spaCy
doc = nlp(text_with_stop_words)

In [55]:
# Filter out stop words and join the remaining tokens
filtered_text = ' '.join([token.text for token in doc if not token.is_stop])

In [56]:
# Print the filtered text
filtered_text

'feeling great today , tomorrow .'

#### 1.4 Lemmatization
<p> 
  Lemmatization is the process of reducing words to their base or root form.  Lemmatization helps in normalizing words so that variations of the same word are treated as the same token, which can improve the performance of natural language processing tasks like text analysis, information retrieval, and machine learning.

<h4>How Lemmatization Works</h4>

<b>Tokenization</b> First, the text is tokenized into individual words or tokens.<br>
<b>Part-of-Speech Tagging</b> Each token is assigned a part-of-speech tag to determine its grammatical category (noun, verb, adjective, etc.).<br>
<b>Lemmatization </b> Based on the part-of-speech tag, each token is reduced to its base form (lemma). This involves removing inflections and suffixes to get to the root form of the word.
 </p>

In [57]:
# Input text
text = "I am running in the park."

# Process the text with spaCy
doc = nlp(text)

In [58]:
# Lemmatize each token and join the results
lemmatized_text = ' '.join([token.lemma_ for token in doc])

In [59]:
lemmatized_text

'I be run in the park .'