<a href="https://colab.research.google.com/github/raghavmahajan821/NLP/blob/main/Exploring_%26_Processing_Text_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Processing Text Data**

We are going to discuss the following recipes under text preprocessing
and exploratory data analysis.

Recipe 1. Lowercasing

Recipe 2. Punctuation removal

Recipe 3. Stop words removal

Recipe 4. Text standardization

Recipe 5. Spelling correction

Recipe 6. Tokenization

Recipe 7. Stemming

Recipe 8. Lemmatization

Recipe 9. Exploratory data analysis

Recipe 10. End-to-end processing pipeline

In [None]:
import pandas as pd
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

###Text in the form of dataframe

In [None]:
Text = [
    'This @is %introduction %to &@NLP',
    'It is likely to be #useful to people',
    'Machine &*learning is the new electricity',
    'There ^ would be less hype around AI and more action going forward',
    'python is the best tool!',
    'R is a good language',
    'I like this #&book',
    'I want more books like this'
]                                           #write in this way only spaces matter


# crreating dataframe from list
df=pd.DataFrame({'tweet':Text})
print(df)

                                               tweet
0                   This @is %introduction %to &@NLP
1               It is likely to be #useful to people
2          Machine &*learning is the new electricity
3  There ^ would be less hype around AI and more ...
4                           python is the best tool!
5                               R is a good language
6                                 I like this #&book
7                        I want more books like this


###Normal Text

In [None]:
text = "Natural !@ language processing (NLP) is a #!field " + \
       "of computer science, artificial intelligence@ " + \
       "and computational linguistics concerned with " + \
       "the interactions ^ between computers and human " + \
       "(natural) languages. In particular, " + \
       "concerned with & programming computers to " + \
       "fruitfully %% process large natural language " + \
       "corpora. Challenges in natural language. " + \
       "Processing frequently ##involve natural " + \
       "language understanding, natural language" + \
       "generation frequently from formal, machine" + \
       "-readable logical forms), connecting language " + \
       "and machine perception, managing human-" + \
       "computer dialog systems, or some combination " + \
       "thereof."
# text = "This is a sample sentence for tokenization."
print(text)

Natural !@ language processing (NLP) is a #!field of computer science, artificial intelligence@ and computational linguistics concerned with the interactions ^ between computers and human (natural) languages. In particular, concerned with & programming computers to fruitfully %% process large natural language corpora. Challenges in natural language. Processing frequently ##involve natural language understanding, natural languagegeneration frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.


###Task 1- LowerCasing

In [None]:
#Lowercasing strings
text=text.lower()
print(text)
print('\n\n')

# Lowercasing dataframes
df['tweet']=df['tweet'].apply(lambda x:" ".join (x.lower() for x in x.split()))
print(df['tweet'])

natural !@ language processing (nlp) is a #!field of computer science, artificial intelligence@ and computational linguistics concerned with the interactions ^ between computers and human (natural) languages. in particular, concerned with & programming computers to fruitfully %% process large natural language corpora. challenges in natural language. processing frequently ##involve natural language understanding, natural languagegeneration frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.



0                     this @is %introduction %to &@nlp
1                 it is likely to be #useful to people
2            machine &*learning is the new electricity
3    there ^ would be less hype around ai and more ...
4                             python is the best tool!
5                                 r is a good language
6                                   i like this #&book

###Task 2-Punctuation Removal

In [None]:
import re
text=re.sub(r'[^\w\s]',' ',text)
print(text)
print('\n\n\n')

# Punctuation removal in dataframe
df['tweet'] = df['tweet'].apply(lambda x: " ".join(re.sub(r'[^\w\s]', '', x).split()))
print(df['tweet'])

natural    language processing  nlp  is a   field of computer science  artificial intelligence  and computational linguistics concerned with the interactions   between computers and human  natural  languages  in particular  concerned with   programming computers to fruitfully    process large natural language corpora  challenges in natural language  processing frequently   involve natural language understanding  natural languagegeneration frequently from formal  machine readable logical forms   connecting language and machine perception  managing human computer dialog systems  or some combination thereof 




0                          this is introduction to nlp
1                  it is likely to be useful to people
2              machine learning is the new electricity
3    there would be less hype around ai and more ac...
4                              python is the best tool
5                                 r is a good language
6                                     i like this boo

###Task 3- Stop words removal

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop=stopwords.words('english')
df['tweet']=df['tweet'].apply(lambda x:" ". join (x for x in x.split() if x not in stop))
print(df['tweet'])

0                                  introduction nlp
1                              likely useful people
2                  machine learning new electricity
3    would less hype around ai action going forward
4                                  python best tool
5                                   r good language
6                                         like book
7                                   want books like
Name: tweet, dtype: object


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


###Task 5-Spelling Correction

In [None]:
from textblob import TextBlob
# Example text with spelling errors
text = "Thes is an example sentnce with speling miskates."

# Create a TextBlob object
blob = TextBlob(text)

# Correct spelling errors
corrected_text = blob.correct()
print(corrected_text)

The is an example sentence with spelling mistakes.


###Task 6- Tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
print(text)

Thes is an example sentnce with speling miskates.


In [None]:
print(sent_tokenize(text))  #tokenize string based on fixed literals such as dot(.)

['Thes is an example sentnce with speling miskates.']


In [None]:
print(word_tokenize(text)) #tokenize each word ,separator is space.

['Thes', 'is', 'an', 'example', 'sentnce', 'with', 'speling', 'miskates', '.']


###Task 7-Stemming
Stemming is a text normalization technique in natural language processing (NLP) that aims to reduce words to their root or base form, often by removing suffixes. The goal of stemming is to map different inflections or derivations of a word to a common form so that words with the same meaning are treated as equal tokens.

It is essential to understand its limitations:





Stemming does not consider the context or meaning of words, which can lead to over-stemming (reducing words to a root form that doesn't make sense) or under-stemming (not reducing words when they should be).

1.   Stemming can sometimes produce non-words or words that are not in common use.
2.   It may not capture irregular plurals or verb conjugations correctly.
3.   Stemming does not consider the context or meaning of words, which can lead to over-stemming (reducing words to a root form that doesn't make sense) or under-stemming (not reducing words when they should be).



In [None]:
from nltk.stem import PorterStemmer
# Create a stemmer object
stemmer = PorterStemmer()

# Example words to be stemmed
words = ["running", "flies", "happily", "unhappiness"]

# Apply stemming to the words
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'fli', 'happili', 'unhappi']


###Task 8- Lemmitization
Lemmatization is a text normalization technique in natural language processing (NLP) that reduces words to their base or dictionary form, known as the lemma. Unlike stemming, which crudely removes suffixes from words, lemmatization takes into account the meaning and part of speech of the word. The goal is to transform words to a common and meaningful root form. Lemmatization is more linguistically informed and generally produces valid words.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Example words to be lemmatized
words = ["running", "flies", "happily", "unhappiness"]

# Apply lemmatization to the words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['running', 'fly', 'happily', 'unhappiness']


**Brief Overview**

 Here's a step-by-step guide on how to standardize text in NLP:


**Lowercasing:**
Convert all text to lowercase to ensure uniformity and prevent case sensitivity issues. This is usually done to avoid treating "Word" and "word" as different tokens.

text = text.lower()

---


**Removing Special Characters:**
Remove special characters, punctuation, and symbols from the text, which are not typically essential for many NLP tasks.

import re

text = re.sub(r'[^a-zA-Z0-9]', ' ', text)



---


**Tokenization:**
Tokenization is the process of splitting text into individual words or tokens. This step is essential for various NLP tasks.

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)



---


**Stopword Removal (optional):**
Remove common stopwords (e.g., "the," "is," "and") if they are not relevant to your analysis.

  from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]



---


**Stemming or Lemmatization (optional):**
Reduce words to their base form to handle different word forms (e.g., "running," "ran" -> "run"). You can choose either stemming or lemmatization, depending on your specific needs.

Stemming example using the NLTK library:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
Lemmatization example using NLTK:


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]


---


**Handling Contractions (optional):**
If you need to expand contractions (e.g., "I'm" -> "I am," "they're" -> "they are"), you can use contraction mapping dictionaries or libraries like pycontractions.



---


**Whitespace Trimming:**
Remove extra whitespaces and trim the text.


text = ' '.join(lemmatized_tokens)

After performing these steps, your text data should be standardized and ready for various NLP tasks like text classification, sentiment analysis, or information retrieval. Remember that the specific preprocessing steps you use can vary depending on the task and the nature of your text data.

#Exploring Text data


##Basic Statistics:

In [None]:
texts = ["This is the first document. It contains some words.",
         "Here is another document with more words.",
         "The third document is shorter."]


# Calculate the number of documents
num_documents = len(texts)

# Calculate the average word count
average_word_count = sum(len(text.split()) for text in texts) / num_documents

In [None]:
print(f'No. of documents are ',num_documents)
print(f'Average no. of words in each document ',average_word_count)

No. of documents are  3
Average no. of words in each document  7.0


##Part-of-Speech Tagging:

In [None]:
# Sample text documents
texts = ["This is the first document. It contains some words.",
         "Here is another document with more words.",
         "The third document is shorter.",
         "I love second document.It is very nice and well structured ",
         "I'm fed up of these articles.I hate them"]

# Tokenize and perform part-of-speech tagging for each document
for text in texts:
    words = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(words)
    print(pos_tags)

[('This', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('first', 'JJ'), ('document', 'NN'), ('.', '.'), ('It', 'PRP'), ('contains', 'VBZ'), ('some', 'DT'), ('words', 'NNS'), ('.', '.')]
[('Here', 'RB'), ('is', 'VBZ'), ('another', 'DT'), ('document', 'NN'), ('with', 'IN'), ('more', 'RBR'), ('words', 'NNS'), ('.', '.')]
[('The', 'DT'), ('third', 'JJ'), ('document', 'NN'), ('is', 'VBZ'), ('shorter', 'JJR'), ('.', '.')]
[('I', 'PRP'), ('love', 'VBP'), ('second', 'JJ'), ('document.It', 'NN'), ('is', 'VBZ'), ('very', 'RB'), ('nice', 'JJ'), ('and', 'CC'), ('well', 'RB'), ('structured', 'VBD')]
[('I', 'PRP'), ("'m", 'VBP'), ('fed', 'VBN'), ('up', 'IN'), ('of', 'IN'), ('these', 'DT'), ('articles.I', 'JJ'), ('hate', 'VBP'), ('them', 'PRP')]


##Sentiment Analysis:

In [None]:
# Sentiment analysis using TextBlob
from textblob import TextBlob
for text in texts:
  sentiment = TextBlob(text).sentiment
  print(sentiment)

Sentiment(polarity=0.25, subjectivity=0.3333333333333333)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.4266666666666667, subjectivity=0.5333333333333333)
Sentiment(polarity=-0.8, subjectivity=0.9)


These objects contain two properties:

**Polarity:** It measures the sentiment's positivity or negativity, typically ranging from **-1 (most negative) to 1 (most positive)**. In your examples, the polarities are 0.25, 0.5, and 0.0.

**Subjectivity:** It measures the subjectivity of the sentiment, indicating whether the text expresses a **factual or objective statement (closer to 0) or a subjective opinion (closer to 1).**

In your examples, the subjectivities are 0.3333 (closer to objective), 0.5 (somewhat subjective), and 0.0 (closer to objective).

**Here's how to interpret these examples:**

The first example has a polarity of 0.25, indicating a **slightly positive** sentiment, and a subjectivity of approximately 0.33, suggesting that the text contains **some opinion or subjectivity but is relatively objective.**

The second example has a polarity of 0.5, indicating a **more positive** sentiment, and a subjectivity of 0.5, indicating a **balanced combination of objectivity and subjectivity**.

The third example has a polarity of 0.0, indicating a **neutral sentiment**, and a subjectivity of 0.0, suggesting that the **text is factual and lacks any subjective opinion**.

The fourth example has a polarity of 0.4266666666666667, indicates that the text expresses a **more positive sentiment**, and a subjectivity of 0.5333333333333333, suggesting that  **likely includes opinions or subjective language**.

The fifth example has a polarity of -0.8, indicates that the text expresses a **very negative sentiment**, and a subjectivity of 0.9, suggesting that  **likely rich in opinions or subjective language**.



These sentiment analysis results are often used in applications like opinion mining, social media analysis, and customer feedback analysis to understand the sentiment expressed in text data.







##Topic Modeling:

In [None]:
# Topic modeling using Latent Dirichlet Allocation (LDA)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit_transform(X)

array([[0.02000495, 0.0202561 , 0.02016091, 0.0200115 , 0.91956654],
       [0.02500583, 0.02518463, 0.02512084, 0.02501354, 0.89967516],
       [0.03333838, 0.86586453, 0.03358586, 0.03334505, 0.03386619],
       [0.01818579, 0.01833237, 0.92690262, 0.01819103, 0.0183882 ],
       [0.89997811, 0.02500532, 0.02500357, 0.02501044, 0.02500257]])

## Vocabulary Size:

In [None]:
# Calculate vocabulary size
unique_words = set(" ".join(texts).split())
vocabulary_size = len(unique_words)
vocabulary_size

34