# <b><p style="background-color: #ff6200; font-family:calibri; color:white; font-size:100%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Task 37-> NLP Preprocessing</p>

# What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. This involves several complex processes, including the understanding of context, semantics, sentiment, and intent within the text or speech.

## <span style='color:#ff6200'> Importing Libraries</span>

In [13]:
import pandas as pd
import nltk
import re

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from datasets import load_dataset


In [14]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### <span style='color:#ff6200'> Common Text Preprocessing Techniques</span>

In [15]:
def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    
    # Removing Punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Removing Stop Words
    stop_words = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in stop_words)
    
    # Removing Special Characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Removing Numbers
    text = re.sub(r'\d+', '', text)
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens

## <span style='color:#ff6200'> Movie Reviews Dataset</span>

In [16]:
print("\nMovie Reviews Dataset Preprocessing:")
movie_reviews = load_dataset('imdb', split='train').select(range(10))  
movie_reviews_df = pd.DataFrame(movie_reviews)


Movie Reviews Dataset Preprocessing:


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [19]:
movie_reviews_df['preprocessed_text'] = movie_reviews_df['text'].apply(preprocess_text)
movie_reviews_df[['text', 'preprocessed_text']].head()

Unnamed: 0,text,preprocessed_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,"[rent, curiousyellow, video, store, controvers..."
1,"""I Am Curious: Yellow"" is a risible and preten...","[curiou, yellow, risibl, pretenti, steam, pile..."
2,If only to avoid making this type of film in t...,"[avoid, make, type, film, futur, film, interes..."
3,This film was probably inspired by Godard's Ma...,"[film, probabl, inspir, godard, masculin, fmin..."
4,"Oh, brother...after hearing about this ridicul...","[oh, brotheraft, hear, ridicul, film, umpteen,..."


## <span style='color:#ff6200'> Tweets Dataset</span>

In [18]:
print("\nTweets Dataset Preprocessing:")
tweets_dataset = load_dataset('tweets_hate_speech_detection', split='train').select(range(10))  
tweets_df = pd.DataFrame(tweets_dataset)


Tweets Dataset Preprocessing:


Downloading readme:   0%|          | 0.00/5.58k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.07M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/31962 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/17197 [00:00<?, ? examples/s]

In [20]:
tweets_df['preprocessed_text'] = tweets_df['tweet'].apply(preprocess_text)
tweets_df[['tweet', 'preprocessed_text']].head()

Unnamed: 0,tweet,preprocessed_text
0,@user when a father is dysfunctional and is so...,"[user, father, dysfunct, selfish, drag, kid, d..."
1,@user @user thanks for #lyft credit i can't us...,"[user, user, thank, lyft, credit, cant, use, c..."
2,bihday your majesty,"[bihday, majesti]"
3,#model i love u take with u all the time in ...,"[model, love, u, take, u, time, ur]"
4,factsguide: society now #motivation,"[factsguid, societi, motiv]"


## <span style='color:#ff6200'> News Articles Dataset</span>

In [22]:
print("\nNews Articles Dataset Preprocessing:")
news_articles = load_dataset('ag_news', split='train').select(range(10))  
news_df = pd.DataFrame(news_articles)


News Articles Dataset Preprocessing:


In [23]:
news_df['preprocessed_text'] = news_df['text'].apply(preprocess_text)
news_df[['text', 'preprocessed_text']].head()

Unnamed: 0,text,preprocessed_text
0,Wall St. Bears Claw Back Into the Black (Reute...,"[wall, st, bear, claw, back, black, reuter, re..."
1,Carlyle Looks Toward Commercial Aerospace (Reu...,"[carlyl, look, toward, commerci, aerospac, reu..."
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,"[oil, economi, cloud, stock, outlook, reuter, ..."
3,Iraq Halts Oil Exports from Main Southern Pipe...,"[iraq, halt, oil, export, main, southern, pipe..."
4,"Oil prices soar to all-time record, posing new...","[oil, price, soar, alltim, record, pose, new, ..."
