<a href="https://colab.research.google.com/github/vvrgit/NLP-LAB/blob/main/Assignment_5_Spacy_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

tweets_df = pd.read_csv('/content/Tweets.csv')
print(tweets_df.head())

             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_gold  retweet_count  \
0                    NaN     cairdin                 NaN              0   
1                    NaN    jnar

In [2]:
tweets = tweets_df['text']
print("Extracted tweets (first 5):")
for i, tweet in enumerate(tweets.head()):
    print(f"{i+1}. {tweet}")

Extracted tweets (first 5):
1. @VirginAmerica What @dhepburn said.
2. @VirginAmerica plus you've added commercials to the experience... tacky.
3. @VirginAmerica I didn't today... Must mean I need to take another trip!
4. @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
5. @VirginAmerica and it's a really big bad thing about it


# Task
Preprocess the text in the `tweets` Series by cleaning, tokenizing, removing stopwords, and lemmatizing the tokens, and then join the processed tokens back into a string.

## Clean Text

### Subtask:
Remove mentions, URLs, special characters, and convert text to lowercase.


**Reasoning**:
The subtask requires removing mentions, URLs, special characters, and converting text to lowercase. I will define a function to perform these cleaning steps using regular expressions and then apply it to the 'tweets' Series.



In [3]:
import re

def clean_text(text):
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove special characters and numbers, keep spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text.strip()

cleaned_tweets = tweets.apply(clean_text)

print("Original tweets (first 5):")
for i, tweet in enumerate(tweets.head()):
    print(f"{i+1}. {tweet}")

print("\nCleaned tweets (first 5):")
for i, tweet in enumerate(cleaned_tweets.head()):
    print(f"{i+1}. {tweet}")

Original tweets (first 5):
1. @VirginAmerica What @dhepburn said.
2. @VirginAmerica plus you've added commercials to the experience... tacky.
3. @VirginAmerica I didn't today... Must mean I need to take another trip!
4. @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
5. @VirginAmerica and it's a really big bad thing about it

Cleaned tweets (first 5):
1. what  said
2. plus youve added commercials to the experience tacky
3. i didnt today must mean i need to take another trip
4. its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
5. and its a really big bad thing about it


## Tokenize Text

### Subtask:
Tokenize the cleaned text into individual words.

**Reasoning**:
The next step is to tokenize the cleaned text. I will use NLTK's `word_tokenize` function for this, but first, I need to ensure NLTK and its required data ('punkt') are installed and downloaded.



In [9]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text)

tokenized_tweets = cleaned_tweets.apply(tokenize_text)

print("Cleaned tweets (first 5):")
for i, tweet in enumerate(cleaned_tweets.head()):
    print(f"{i+1}. {tweet}")

print("\nTokenized tweets (first 5):")
for i, tokens in enumerate(tokenized_tweets.head()):
    print(f"{i+1}. {tokens}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Cleaned tweets (first 5):
1. what  said
2. plus youve added commercials to the experience tacky
3. i didnt today must mean i need to take another trip
4. its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse
5. and its a really big bad thing about it

Tokenized tweets (first 5):
1. ['what', 'said']
2. ['plus', 'youve', 'added', 'commercials', 'to', 'the', 'experience', 'tacky']
3. ['i', 'didnt', 'today', 'must', 'mean', 'i', 'need', 'to', 'take', 'another', 'trip']
4. ['its', 'really', 'aggressive', 'to', 'blast', 'obnoxious', 'entertainment', 'in', 'your', 'guests', 'faces', 'amp', 'they', 'have', 'little', 'recourse']
5. ['and', 'its', 'a', 'really', 'big', 'bad', 'thing', 'about', 'it']


# Task
Remove stopwords from the `tokenized_tweets` Series.

## Remove Stopwords

### Subtask:
Eliminate common words that do not carry significant meaning from the tokenized text.


**Reasoning**:
I need to ensure the NLTK 'stopwords' corpus is downloaded and then define and apply a function to remove stopwords from the tokenized text.



In [10]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords for efficient lookup
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

filtered_tweets = tokenized_tweets.apply(remove_stopwords)

print("Tokenized tweets (first 5):")
for i, tokens in enumerate(tokenized_tweets.head()):
    print(f"{i+1}. {tokens}")

print("\nFiltered tweets (first 5) (stopwords removed):")
for i, tokens in enumerate(filtered_tweets.head()):
    print(f"{i+1}. {tokens}")

Tokenized tweets (first 5):
1. ['what', 'said']
2. ['plus', 'youve', 'added', 'commercials', 'to', 'the', 'experience', 'tacky']
3. ['i', 'didnt', 'today', 'must', 'mean', 'i', 'need', 'to', 'take', 'another', 'trip']
4. ['its', 'really', 'aggressive', 'to', 'blast', 'obnoxious', 'entertainment', 'in', 'your', 'guests', 'faces', 'amp', 'they', 'have', 'little', 'recourse']
5. ['and', 'its', 'a', 'really', 'big', 'bad', 'thing', 'about', 'it']

Filtered tweets (first 5) (stopwords removed):
1. ['said']
2. ['plus', 'youve', 'added', 'commercials', 'experience', 'tacky']
3. ['didnt', 'today', 'must', 'mean', 'need', 'take', 'another', 'trip']
4. ['really', 'aggressive', 'blast', 'obnoxious', 'entertainment', 'guests', 'faces', 'amp', 'little', 'recourse']
5. ['really', 'big', 'bad', 'thing']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Task
Lemmatize the tokens in the `filtered_tweets` Series, download the necessary NLTK corpora (`wordnet` and `omw-1.4`), then join the lemmatized tokens back into a single string for each tweet, and display the first few lemmatized tweets.

## Lemmatize Tokens

### Subtask:
Reduce tokens to their base or dictionary form using NLTK's WordNetLemmatizer.


#### Reasoning:
To further normalize the text and group together words with similar meanings (e.g., 'run', 'runs', 'running'), I will lemmatize the filtered tokens. This involves using NLTK's WordNetLemmatizer, which requires downloading the 'wordnet' and 'omw-1.4' corpora. I will then define a function to apply the lemmatizer to each token and map this function over the `filtered_tweets` Series.

**Reasoning**:
As per the instructions, I need to download the required NLTK corpora ('wordnet' and 'omw-1.4'), import `WordNetLemmatizer`, initialize it, define a function to lemmatize tokens, apply this function to the `filtered_tweets` Series, and then display the first five lemmatized tweet lists.



In [18]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_tweets = filtered_tweets.apply(lemmatize_tokens)

print("Filtered tweets (first 5):")
for i, tokens in enumerate(filtered_tweets.head()):
    print(f"{i+1}. {tokens}")

print("\nLemmatized tweets (first 5):")
for i, tokens in enumerate(lemmatized_tweets.head()):
    print(f"{i+1}. {tokens}")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Filtered tweets (first 5):
1. ['said']
2. ['plus', 'youve', 'added', 'commercials', 'experience', 'tacky']
3. ['didnt', 'today', 'must', 'mean', 'need', 'take', 'another', 'trip']
4. ['really', 'aggressive', 'blast', 'obnoxious', 'entertainment', 'guests', 'faces', 'amp', 'little', 'recourse']
5. ['really', 'big', 'bad', 'thing']

Lemmatized tweets (first 5):
1. ['said']
2. ['plus', 'youve', 'added', 'commercial', 'experience', 'tacky']
3. ['didnt', 'today', 'must', 'mean', 'need', 'take', 'another', 'trip']
4. ['really', 'aggressive', 'blast', 'obnoxious', 'entertainment', 'guest', 'face', 'amp', 'little', 'recourse']
5. ['really', 'big', 'bad', 'thing']


In [20]:
def join_tokens(tokens):
    return ' '.join(tokens)

preprocessed_tweets = lemmatized_tweets.apply(join_tokens)

print("Preprocessed tweets (first 5):")
for i, tweet in enumerate(preprocessed_tweets.head()):
    print(f"{i+1}. {tweet}")

Preprocessed tweets (first 5):
1. said
2. plus youve added commercial experience tacky
3. didnt today must mean need take another trip
4. really aggressive blast obnoxious entertainment guest face amp little recourse
5. really big bad thing


# **PoS Tagging on Lemmatized Tweets**

In [22]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
from nltk import pos_tag

def tag_tokens(tokens):
    return pos_tag(tokens)

pos_tagged_tweets = lemmatized_tweets.apply(tag_tokens)

print("Lemmatized tweets (first 5):")
for i, tokens in enumerate(lemmatized_tweets.head()):
    print(f"{i+1}. {tokens}")

print("\nPOS Tagged tweets (first 5):")
for i, tagged_tokens in enumerate(pos_tagged_tweets.head()):
    print(f"{i+1}. {tagged_tokens}")

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Lemmatized tweets (first 5):
1. ['said']
2. ['plus', 'youve', 'added', 'commercial', 'experience', 'tacky']
3. ['didnt', 'today', 'must', 'mean', 'need', 'take', 'another', 'trip']
4. ['really', 'aggressive', 'blast', 'obnoxious', 'entertainment', 'guest', 'face', 'amp', 'little', 'recourse']
5. ['really', 'big', 'bad', 'thing']

POS Tagged tweets (first 5):
1. [('said', 'VBD')]
2. [('plus', 'CC'), ('youve', 'NN'), ('added', 'JJ'), ('commercial', 'JJ'), ('experience', 'NN'), ('tacky', 'NN')]
3. [('didnt', 'NN'), ('today', 'NN'), ('must', 'MD'), ('mean', 'VB'), ('need', 'MD'), ('take', 'VB'), ('another', 'DT'), ('trip', 'NN')]
4. [('really', 'RB'), ('aggressive', 'JJ'), ('blast', 'NN'), ('obnoxious', 'JJ'), ('entertainment', 'NN'), ('guest', 'JJS'), ('face', 'NN'), ('amp', 'NN'), ('little', 'JJ'), ('recourse', 'NN')]
5. [('really', 'RB'), ('big', 'JJ'), ('bad', 'JJ'), ('thing', 'NN')]


# Task
Define a Python function to detect hashtags (words starting with '#') from a list of tokens, apply this function to the `lemmatized_tweets` Series, and display the first few detected hashtags.

## Define Hashtag Detector Function

### Subtask:
Create a Python function that takes a list of tokens and returns a list of identified hashtags (words starting with '#').


**Reasoning**:
I will define the `detect_hashtags` function as specified, which iterates through a list of tokens, identifies those starting with '#', and returns them. I will then demonstrate its usage with a sample tweet from `lemmatized_tweets`.



In [23]:
def detect_hashtags(tokens):
    hashtags = []
    for token in tokens:
        if token.startswith('#'):
            hashtags.append(token)
    return hashtags

# Example usage with a sample tweet from lemmatized_tweets
# For demonstration, let's manually add some hashtags to a lemmatized tweet sample
sample_tweet_tokens = lemmatized_tweets.iloc[3] + ['#aggressive', '#obnoxious']

detected_hashtags = detect_hashtags(sample_tweet_tokens)

print(f"Original tokens (sample with added hashtags): {sample_tweet_tokens}")
print(f"Detected hashtags: {detected_hashtags}")

Original tokens (sample with added hashtags): ['really', 'aggressive', 'blast', 'obnoxious', 'entertainment', 'guest', 'face', 'amp', 'little', 'recourse', '#aggressive', '#obnoxious']
Detected hashtags: ['#aggressive', '#obnoxious']


**Reasoning**:
The `detect_hashtags` function has been defined and tested with a sample. Now, I will apply this function to the entire `lemmatized_tweets` Series to extract hashtags from all tweets and display the first few results.



In [24]:
all_detected_hashtags = lemmatized_tweets.apply(detect_hashtags)

print("Detected hashtags for first 5 tweets:")
for i, hashtags in enumerate(all_detected_hashtags.head()):
    print(f"{i+1}. {hashtags}")

Detected hashtags for first 5 tweets:
1. []
2. []
3. []
4. []
5. []
