# Text Data Processing

In this assignment we are writing the following 2 functions for Text Data Processing

<ul>
    <li>
        <b>preprocess:</b> Function takes in a pandas.Series() of a corpus of text data as an argument. This function should output an indexed vocabulary and preprocessed tokens.
    </li>
    <li>
        <b>encode():</b> Function that takes in two arguments: 1) a pandas.Series() (or the preprocessed token outputs of the preprocess() function), and 2) a specified encoding method. These encoding methods must include Bag-of-Words, TF-IDF, and Word2Vec. 
    </li>
</ul>

In [1]:
!pip install --upgrade pip
!pip install nltk
!pip install contractions
!pip install inflect
!pip install scikit-learn 
!pip install gensim
!pip uninstall -y tensorflow
!pip install torch
!pip install transformers



In [2]:
from platform import python_version

print(python_version())

3.9.13


In [3]:
from transformers import pipeline

# Specify the model
model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"

sentiment_pipe = pipeline("sentiment-analysis", model=model_id)
print(sentiment_pipe('I hate it'))

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'negative', 'score': 0.7479290962219238}]


In [4]:
print(sentiment_pipe('I would avoid it'))

[{'label': 'negative', 'score': 0.49786096811294556}]


In [5]:
import pandas as pd
import numpy as np
import sklearn
from IPython.display import display, HTML

# Display Properties
from IPython.display import display, HTML
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 2)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [6]:
import nltk
import string
import re
import inflect
import contractions
from data_pipeline import Text_Pipeline

# Download the various 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Initialize various tools
text_pipeline = Text_Pipeline('CONVERT')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shaileshhemdev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shaileshhemdev/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shaileshhemdev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/shaileshhemdev/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [10]:
import pandas as pd 

CORPUS = [
    "The quick brown fox jumps over the lazy dog",
    "A king's strength also includes his allies",
    "History is written by the victors",
    "An apple a day keeps the doctor away",
    "Nothing happens until something moves",
    "The 10,000,303 striped bats    aren't hanging on their feet for best.",
    "I did not like it"
    ]

# Create a Pandas series 
s = pd.Series(CORPUS) 

# Obtain pre processed series
preprocessed_series = text_pipeline.preprocess(s)
print(preprocessed_series)

0                        quick brown fox jump lazy dog
1                      king strength also include ally
2                                 history write victor
3                           apple day keep doctor away
4                       nothing happens something move
5    ten million three hundred three strip bat hang...
6                                                 like
dtype: object


In [22]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """ Map POS tag to first character lemmatize() accepts

    Parameters
    ----------
    word : str
        The word that needs its Tag gleaned

    Returns
    -------
    tag
        The tag associated for the word

    """
    print(nltk.pos_tag([word]))
    tag = nltk.pos_tag([word])[0][1][0].lower()
    print(tag)
    tag_dict = {"j": wordnet.ADJ,
                "n": wordnet.NOUN,
                "v": wordnet.VERB,
                "r": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

input_text = "I hated it"

# Remove whitespace
transformed_text = Text_Pipeline.remove_whitespace(input_text)

# Make the text lower case
transformed_text = transformed_text.lower()

# Expand Contractions
transformed_text = Text_Pipeline.expand_contractions(transformed_text)
print(transformed_text)
# Remove punctuation and convert numbers 
transformed_text = Text_Pipeline.remove_punctuation(transformed_text)

filtered_words = Text_Pipeline.remove_stopwords(transformed_text)

# Lemmatize
lemmatizer = WordNetLemmatizer()
transformed_text = ' '.join([lemmatizer.lemmatize(w, pos=get_wordnet_pos(w)) for w in filtered_words])

print(transformed_text)

i hated it
[('hated', 'VBN')]
v
hat


In [None]:
import pandas as pd

# Get matrix using BOW
matrix, column_names = text_pipeline.encode(preprocessed_series, 'BOW')

result = pd.DataFrame(
    data=matrix.toarray(), 
    index=preprocessed_series.values, 
    columns=column_names
)

result.head()

In [None]:
# Get matrix using TF-IDF
matrix, column_names = text_pipeline.encode(preprocessed_series, 'TFIDF')

result = pd.DataFrame(
    data=matrix.toarray(), 
    index=preprocessed_series.values, 
    columns=column_names
)

result.head()

In [None]:
# Get matrix using Word to Vector
matrix = text_pipeline.encode(preprocessed_series, 'WordToVec')

result = pd.DataFrame(
    data=matrix.vectors, 
    index=matrix.key_to_index.keys()
)

result.head()

We will now apply a model to it using Large Language Models

In [None]:
def analyze_sentiment(text):
    #sentiment_analyzer = pipeline('sentiment-analysis', model=model_id)
    result = sentiment_pipe(text)
    return result[0]['label']

In [None]:

# Analyze the sentiment of a few sentences
amazon_reviews = [
    "My kiddos liked it!",
    "Amazon, please buy the show! I'm hooked!",
]

#amazon_reviews = df1['text'].values

# Analyze sentiment for each news headline
sentiments = [analyze_sentiment(review) for review in amazon_reviews]

In [None]:
print(sentiments)

In [None]:
print(type(amazon_reviews))

In [None]:
print(type(preprocessed_series))