# Natural Language Processing Introduction

Natural language processing (NLP) aims to give computers the ability to understand, process, and even generate human language. This notebook introduces the common preprocessing steps and demonstrates how to use a widely used transformer model (`distilbert-base-uncased-finetuned-sst-2-english`) to perfrom sentiment analysis. 😀😦🙁

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

In [2]:
pd.set_option('display.max_columns', 100)

## Load data

This exercise uses a small sampled dataset that contains reviews of property management companies in campustowns.

In [3]:
df_b = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/businesses-sample.csv'
)
df_r = pd.read_csv(
    'https://github.com/bdi475/datasets/raw/main/campustowns-leasing-company-reviews/reviews-sample.csv',
    parse_dates = ['review_datetime_utc', 'owner_answer_timestamp_datetime_utc']
)

In [4]:
display(df_b.head(2))
display(df_r.head(2))

Unnamed: 0,campus,place_id,name,site,category,borough,street,city,postal_code,state,latitude,longitude,verified
0,Indiana University Bloomington,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,State On Campus Bloomington,https://stateoncampus.com/bloomington/?utm_sou...,Apartment complex,,2036 N Walnut St,Bloomington,47404,Indiana,39.184846,-86.532875,True
1,Indiana University Bloomington,ChIJPb8SbdpnbIgR82bkOLSKtZM,The Standard at Bloomington,https://www.thestandardbloomington.com/?utm_so...,Student housing center,,250 E 14th St,Bloomington,47408,Indiana,39.175974,-86.531609,True


Unnamed: 0,place_id,review_id,author_id,author_title,review_text,review_rating,review_img_url,review_datetime_utc,owner_answer,owner_answer_timestamp_datetime_utc,review_likes
0,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,109839330111495228413,Aziz Bohra,"This place has my heart! Spacious rooms, quali...",5,https://lh5.googleusercontent.com/p/AF1QipM0Jm...,2023-04-01 03:19:15+00:00,Thanks for your feedback. We are grateful tha...,2022-11-01 18:54:45+00:00,0
1,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChZDSUhNMG9nS0VJQ0FnSUNaOC15NkhBEAE,102607480175477014087,Nessa Bacher,I’ll start with the positives of living at Sta...,4,,2023-09-25 00:17:21+00:00,We are so pleased to hear that you enjoy livin...,2023-09-18 14:20:59+00:00,0


## 🪓 Tokenization using spaCy

In [5]:
import spacy

### Trained pipelines

Trained pipelines are models that enable spaCy to predict linguistic attributes in context

- Part-of-speech tags
- Syntactic dependencies
- Named entities

`'en_core_web_sm'` is a English pipeline optimized for CPU.

Components: 

- tok2vec
- taggerparser
- senter
- ner
- attribute_ruler
- lemmatizer

In [6]:
nlp = spacy.load('en_core_web_sm')

In [7]:
text = 'I love this apartment'
doc = nlp(text)

for token in doc:
    print('------------------')
    print(f'text: {token.text}')
    print(f'lemma: {token.lemma_}')
    print(f'pos: {token.pos_}') # pos_ stands for part-of-speech
    print(f'explain: {spacy.explain(token.pos_)}')
    print(f'is_stop: {token.is_stop}')

------------------
text: I
lemma: I
pos: PRON
explain: pronoun
is_stop: True
------------------
text: love
lemma: love
pos: VERB
explain: verb
is_stop: False
------------------
text: this
lemma: this
pos: DET
explain: determiner
is_stop: True
------------------
text: apartment
lemma: apartment
pos: NOUN
explain: noun
is_stop: False


We can visualize the parse tree using `displacy`.

In [8]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

### Tokenization and lemmatization

Tokenization takes a piece of text and breaks it down into meaningful units called "tokens." These tokens can be individual words, punctuation marks, numbers, or even phrases depending on the task and chosen method.

Lemmatization goes a step further, focusing on the "base form" or "dictionary form" of words. It groups together different grammatical variations of the same word (like "playing," "plays," "played") and reduces them to their core meaning ("play"). This helps capture the true meaning of the text regardless of how they are used.

In [9]:
cols = ["text", "lemma", "pos", "explain", "is_stop"]
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens

Unnamed: 0,text,lemma,pos,explain,is_stop
0,I,I,PRON,pronoun,True
1,love,love,VERB,verb,False
2,this,this,DET,determiner,True
3,apartment,apartment,NOUN,noun,False


In [10]:
cols = ["review_id", "text", "lemma", "pos", "explain", "is_stop"]
rows = []

for index, row in df_r[df_r['review_text'].notna()].iterrows():
    doc = nlp(row['review_text'])
    for t in doc:
        new_row = [row['review_id'], t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
        rows.append(new_row)

df_tokens = pd.DataFrame(rows, columns=cols)
df_tokens

Unnamed: 0,review_id,text,lemma,pos,explain,is_stop
0,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,This,this,DET,determiner,True
1,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,place,place,NOUN,noun,False
2,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,has,have,VERB,verb,True
3,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,my,my,PRON,pronoun,True
4,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,heart,heart,NOUN,noun,False
...,...,...,...,...,...,...
18255,ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE,!,!,PUNCT,punctuation,False
18256,ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE,Hooray,Hooray,PROPN,proper noun,False
18257,ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE,!,!,PUNCT,punctuation,False
18258,ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE,wowie,wowie,PROPN,proper noun,False


### Remove stop words

Stop words, as you might guess from the name, are a set of commonly used words in a language that are often filtered out before processing text in Natural Language Processing (NLP) tasks. These words, like "the," "a," "is," "and," "on," etc., are considered to carry little independent meaning and contribute minimally to the overall understanding of the text.

We remove the stop words here for two reasons:

1. Reduce noise: By removing commonly used words, we focus on the content-rich keywords that convey the core meaning of the text.
2. Improve efficiency: Removing stop words reduces the overall size of the data, making NLP tasks faster and less computationally expensive.

In [11]:
# only filter non stop words
df_tokens_filtered = df_tokens[~df_tokens['is_stop']]

# remove words shorter than 4 characters long
df_tokens_filtered = df_tokens_filtered[df_tokens_filtered['lemma'].str.len() >= 4]

df_tokens_filtered

Unnamed: 0,review_id,text,lemma,pos,explain,is_stop
1,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,place,place,NOUN,noun,False
4,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,heart,heart,NOUN,noun,False
6,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,Spacious,spacious,ADJ,adjective,False
7,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,rooms,room,NOUN,noun,False
9,ChZDSUhNMG9nS0VJQ0FnSUMtbE1XaWR3EAE,quality,quality,NOUN,noun,False
...,...,...,...,...,...,...
18252,ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE,staff,staff,NOUN,noun,False
18254,ChZDSUhNMG9nS0VJQ0FnSUNSdUpEV2FREAE,awesome,awesome,ADJ,adjective,False
18256,ChZDSUhNMG9nS0VJQ0FnSUNSeUp1dUVBEAE,Hooray,Hooray,PROPN,proper noun,False
18258,ChZDSUhNMG9nS0VJQ0FnSUR4aDZQckl3EAE,wowie,wowie,PROPN,proper noun,False


Display value counts.

In [12]:
df_tokens_filtered['lemma'].value_counts()

lemma
apartment    125
live         119
place         99
staff         71
great         62
            ... 
discard        1
needle         1
massive        1
forever        1
zowa           1
Name: count, Length: 1457, dtype: int64

## 🧪 Sentiment analysis using distilbert

From [Hugging Face's Documentation](https://huggingface.co/docs/transformers/main/en/index):

> Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

> 📝 **Natural Language Processing**: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.

In [13]:
! pip install transformers

Defaulting to user installation because normal site-packages is not writeable


In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

2023-12-05 19:31:24.017784: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-05 19:31:24.096098: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-05 19:31:24.096150: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-05 19:31:24.097781: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-05 19:31:24.107923: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-05 19:31:24.109070: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Run sentiment classifier

In [15]:
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [16]:
classifier("These thieves tried to steal my security deposit.")

[{'label': 'NEGATIVE', 'score': 0.996752142906189}]

### Sample 30 rows for in-class exercise

Although the `distilbert-base-uncased-finetuned-sst-2-english` model is pre-trained and distilled (40% smaller than the original BERT model), it will still be slow to be used for the entire dataset.

For in-class exercise, only sample 30 rows where review_rating is less than or equal to 4 out of 5 stars.

In [17]:
df_sample = df_r[df_r['review_rating'] <= 4].sample(30)

Create two new columns to store sentiment labels and scores.

In [18]:
df_sample['sentiment'] = np.nan
df_sample['score'] = np.nan

Run sentiment analysis.

In [19]:
num_rows = df_sample.shape[0]

for i in range(num_rows):
  # grab review text to a variable
  review_text = df_sample['review_text'].iloc[i]

  # calculate sentiments for non-missing review texts
  # the model supports up to 512 tokens
  # truncate longer texts
  if pd.notna(review_text):
    result = classifier(
        review_text,
        truncation=True,
        padding=True,
        max_length=512
    )
    
    df_sample.iloc[i, df_sample.columns.get_loc('sentiment')] = result[0]['label']
    df_sample.iloc[i, df_sample.columns.get_loc('score')] = result[0]['score']

  # display progress
  progress_percentage = round((i + 1) / num_rows * 100, 2)
  print(f'{i + 1}/{num_rows} ({progress_percentage}%)', end=' ')

  if (i + 1) % 10 == 0:
    print('')

print('====================')
print('Complete')

  df_sample.iloc[i, df_sample.columns.get_loc('sentiment')] = result[0]['label']


1/30 (3.33%) 2/30 (6.67%) 3/30 (10.0%) 4/30 (13.33%) 5/30 (16.67%) 6/30 (20.0%) 7/30 (23.33%) 8/30 (26.67%) 9/30 (30.0%) 10/30 (33.33%) 
11/30 (36.67%) 12/30 (40.0%) 13/30 (43.33%) 14/30 (46.67%) 15/30 (50.0%) 16/30 (53.33%) 17/30 (56.67%) 18/30 (60.0%) 19/30 (63.33%) 20/30 (66.67%) 
21/30 (70.0%) 22/30 (73.33%) 23/30 (76.67%) 24/30 (80.0%) 25/30 (83.33%) 26/30 (86.67%) 27/30 (90.0%) 28/30 (93.33%) 29/30 (96.67%) 30/30 (100.0%) 
Complete


In [20]:
df_sample

Unnamed: 0,place_id,review_id,author_id,author_title,review_text,review_rating,review_img_url,review_datetime_utc,owner_answer,owner_answer_timestamp_datetime_utc,review_likes,sentiment,score
88,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUN3eTZ6QXlnRRAB,107921325574085425680,J,Great place to live! But when it gets snowy it...,3,,2018-01-15 17:15:59+00:00,Thank you for your review. We are glad you are...,2018-01-15 19:12:02+00:00,0,POSITIVE,0.922939
50,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChZDSUhNMG9nS0VJQ0FnSUMwajlQb2NREAE,102119770053957405964,Hannah Miller,The price is inexpensive for how nice the apar...,4,,2019-10-14 21:14:49+00:00,,NaT,0,POSITIVE,0.999724
184,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChZDSUhNMG9nS0VJQ0FnSURnd3J6TkpREAE,117784691122254913422,Leo Cabr,,1,,2018-08-23 19:38:26+00:00,,NaT,0,,
54,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUNZbWItUV9BRRAB,104270285434832150877,Anna Peura,Management here is not that great. I understan...,1,,2019-04-15 18:03:05+00:00,,NaT,7,NEGATIVE,0.999542
185,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChZDSUhNMG9nS0VJQ0FnSUNnMXE2N0lREAE,115211396667741977481,Derek Oden,,3,,2018-07-02 09:25:34+00:00,,NaT,0,,
19,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUNlLVlhMzV3RRAB,110019894227513393757,KJen Howie,Chill on the No prorates. I promise you and yo...,1,,2022-09-18 16:01:57+00:00,"Thank you for your feedback, KJen. We are than...",2022-09-19 14:30:09+00:00,5,NEGATIVE,0.99813
148,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUN3M2ItUjVnRRAB,111649617446437780997,Jacci Carey,,1,,2018-06-08 19:35:53+00:00,,NaT,0,,
207,ChIJPb8SbdpnbIgR82bkOLSKtZM,ChdDSUhNMG9nS0VJQ0FnSUNseExTTWx3RRAB,105558539338032788154,Elena Wilson,** The reason there are so many positive revie...,2,,2023-11-14 03:48:19+00:00,"Hey Elena, thank you for your review. We are i...",2023-11-14 17:27:22+00:00,2,NEGATIVE,0.999679
117,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUNlZ0pfcGlBRRAB,104610055701729616913,Newmi Oldmi,Rude management and yes I’m talking about you ...,1,,2022-09-03 02:39:35+00:00,"Hi there, Newmi. We cannot find you in our sys...",2022-09-05 15:31:47+00:00,5,NEGATIVE,0.850365
168,ChIJY1yB5NJmbIgRZn7E2oF5gVQ,ChdDSUhNMG9nS0VJQ0FnSUNBcU43X2hRRRAB,117278245074325192982,jake watson,,1,,2018-09-07 20:47:49+00:00,,NaT,0,,
