# Project Part 3

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/rjbeer/CS39AA-Project/blob/main/ProjectPart2.ipynb)

# Dont forget to update kaggle link!!!!

# Using a trained model

Before the Starbucks reviews dataset was used to train and test default models. Here a pre-trained model, specifically BERT from assignment 5, will be used and tuned and the results collected to compare against the results from Part 2.

First we must clean the data to be used. Before the data needed to trimmed in the sense that all 1 star reviews had to be cut, here they will be used. Below is the initial data load and then the data cleaning.

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch.nn.functional as F
import torch.cuda

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/starbucks-reviews-dataset/reviews_data.csv


In [4]:
df = pd.read_csv("/kaggle/input/starbucks-reviews-dataset/reviews_data.csv")
#df.columns = ['label', 'text']
df.head()

Unnamed: 0,name,location,Date,Rating,Review,Image_Links
0,Helen,"Wichita Falls, TX","Reviewed Sept. 13, 2023",5.0,Amber and LaDonna at the Starbucks on Southwes...,['No Images']
1,Courtney,"Apopka, FL","Reviewed July 16, 2023",5.0,** at the Starbucks by the fire station on 436...,['No Images']
2,Daynelle,"Cranberry Twp, PA","Reviewed July 5, 2023",5.0,I just wanted to go out of my way to recognize...,['https://media.consumeraffairs.com/files/cach...
3,Taylor,"Seattle, WA","Reviewed May 26, 2023",5.0,Me and my friend were at Starbucks and my card...,['No Images']
4,Tenessa,"Gresham, OR","Reviewed Jan. 22, 2023",5.0,I’m on this kick of drinking 5 cups of warm wa...,['https://media.consumeraffairs.com/files/cach...


In [5]:
df.isnull().sum()

name             0
location         0
Date             0
Rating         145
Review           0
Image_Links      0
dtype: int64

In [6]:
df.shape

(850, 6)

In [7]:
df.columns.tolist()

['name', 'location', 'Date', 'Rating', 'Review', 'Image_Links']

In [8]:
X_target = df[df['Rating'].isnull()]
X_target.head()

Unnamed: 0,name,location,Date,Rating,Review,Image_Links
704,James,"Kansas City, MO","Reviewed July 25, 2011",,I just wanted to amend my email the I sent to ...,['No Images']
705,James,"Kansas City, MO","Reviewed July 25, 2011",,"Recently, I have gone to your Starbucks at Bar...",['No Images']
706,Mike,"Revere, ma","Reviewed June 26, 2011",,Upon my first visit to this location on my way...,['No Images']
707,Hughes,"Macclesfield, Other","Reviewed Jan. 13, 2011",,"Recently, British Royal Marines in Iraq wrote ...",['No Images']
708,Sherrilynn,"Jenison, MI","Reviewed Jan. 4, 2011",,"On the way to catch our plane, we got a medium...",['No Images']


In [9]:
X_target.shape

(145, 6)

In [10]:
# create a deep copy of the data to avoid comprimising the integrity of the original data
# the deep copy can be safely manipulated and altered and if a new copy needs to be made
# the original is still intact.

df_copy = df.copy(deep=True)
df_copy.dropna(inplace=True)
df_copy.isnull().sum()

name           0
location       0
Date           0
Rating         0
Review         0
Image_Links    0
dtype: int64

In [11]:
df_copy.shape

(705, 6)

In [12]:
df_copy.drop(['name', 'location', 'Date', 'Image_Links'], axis=1, inplace=True)
df_copy.head()

Unnamed: 0,Rating,Review
0,5.0,Amber and LaDonna at the Starbucks on Southwes...
1,5.0,** at the Starbucks by the fire station on 436...
2,5.0,I just wanted to go out of my way to recognize...
3,5.0,Me and my friend were at Starbucks and my card...
4,5.0,I’m on this kick of drinking 5 cups of warm wa...


In [None]:
#df_copy = df_copy.drop(df_copy[df_copy['Rating'] == 1].index)

#print(df_copy)

In [13]:
df_copy['label'] = np.where(df_copy['Rating'] > 3, 1, 0) #Assign 1 when rating 3 or above else assign 0.
df_copy.sample(5)

Unnamed: 0,Rating,Review,label
676,1.0,I just recently moved to Renton Washington and...,0
584,2.0,I'm a regular customer at Starbucks and I neve...,0
397,1.0,Starbucks sucks! Last night I stop with my wif...,0
363,5.0,Great service. Never have problems with this c...,1
489,2.0,I was visiting the Starbucks at 2183 Vista Way...,0


In [14]:
# split data into train and validation sets: df_train and df_val
from sklearn.model_selection import train_test_split

X = df_copy['Review'].copy()
y = df_copy['label'].copy()

X_train_raw, X_val_raw, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

Now the BERT model will be downloaded and imported.

In [15]:


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification,  TrainingArguments, Trainer
from datasets import Dataset, load_metric



In [16]:


MODEL_NAME = "bert-base-cased"
MAX_LENGTH=50

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3, max_length=MAX_LENGTH, output_attentions=False, output_hidden_states=False)



Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
classes = df_copy.label.unique().tolist()
class_tok2idx = dict((v, k) for k, v in enumerate(classes))
class_idx2tok = dict((k, v) for k, v in enumerate(classes))
print(class_tok2idx)
print(class_idx2tok)

{1: 0, 0: 1}
{0: 1, 1: 0}


In [21]:
#df_copy['label'] = df_copy['Rating'].apply(lambda x: class_tok2idx[x])
#df_copy.head()

sequence_0 = "We had to correct them on our order 3 times. They never got it right then the manager came over to us and said we made her employee uncomfortable because we were trying to correct our order. The manager tried was racist against my stepmom (Chinese) taking over her but when I (**) would talk she would stop talking and listen to me."
seq0_tokens = tokenizer(sequence_0, return_tensors="pt")
print(f"number of tokens in seq0 is {len(seq0_tokens['input_ids'].flatten())}")
print(seq0_tokens)
F.softmax(model(**seq0_tokens).logits, dim=1)

number of tokens in seq0 is 75
{'input_ids': tensor([[  101,  1284,  1125,  1106,  5663,  1172,  1113,  1412,  1546,   124,
          1551,   119,  1220,  1309,  1400,  1122,  1268,  1173,  1103,  2618,
          1338,  1166,  1106,  1366,  1105,  1163,  1195,  1189,  1123,  7775,
          8504,  1272,  1195,  1127,  1774,  1106,  5663,  1412,  1546,   119,
          1109,  2618,  1793,  1108, 18848,  1222,  1139,  2585,  3702,  1306,
           113,  1922,   114,  1781,  1166,  1123,  1133,  1165,   146,   113,
           115,   115,   114,  1156,  2037,  1131,  1156,  1831,  2520,  1105,
          5113,  1106,  1143,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

tensor([[0.3780, 0.3840, 0.2380]], grad_fn=<SoftmaxBackward0>)

In [22]:
sequence_1 = "Amber and LaDonna at the Starbucks on Southwest Parkway are always so warm and welcoming. There is always a smile in their voice when they greet you at the drive-thru. And their customer service is always spot-on, they always get my order right and with a smile. I would actually give them more than 5 stars if they were available."
seq1_tokens = tokenizer(sequence_1, return_tensors="pt")
print(f"number of tokens in seq1 is {len(seq1_tokens['input_ids'].flatten())}")
F.softmax(model(**seq1_tokens).logits, dim=1)

number of tokens in seq1 is 77


tensor([[0.3659, 0.4244, 0.2097]], grad_fn=<SoftmaxBackward0>)

In [23]:
sequence_2 = "Every time I try to buy a Strawberry Refresher Starbucks never has strawberries to put into the drink. How is the drink called Strawberry Refresher and you guys never have any damn strawberries. It seems every time we go on the Starbucks to order a specialty drink you guys are constantly out of it. Itâ€™s like calling a pizza place, and theyâ€™re telling us theyâ€™re out of cheese. I donâ€™t give you a partial payment so I donâ€™t expect a partial drink."
seq2_tokens = tokenizer(sequence_2, return_tensors="pt")
print(f"number of tokens in seq2 is {len(seq2_tokens['input_ids'].flatten())}")
F.softmax(model(**seq2_tokens).logits, dim=1)

number of tokens in seq2 is 106


tensor([[0.3557, 0.4469, 0.1973]], grad_fn=<SoftmaxBackward0>)

In [24]:
sequence_3 = "I walk to the Starbucks near my house every other day and get the same drink every day, chai tea latte. The service is usually great and I don't have to wait that long at all. The employees are friendly and kind. This goes for other locations that I visit. Unfortunately, the quality of Starbucks products has significantly declined and the prices have skyrocketed but that is understandable given the current economy. Every time I get a chai tea latte, it either tastes like water or nothing but milk, and on top of that, I am paying a premium for the venti size."
seq3_tokens = tokenizer(sequence_3, return_tensors="pt")
print(f"number of tokens in seq3 is {len(seq3_tokens['input_ids'].flatten())}")
F.softmax(model(**seq3_tokens).logits, dim=1)

number of tokens in seq3 is 130


tensor([[0.3690, 0.4159, 0.2150]], grad_fn=<SoftmaxBackward0>)

In [25]:


ds_raw = Dataset.from_pandas(df_copy[['label','Review']])
ds_raw[0]



{'label': 1,
 'Review': 'Amber and LaDonna at the Starbucks on Southwest Parkway are always so warm and welcoming. There is always a smile in their voice when they greet you at the drive-thru. And their customer service is always spot-on, they always get my order right and with a smile. I would actually give them more than 5 stars if they were available.',
 '__index_level_0__': 0}

In [27]:


def tokenize_function(examples):
    return tokenizer(examples["Review"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

ds = ds_raw.map(tokenize_function, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]

In [28]:
ds[0]

{'label': 1,
 'Review': 'Amber and LaDonna at the Starbucks on Southwest Parkway are always so warm and welcoming. There is always a smile in their voice when they greet you at the drive-thru. And their customer service is always spot-on, they always get my order right and with a smile. I would actually give them more than 5 stars if they were available.',
 '__index_level_0__': 0,
 'input_ids': [101,
  11623,
  1105,
  2001,
  2137,
  1320,
  1605,
  1120,
  1103,
  2537,
  7925,
  8770,
  1113,
  10859,
  14293,
  1132,
  1579,
  1177,
  3258,
  1105,
  20028,
  119,
  1247,
  1110,
  1579,
  170,
  2003,
  1107,
  1147,
  1490,
  1165,
  1152,
  18884,
  1128,
  1120,
  1103,
  2797,
  118,
  24438,
  5082,
  119,
  1262,
  1147,
  8132,
  1555,
  1110,
  1579,
  3205,
  118,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [29]:
ds = ds.shuffle(seed=42)
ds[0]

{'label': 0,
 'Review': 'On 3/12, my wife went to the Starbucks on Highway 20 in Yuba City and ordered a granda and a venita Vanilla Chai Tea, both drinks were too hot to drink for almost 45 minutes. By this time, we were 35 miles away, and once we could drink them, we found that there was no vanilla. We would have taken it back if we were not so far from the store. My wife also had an apple fritter that was hard as steel.',
 '__index_level_0__': 663,
 'input_ids': [101,
  1212,
  124,
  120,
  1367,
  117,
  1139,
  1676,
  1355,
  1106,
  1103,
  2537,
  7925,
  8770,
  1113,
  3580,
  1406,
  1107,
  10684,
  2822,
  1392,
  1105,
  2802,
  170,
  5372,
  1161,
  1105,
  170,
  1396,
  2605,
  1777,
  3605,
  5878,
  24705,
  1182,
  15832,
  117,
  1241,
  8898,
  1127,
  1315,
  2633,
  1106,
  3668,
  1111,
  1593,
  2532,
  1904,
  119,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0

In [31]:
train_prop = 0.85
ds_train = ds.select(range(int(len(ds)*train_prop)))
ds_eval = ds.select(range(int(len(ds)*train_prop), len(ds)))

In [32]:
print(f"len(ds_train) = {len(ds_train)}")
print(f"len(ds_eval) = {len(ds_eval)}")

len(ds_train) = 599
len(ds_eval) = 106


In [34]:


import os
os.environ["WANDB_DISABLED"] = "true"



In [35]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(num_train_epochs=10,
                                  do_train=True,
                                  report_to=None,
                                  output_dir="/kaggle/working",
                                  evaluation_strategy="steps",
                                  eval_steps=200,
                                  learning_rate=1e-5,
                                  per_device_train_batch_size=32,
                                  per_device_eval_batch_size=32)

trainer = Trainer(model = model, 
                  args = training_args,
                  train_dataset = ds_train, 
                  eval_dataset = ds_eval,
                  compute_metrics = compute_metrics,
)

Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [38]:
if torch.cuda.is_available():
    device = "cuda:0"
    print("Using GPU")
else: 
    device = "cpu"

In [39]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [40]:
torch.set_grad_enabled(True)
trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss


{'eval_loss': 0.257964551448822,
 'eval_accuracy': 0.9056603773584906,
 'eval_runtime': 8.4781,
 'eval_samples_per_second': 12.503,
 'eval_steps_per_second': 0.472,
 'epoch': 10.0}