In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset = load_dataset("yelp_review_full")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [4]:
dataset["train"][10]

{'label': 0,
 'text': "Owning a driving range inside the city limits is like a license to print money.  I don't think I ask much out of a driving range.  Decent mats, clean balls and accessible hours.  Hell you need even less people now with the advent of the machine that doles out the balls.  This place has none of them.  It is april and there are no grass tees yet.  BTW they opened for the season this week although it has been golfing weather for a month.  The mats look like the carpet at my 107 year old aunt Irene's house.  Worn and thread bare.  Let's talk about the hours.  This place is equipped with lights yet they only sell buckets of balls until 730.  It is still light out.  Finally lets you have the pit to hit into.  When I arrived I wasn't sure if this was a driving range or an excavation site for a mastodon or a strip mining operation.  There is no grass on the range. Just mud.  Makes it a good tool to figure out how far you actually are hitting the ball.  Oh, they are cash 

In [5]:
dataset["train"].features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None)}

As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map) method to apply a preprocessing function over the entire dataset:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [8]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:

In [9]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
small_validation_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000,2000))

<a id='trainer'></a>

In [10]:
from transformers import AutoModelForSequenceClassification

base_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
base_model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2', 3: 'LABEL_3', 4: 'LABEL_4'}

In [12]:
trained_model = AutoModelForSequenceClassification.from_pretrained("models/bert_base_cased/yelp", num_labels=5)

In [13]:
from transformers import pipeline

pipe_base = pipeline(
    "text-classification",
    model=base_model,
    batch_size=8,
    tokenizer=tokenizer,
    device=0,
    top_k=1
)

pipe_trained = pipeline(
    "text-classification",
    model=trained_model,
    batch_size=8,
    tokenizer=tokenizer,
    device=0,
    top_k=1
)

In [15]:
for n in range(0,20):
    print("Review Text",small_validation_dataset[n]["text"])
    print("Model prediction (trained): ", pipe_trained(small_validation_dataset[n]["text"][:512]))
    #print("Model prediction (base)", pipe_base(small_validation_dataset[n]["text"][:512]))
    print("Label ", small_validation_dataset[n]["label"])

Review Text Hefty portions, great value, homey comfortable service: not overly efficient, but not slow either. Stick to T-Bird specialties and typical fare. Being adventurous doesn't pay here. Ask the waiter what's cooking to find the best options. Pros: Cheap, hefty, homemade feel. Cons: distinct bar atmosphere, poor atypical fare.
Model prediction (trained):  [[{'label': 'LABEL_2', 'score': 0.6693612337112427}]]
Label  2
Review Text DO NOT GO HERE.  We bought a bouquet for $50 that only lasted 2 DAYS.  I used to work in a flower shop and flowers should be lasting longer than that.  I am NOT HAPPY.  MONEY DOWN THE DRAIN
Model prediction (trained):  [[{'label': 'LABEL_0', 'score': 0.8262115120887756}]]
Label  0
Review Text Cannot recommend before safety standards are improved. \n\nPros: courteous, professional shuttle drivers who are punctual; office was helpful in rescheduling due to my delayed flight\n\nCons: Loose safety procedures. My single seat buggy lacked left foot rest, four p



Model prediction (trained):  [[{'label': 'LABEL_2', 'score': 0.458805650472641}]]
Label  2
Review Text Had dinner last night.  I ordered calamari from our waiter who called himself \"meatball,\" and I  asked if he had tarter sauce.  He said they did not have tarter sauce. I asked if he had any mayo (I could make my own).  He said to me \"I'm not making you tarter sauce,\" then turned his head and mouthed some obscene remark.  I was very insulted.    After dinner I paid the bill and gave him a 10% tip instead of my standard 20% and on the check I added a message that he insulted me and I'd see him on Yelp.  He ran after me where I was standing at the curb, irrationally and aggressively talking about the tarter sauce.  I tried to explain that he totally missed the point, and that it was how he handled my request, but he wouldn't listen.  He told me to get off the  (public) property or he would call the police.  My husband thought \"meatball,\" a 350 pound guy, was going to hit me.  The w

In [None]:
small_validation_dataset

In [None]:
len(small_validation_dataset[0]["text"])