# Modelling

Prompt:

Assume you are an expert in NLP in Tensorflow and Keras. For any input I will prompt, please share code that can solve that problem with Tensorflow and Keras.

Prompt:

Please create code that downloads the yelp dataset from this link https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0, which is csv format. Preprocess the reviews by:

- removing links
- removing punctuation
- removing stopwords
- ensuring alphanumerical characters only are allowed
- lowercases every word

And apply to the dataset. Later use Keras tokenizer to tokenize each review into a tensor of its IDs. Ensure the vocabulary are all the words that have a frequency of at least 5 words and the tensor of IDs doesn't truncate any review. Ensure finally all reviews are padded

Prompt:

Now let's change our approach, and we will do transfer learning on distillbert to do sentiment analysis. Use the model distilbert-base-uncased-finetuned-sst-2-english as base and add a head of size 2 to predict sentiment. Remember to use Tensorflow and Keras and perform the training

In [None]:
!pip install transformers

In [None]:
import tensorflow as tf
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, DistilBertTokenizerFast
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import urllib.request

# Downloading nltk stopwords
nltk.download('stopwords')

# Set of stopwords
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Preprocess a single review:
    - Lowercasing
    - Removing links
    - Removing punctuation
    - Removing non-alphanumerical characters
    - Removing stopwords
    """
    text = text.lower()
    text = re.sub(r'http\S+', '', text)  # Remove links
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\W*\b\w{1,2}\b', '', text)  # Remove words with 1 or 2 letters
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # Ensure only alphanumerical characters
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Load the Yelp dataset
yelp_data = pd.read_csv(file_path) #use the yelp.csv in the data folder. You will need to download it/upload to colab

# Preprocess the reviews
yelp_data['processed_reviews'] = yelp_data['text'].apply(preprocess_text)

yelp_data

In [None]:
# Load the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# Tokenize the dataset
def tokenize_reviews(reviews):
    return tokenizer(reviews, padding=True, truncation=True, return_tensors='tf')

# Tokenize Yelp reviews
tokenized_reviews = tokenize_reviews(yelp_data['processed_reviews'].tolist())

In [None]:
yelp_data['binary_labels'] = yelp_data['stars'].apply(lambda x: 1 if x > 3 else 0)
labels = np.array(yelp_data['binary_labels'])

train_reviews, val_reviews, train_labels, val_labels = train_test_split(tokenized_reviews['input_ids'], labels, test_size=0.2)

In [None]:
# Load the pre-trained DistilBERT model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', num_labels=2)

# Compile the model
optimizer = Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_reviews, train_labels, epochs=3, batch_size=32, validation_data=(val_reviews, val_labels))