#Project MS2

Weihua Pan
IE7374 -- Gen AI

#  Introduction

As Large Language Models (LLMs) continue to evolve, they are characterized by
increasingly larger parameters. However, the pace of computational resource development
struggles to keep up. Consequently, there is a significant advantage in devising a framework
that can reduce parameter size while preserving performance. Such a framework would enable
these models to operate on devices with limited computational capabilities, including mobile
phones.


# Problem Statement
This project is aimed to develop a car chatbot that will give some comment on specific car model and probably give a recommend car by some budget and features. <br>
In addition, this project is focused on discovering methods to decrease model size without
compromising performance for specified NLP tasks.  I will identify a Large Language Model
(LLM) like GPT-4 or BERT that is appropriate for this project and subsequently develop a user
interface tailored for real time text analysis.

# Dataset

The dataset i am using is scrape by `ankkur13`. The dataset contains reviews and rating for each car brands, model and etc. Here is the link: https://github.com/ankkur13/Edmunds-Car-Consumer-Ratings-and-Reviews

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Scraped Data/Scraped_Car_Review_eagle.csv',lineterminator='\n',index_col=0)

In [3]:
df

Unnamed: 0,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating\r
0,on 12/14/05 20:47 PM (PST),Trisha,1997 Eagle Talon Hatchback ESi 2dr Hatchback,AWESOME CAR!,This is a fantastic car! Not only is it fun t...,5.0
1,on 12/04/05 00:04 AM (PST),Patricia,1997 Eagle Talon Hatchback ESi 2dr Hatchback,Awesome Car!,This car is the most fun car I have ever driv...,5.0
2,on 02/19/05 14:20 PM (PST),Paco,1997 Eagle Talon Hatchback 2dr Hatchback,Not Bad.,"Most people don't like these cars, think\rthe...",4.25
3,on 08/06/04 00:00 AM (PDT),strave,1997 Eagle Talon Hatchback ESi 2dr Hatchback,enjoyable,for a four cylander with no turbo it is \ran ...,4.875
4,on 12/23/03 00:00 AM (PST),Deelio,1997 Eagle Talon Hatchback ESi 2dr Hatchback,"Fun, Reliable Car",This was my first brand spankin' new \rcar an...,4.75
5,on 12/02/03 00:00 AM (PST),D. Strickland,1997 Eagle Talon Hatchback ESi 2dr Hatchback,ESi,"Car is fun to drive, most DSM vehicles \rhave...",4.625
6,on 05/07/03 00:00 AM (PDT),DR1665,1997 Eagle Talon Hatchback 2dr Hatchback,pocket rocket!,I have owned my Talon since it was \rnew. Th...,5.0
7,on 09/26/02 00:00 AM (PDT),Andy Ashcraft,1997 Eagle Talon Hatchback ESi 2dr Hatchback,I love mine!,I've had nothing but fun driving my '97 \rTal...,4.625
8,on 09/02/02 00:00 AM (PDT),Mitch'sTalon,1997 Eagle Talon Hatchback ESi 2dr Hatchback,One of the funnest cars to drive!,If you don't own one or have never \rdriven o...,4.75
9,on 09/07/14 00:22 AM (PDT),mack_kenamond,1997 Eagle Talon TSi TSi Turbo 2dr Hatchback AWD,18 years of fun,"In case you didn't know, this car is virtuall...",4.875


Let's take a look at the structure in one csv file. The columns are `Review_Date,	Author_Name,	Vehicle_Title,	Review_Title,	Review,	Rating\r`. I don't think having date and author_name would be helpful to train the car Chatbot. So, I will ignore them and transform the csv to txt like this.

Vehicle: 1997 Eagle Talon Hatchback ESi 2dr Hatchback<br>
Title: AWESOME CAR!<br>
Review: "This is a fantastic car! Not only is it fun to drive, but it is a head turner. I have had more compliments on this car than I have had on any other car I have owned. I haven't had any major repairs on this car. It has been very dependable. I love driving it. It is red with the black roof & spoiler, which is very sharp. I haven't seen very many on the road, & when I take it in for oil changes, the mechanics tell me they have never seen one in such good condition. I will keep this car until it just won't go anymore. I now have 55,000 miles on it & expect to put a lot more miles on it before it dies. I was disappointed to hear that 1998 was the last year for this car."<br>
Rating: 5.0  [round to 1 decimal]

And then next car review

In [4]:
# put all cvs filename to one list
cvs_list = []
with open("Scraped Data/cvs_list.txt",mode="r",encoding='utf-8') as f:
    cvs_list = [line.strip() for line in f.readlines()]

In [5]:
def df_transform_txt(df: pd.DataFrame) -> str:
    """Extract  text and format them from csv dataframe.
    Only transform 1% of the data
    """
    df.columns = df.columns.str.strip()
    txt_lines = []
    one_percent = 0.01 * len(df)
    for _, row in df.iterrows():
        if _ >= one_percent:
            break
        vehicle = row['Vehicle_Title']
        title = row['Review_Title']
        review = row['Review'].replace("\r",' ') # fix the review format
        rating = round(row["Rating"], 1)
        
        txt_lines.append(f"Vehicle: {vehicle}\nTitle: {title}\nReview: {review}\nRating: {rating}\n")

    # Join all the lines into a single string with two newlines separating each entry
    txt = "\n\n".join(txt_lines)
    return txt

In [6]:
# loop through each csv file and transform them into 1 txt file
car_reviews_text_list = []
for filename in cvs_list:
    print(filename)
    df = pd.read_csv(f"Scraped Data/{filename}",lineterminator="\n",index_col=0)
    text = df_transform_txt(df)
    car_reviews_text_list.append(text)

car_reviews_text = "\n".join(car_reviews_text_list)
    

Scraped_Car_Review_daewoo.csv
Scraped_Car_Review_dodge.csv
Scraped_Car_Review_eagle.csv
Scraped_Car_Review_ferrari.csv
Scraped_Car_Review_fiat.csv
Scraped_Car_Review_fisker.csv
Scraped_Car_Review_ford.csv
Scraped_Car_Review_genesis.csv
Scraped_Car_Review_geo.csv
Scraped_Car_Review_hummer.csv
Scraped_Car_Review_hyundai.csv
Scraped_Car_Review_infiniti.csv
Scraped_Car_Review_isuzu.csv
Scraped_Car_Review_jaguar.csv
Scraped_Car_Review_jeep.csv
Scraped_Car_Review_kia.csv
Scraped_Car_Review_lamborghini.csv
Scraped_Car_Review_land-rover.csv
Scraped_Car_Review_lexus.csv
Scraped_Car_Review_lincoln.csv
Scraped_Car_Review_lotus.csv
Scraped_Car_Review_maserati.csv
Scraped_Car_Review_maybach.csv
Scraped_Car_Review_mazda.csv
Scraped_Car_Review_mclaren.csv
Scraped_Car_Review_mercedes-benz.csv
Scraped_Car_Review_mercury.csv
Scraped_Car_Review_mini.csv
Scraped_Car_Review_mitsubishi.csv
Scraped_Car_Review_nissan.csv
Scraped_Car_Review_oldsmobile.csv
Scraped_Car_Review_panoz.csv
Scraped_Car_Review_plymout

In [7]:
# define output file path
output_file_path = 'car_reviews.txt'

# write to the file
with open(output_file_path, 'w', encoding='utf-8') as file:
    file.write(car_reviews_text)

print(f"Review text has been saved to '{output_file_path}'")

Review text has been saved to 'car_reviews.txt'


# Modeling

In [8]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.optim import Adam

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [10]:
device

device(type='cuda')

In [11]:
# Lowercase the text
text = car_reviews_text.lower()

# Define the tokenizer
tokenizer = get_tokenizer('basic_english')

# Tokenize the text
tokenized_text = [list(tokenizer(text))]

# Build the vocabulary from the tokenized text
vocab = build_vocab_from_iterator(tokenized_text)

# Numericalize the text
numericalized_text = [vocab[token] for token in tokenized_text[0]]

In [12]:
token_list = numericalized_text[0:10]

In [13]:
vocab.lookup_tokens(token_list)

['vehicle',
 '2000',
 'daewoo',
 'leganza',
 'sedan',
 'sx',
 '4dr',
 'sedan',
 'title',
 'best']

see if the numericalized_text is reversible

In [14]:
len(vocab)

12444

the vocab size is 12444 which is quite large for 1mb text file.

In [15]:
# Define the dataset
class LlamaDataset(Dataset):
    def __init__(self, text, sequence_length):
        self.text = text
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.text) - self.sequence_length

    def __getitem__(self, idx):
        return (
            torch.tensor(self.text[idx:idx+self.sequence_length]),
            torch.tensor(self.text[idx+1:idx+self.sequence_length+1]),
        )

# Create the dataset and dataloader
sequence_length = 32
dataset = LlamaDataset(numericalized_text, sequence_length)
dataloader = DataLoader(dataset, batch_size=128)

In [16]:
class LlamaModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=hidden_size,
            dropout=dropout,
            batch_first=True,
        )
        self.fc = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output = self.transformer(embedded, embedded)
        output = self.fc(output)
        return output

In [17]:
# Initialize the model and the optimizer
model = LlamaModel(len(vocab), embed_size=128, hidden_size=128, num_layers=2, num_heads=8, dropout=0.1).to(device)

# If there are multiple GPUs, wrap the model with nn.DataParallel
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
model = model.to(device)

optimizer = Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(20):
    for batch in dataloader:
        x, y = batch
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        y_pred = model(x)
        loss = nn.functional.cross_entropy(y_pred.view(-1, len(vocab)), y.view(-1))
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch}, Loss {loss.item()}')
    if float(loss.item()) < 0.06:
        break

  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)


Epoch 0, Loss 5.0411696434021
Epoch 1, Loss 4.13360071182251
Epoch 2, Loss 3.409588098526001
Epoch 3, Loss 2.9343554973602295
Epoch 4, Loss 2.6019606590270996
Epoch 5, Loss 2.356454849243164
Epoch 6, Loss 2.1351141929626465
Epoch 7, Loss 2.0737743377685547
Epoch 8, Loss 1.949976921081543
Epoch 9, Loss 1.8891923427581787
Epoch 10, Loss 1.7727535963058472
Epoch 11, Loss 1.6936289072036743
Epoch 12, Loss 1.6191486120224
Epoch 13, Loss 1.5524119138717651
Epoch 14, Loss 1.4959806203842163
Epoch 15, Loss 1.4818544387817383
Epoch 16, Loss 1.3750232458114624
Epoch 17, Loss 1.4040517807006836
Epoch 18, Loss 1.2730814218521118
Epoch 19, Loss 1.3316398859024048


The model takes 1min to run 1 epoch. For the first 20 epoch, the loss function is lower to 1.3, and the model is converging.

In [24]:
# Use the trained model to generate new text
def generate_text(model, human_input, num_tokens):
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # No need to track the gradients
        tokens = [vocab[token] for token in tokenizer(human_input)]
        tokens = torch.tensor(tokens).unsqueeze(0).to(device)
        for _ in range(num_tokens):
            output = model(tokens)
            probabilities = nn.functional.softmax(output[0, -1], dim=0)
            next_token = torch.multinomial(probabilities, 1).item()
            tokens = torch.cat([tokens, torch.tensor([[next_token]]).to(device)], dim=1)
        generated_text = ' '.join(vocab.get_itos()[token] for token in tokens[0].cpu().numpy())
        return generated_text

# Experiment Findings

In [85]:
result = generate_text(model, human_input="I love tesla  ", num_tokens=32)
print(result)

i love tesla love rating a performance a performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance


from the generated output above, we can see tesla have some relation to performance. This is interesting, tesla indeed have some tags of performance car.

In [172]:
result = generate_text(model, human_input="vehicle 2012 bmw m3", num_tokens=32)
print(result)

vehicle 2012 bmw m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 m3 all cost all cost all cost all cost all cost all cost cost cost cost


with m3 as input, it output some cost tokens in the end.

# Conclusion

The model can output some words, but in general, the sentences is not making sense. I will see whether increasing epochs can improve the performance. Another way to improve is increasing the data size, but it will dramatically increase the computation and I don't think it is trainable using one GPU. I think the best way would be getting ultra small LLM like llamma 7b to fine-tuning using the dataset.