# 任务内容
有7个prompts，学生们根据prompts和source text写essays。其中2个prompts对应的essays构成训练集，其余的构成测试集。注意：训练集中大部分是学生写的，可以用LLM生成更多的essay加入训练集。最后根据prompt和text生成它是LLM生成的概率。
## 步骤思路
1.数据预处理：处理文本数据  
2.用BERT对文本编码；也叫tokenization，在NLP任务中，需要把text分解成wordpieces  
3.训练模型  
4.预测结果，保存到csv中

In [17]:
import pandas as pd
import markdown2
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
import torch
from torch.utils.data import DataLoader, TensorDataset, random_split
import nltk

In [15]:
train_data = pd.read_csv("D:/kaggle/AI text/data/train_essays.csv")
test_data = pd.read_csv("D:/kaggle/AI text/data/test_essays.csv")
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378 entries, 0 to 1377
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1378 non-null   object
 1   prompt_id  1378 non-null   int64 
 2   text       1378 non-null   object
 3   generated  1378 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.2+ KB


In [4]:
train_data['prompt_id']

0       0
1       0
2       0
3       0
4       0
       ..
1373    1
1374    0
1375    0
1376    0
1377    0
Name: prompt_id, Length: 1378, dtype: int64

In [5]:
set(train_data['prompt_id'])

{0, 1}

果然只有0，1两个prompt用于生成训练集

In [7]:
set(train_data['generated'])

{0, 1}

In [12]:
value_counts = train_data['generated'].value_counts()
percentage = value_counts/ len(train_data)* 100
result = pd.concat([value_counts, percentage], axis = 1)
result.columns = ['value', 'percentage']
result

Unnamed: 0_level_0,value,percentage
generated,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1375,99.782293
1,3,0.217707


1378个数据中，只有3个是LLM生成的

In [8]:
train_prompts = pd.read_csv("D:/kaggle/AI text/data/train_prompts.csv")
source_text_0 = train_prompts['source_text'][0]
source_text_0

'# In German Suburb, Life Goes On Without Cars by Elisabeth Rosenthal\n\n1 VAUBAN, Germany—Residents of this upscale community are suburban pioneers, going where few soccer moms or commuting executives have ever gone before: they have given up their cars.\n\n2 Street parking, driveways and home garages are generally forbidden in this experimental new district on the outskirts of Freiburg, near the French and Swiss borders. Vauban’s streets are completely “car-free”—except the main thoroughfare, where the tram to downtown Freiburg runs, and a few streets on one edge of the community. Car ownership is allowed, but there are only two places to park—large garages at the edge of the development, where a car-owner buys a space, for $40,000, along with a home.\n\n3 As a result, 70 percent of Vauban’s families do not own cars, and 57 percent sold a car to move here. “When I had a car I was always tense. I’m much happier this way,” said Heidrun Walter, a media trainer and mother of two, as she 

不知道source text有啥用，先看看essay

In [16]:
essay_0 = train_data['text'][0]
essay_0

'Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then. But now, people are starting to question if limiting car usage would be a good thing. To me, limiting the use of cars might be a good thing to do.\n\nIn like matter of this, article, "In German Suburb, Life Goes On Without Cars," by Elizabeth Rosenthal states, how automobiles are the linchpin of suburbs, where middle class families from either Shanghai or Chicago tend to make their homes. Experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe. Passenger cars are responsible for 12 percent of greenhouse gas emissions in Europe...and up to 50 percent in some carintensive areas in the United States. Cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where they need to go. Article

里面有一些转义字符，需要对文本进行清洗。此外，在nlp任务中，一般要去掉stopword，标点

In [18]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)  # Replace any character that is not a word character or a whitespace character with an empty string
    words = text.split()  # Tokenize
    words = [word.lower() for word in words if word.isalpha()]  # Lowercase and remove non-alphabetic words
    words = [word for word in words if word not in stop_words]  # Remove stop words
    return ' '.join(words)
train_data['clean_text'] = train_data['text'].apply(clean_text)
train_data['clean_text'][0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tan13\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'cars cars around since became famous henry ford created built first modelt cars played major role every day lives since people starting question limiting car usage would good thing limiting use cars might good thing like matter article german suburb life goes without cars elizabeth rosenthal states automobiles linchpin suburbs middle class families either shanghai chicago tend make homes experts say huge impediment current efforts reduce greenhouse gas emissions tailpipe passenger cars responsible percent greenhouse gas emissions europeand percent carintensive areas united states cars main reason greenhouse gas emissions lot people driving around time getting need go article paris bans driving due smog robert duffer says paris days nearrecord pollution enforced partial driving ban clear air global city also says monday motorist evennumbered license plates ordered leave cars home fined fine order would applied oddnumbered plates following day cars reason polluting entire cities like pa

对训练数据再分类，做交叉验证，以避免过拟合

In [21]:
X_train, X_validate, y_train, y_validate = train_test_split(train_data['clean_text'], train_data['generated'], test_size=0.2, random_state=42)

X是feature,y是label,子训练集占80%

In [23]:
# Tokenization，用训练好的BERT模型对文本进行编码
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, padding=True, truncation=True, max_length=128)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [26]:
# 对文本进行编码
encoded_train = tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors='pt')
encoded_validate = tokenizer(X_validate.tolist(), padding=True, truncation=True, return_tensors='pt')
encoded_train

{'input_ids': tensor([[ 101, 3765, 2191,  ...,    0,    0,    0],
        [ 101, 2420, 2156,  ...,    0,    0,    0],
        [ 101, 4883, 2602,  ...,    0,    0,    0],
        ...,
        [ 101, 2715, 9935,  ...,    0,    0,    0],
        [ 101, 6092, 2267,  ...,    0,    0,    0],
        [ 101, 2047, 2287,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [25]:
# Convert labels to tensors
train_labels = torch.tensor(y_train.values)
validate_labels = torch.tensor(y_validate.values)

In [32]:
# Create TensorDatasets
train_dataset = TensorDataset(encoded_train['input_ids'], encoded_train['attention_mask'], train_labels)
validate_dataset = TensorDataset(encoded_validate['input_ids'], encoded_validate['attention_mask'], validate_labels)
# DataLoader for efficient processing
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) # shuffle是打乱顺序
validate_loader = DataLoader(validate_dataset, batch_size=16, shuffle=False)

In [28]:
# Define the BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

the architecture of a BERT-based sequence classification model👆

In [29]:
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
epochs = 10



In [33]:
for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Gradient clipping to avoid exploding gradients
        optimizer.step()

    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/{epochs}, Average Training Loss: {avg_train_loss:.2f}")

KeyboardInterrupt: 