## 步骤 1 - 安装所需的 Python 环境及包

## 步骤 2 - 初步读入并清洗数据

In [3]:
import pandas as pd

df = pd.read_csv("./tweets.csv")
df.head(5)

FileNotFoundError: [Errno 2] No such file or directory: './tweets.csv'

我们可以发现，Roberta 使用 0/1/2 来分别代表 negative/neutral/positive，而 gpt2 使用 LABEL_0/1/2 来代表 negative/neutral/positive。为了方便后续处理，我们需要进行清理。

In [None]:
def label_map_gpt2(x):
    if x == "LABEL_0":
        return 0
    elif x == "LABEL_1":
        return 1
    elif x == "LABEL_2":
        return 2
    else:
        return x
    
def label_map_roberta(x):
    if x == 0:
        return 0
    elif x == 1:
        return 1
    elif x == 2:
        return 2
    else:
        return x

# 统一标签格式
df['gpt2'] = df['gpt2'].map(label_map_gpt2)
df['roberta'] = df['roberta'].map(label_map_roberta)

# 将label列也转换为数值格式以便比较
def label_map_text(x):
    if x == "negative":
        return 0
    elif x == "neutral":
        return 1
    elif x == "positive":
        return 2
    else:
        return x

df['label'] = df['label'].map(label_map_text)
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score
0,@user @user what do these '1/2 naked pics' hav...,1,0,0.8047260642051697,2,0.9134505987167358
1,OH: âI had a blue penis while I was thisâ?[...,1,1,0.8669487237930298,1,0.7534046173095703
2,"@user @user That's coming, but I think the vic...",1,1,0.7637239098548889,2,0.9999619722366332
3,I think I may be finally in with the in crowd ...,2,2,0.7740470767021179,2,0.8987836837768555
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0,1,0.4163974821567535,2,0.9864314198493958


Zeno 处理要求我们增加 input_length 列与 id 列，对数据进行进一步处理

In [None]:
df["input_length"] = df["text"].str.len()
df['id'] = df.index
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score,input_length,id
0,@user @user what do these '1/2 naked pics' hav...,1,0,0.8047260642051697,2,0.9134505987167358,96,0
1,OH: âI had a blue penis while I was thisâ?[...,1,1,0.8669487237930298,1,0.7534046173095703,75,1
2,"@user @user That's coming, but I think the vic...",1,1,0.7637239098548889,2,0.9999619722366332,87,2
3,I think I may be finally in with the in crowd ...,2,2,0.7740470767021179,2,0.8987836837768555,83,3
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0,1,0.4163974821567535,2,0.9864314198493958,133,4


In [None]:
# 检查数据类型和唯一值
print("Label column unique values:", df['label'].unique())
print("Label column dtype:", df['label'].dtype)
print("GPT2 column unique values:", df['gpt2'].unique())
print("Roberta column unique values:", df['roberta'].unique())

# 检查是否有缺失值或异常值
print("\nMissing values:")
print(df.isnull().sum())

# 检查数据形状
print(f"\nDataframe shape: {df.shape}")

# 显示前几行来检查数据
df.head()

Label column unique values: [1 2 0 ' they MUST discredit #PizzaGate"'
 ' buddy. There are lots of nasty women and bad hombres that are staying too."'
 " and Hillary most likely didn't win the popular vote"
 ' homeopathy? Patients need to know this is a scam with no basis in science (or reality)."'
 'her sis was big in election'
 ' and I will still like the one with Zac Efron better."'
 ' human or animal"'
 ' does it ð\x9f\x98\x94 #fracking destroys #environment @user"'
 ' get it together. #TheWalkingDead"' ' etc.: #Palestine #Israel"'
 ' which will include marine conservation'
 " but can someone enlighten me as to what's next?? Venezuelans rejoiced when Chavez died"
 ' I can\'t help but feel like David Blaine. #magic #static"']
Label column dtype: object
GPT2 column unique values: [2 1 0 '93' '92' '116' '112' '96' '102' '109' '105' '115' '132' '97' '107'
 '110' '0.7893977761268616' '76' '63' '113' '131' '79' '38' '72' '89'
 '114' '103' '108' '57' '0.8704230189323425' '87' '39' '106' '9

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score,input_length,id
0,@user @user what do these '1/2 naked pics' hav...,1,0,0.8047260642051697,2,0.9134505987167358,96,0
1,OH: âI had a blue penis while I was thisâ?[...,1,1,0.8669487237930298,1,0.7534046173095703,75,1
2,"@user @user That's coming, but I think the vic...",1,1,0.7637239098548889,2,0.9999619722366332,87,2
3,I think I may be finally in with the in crowd ...,2,2,0.7740470767021179,2,0.8987836837768555,83,3
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0,1,0.4163974821567535,2,0.9864314198493958,133,4


In [None]:
# 重新加载和清理数据
df = pd.read_csv("./tweets.csv")

def clean_label_data(x):
    """清理标签数据，只保留0,1,2或对应的LABEL值"""
    if str(x).strip() in ['0', '1', '2']:
        return int(x)
    elif str(x).startswith('LABEL_'):
        try:
            return int(str(x).split('_')[1])
        except:
            return None
    elif str(x).lower() == 'negative':
        return 0
    elif str(x).lower() == 'neutral':
        return 1
    elif str(x).lower() == 'positive':
        return 2
    else:
        return None

# 清理所有列
df['label_clean'] = df['label'].apply(clean_label_data)
df['gpt2_clean'] = df['gpt2'].apply(clean_label_data)  
df['roberta_clean'] = df['roberta'].apply(clean_label_data)

# 只保留有效的数据行
df_clean = df[(df['label_clean'].notna()) & 
              (df['gpt2_clean'].notna()) & 
              (df['roberta_clean'].notna())].copy()

# 重新整理DataFrame
df_final = pd.DataFrame({
    'text': df_clean['text'],
    'label': df_clean['label_clean'].astype(int),
    'roberta': df_clean['roberta_clean'].astype(int),
    'roberta_score': df_clean['roberta_score'],
    'gpt2': df_clean['gpt2_clean'].astype(int),
    'gpt2_score': df_clean['gpt2_score'],
    'input_length': df_clean['text'].str.len(),
    'id': range(len(df_clean))
})

print(f"Original data shape: {df.shape}")
print(f"Cleaned data shape: {df_final.shape}")
print("\nLabel distribution:")
print(df_final['label'].value_counts().sort_index())
print("\nGPT2 predictions distribution:")
print(df_final['gpt2'].value_counts().sort_index())
print("\nRoberta predictions distribution:")
print(df_final['roberta'].value_counts().sort_index())

# 更新df变量
df = df_final
df.head()

Original data shape: (460, 9)
Cleaned data shape: (408, 8)

Label distribution:
label
0    123
1    201
2     84
Name: count, dtype: int64

GPT2 predictions distribution:
gpt2
0     66
1    110
2    232
Name: count, dtype: int64

Roberta predictions distribution:
roberta
0    153
1    168
2     87
Name: count, dtype: int64


Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score,input_length,id
0,@user @user what do these '1/2 naked pics' hav...,1,0,0.8047260642051697,2,0.9134505987167358,96,0
1,OH: âI had a blue penis while I was thisâ?[...,1,1,0.8669487237930298,1,0.7534046173095703,75,1
2,"@user @user That's coming, but I think the vic...",1,1,0.7637239098548889,2,0.9999619722366332,87,2
3,I think I may be finally in with the in crowd ...,2,2,0.7740470767021179,2,0.8987836837768555,83,3
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0,1,0.4163974821567535,2,0.9864314198493958,133,4


## 步骤 3 - 启动 Zeno 进行模型分析

创建一个 [Zeno](https://hub.zenoml.com/account) 账号，阅读如下代码并正确运行，运行完成后你将在个人账户下看到创建的 projects

In [None]:
from zeno_client import ZenoClient, ZenoMetric
import pandas as pd

client = ZenoClient("zen_Bvt0V-GtPK2XlbwBTojUIQA2c591575DBcmiAQahaQs")

# 创建项目
proj = client.create_project(
    name="Twitter Sentiment Analysis",
    view="text-classification",
    metrics=[
        ZenoMetric(name="roberta_accuracy", type="mean", columns=["roberta_correct"]),
        # 为 gpt2 模型创建相应的评估指标
        ZenoMetric(name="gpt2_accuracy", type="mean", columns=["gpt2_correct"]),
    ]
)

proj.upload_dataset(df, id_column="id", data_column='text', label_column="label")

# 为 Roberta 模型分别创建系统数据框
df_roberta = pd.DataFrame({
    "id": df["id"],
    "output": df["roberta"],
    "roberta_correct": (df["roberta"] == df["label"]).astype(int)
})
proj.upload_system(df_roberta, name="Roberta", id_column="id", output_column="output")

# 为 gpt2 模型创建系统数据框
df_gpt2 = pd.DataFrame({
    "id": df["id"],
    "output": df["gpt2"],
    "gpt2_correct": (df["gpt2"] == df["label"]).astype(int)
})
proj.upload_system(df_gpt2, name="GPT2", id_column="id", output_column="output")

Successfully created project.
Access your project at  https://hub.zenoml.com/project/d5003e37-a40b-4e99-9b52-4e694d80f987/Twitter%20Sentiment%20Analysis


 '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43'
 '44' '45' '46' '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57'
 '58' '59' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '70' '71'
 '72' '73' '74' '75' '76' '77' '78' '79' '80' '81' '82' '83' '84' '85'
 '86' '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99'
 '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '110' '111'
 '112' '113' '114' '115' '116' '117' '118' '119' '120' '121' '122' '123'
 '124' '125' '126' '127' '128' '129' '130' '131' '132' '133' '134' '135'
 '136' '137' '138' '139' '140' '141' '142' '143' '144' '145' '146' '147'
 '148' '149' '150' '151' '152' '153' '154' '155' '156' '157' '158' '159'
 '160' '161' '162' '163' '164' '165' '166' '167' '168' '169' '170' '171'
 '172' '173' '174' '175' '176' '177' '178' '179' '180' '181' '182' '183'
 '184' '185' '186' '187' '188' '189' '190' '191' '192' '193' '1

Successfully uploaded data


100%|██████████| 1/1 [00:00<00:00,  1.30it/s]


Successfully uploaded system


100%|██████████| 1/1 [00:00<00:00,  1.30it/s]

Successfully uploaded system





## 步骤 4 - 创建分析切片并进一步分析

你一共需要利用 Zeno 提供的接口创建 5 个不同的切片。

你可以首先创建两个切片：

1. 有标签的推文（含有"#"）
2. 有强烈正向情感的词语（如 love），你可以自行决定一个词

创建切片可以直接通过点击 "+" 按钮来进行，可以通过基本值匹配或正则表达式的方式创建，具体用法参见[文档](https://zenoml.com/docs/intro/)

![image.png](images/image.png)

有关 Zeno 中更多有趣的用法可以参考 Zeno 仓库中的 [README](https://github.com/zeno-ml/zeno)。

在此处写下你想要另外创建的三个切片，并在个人 Zeno 账户中创建，针对每一个创建的切片，用一两句话总结模型在其上的表现特点：

3. **短文本切片（推文长度 < 50 字符）**：短文本通常包含较少的语义信息，模型在此类数据上的预测准确率可能较低，特别是在处理简短的情感表达时容易出现误分类。

4. **包含用户提及(@username)的推文**：包含@符号的推文通常是对话性质或互动性内容，这类推文的情感表达可能更加复杂或依赖上下文，模型在理解这类社交互动语境时可能存在挑战。

5. **包含否定词的推文（如"not", "no", "never"等）**：否定词会改变句子的情感极性，这是情感分析中的经典难点。模型在处理包含否定的文本时准确率通常会下降，因为需要理解否定词对整体情感的影响。

## 新创建的切片
1.**长文本切片**，推文长度>100，模型在此类数据上预测准确率较低，这类推文内容含量大，甚至可能有多种情感，模型不容易准确预测。

2.**包含感叹号的推文**，感叹号通常表示强烈的情感，模型在此类数据上预测准确率较高，感叹号可以增强情感表达，使模型更容易理解.

3.**包含用户提及(@user)的推文**：包含@符号的推文通常是与其他用户的互动性内容，推文情感可能需要特定的语境或者更复杂的用户信息，模型预测正确率低。

## 提交：
1. 本 notebook，保留你书写的代码与输出结果
2. 一张截图，显示你创建的 5 个切片