## 步骤 1 - 安装所需的 Python 环境及包

## 步骤 2 - 初步读入并清洗数据

In [39]:
import pandas as pd

df = pd.read_csv("./tweets.csv")
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score
0,@user @user what do these '1/2 naked pics' hav...,neutral,0,0.8047260642051697,LABEL_2,0.9134505987167358
1,OH: âI had a blue penis while I was thisâ?[...,neutral,1,0.8669487237930298,LABEL_1,0.7534046173095703
2,"@user @user That's coming, but I think the vic...",neutral,1,0.7637239098548889,LABEL_2,0.9999619722366332
3,I think I may be finally in with the in crowd ...,positive,2,0.7740470767021179,LABEL_2,0.8987836837768555
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,1,0.4163974821567535,LABEL_2,0.9864314198493958


我们可以发现，Roberta 使用 0/1/2 来分别代表 negative/neutral/positive，而 gpt2 使用 LABEL_0/1/2 来代表 negative/neutral/positive。为了方便后续处理，我们需要进行清理。

In [None]:
def label_map_gpt2(x):
    if x == "LABEL_0":
        return "negative"
    elif x == "LABEL_1":
        return "neutral"
    elif x == "LABEL_2":
        return "positive"
    else:
        return "invalid"
    
def label_map_roberta(x):
    if x == "0":
        return "negative"
    elif x == "1":
        return "neutral"
    elif x == "2":
        return "positive"
    else:
        return "invalid"

df['gpt2'] = df['gpt2'].map(label_map_gpt2)
df['roberta'] = df['roberta'].map(label_map_roberta)

# print(f"Before cleaning: {df.shape}")
# 删除无效的标签，暂时注释，因为暂时不影响处理。如果需要更干净的数据可以打开
# df = df[df['gpt2'] != 'invalid']
# df = df[df['roberta'] != 'invalid']
# df = df[pd.to_numeric(df['roberta_score'], errors='coerce').between(0, 1)]
# df = df[pd.to_numeric(df['gpt2_score'], errors='coerce').between(0, 1)]
# 删除label不是 neutral / positive / negative的数据
# df = df[df['label'].isin(['neutral', 'positive', 'negative'])]
# print(f"After cleaning: {df.shape}")

df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score
0,@user @user what do these '1/2 naked pics' hav...,neutral,negative,0.8047260642051697,positive,0.9134505987167358
1,OH: âI had a blue penis while I was thisâ?[...,neutral,neutral,0.8669487237930298,neutral,0.7534046173095703
2,"@user @user That's coming, but I think the vic...",neutral,neutral,0.7637239098548889,positive,0.9999619722366332
3,I think I may be finally in with the in crowd ...,positive,positive,0.7740470767021179,positive,0.8987836837768555
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,neutral,0.4163974821567535,positive,0.9864314198493958


Zeno 处理要求我们增加 input_length 列与 id 列，对数据进行进一步处理

In [41]:
df["input_length"] = df["text"].str.len()
df['id'] = df.index.astype(str)
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score,input_length,id
0,@user @user what do these '1/2 naked pics' hav...,neutral,negative,0.8047260642051697,positive,0.9134505987167358,96,0
1,OH: âI had a blue penis while I was thisâ?[...,neutral,neutral,0.8669487237930298,neutral,0.7534046173095703,75,1
2,"@user @user That's coming, but I think the vic...",neutral,neutral,0.7637239098548889,positive,0.9999619722366332,87,2
3,I think I may be finally in with the in crowd ...,positive,positive,0.7740470767021179,positive,0.8987836837768555,83,3
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,neutral,0.4163974821567535,positive,0.9864314198493958,133,4


## 步骤 3 - 启动 Zeno 进行模型分析

创建一个 [Zeno](https://hub.zenoml.com/account) 账号，阅读如下代码并正确运行，运行完成后你将在个人账户下看到创建的 projects

In [42]:
from zeno_client import ZenoClient, ZenoMetric
import pandas as pd

import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='.env')
ZENO_API_KEY = os.getenv("ZENO_API_KEY")
client = ZenoClient(ZENO_API_KEY)

# 创建项目
proj = client.create_project(
    name="Twitter Sentiment Analysis",
    view="text-classification",
    metrics=[
        ZenoMetric(name="roberta_accuracy", type="mean", columns=["roberta_correct"]),
        ZenoMetric(name="gpt2_accuracy", type="mean", columns=["gpt2_correct"]),
    ]
)

proj.upload_dataset(df, id_column="id", data_column='text', label_column="label")

# 为 Roberta 模型分别创建系统数据框
df_roberta = pd.DataFrame({
    "id": df["id"].astype(str),
    "output": df["roberta"],
    "roberta_correct": (df["roberta"] == df["label"]).astype(int)
})
proj.upload_system(df_roberta, name="Roberta", id_column="id", output_column="output")

df_gpt2 = pd.DataFrame({
    "id": df["id"].astype(str),
    "output": df["gpt2"],
    "gpt2_correct": (df["gpt2"] == df["label"]).astype(int)
})

proj.upload_system(df_gpt2, name="gpt2", id_column="id", output_column="output")

Successfully created project.
Access your project at  https://hub.zenoml.com/project/b75173d4-20c8-494a-a15a-d350e2ce1e56/Twitter%20Sentiment%20Analysis


  0%|          | 0/1 [00:00<?, ?it/s]

Successfully uploaded data


  0%|          | 0/1 [00:00<?, ?it/s]

Successfully uploaded system


  0%|          | 0/1 [00:00<?, ?it/s]

Successfully uploaded system


## 步骤 4 - 创建分析切片并进一步分析

你一共需要利用 Zeno 提供的接口创建 5 个不同的切片。

你可以首先创建两个切片：

1. 有标签的推文（含有"#"）
2. 有强烈正向情感的词语（如 love），你可以自行决定一个词

创建切片可以直接通过点击 "+" 按钮来进行，可以通过基本值匹配或正则表达式的方式创建，具体用法参见[文档](https://zenoml.com/docs/intro/)

![image.png](images/image.png)

有关 Zeno 中更多有趣的用法可以参考 Zeno 仓库中的 [README](https://github.com/zeno-ml/zeno)。

在此处写下你想要另外创建的三个切片，并在个人 Zeno 账户中创建，针对每一个创建的切片，用一两句话总结模型在其上的表现特点：

整体水平为 0.67 (Roberta) 0.35 (GPT2)
1. Twitter with tag（有标签的推文，正则表达式为 `#\w+`） 
    - Roberta的准确率为0.63，GPT2准确率为0.38
    - 两个模型在此类文本上的表现均与整体水平接近，
2. 有强烈正向情感的词语（如 love），你可以自行决定一个词 （正则表达式为 `(?i)\bhappy\b`）
    - Roberta的准确率为0.71，GPT2准确率为0.57
    - GPT在这一类文本的表现明显好于整体水平，说明GPT对于有强烈情感词语的识别能力更强
3. 长文本 `input_length > 100`
    - Roberta的准确率为0.60，GPT2准确率为0.25
    - 两个模型对于长文本的分析能力均有下降，但GPT2则下降的更多，说明GPT2对于长文本序列依赖表现糟糕
4. 短文本 `input_length < 30`
    - Roberta的准确率为0.73，GPT2准确率为0.67
    - 两个模型对于短文本的分析能力均比平均好，但GPT2尤其优秀，说明GPT2对于短文本的情感分类强。
5. 含有http链接 正则表达为 `http`
    - Roberta的准确率为0.44，GPT2准确率为0.44
    - Roberta对于含有链接文本的分类显著低于平均水平，说明无法有效区分链接，GPT2则高于，一定程度上有效区分了链接

![](./screenshot.png)

## 提交：
1. 本 notebook，保留你书写的代码与输出结果
2. 一张截图，显示你创建的 5 个切片