## 步骤 1 - 安装所需的 Python 环境及包

## 步骤 2 - 初步读入并清洗数据

In [51]:
import pandas as pd

df = pd.read_csv("./tweets.csv")
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score
0,@user @user what do these '1/2 naked pics' hav...,neutral,0,0.8047260642051697,LABEL_2,0.9134505987167358
1,OH: âI had a blue penis while I was thisâ?[...,neutral,1,0.8669487237930298,LABEL_1,0.7534046173095703
2,"@user @user That's coming, but I think the vic...",neutral,1,0.7637239098548889,LABEL_2,0.9999619722366332
3,I think I may be finally in with the in crowd ...,positive,2,0.7740470767021179,LABEL_2,0.8987836837768555
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,1,0.4163974821567535,LABEL_2,0.9864314198493958


我们可以发现，Roberta 使用 0/1/2 来分别代表 negative/neutral/positive，而 gpt2 使用 LABEL_0/1/2 来代表 negative/neutral/positive。为了方便后续处理，我们需要进行清理。

In [52]:
def label_map_gpt2(x):
    #TODO
    if x == "LABEL_0":
        return "negative"
    elif x == "LABEL_1":
        return "neutral"
    elif x == "LABEL_2":
        return "positive"
    else:
        return "NaN"
    
def label_map_roberta(x):
    #TODO
    if x == "0":
        return "negative"
    elif x == "1":
        return "neutral"
    elif x == "2":
        return "positive"
    else:
        return "NaN"

df['gpt2'] = df['gpt2'].map(label_map_gpt2)
df['roberta'] = df['roberta'].map(label_map_roberta)
df = df[df["label"].isin(["negative", "neutral", "positive"])]
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score
0,@user @user what do these '1/2 naked pics' hav...,neutral,negative,0.8047260642051697,positive,0.9134505987167358
1,OH: âI had a blue penis while I was thisâ?[...,neutral,neutral,0.8669487237930298,neutral,0.7534046173095703
2,"@user @user That's coming, but I think the vic...",neutral,neutral,0.7637239098548889,positive,0.9999619722366332
3,I think I may be finally in with the in crowd ...,positive,positive,0.7740470767021179,positive,0.8987836837768555
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,neutral,0.4163974821567535,positive,0.9864314198493958


Zeno 处理要求我们增加 input_length 列与 id 列，对数据进行进一步处理

In [53]:
df["input_length"] = df["text"].str.len()
df['id'] = df.index
df.head(5)

Unnamed: 0,text,label,roberta,roberta_score,gpt2,gpt2_score,input_length,id
0,@user @user what do these '1/2 naked pics' hav...,neutral,negative,0.8047260642051697,positive,0.9134505987167358,96,0
1,OH: âI had a blue penis while I was thisâ?[...,neutral,neutral,0.8669487237930298,neutral,0.7534046173095703,75,1
2,"@user @user That's coming, but I think the vic...",neutral,neutral,0.7637239098548889,positive,0.9999619722366332,87,2
3,I think I may be finally in with the in crowd ...,positive,positive,0.7740470767021179,positive,0.8987836837768555,83,3
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,neutral,0.4163974821567535,positive,0.9864314198493958,133,4


## 步骤 3 - 启动 Zeno 进行模型分析

创建一个 [Zeno](https://hub.zenoml.com/account) 账号，阅读如下代码并正确运行，运行完成后你将在个人账户下看到创建的 projects

In [55]:
from zeno_client import ZenoClient, ZenoMetric
from dotenv import load_dotenv
import os
import pandas as pd

load_dotenv()
API_KEY = os.getenv("API_KEY")
client = ZenoClient(API_KEY)

# 创建项目
proj = client.create_project(
    name="Twitter Sentiment Analysis",
    view="text-classification",
    metrics=[
        ZenoMetric(name="roberta_accuracy", type="mean", columns=["roberta_correct"]),
        #TODO 类比上述方法，为 gpt2 模型创建相应的评估指标
        ZenoMetric(name="gpt2_accuracy", type="mean", columns=["gpt2_correct"])
    ]
)

proj.upload_dataset(df, id_column="id", data_column='text', label_column="label")

# 为 Roberta 模型分别创建系统数据框
df_roberta = pd.DataFrame({
    "id": df["id"],
    "output": df["roberta"],
    "roberta_correct": (df["roberta"] == df["label"]).astype(int)
})
proj.upload_system(df_roberta, name="Roberta", id_column="id", output_column="output")
#TODO 类比上述方法，为 gpt2 模型创建系统数据框
df_gpt2 = pd.DataFrame({
    "id": df["id"],
    "output": df["gpt2"],
    "gpt2_correct": (df["gpt2"] == df["label"]).astype(int)
})
proj.upload_system(df_gpt2, name="gpt2", id_column="id", output_column="output")


Successfully updated project.
Access your project at  https://hub.zenoml.com/project/e10c4455-8b15-44c4-b0c9-920955a1239c/Twitter%20Sentiment%20Analysis


100%|██████████| 1/1 [00:01<00:00,  1.66s/it]


Successfully uploaded data


100%|██████████| 1/1 [00:01<00:00,  1.38s/it]


Successfully uploaded system


100%|██████████| 1/1 [00:04<00:00,  4.05s/it]

Successfully uploaded system





## 步骤 4 - 创建分析切片并进一步分析

你一共需要利用 Zeno 提供的接口创建 5 个不同的切片。

你可以首先创建两个切片：

1. 有标签的推文（含有"#"）
2. 有强烈正向情感的词语（如 love），你可以自行决定一个词

创建切片可以直接通过点击 "+" 按钮来进行，可以通过基本值匹配或正则表达式的方式创建，具体用法参见[文档](https://zenoml.com/docs/intro/)

![image.png](images/image.png)

有关 Zeno 中更多有趣的用法可以参考 Zeno 仓库中的 [README](https://github.com/zeno-ml/zeno)。

在此处写下你想要另外创建的三个切片，并在个人 Zeno 账户中创建，针对每一个创建的切片，用一两句话总结模型在其上的表现特点：

3. negative words:
   
   gpt2 表现优于 roberta 
   
4. positive words:
   
   roberta 与 gpt2 都较优，其中 roberta 更优
   
5. @someone:
   
   roberta 表现优于 gpt2   

## 提交：
1. 本 notebook，保留你书写的代码与输出结果
2. 一张截图，显示你创建的 5 个切片
   ![slice.png](images/slice.png)