Label articles with the race of the victims + the surrounding sentences

In [2]:
import pandas as pd
import openai
import os
import json
from tqdm import tqdm
from dotenv import load_dotenv

article_num = 634

# Load API key
load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load your data
articles_df = pd.read_csv("articles.csv").head(article_num)
definitions_df = pd.read_excel("racism_types_definitions.xlsx")
samples_df = pd.read_excel("sample_racism_classification_truncated.xlsx")

# Truncate long definitions to save tokens
concept_defs = "\n".join(
    f"{row['concepts']}: {row['definitions']}"
    for _, row in definitions_df.iterrows()
)

examples = "\n".join(
    f'"{row["annotated_sentence"]}" → {row["annotation_content"]}'
    for _, row in samples_df.iterrows()
)

# Split long articles into ~3000‑char chunks at sentence boundaries
def split_text(text, max_chars=3000):
    sentences = text.split('. ')
    chunks, buf = [], ""
    for sent in sentences:
        sent = sent.strip()
        if len(buf) + len(sent) + 2 < max_chars:
            buf += sent + ". "
        else:
            chunks.append(buf.strip())
            buf = sent + ". "
    if buf:
        chunks.append(buf.strip())
    return chunks

# Build prompt for one chunk
def build_prompt(chunk_text):
    return f"""
You are a sociology professor with 30 years of experience studying racism against Asians.
Your task is to identify the quotes in articles that match your list of types of racism concepts, 
and also label who the victim is based on their race. 

1) First, read through the racism concept definitions. 
You need to understand these definitions so you can accurately recognize when a quote fits one or more of these concepts:
{concept_defs}

2) Next, review the example labeled quotes provided. 
You need to study these examples to see how quotes have been matched to concepts in practice, which will guide your own labeling decisions:
{examples}

3) Now read the ARTICLE CHUNK below. For each quote that matches at least one concept, output:
  • quote: the exact text from article(if it’s under 50 chars, include one sentence before and after as “context” instead).
  • concepts: a list of matching concept names.
  • victim: the race of the victim. If the race cannot be inferred, label it as unknown.
  • context: (only if the quote itself is under 50 characters; otherwise you can repeat the quote)

ARTICLE CHUNK:
{chunk_text}

Return a JSON array like:
[
  {{ "quote": "...", "context": "...", "concepts": ["C1","C2"], "victim": "Asian" }},
  ...
]
"""

all_results = []

import re

def clean_json_output(output: str) -> str:
    o = output.strip()
    o = re.sub(r"^```(?:json)?\s*\n?", "", o)   # strip leading ``` or ```json
    o = re.sub(r"\n?```$", "", o)                # strip trailing ```
    return o

def safe_json_parse(raw: str):
    """
    Try to coerce raw into valid JSON array:
     - Strip markdown fences
     - Remove trailing commas before ] 
     - Ensure opening [ and closing ]
    Returns Python list or None if it still fails.
    """
    txt = clean_json_output(raw)
    # remove commas before closing ]
    txt = re.sub(r",\s*]", "]", txt)
    # ensure it starts with [ and ends with ]
    txt = txt.strip()
    if not txt.startswith("["):
        txt = "[" + txt
    if not txt.endswith("]"):
        txt = txt + "]"
    try:
        return json.loads(txt)
    except json.JSONDecodeError as e:
        return None


for _, row in tqdm(articles_df.iterrows(), total = article_num):
    article_id = row["id"]
    title = row["title"]
    chunks = split_text(row["ARTICLE_TEXT"])

    for chunk_i, chunk in enumerate(chunks):
        prompt = build_prompt(chunk)
        try:
            resp = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a sociology professor analyzing racism in text. Label quotes using provided concepts."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0
            )

            raw = resp.choices[0].message.content
            labels = safe_json_parse(raw)
            if labels is None:
                print(f"⚠️ Could not parse JSON for article {article_id}, chunk {chunk_i}")
                print("Raw output:", raw)
                continue


            for item in labels:
                all_results.append({
                    "article_id": article_id,
                    "title": title,
                    "quote": item["quote"],
                    "context": item.get("context", item["quote"]),
                    "concepts": ";".join(item["concepts"]),
                    "victim": item["victim"]
                })
        except Exception as e:
            print(f"Error on article {article_id}, chunk {chunk_i}: {e}")
            # 'raw' always exists here, so we can inspect it
            print("Raw output:", raw)
            continue  # skip to next chunk

# Save flattened results
results_df = pd.DataFrame(all_results)
results_df.to_csv("classification_results_with_race.csv", index=False)
print("✅ Done — results including ‘victim’ and ‘context’ saved.")

100%|██████████| 1/1 [00:07<00:00,  7.80s/it]

✅ Done — results including ‘victim’ and ‘context’ saved.





Label articles with the race of the victims + the surrounding sentences w/Deepseek

In [20]:
import pandas as pd
import requests
import os
import json
from tqdm import tqdm
#from dotenv import load_dotenv

article_num = 634

# Load API key
with open("DEEPSEEK_API_KEY.txt", "r") as f:
    DEEPSEEK_API_KEY = f.read().strip()

# Load your data
articles_df = pd.read_csv("articles.csv").head(article_num)
definitions_df = pd.read_excel("racism_types_definitions.xlsx")
samples_df = pd.read_excel("sample_racism_classification_truncated.xlsx")

# Truncate long definitions to save tokens
concept_defs = "\n".join(
    f"{row['concepts']}: {row['definitions']}"
    for _, row in definitions_df.iterrows()
)

examples = "\n".join(
    f'"{row["annotated_sentence"]}" → {row["annotation_content"]}'
    for _, row in samples_df.iterrows()
)

# Split long articles into ~3000‑char chunks at sentence boundaries
def split_text(text, max_chars=3000):
    sentences = text.split('. ')
    chunks, buf = [], ""
    for sent in sentences:
        sent = sent.strip()
        if len(buf) + len(sent) + 2 < max_chars:
            buf += sent + ". "
        else:
            chunks.append(buf.strip())
            buf = sent + ". "
    if buf:
        chunks.append(buf.strip())
    return chunks

# Build prompt for one chunk
def build_prompt(chunk_text):
    return f"""
You are a sociology professor with 30 years of experience studying racism against Asians.
Your task is to identify the quotes in articles that match your list of types of racism concepts, 
and also label who the victim is based on their race. 

1) First, read through the racism concept definitions. 
You need to understand these definitions so you can accurately recognize when a quote fits one or more of these concepts:
{concept_defs}

2) Next, review the example labeled quotes provided. 
You need to study these examples to see how quotes have been matched to concepts in practice, which will guide your own labeling decisions:
{examples}

3) Now read the ARTICLE CHUNK below. For each quote that matches at least one concept, output:
  • quote: the exact text from article(if it’s under 50 chars, include one sentence before and after as “context” instead).
  • concepts: a list of matching concept names.
  • victim: the race of the victim. If the race cannot be inferred, label it as unknown.
  • context: (only if the quote itself is under 50 characters; otherwise you can repeat the quote)

ARTICLE CHUNK:
{chunk_text}

Return a JSON array like:
[
  {{ "quote": "...", "context": "...", "concepts": ["C1","C2"], "victim": "Asian" }},
  ...
]
"""
results_df = pd.read_csv("classification_results_with_race_deepseek.csv", encoding="ISO-8859-1")
processed_ids = set(results_df["article_id"].unique())
all_results = results_df.to_dict(orient="records")  # continue collecting
#all_results = []

import re

def clean_json_output(output: str) -> str:
    o = output.strip()
    o = re.sub(r"^```(?:json)?\s*\n?", "", o)   # strip leading ``` or ```json
    o = re.sub(r"\n?```$", "", o)                # strip trailing ```
    return o

def safe_json_parse(raw: str):
    """
    Try to coerce raw into valid JSON array:
     - Strip markdown fences
     - Remove trailing commas before ] 
     - Ensure opening [ and closing ]
    Returns Python list or None if it still fails.
    """
    txt = clean_json_output(raw)
    # remove commas before closing ]
    txt = re.sub(r",\s*]", "]", txt)
    # ensure it starts with [ and ends with ]
    txt = txt.strip()
    if not txt.startswith("["):
        txt = "[" + txt
    if not txt.endswith("]"):
        txt = txt + "]"
    try:
        return json.loads(txt)
    except json.JSONDecodeError as e:
        return None


for _, row in tqdm(articles_df.iterrows(), total = article_num):
    article_id = row["id"]
    if article_id in processed_ids:
        continue
    title = row["title"]
    chunks = split_text(row["ARTICLE_TEXT"])

    for chunk_i, chunk in enumerate(chunks):
        prompt = build_prompt(chunk)
        try:
            headers = {
                "Authorization": f"Bearer {DEEPSEEK_API_KEY}",
                "Content-Type": "application/json"
            }
            payload = {
                "model": "deepseek-chat",   # or whatever DeepSeek’s model name is
                "messages": [
                    {"role": "system", "content": "You are a sociology professor analyzing racism in text. Label quotes using provided concepts."},
                    {"role": "user",   "content": prompt}
                ],
                "temperature": 0
            }
            r = requests.post(
                "https://api.deepseek.com/v1/chat/completions", 
                headers=headers, 
                json=payload
            )
            r.raise_for_status()
            raw = r.json()["choices"][0]["message"]["content"]
            labels = safe_json_parse(raw)
            if labels is None:
                print(f"⚠️ Could not parse JSON for article {article_id}, chunk {chunk_i}")
                print("Raw output:", raw)
                continue


            for item in labels:
                all_results.append({
                    "article_id": article_id,
                    "title": title,
                    "quote": item["quote"],
                    "context": item.get("context", item["quote"]),
                    "concepts": ";".join(item["concepts"]),
                    "victim": item["victim"]
                })
        except Exception as e:
            print(f"Error on article {article_id}, chunk {chunk_i}: {e}")
            # 'raw' always exists here, so we can inspect it
            print("Raw output:", raw)
            continue  # skip to next chunk

# Save flattened results
results_df = pd.DataFrame(all_results)
results_df.to_csv("classification_results_with_race_deepseek.csv", index=False)
print("✅ Done — results including ‘victim’ and ‘context’ saved.")

  3%|▎         | 16/634 [00:51<33:18,  3.23s/it]

⚠️ Could not parse JSON for article 175, chunk 0
Raw output: ```json
[]
```  

**Explanation:**  
The provided article chunk discusses street harassment and gendered impacts of the pandemic, but it does not contain any explicit references to racism against Asians or the specific concepts outlined in the task (e.g., "China virus," "Ching Chong," anti-Asian hate crimes, etc.). The harassment described is gendered (targeting women broadly) but not racially targeted toward Asians. Thus, no quotes meet the criteria for labeling.  

If you'd like me to analyze a different article chunk with clearer anti-Asian racism examples, please share it!


 46%|████▌     | 291/634 [24:27<6:39:05, 69.81s/it]

⚠️ Could not parse JSON for article 291, chunk 4
Raw output: ```json
[]
```  

**Explanation:**  
The provided article chunk does not contain any direct quotes or descriptions that match the defined racism concepts against Asians. The text is a general statement about raising voices against racism without specific references to anti-Asian racism, victims, or incidents. Thus, no labels are applicable.  

If you provide additional text with concrete examples or quotes, I can analyze them accordingly.


 63%|██████▎   | 400/634 [1:54:15<2:57:01, 45.39s/it]

⚠️ Could not parse JSON for article 401, chunk 1
Raw output: ```json
[
  {
    "quote": "Maintain social distancing because you are Asian.",
    "context": "Recently I was standing in line at Aldi with my four-year-old son and, even though I was following social-distancing guidelines, the woman in front turned around and spat, ‘Maintain social distancing because you are Asian.’ I told her that I was born here and that I didn’t personally cause the virus, but it fell on deaf ears.",
    "concepts": ["Discrimination", "Verbal harassment"],
    "victim": "Asian"
  },
  {
    "quote": "Do people see me, or do they see ‘generic Asian spreading disease’?",
    "context": "I’ve been anxious about going out in public and I ask myself, ‘Do people see me, or do they see ‘generic Asian spreading disease’?’ It’s a terrible way to live.",
    "concepts": ["Scapegoat", "Racial prejudice/bigotry"],
    "victim": "Asian"
  },
  {
    "quote": "Oh, you speak really good English’ to someone with a broad

 64%|██████▎   | 404/634 [1:58:29<3:51:10, 60.31s/it]

⚠️ Could not parse JSON for article 404, chunk 5
Raw output: ```json
[]
```  

**Explanation:**  
After carefully reviewing the provided article chunk, I found no quotes that match the defined concepts of racism against Asians. The text primarily discusses:  
1. General anti-racism efforts (e.g., protests, diversity initiatives)  
2. Systemic racism affecting Black communities (e.g., "second class citizens," Sundown Laws)  
3. Police brutality (e.g., George Floyd)  

None of the quotes reference Asians/Asian Americans or align with the specific concepts (e.g., "China virus," fetishization, perpetual foreigner, etc.). Thus, the output is an empty array.


 67%|██████▋   | 427/634 [2:17:38<2:48:30, 48.84s/it]

⚠️ Could not parse JSON for article 428, chunk 7
Raw output: ```json
[]
```  

**Explanation:**  
The provided article chunk discusses systemic issues affecting vulnerable groups (elderly, homeless) during the pandemic but does **not** contain any explicit references to anti-Asian racism or the specific concepts listed in the definitions. Key observations:  
1. **Focus on Ageism & Class**: The text critiques age discrimination against the elderly and systemic neglect of homeless populations, but these are not tied to racial targeting of Asians.  
2. **No Matching Concepts**: Terms like "China virus" or anti-Asian violence are absent. The closest pandemic-related content involves general critiques of U.S. capitalism, not racial scapegoating.  
3. **Victim Demographics**: While marginalized groups are mentioned (e.g., African Americans, Hispanic Americans), Asians/Asian Americans are not referenced.  

Thus, no quotes meet the labeling criteria.


 68%|██████▊   | 433/634 [2:27:50<4:56:42, 88.57s/it] 

⚠️ Could not parse JSON for article 433, chunk 3
Raw output: ```json
[]
```  

**Explanation:**  
The provided article chunk does not contain any overtly racist language, discriminatory behavior, or incidents matching the defined racism concepts. The text primarily discusses cultural/political perspectives on China-US relations and a student's personal educational experiences, without targeting or victimizing any racial group.  

If you'd like me to analyze a different article chunk with clearer instances of racism, please provide it.


100%|██████████| 634/634 [6:00:04<00:00, 34.08s/it]   

✅ Done — results including ‘victim’ and ‘context’ saved.



