NLP-ASD

Study repo focused on non-literal language (sarcasm) and autism spectrum disorder (ASD)—building a pipeline to collect, filter, and analyze sarcasm-related discussions from public web sources.

Project overview

This repository supports a study on how people on the autism spectrum describe, interpret, and react to sarcasm in online discussions.

Core goal:

Scrape public text → run a sarcasm detector → keep sarcasm-relevant content → analyze ASD-related reactions.

Research goal

High-level question

How do people on the autism spectrum describe their reactions to sarcasm, and what patterns appear in their responses (e.g., confusion, distress, coping strategies, learning, cues used, context dependence)?

This repo focuses on:

Collecting relevant text from public web sources (web scraping)
Filtering scraped text using a trained sarcasm detection model
Preparing a dataset for analysis (manual coding, NLP features, clustering, sentiment/emotion, etc.)

Pipeline

1) Web scraping (collection)

A Playwright-based scraper discovers and opens search results, then extracts paragraph-level content.

Script: scrap_updated.py
Output (recommended): raw extracted paragraphs + metadata (URL, timestamp, position, etc.)

2) Model-based sarcasm filtering (main use)

The final sarcasm detection model (trained/selected in this repo) will be used to filter the scraped data.

Filtering concept:

Run the model on each scraped text segment (paragraph/post)
Keep items with sarcasm probability above a threshold (e.g., p >= 0.5, tune as needed)
Save:
- text
- sarcasm_score
- is_sarcastic (thresholded label)
- metadata (source URL, timestamp)

This replaces the current keyword-only heuristic (e.g., checking for the word "sarcasm").

Sarcasm Detection Baselines (LLM + RoBERTa)

This repo includes two notebooks to build/benchmark sarcasm detection approaches:

LLM prompt benchmark: prompt-based sarcasm classification across datasets
RoBERTa-Large baseline (TPU-safe): supervised training + evaluation in Colab/TPU

Notebooks

SarcasmGPTBenchmark.ipynb
- Prompt-based sarcasm benchmark across datasets
- Configurable prompt versions and metrics reporting
sarcasm_roberta_large_full_tpu.ipynb
- TPU-safe training pipeline for RoBERTa-Large
- Produces evaluation metrics (e.g., accuracy/F1) and model outputs

Datasets used (in the notebooks)

Hugging Face

cardiffnlp/tweet_eval (irony)
tasksource/figlang2020-sarcasm

Kaggle (via `kagglehub`)

danofer/sarcasm (SARC-style Reddit sarcasm)

Ethics & responsible research (important)

Respect website/platform Terms of Service and robots policies.
Prefer public content and minimize collection of personal identifiers.
Avoid re-publishing raw scraped text unless permitted

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Sarcasm-Detection-Models-Notebooks		Sarcasm-Detection-Models-Notebooks
.gitattributes		.gitattributes
README.md		README.md
RoBERT-Model-Weights-Trained-On-Sarcasm-Data.zip		RoBERT-Model-Weights-Trained-On-Sarcasm-Data.zip
scrap_updated.py		scrap_updated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-ASD

Project overview

Research goal

Pipeline

1) Web scraping (collection)

2) Model-based sarcasm filtering (main use)

Sarcasm Detection Baselines (LLM + RoBERTa)

Notebooks

Datasets used (in the notebooks)

Hugging Face

Kaggle (via `kagglehub`)

Ethics & responsible research (important)

About

Uh oh!

Releases

Packages

Languages

pyarini/NLP-ASD

Folders and files

Latest commit

History

Repository files navigation

NLP-ASD

Project overview

Research goal

Pipeline

1) Web scraping (collection)

2) Model-based sarcasm filtering (main use)

Sarcasm Detection Baselines (LLM + RoBERTa)

Notebooks

Datasets used (in the notebooks)

Hugging Face

Kaggle (via kagglehub)

Ethics & responsible research (important)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Kaggle (via `kagglehub`)

Packages