Reddit Scraper and Text Cleaning Guide

This project helps you do two things in a simple workflow:

Scrape Reddit posts into JSON
Clean the text inside a notebook for analysis

It is beginner-friendly and works well if you want a small, practical pipeline for collecting r/datascience posts and preparing the selftext field for NLP work.

What This Project Does

Scrapes Reddit submission data into a clean JSON file
Saves the raw scraped file in data/raw/
Includes a notebook guide for text cleaning and preprocessing
Lets you export a cleaned JSON file after notebook processing
Includes a word cloud step to visualize the most-used cleaned words

Project Layout

reddit_nepal_scraper/
|-- README.md
|-- requirements.txt
|-- src/
|   |-- __init__.py
|   |-- scraper.py
|   `-- utils.py
|-- data/
|   |-- raw/
|   `-- cleaned/
|-- notebooks/
|   `-- text_cleaning.ipynb
`-- tests/

Requirements

Python 3.9+
Windows PowerShell is the easiest setup path for this repo

Setup on Windows PowerShell

Open PowerShell in the project folder, then run these commands one by one:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

If PowerShell blocks activation, run:

Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1

Important:

Run each command on its own line
Do not paste two commands together on one line
After activation, your prompt should show (.venv)

Run the Scraper

The scraper entrypoint is:

python -m src.scraper

By default, the scraper:

targets r/datascience
fetches up to 1 page
saves one raw JSON file
writes output to data/raw/

Raw Output You Should Expect

After a successful scraper run, you should see a file like:

data/raw/datascience_20260511_161217.json

Each record contains fields such as:

id
title
author
selftext
subreddit
score
created_utc
permalink

Open the Notebook for Cleaning

The notebook guide in this repo is:

notebooks/text_cleaning.ipynb

Open it in Jupyter or in VS Code.

The notebook asks you to set two paths near the top:

JSON_FILE This should point to your raw scraper output in data/raw/
EXPORT_JSON_FILE This is where the cleaned JSON will be saved

Default example used in the notebook:

JSON_FILE = r"../data/raw/datascience_20260511_161217.json"
EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"

Cleaned Output You Should Expect

After running the notebook, you should get a cleaned JSON file in the cleaned-data area, for example:

data/cleaned/cleaned_selftext_posts.json

The cleaned output keeps useful fields like:

id
title
author
subreddit
score
raw_selftext
clean_selftext

The repo also currently contains a cleaned sample file in:

data/cleaned/gg.json

What You Get After Running This

After following the full workflow, you will have:

raw Reddit JSON in data/raw/
cleaned JSON generated through the notebook in data/cleaned/
a notebook-based text cleaning walkthrough you can rerun with different files

Common Workflow

If you are using this project for the first time, the normal flow is:

Set up the virtual environment
Install requirements
Run python -m src.scraper
Confirm a new datascience_*.json file appears in data/raw/
Open notebooks/text_cleaning.ipynb
Set JSON_FILE to the raw file you want to clean
Set EXPORT_JSON_FILE to the cleaned output file you want
Run the notebook cells top to bottom

Dependencies

This project currently uses:

requests
pandas
matplotlib
wordcloud
pytest

Troubleshooting

PowerShell will not activate the venv

Use:

Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1

`pip install -r requirements.txt` fails with a strange filename error

This usually happens when two commands were pasted on the same line.

Wrong:

pip install -r requirements.txt .\.venv\Scripts\Activate.ps1

Correct:

pip install -r requirements.txt

No output file appears in `data/raw/`

Possible reasons:

the request failed
the API returned no data
the environment blocked network access

Run the scraper again and watch the logs in the terminal.

Notebook cannot find your JSON file

Check the JSON_FILE variable in the notebook.

Good example:

JSON_FILE = r"../data/raw/datascience_20260511_161217.json"

You can also use an absolute Windows path if needed.

Notebook saves cleaned output to the wrong place

Check the EXPORT_JSON_FILE variable in the notebook.

Good example:

EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"

Final Note

This repo is best used as a small end-to-end learning project:

scrape real Reddit data
inspect and clean the text
export cleaner data for later NLP tasks

Once your cleaned JSON is ready, you can move on to TF-IDF, sentiment analysis, clustering, topic modeling, or embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Scraper and Text Cleaning Guide

What This Project Does

Project Layout

Requirements

Setup on Windows PowerShell

Run the Scraper

Raw Output You Should Expect

Open the Notebook for Cleaning

Cleaned Output You Should Expect

What You Get After Running This

Common Workflow

Dependencies

Troubleshooting

PowerShell will not activate the venv

`pip install -r requirements.txt` fails with a strange filename error

No output file appears in `data/raw/`

Notebook cannot find your JSON file

Notebook saves cleaned output to the wrong place

Final Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reddit Scraper and Text Cleaning Guide

What This Project Does

Project Layout

Requirements

Setup on Windows PowerShell

Run the Scraper

Raw Output You Should Expect

Open the Notebook for Cleaning

Cleaned Output You Should Expect

What You Get After Running This

Common Workflow

Dependencies

Troubleshooting

PowerShell will not activate the venv

pip install -r requirements.txt fails with a strange filename error

No output file appears in data/raw/

Notebook cannot find your JSON file

Notebook saves cleaned output to the wrong place

Final Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`pip install -r requirements.txt` fails with a strange filename error

No output file appears in `data/raw/`

Packages