Skip to content

sokebat/text-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Scraper and Text Cleaning Guide

This project helps you do two things in a simple workflow:

  1. Scrape Reddit posts into JSON
  2. Clean the text inside a notebook for analysis

It is beginner-friendly and works well if you want a small, practical pipeline for collecting r/datascience posts and preparing the selftext field for NLP work.

What This Project Does

  • Scrapes Reddit submission data into a clean JSON file
  • Saves the raw scraped file in data/raw/
  • Includes a notebook guide for text cleaning and preprocessing
  • Lets you export a cleaned JSON file after notebook processing
  • Includes a word cloud step to visualize the most-used cleaned words

Project Layout

reddit_nepal_scraper/
|-- README.md
|-- requirements.txt
|-- src/
|   |-- __init__.py
|   |-- scraper.py
|   `-- utils.py
|-- data/
|   |-- raw/
|   `-- cleaned/
|-- notebooks/
|   `-- text_cleaning.ipynb
`-- tests/

Requirements

  • Python 3.9+
  • Windows PowerShell is the easiest setup path for this repo

Setup on Windows PowerShell

Open PowerShell in the project folder, then run these commands one by one:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

If PowerShell blocks activation, run:

Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1

Important:

  • Run each command on its own line
  • Do not paste two commands together on one line
  • After activation, your prompt should show (.venv)

Run the Scraper

The scraper entrypoint is:

python -m src.scraper

By default, the scraper:

  • targets r/datascience
  • fetches up to 1 page
  • saves one raw JSON file
  • writes output to data/raw/

Raw Output You Should Expect

After a successful scraper run, you should see a file like:

data/raw/datascience_20260511_161217.json

Each record contains fields such as:

  • id
  • title
  • author
  • selftext
  • subreddit
  • score
  • created_utc
  • permalink

Open the Notebook for Cleaning

The notebook guide in this repo is:

notebooks/text_cleaning.ipynb

Open it in Jupyter or in VS Code.

The notebook asks you to set two paths near the top:

  • JSON_FILE This should point to your raw scraper output in data/raw/
  • EXPORT_JSON_FILE This is where the cleaned JSON will be saved

Default example used in the notebook:

JSON_FILE = r"../data/raw/datascience_20260511_161217.json"
EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"

Cleaned Output You Should Expect

After running the notebook, you should get a cleaned JSON file in the cleaned-data area, for example:

data/cleaned/cleaned_selftext_posts.json

The cleaned output keeps useful fields like:

  • id
  • title
  • author
  • subreddit
  • score
  • raw_selftext
  • clean_selftext

The repo also currently contains a cleaned sample file in:

data/cleaned/gg.json

What You Get After Running This

After following the full workflow, you will have:

  • raw Reddit JSON in data/raw/
  • cleaned JSON generated through the notebook in data/cleaned/
  • a notebook-based text cleaning walkthrough you can rerun with different files

Common Workflow

If you are using this project for the first time, the normal flow is:

  1. Set up the virtual environment
  2. Install requirements
  3. Run python -m src.scraper
  4. Confirm a new datascience_*.json file appears in data/raw/
  5. Open notebooks/text_cleaning.ipynb
  6. Set JSON_FILE to the raw file you want to clean
  7. Set EXPORT_JSON_FILE to the cleaned output file you want
  8. Run the notebook cells top to bottom

Dependencies

This project currently uses:

  • requests
  • pandas
  • matplotlib
  • wordcloud
  • pytest

Troubleshooting

PowerShell will not activate the venv

Use:

Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1

pip install -r requirements.txt fails with a strange filename error

This usually happens when two commands were pasted on the same line.

Wrong:

pip install -r requirements.txt .\.venv\Scripts\Activate.ps1

Correct:

pip install -r requirements.txt

No output file appears in data/raw/

Possible reasons:

  • the request failed
  • the API returned no data
  • the environment blocked network access

Run the scraper again and watch the logs in the terminal.

Notebook cannot find your JSON file

Check the JSON_FILE variable in the notebook.

Good example:

JSON_FILE = r"../data/raw/datascience_20260511_161217.json"

You can also use an absolute Windows path if needed.

Notebook saves cleaned output to the wrong place

Check the EXPORT_JSON_FILE variable in the notebook.

Good example:

EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"

Final Note

This repo is best used as a small end-to-end learning project:

  • scrape real Reddit data
  • inspect and clean the text
  • export cleaner data for later NLP tasks

Once your cleaned JSON is ready, you can move on to TF-IDF, sentiment analysis, clustering, topic modeling, or embeddings.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors