This project helps you do two things in a simple workflow:
- Scrape Reddit posts into JSON
- Clean the text inside a notebook for analysis
It is beginner-friendly and works well if you want a small, practical pipeline for collecting r/datascience posts and preparing the selftext field for NLP work.
- Scrapes Reddit submission data into a clean JSON file
- Saves the raw scraped file in
data/raw/ - Includes a notebook guide for text cleaning and preprocessing
- Lets you export a cleaned JSON file after notebook processing
- Includes a word cloud step to visualize the most-used cleaned words
reddit_nepal_scraper/
|-- README.md
|-- requirements.txt
|-- src/
| |-- __init__.py
| |-- scraper.py
| `-- utils.py
|-- data/
| |-- raw/
| `-- cleaned/
|-- notebooks/
| `-- text_cleaning.ipynb
`-- tests/
- Python
3.9+ - Windows PowerShell is the easiest setup path for this repo
Open PowerShell in the project folder, then run these commands one by one:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txtIf PowerShell blocks activation, run:
Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1Important:
- Run each command on its own line
- Do not paste two commands together on one line
- After activation, your prompt should show
(.venv)
The scraper entrypoint is:
python -m src.scraperBy default, the scraper:
- targets
r/datascience - fetches up to 1 page
- saves one raw JSON file
- writes output to
data/raw/
After a successful scraper run, you should see a file like:
data/raw/datascience_20260511_161217.json
Each record contains fields such as:
idtitleauthorselftextsubredditscorecreated_utcpermalink
The notebook guide in this repo is:
notebooks/text_cleaning.ipynb
Open it in Jupyter or in VS Code.
The notebook asks you to set two paths near the top:
JSON_FILEThis should point to your raw scraper output indata/raw/EXPORT_JSON_FILEThis is where the cleaned JSON will be saved
Default example used in the notebook:
JSON_FILE = r"../data/raw/datascience_20260511_161217.json"
EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"After running the notebook, you should get a cleaned JSON file in the cleaned-data area, for example:
data/cleaned/cleaned_selftext_posts.json
The cleaned output keeps useful fields like:
idtitleauthorsubredditscoreraw_selftextclean_selftext
The repo also currently contains a cleaned sample file in:
data/cleaned/gg.json
After following the full workflow, you will have:
- raw Reddit JSON in
data/raw/ - cleaned JSON generated through the notebook in
data/cleaned/ - a notebook-based text cleaning walkthrough you can rerun with different files
If you are using this project for the first time, the normal flow is:
- Set up the virtual environment
- Install requirements
- Run
python -m src.scraper - Confirm a new
datascience_*.jsonfile appears indata/raw/ - Open
notebooks/text_cleaning.ipynb - Set
JSON_FILEto the raw file you want to clean - Set
EXPORT_JSON_FILEto the cleaned output file you want - Run the notebook cells top to bottom
This project currently uses:
requestspandasmatplotlibwordcloudpytest
Use:
Set-ExecutionPolicy -Scope Process Bypass
.\.venv\Scripts\Activate.ps1This usually happens when two commands were pasted on the same line.
Wrong:
pip install -r requirements.txt .\.venv\Scripts\Activate.ps1Correct:
pip install -r requirements.txtPossible reasons:
- the request failed
- the API returned no data
- the environment blocked network access
Run the scraper again and watch the logs in the terminal.
Check the JSON_FILE variable in the notebook.
Good example:
JSON_FILE = r"../data/raw/datascience_20260511_161217.json"You can also use an absolute Windows path if needed.
Check the EXPORT_JSON_FILE variable in the notebook.
Good example:
EXPORT_JSON_FILE = r"../data/cleaned/cleaned_selftext_posts.json"This repo is best used as a small end-to-end learning project:
- scrape real Reddit data
- inspect and clean the text
- export cleaner data for later NLP tasks
Once your cleaned JSON is ready, you can move on to TF-IDF, sentiment analysis, clustering, topic modeling, or embeddings.