Skip to content

Scripts to explore Stackoverflow data for various blockchains

Notifications You must be signed in to change notification settings

tradingstrategy-ai/blockchain-stackoverflow

Repository files navigation

StackExchange blockchain data exploration scripts

This is a data research how blockchain development has changed over the years, based on the most popular programmers' forum, StackOverflow.com data dumps.

Read the research report.

Prerequisites

If you want to use this notebook for your own research you need to

  • Know Python and UNIX shell basics
  • Have Python 3.11 installed
  • Have Poetry installed

Get started

Check out files with git-lfs:

git clone ...
cd blockchain-stackoverflow
git lfs install
git lfs pull

Create Python environment:

poetry shell
poetry install

Usage

After you have Python environment and large files set up, you can open research.ipynb in your notebook editor (Visual Studio Code) and point the Python interpreter to the environment created with Poetry.

Alternatively you can open the notebook using stock Jupyter and the web browser

jupyter notebook research.ipynb

Recreating datasets

We supply ./blockchain-questions.parquet with the Github repository. You might want to update this dataset as soon as StackOverflow starts to re-publish their data dumps.

To re-create the dataset you need ~200 GB free disk space. We recommend you work on a remote server using Visual Studio Code remote extensions.

We need

  • Posts dataset
  • Tags dataset

Creating tag map

First we need to create tag name -> primary key mappings we can use to navigate the StackOverflow posts dump.

Create tags CSV file we can import to Pandas:

wget -O stackoverflow.com-Tags.7z https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z
7z x stackoverflow.com-Tags.7z
./converter --source-path Tags.xml --result-format csv --store-to-dir csv

Then we create tags.parquet using our script:

python blockchain_stackoverflow/tag_map.py 

This will create tags.parquet and also output post counts for our tags:

ethereum with 6681 posts
blockchain with 6637 posts
solidity with 6534 posts
svelte with 4932 posts
hyperledger with 3938 posts
smartcontracts with 2989 posts...

Downloading and extracting the full posts dataset

We now need to get all StackOverflow questions to a CSV file.

Download using Bittorrent, and this way you do not die to the old age waiting for the download to finish.

cd download
npm install
node_modules/.bin/webtorrent --select stackoverflow.com-Posts.7z stackexchange_archive.torrent 
# 658 = index for Posts.7z
node_modules/.bin/webtorrent --select 658 stackexchange_archive.torrent 

Or HTTPS:

wget -O download/stackexchange/stackoverflow.com-Posts.7z https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z

Webtorrent downloading

And then after two hours:

7z x download/stackexchange/stackoverflow.com-Posts.7z
./converter --source-path Posts.xml --result-format csv --store-to-dir csv
rm Posts.xml  # Save 95 GB space
ipython create-reduced-dataset.ipynb  # Or run in Visual Studio Code

Now we have created blockchain-posts.parquet.

Creating blockchain questions only reduced dataset

As the full posts dataset is too large to read in RAM, we will use a chunked reader to create a smaller dataset of 25k blockchain questions weighting around 25 MB.

ipython create-reduced-dataset.ipynb  # Or use Visual Studio Code

Creating StackOverflow question count baseline

Because StackOverflow is in decline we need to separate this StackOverflow's decline from the possible blockchains decline.

For this purpose, we create a time-series that contains monthly binned question counts of all StackOverflow posts.

We do this with our notebook, which is also going to display a graph of the question counts:

ipython create-baseline.ipynb  # Or use Visual Studio Code

Exporting Jupyter Notebook as Ghost blog post

First let's convert the notebook to a static HTML:

jupyter nbconvert --to=html --no-input --embed-images --output-dir html-export research.ipynb

Then you can open html-export/research.html in your web browser and copy-paste content to the Ghost blog post editor.

Useful links and background

About

Scripts to explore Stackoverflow data for various blockchains

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published