# Satoshi QA

*What if Satoshi himself could answers our questions?*

This notebook is about neural question and answering using transformers models.

I will be using Satoshi Nakamoto forum posts and personal emails, compilated by https://satoshi.nakamotoinstitute.org/ under CC 4.0.

The secret sauce behind scaling up is Haystack. It lets you scale QA models to large collections of documents! 
You can read more about this amazing library here https://github.com/deepset-ai/haystack


In [1]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.2.2-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.1.2
    Uninstalling pip-22.1.2:
      Successfully uninstalled pip-22.1.2
Successfully installed pip-22.2.2
[0mCollecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-t_0k6lwd/farm-haystack_aed14f94a40b4be184e75901b521bc85
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-t_0k6lwd/farm-haystack_aed14f94a40b4be184e75901b521bc85
  Resolved https://github.com/deepset-ai/haystack.git to commit d61755322fd76536f62b8edaabd612ae080e1e17
  Installing build dependencies ... [?25l- \ | / - \ done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  

In [2]:
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader
import numpy as np
import pandas as pd

## Data prep

Satoshi Nakamoto Institute offers multiple json files with text from Satoshi himself. I've choose to go with forum posts and emails : more suited for QA, and almost 4,000 examples when combined.

In [3]:
import json
posts  = []
emails  = []
with open("../input/satoshi-nakamoto-institute/posts.json", "r") as read_file:
    posts = json.load(read_file)

In [4]:
posts = pd.DataFrame(posts)

In [5]:
posts = posts[['subject', 'content']].rename(columns={'subject':'document','content':'content'})

In [6]:
with open("../input/satoshi-nakamoto-institute/emails.json", "r") as read_file:
    emails = json.load(read_file)

emails = pd.DataFrame(emails)
emails = emails[['subject', 'text']].rename(columns={'subject':'document','text':'content'})

In [7]:
emails.head(5)

Unnamed: 0,document,content
0,Bitcoin P2P e-cash paper,I've been working on a new electronic cash system that's fully\npeer-to-peer...
1,Bitcoin P2P e-cash paper,Satoshi Nakamoto wrote:\n> I've been working on a new electronic cash system...
2,Bitcoin P2P e-cash paper,>Satoshi Nakamoto wrote:\n>> I've been working on a new electronic cash syst...
3,Bitcoin P2P e-cash paper,"> As long as honest nodes control the most CPU power on the network,\n> they..."
4,Bitcoin P2P e-cash paper,">> As long as honest nodes control the most CPU power on the network,\n>> th..."


In [8]:
posts.head(5)

Unnamed: 0,document,content
0,Bitcoin open source implementation of P2P currency,"<div class=""post"">\nI've developed a new open source P2P e-cash system calle..."
1,Bitcoin open source implementation of P2P currency,"<div class=""post"">Great stuff.<br/>\n<br/>\nThis is the first real innovatio..."
2,Bitcoin open source implementation of P2P currency,"<div class=""post""><a href=""http://p2pfoundation.ning.com/profile/DanteGabrye..."
3,Bitcoin open source implementation of P2P currency,"<div class=""post"">Could be. They're talking about the old Chaumian central m..."
4,Bitcoin open source implementation of P2P currency,"<div class=""post"">Hi Satoshi,<br/>\n<br/>\nwe are actually really talking ab..."


In [9]:
emails.describe()

Unnamed: 0,document,content
count,63,63
unique,19,63
top,Bitcoin P2P e-cash paper,I've been working on a new electronic cash system that's fully\npeer-to-peer...
freq,32,1


In [10]:
posts.describe()

Unnamed: 0,document,content
count,3845,3845
unique,530,3842
top,Re: BitDNS and Generalizing Bitcoin,"<div class=""post"">adg</div>"
freq,264,4


In [11]:
data = pd.concat([posts, emails])

In [12]:
data.describe()

Unnamed: 0,document,content
count,3908,3908
unique,549,3905
top,Re: BitDNS and Generalizing Bitcoin,"<div class=""post"">adg</div>"
freq,264,4


💡 May want to:
* Data cleaning: Remove HTML attributes from raw texts
* Try Data labellisation using Deepset Annotation Tool
* Try Data Augmentation

## Setting up DocumentStore

Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).

In [13]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2
 
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [14]:
from haystack.utils import launch_es

launch_es()

/bin/sh: 1: docker: not found


In [15]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

In [16]:
data.head()

Unnamed: 0,document,content
0,Bitcoin open source implementation of P2P currency,"<div class=""post"">\nI've developed a new open source P2P e-cash system calle..."
1,Bitcoin open source implementation of P2P currency,"<div class=""post"">Great stuff.<br/>\n<br/>\nThis is the first real innovatio..."
2,Bitcoin open source implementation of P2P currency,"<div class=""post""><a href=""http://p2pfoundation.ning.com/profile/DanteGabrye..."
3,Bitcoin open source implementation of P2P currency,"<div class=""post"">Could be. They're talking about the old Chaumian central m..."
4,Bitcoin open source implementation of P2P currency,"<div class=""post"">Hi Satoshi,<br/>\n<br/>\nwe are actually really talking ab..."


In [17]:
data = data.to_dict(orient='records')

In [18]:
print(data[:3])

[{'document': 'Bitcoin open source implementation of P2P currency', 'content': '<div class="post">\nI\'ve developed a new open source P2P e-cash system called Bitcoin. It\'s completely decentralized, with no central server or trusted parties, because everything is based on crypto proof instead of trust. Give it a try, or take a look at the screenshots and design paper:<br/>\n<br/>\nDownload Bitcoin v0.1 at <a href="http://www.bitcoin.org">http://www.bitcoin.org</a><br/>\n<br/>\nThe root problem with conventional currency is all the trust that\'s required to make it work. The central bank must be trusted not to debase the currency, but the history of fiat currencies is full of breaches of that trust. Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make mic

In [19]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(data)

## Initialize Retriever, Reader & Pipeline

1. **Retriever:** Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
2. **Reader:** Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.
3. **Finder:** which glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.

In [20]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

In [21]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [22]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Testing

In [23]:
# You can configure how many candidates the Reader and Retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="Who are you?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.21s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 67.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 68.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 66.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 66.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 23.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 43.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 44.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 78.24 Batches/s]


In [24]:
# ...or use a util to simplify the output
# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="minimum")


Query: Who are you?
Answers:
[   {   'answer': 'early adopters',
        'context': 't of the Faucet since I refilled it last '
                   'night.<br/><br/>Any of you early adopters who generated '
                   'tens of thousands of coins back in the early days, ar'},
    {   'answer': 'Anyone who currently uses PayPal for a start',
        'context': 'iv class="quoteheader">Quote</div><div '
                   'class="quote">Anyone who currently uses PayPal for a '
                   'start.<br/></div><br/>Most of those will prefer a fast '
                   'iss'},
    {   'answer': 'FreeMoney',
        'context': '://bitcointalk.org/index.php?topic=1268.msg13910#msg13910">Quote '
                   'from: FreeMoney on September 24, 2010, 04:42:21 '
                   'AM</a></div><div class="quote">Does a'},
    {   'answer': 'whoever you are who donated',
        'context': '<div class="post">Thanks for the donations... whoever you '
                   'are who dona

Note to myself: Let's dig how to get better results...