<a href="https://colab.research.google.com/github/takky0330/ChatBot/blob/master/txtai_TEST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install dependencies

Install txtai and all dependencies

In [2]:
!pip uninstall transformers
!pip install transformers==3.0.2

import transformers
transformers.__version__

Uninstalling transformers-3.0.2:
  Would remove:
    /usr/local/bin/transformers-cli
    /usr/local/lib/python3.6/dist-packages/transformers-3.0.2.dist-info/*
    /usr/local/lib/python3.6/dist-packages/transformers/*
Proceed (y/n)? y
  Successfully uninstalled transformers-3.0.2
Collecting transformers==3.0.2
  Using cached https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl
Installing collected packages: transformers
Successfully installed transformers-3.0.2


'3.0.2'

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai

In [2]:
%%capture

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Running similarity queries

In [3]:
import numpy as np

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Get index of best section that best matches query
    uid = np.argmax(embeddings.similarity(query, sections))

    print("%-20s %s" % (query, sections[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


# Building an Embeddings index


In [6]:
# Create an index for the list of sections
embeddings.index([(uid, text, None) for uid, text in enumerate(sections)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print section
    print("%-20s %s" % (query, sections[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


# Embeddings load/save



In [9]:
embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(sections[uid])

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg


In [None]:
!ls index

config	embeddings


# Embedding methods

Embeddings supports two methods for creating text vectors, the sentence-transformers library and word embeddings vectors. Both methods have their merits as shown below:

- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
  - Creates a single embeddings vector via mean pooling of vectors generated by the transformers library. 
  - Supports models stored on Hugging Face's model hub or stored locally. 
  - See sentence-transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face's model hub.
  - Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that tradeoff accuracy for speed.
- word embeddings
  - Creates a single embeddings vector via BM25 scoring of each word component. See this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) for the logic behind this method.
  - Backed by the [pymagnitude](https://github.com/plasticityai/magnitude) library. Pre-trained word vectors can be installed from the referenced link.
  - See [vectors.py](https://github.com/neuml/txtai/blob/master/src/python/txtai/vectors.py) for code that can build word vectors for custom datasets.
  - Significantly better performance with default models. For larger datasets, it offers a good tradeoff of speed and accuracy.


#ここから実験！

In [11]:
## 中国（ China ）で検索してみましょう！

uid = embeddings.search("China", 1)[0][0]
print(sections[uid])

Beijing mobilises invasion craft along coast as Taiwan tensions escalate


わぉ！  北京・台湾の緊張についての文が選ばれました！ ＂中国＂とは一つも記述されていません…


In [13]:
## 自然の脅威（ Natural Threats ）で検索してみましょう！

uid = embeddings.search("Natural Threats", 1)[0][0]
print(sections[uid])

The National Park Service warns against sacrificing slower friends in a bear attack


 ＂in a bear attack　＂は、自然の脅威ですね…

In [14]:

## お金持ち（ rich person ）で検索してみましょう！

uid = embeddings.search("rich person", 1)[0][0]
print(sections[uid])

Make huge profits without work, earn up to $100,000 a day


おぉ… しかも 不労所得ですか！ うらやましぃ





#こんな感じで…
#自然なワードで検索すると、イイ感じで 似ている 文を選択してくれます
