<a href="https://colab.research.google.com/github/waifuai/interpersonal/blob/master/tutorials/4_creating_traits_databases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating Traits Databases

The Traits databases are the core of the program. You can get large precomputed traits databases from our [releases on GitHub](https://github.com/waifuai/interpersonal/releases) or [Kaggle](https://www.kaggle.com/waifuai/interpersonal-traits).
To create the traits databases (for friendliness and dominance) we need to download the required files: Google News vectors and Brown Corpus.

Start by installing the library from pip. We use version number here to ensure that the documentation will work in the future.

In [0]:
!pip install interpersonal==0.0.1

Collecting interpersonal==0.0.1
  Downloading https://files.pythonhosted.org/packages/be/6a/4b907680f76484b29494aa5da47c7c64fcd21c0edf658445dda097d12c84/interpersonal-0.0.1-py3-none-any.whl
Installing collected packages: interpersonal
Successfully installed interpersonal-0.0.1


In [0]:
!echo "Downloading GoogleNews-vectors (1.57 GB)"
!link="https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz";wget -c $link 2>/dev/null || curl -L -O -C - $link || curl -L -O $link
!du -h /content/GoogleNews-vectors-negative300.bin.gz
!echo "Downloading Brown Corpus (3.3 MB)"
!link="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip";wget -c $link 2>/dev/null || curl -L -O -C - $link || curl -L -O $link
!du -h /content/brown.zip

Downloading GoogleNews-vectors (1.57 GB)
1.6G	/content/GoogleNews-vectors-negative300.bin.gz
Downloading Brown Corpus (3.3 MB)
3.2M	/content/brown.zip


In [0]:
"""
Library for the functions used in populate_traits.py
"""


def is_adjective(word, adjectives):
    """
    Check whether the given word is an adjective or not
    :param word: the input word
    :param adjectives: list of all adjectives in the English language
    :return: True if the word is an adjective, False otherwise
    """
    if word in adjectives:
        return True
    else:
        return False


def filter_adjectives(a_list, adjectives):
    """
    Filter the list of words to find the adjectives in it
    :param a_list: A list of words
    :param adjectives: The list of adjectives
    :return: A list of all adjectives that were in the input list
    """
    found = []
    i = 0
    while i < len(a_list):
        if is_adjective(a_list[i][0], adjectives):
            found.append(a_list[i])
        i += 1
    return found


def scale_min_max(x, xmin, xmax, ymin, ymax):
    """
    scales input into integer output range
    :param x: the input value to transform
    :param xmin: the minimum input range
    :param xmax: the maximum input range
    :param ymin: the minimum output range
    :param ymax: the maximum output range
    :return: the scaled output value
    """
    y = (x - xmin) / (xmax - xmin)
    y *= (ymax - ymin)
    y += ymin
    y = int(y)
    return y


def scale_my_list(list, positivity):
    """
    Scale a list
    Example input :
        [('warm_hearted', 0.43241143226623535),
        ('playful', 0.3962867259979248),...
    Example output:
        [['warm_hearted', 11],
        ['playful', 7]
    :param list: the list to scale
    :param positivity: True if friendliness or dominance,
    False otherwise
    :return: the scaled list
    """
    if positivity:
        multiplier = 1
    else:
        multiplier = -1
    i = 0
    min = 1
    while i < len(list):
        if list[i][1] < min:
            min = list[i][1]
        i += 1
    i = 0
    max = 0
    while i < len(list):
        if list[i][1] > max:
            max = list[i][1]
        i += 1
    i = 0
    list2 = []
    while i < len(list):
        new = [list[i][0],
               multiplier * scale_min_max(list[i][1], min, max, 1, 10)]
        list2.append(new)
        i += 1
    return list2

In [0]:
from gensim.models import KeyedVectors

print("Loading GoogleNews-vectors into word2vec (~30 seconds)")
model = KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin.gz',
    binary=True,
    limit=500000
)

Loading GoogleNews-vectors into word2vec (~30 seconds)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
"""
This script does all the heavy-lifting of setting up the
    initial databases...
topn ~ limit/5 seems a good number to ensure
    the combination of having enough traits
    yet avoiding matches that are completely irrelevant
"""

import nltk
import zipfile
from nltk.corpus import brown

from tqdm import tqdm

from interpersonal.classes.trait_dao import TraitDao

TraitDao.create_tables()
TraitDao.empty_tables()

# using the regular download can result in certificate problems
# which are hard to resolve without root access to computer
# but if you have root access to computers, this is the simplest way to
# download the brown corpora:
nltk.download('brown')

# print("Extracting brown corpora to directory")
# zip_ref = zipfile.ZipFile('brown.zip', 'r')
# # if nltk.data.path[0] fails, just try [1],[2],... and so on
# # on our system we had about 10 alternative paths, eg. [0]~[9]
# # you need write permissions to the path you choose
# zip_ref.extractall(nltk.data.path[0])
# zip_ref.close()



print("Extracting words from word2vec model")
friendliness = model.most_similar(
    positive=['friendly', 'affectionate', 'loving', 'kind'],
    negative=['hostile', 'hurtful', 'unfriendly', 'mean'],
    topn=100000
)
unfriendliness = model.most_similar(
    positive=['hostile', 'hurtful', 'unfriendly', 'mean'],
    negative=['friendly', 'affectionate', 'loving', 'kind'],
    topn=100000
)
dominance = model.most_similar(
    positive=['dominant', 'assertive', 'capable', 'important'],
    negative=['submissive', 'apologetic', 'meek', 'passive'],
    topn=100000
)
undominance = model.most_similar(
    positive=['submissive', 'apologetic', 'meek', 'passive'],
    negative=['dominant', 'assertive', 'capable', 'important'],
    topn=100000
)

# set of over 8000 adjectives
adjectives = {word for word, pos in brown.tagged_words()
              if pos.startswith('JJ')}

print("Filtering for adjectives from extracted words")
friendliness = filter_adjectives(friendliness, adjectives)
unfriendliness = filter_adjectives(unfriendliness, adjectives)
dominance = filter_adjectives(dominance, adjectives)
undominance = filter_adjectives(undominance, adjectives)

print("Scaling the list to fit the range (0,10) or (-10,0)")
friendliness = scale_my_list(friendliness, True)
unfriendliness = scale_my_list(unfriendliness, False)
dominance = scale_my_list(dominance, True)
undominance = scale_my_list(undominance, False)

print("Adding traits to database:")
print("- 1/4")
for trait in tqdm(friendliness):
    TraitDao.add_friendliness_trait(trait[0], trait[1])
print("- 2/4")
for trait in tqdm(unfriendliness):
    TraitDao.add_friendliness_trait(trait[0], trait[1])
print("- 3/4")
for trait in tqdm(dominance):
    TraitDao.add_dominance_trait(trait[0], trait[1])
print("- 4/4")
for trait in tqdm(undominance):
    TraitDao.add_dominance_trait(trait[0], trait[1])

print("traits.db is ready!")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
Extracting words from word2vec model


  if np.issubdtype(vec.dtype, np.int):


Filtering for adjectives from extracted words


  0%|          | 6/1738 [00:00<00:30, 56.99it/s]

Scaling the list to fit the range (0,10) or (-10,0)
Adding traits to database:
- 1/4


100%|██████████| 1738/1738 [00:14<00:00, 117.04it/s]
  1%|          | 11/1503 [00:00<00:13, 109.78it/s]

- 2/4


100%|██████████| 1503/1503 [00:13<00:00, 115.49it/s]
  1%|          | 12/1261 [00:00<00:10, 119.03it/s]

- 3/4


100%|██████████| 1261/1261 [00:10<00:00, 116.11it/s]
  1%|          | 12/2074 [00:00<00:17, 118.37it/s]

- 4/4


100%|██████████| 2074/2074 [00:23<00:00, 88.58it/s]

traits.db is ready!





In [0]:
!du -h traits.db

128K	traits.db
