<a href="https://colab.research.google.com/github/qamtam/ads_final_project/blob/main/Project_Abel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1 - overview and assumptions

This is Project Abel, where I try to use a fragment of the Polish National Text Corpus to make my own embeddings, and then utilize NAWL's research "emotional embeddings" with publically available lists of names to explore if there could potentially be any deep language associations between names and emotionally charged words/adjectives.

After making the embeddings using essentially just the [Word2Vec](https://https://www.tensorflow.org/tutorials/text/word2vec) tutorial and some data management, we're going to 
to extract what pairs of words I want to compare from external databases (names and emotionally charged words), then compute similarities using standard measure of cosine similarity. The last point is to try to 'rank' the names based on some criteria - for example what name is the closest to the cluster of positive adjectives?

As a bonus, I attempt to make a 'best names' ranking based on weighted averages of their similarity scores to wanted/unwanted qualities. Potentially these ranking tools could be used to help new parents choose a 'objectively' good name that fits the wanted qualities.

My personal benchmark of success for the embeddings is going to be a RMSE metric that compares the similarity between my embeddings and the NAWL databases that is going to be in the same ballpark as [one achieved by embeddings based on the whole corpus in a similar test](https://github.com/qamtam/ads_final_project/blob/main/Project_CHONK.ipynb) (say, lower than 0.75).

Of course, right off the bat it is worth to point out that this tool won't be an oracle, if only because the cosine similarity of embeddings might associate strongly two words with directly opposite meaning (for example 
```
cosine_similarity("good", "bad")
``` will be very high in most embeddings. As far as I know, we don't have a way to extract the 'direction' of similarity. The National Corpus also is based on a vast stretch of time and combined embeddings won't necessarily bring the 'best' possible ones for the language of 2021. However, at the end of the day it can still be a fun party tool that's much more closer to reality than horoscopes ;)

# Section 2 - data preparation

We are going to use 5 main databases:


*   Polish National Text Corpus Embeddings
*   NAWL database with default thresholds
*   NAWL database marked for grammar
*   Comprehensive Polish first names list
*   Statistical rankings for given names for babies in Poland in 21st cenury

After downloading, extracting and preprocessing data using mostly Pandas for this project we will create the embeddings and then compute 'scoring tables' (names v words) using cosine similarity to get the starting point for actual analysis.

In [4]:
#let's try to make our very own embeddings of Polish language
import tensorflow as tf
import pandas as pd
import numpy as np
import io
import itertools
import os
import re
import string
import tensorflow as tf
import tqdm
import requests
import urllib.request
import tarfile
import shutil
import csv
from bs4 import BeautifulSoup
from os import walk
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Dot, Embedding, Flatten, GlobalAveragePooling1D, Reshape
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization


In [5]:
#preprocess raw data and get pandas-friendly csv
cwd = os.getcwd()
data_path = os.path.join(cwd, 'Data')
dest = os.path.join(data_path, 'dest')

In [6]:
url = 'http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=get&target=NKJP-PodkorpusMilionowy-1.2.tar.gz'
#^ this is the hand-annotated fragment of the full corpus, couple hundred MBs
thetarfile = url
ftpstream = urllib.request.urlopen(thetarfile)
thetarfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
thetarfile.extractall(data_path)
# this should unpack the archive in /Data folder

In [7]:
try:
  os.mkdir(dest) 
except FileExistsError as err:
  pass

i = 0
# extract only pure text files (text.xml files), with not too much metadata
for subdir, dirs, files in os.walk(data_path):
  for file in files:
      if file == "text.xml":
          i+=1
          base, extension = os.path.splitext(file)
          destination = os.path.join(dest, '{}_{}{}'.format(base, i, extension))
          original = os.path.join(subdir, file)
          shutil.copyfile(original, destination)
          #^ copy files one by one to /Data/dest

In [8]:
nawlg = pd.read_excel('https://static-content.springer.com/esm/art%3A10.3758%2Fs13428-014-0552-1/MediaObjects/13428_2014_552_MOESM1_ESM.xlsx')


url = 'https://drive.google.com/file/d/1frgNyuSUZ9qQKma3CskKof7pLFGX0C7v/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
nawl = pd.read_csv(path)
#^ path to hand-downloaded file with sentiment/emotion analysis of about a thousand words 
# more info : https://exp.lobi.nencki.gov.pl/nawl-analysis
nawlg = nawlg[["NAWL_word", "Gram"]]


names = pd.read_csv('https://raw.githubusercontent.com/sdadas/polish-nlp-resources/master/lexicons/names/names.csv')
names.columns = ['name', 'metadata', 'u']
is_name =  names['metadata']=='P-N' #leaves only first Polish names
first_names = names[is_name].name


#NAWL researches sentiments behind popular words in basic emotions
# Happiness, Anger, Sadness, Fear, Disgust
# I'll simplify the dataset and think of 'happy' words as positive and every other type as negative


nawlg.columns = ['word', 'gram']

nawl.sort_values(by='word', ascending=True)


words_df = pd.merge(nawl, nawlg, on=['word']) # main nawl table doesn't have grammar, nawlg doesn't define categories as they are left up to the user

positive_adjectives = words_df[(words_df['category'] == 'H') & (words_df['gram'] == 3)].word #gram == 3 is adjective
negative_adjectives = words_df[(words_df['category'].isin(['A', 'S', 'D', 'F'])) & (words_df['gram'] == 3)].word
positive_words = words_df[(words_df['category'] == 'H')].word
negative_words = words_df[(words_df['category'].isin(['A', 'S', 'D', 'F']))].word
# handy shortcut of ASDF - anger sadness disgust and fear


popular_names  = pd.read_csv('https://api.dane.gov.pl/resources/21457,imiona-nadane-dzieciom-w-polsce-w-latach-2000-2019-imie-pierwsze/file') # finding this link was a PITA
#names given to Polish babies in 21st century; sorted by popularity in any given year
summed_names = popular_names.groupby(['Imię', 'Płeć']).agg({'Liczba': 'sum'}).sort_values(by= 'Liczba') # wanting to preserve information about sex creates a MultiIndex
girl_names = summed_names[summed_names.index.isin(['K'], level=1)]
boy_names = summed_names[summed_names.index.isin(['M'], level=1)]

def get_names_list(names_table, top_n):
  names_table.reset_index(inplace=True)
  names_table.drop(columns='Płeć', inplace=True)
  new = names_table.sort_values(by='Liczba', ascending=False)[:top_n]
  new.drop(columns='Liczba', inplace=True)
  new = pd.Series(new['Imię'])
  l = []
  for name in new:
    l.append(name.lower().capitalize())
  return pd.Series(l)

pop_boy_names = get_names_list(boy_names, 170)
pop_girl_names = get_names_list(girl_names, 170)
pop_names = get_names_list(summed_names, 400)
all_names = pd.concat((pop_names,first_names), copy=False)
all_names.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [9]:
_, _, filenames = next(walk(dest))
print(filenames)
outputcsv = os.path.join(dest, "output.csv")
#copy all of pure text content in the corpus to a single csv file
with open(outputcsv, 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    i = 0
    for file in filenames:
      i += 1
      fi = os.path.join(dest, file)
      with open(fi, "r", encoding="utf-8") as f:
        contents = f.read()
        soup = BeautifulSoup(contents, 'lxml')
        texts = soup.find_all("ab")
        writer.writerows(texts)

['text_3651.xml', 'text_2046.xml', 'text_3483.xml', 'text_3327.xml', 'text_206.xml', 'text_352.xml', 'text_1041.xml', 'text_1702.xml', 'text_3640.xml', 'text_1256.xml', 'text_1562.xml', 'text_1899.xml', 'text_2220.xml', 'text_618.xml', 'text_1746.xml', 'text_164.xml', 'text_2097.xml', 'text_1213.xml', 'text_1019.xml', 'text_137.xml', 'text_2839.xml', 'text_2386.xml', 'text_1263.xml', 'text_292.xml', 'text_3342.xml', 'text_1469.xml', 'text_1852.xml', 'text_1737.xml', 'text_2382.xml', 'text_2320.xml', 'text_3169.xml', 'text_2146.xml', 'text_2763.xml', 'text_3060.xml', 'text_3503.xml', 'text_2392.xml', 'text_752.xml', 'text_2758.xml', 'text_518.xml', 'text_1823.xml', 'text_2221.xml', 'text_1481.xml', 'text_2480.xml', 'text_1616.xml', 'text_646.xml', 'text_1878.xml', 'text_3142.xml', 'text_1960.xml', 'text_838.xml', 'text_1176.xml', 'text_973.xml', 'text_168.xml', 'text_1890.xml', 'text_2061.xml', 'text_3065.xml', 'text_413.xml', 'text_3677.xml', 'text_3187.xml', 'text_2806.xml', 'text_106

In [10]:
data = pd.read_csv(outputcsv, header=None)
pd.set_option('display.max_colwidth', None)

In [11]:
data.head(20)

Unnamed: 0,0
0,"Włodarze mistrzów Polski pragną podpisać z braćmi Gollobami co najmniej 2-letnie umowy. Po sezonie 1998 skończy się bowiem kontrakt Piotrowi Protasiewiczowi, zaś jednoczesne prowadzenie rozmów z trzema liderami na pewno przekroczyłoby wówczas finansowe możliwości klubu. Z kolei Tomasz i Jacek pragną podpisać roczne kontrakty."
1,"Siostry znów wyszły na drogi - około 70 pielęgniarek i położnych blokowało przez godzinę trasę dojazdową na jedyny most przez Wisłę w Płocku i prowadzącą tamtędy drogę krajową nr 60. Protest przerwały po negocjacjach z policją. Według danych OZZPiP w 48 placówkach w całym kraju trwa strajk bezterminowy, a głodówka w 31."
2,– ty się ciesz w mniejszych miasteczkach nie ma pracy w ogóle..
3,– no dlatego mówię że..
4,– ty jeszcze w ostatniej chwili tutaj się. złapałeś..
5,– może robiłbym coś innego wtedy..
6,– byś był pracownikiem socjalnym tak? pewno jak byś nie spotkał takich przyjemnych osób jak my. to byś szybko uciekał..
7,– z miasta Łodzi..
8,– a bo głupi byłem bo rodzinie pożyczył bo tam komuś pożyczył i nie ma pieniędzy rozumiesz? a miał kasę taką że miał dom budować..
9,– ale to chyba oddadzą mu nie?


In [12]:
#simplify the analyzed texts

def removeAccents(input_text):
    strange='ŮôῡΒძěἊἦëĐᾇόἶἧзвŅῑἼźἓŉἐÿἈΌἢὶЁϋυŕŽŎŃğûλВὦėἜŤŨîᾪĝžἙâᾣÚκὔჯᾏᾢĠфĞὝŲŊŁČῐЙῤŌὭŏყἀхῦЧĎὍОуνἱῺèᾒῘᾘὨШūლἚύсÁóĒἍŷöὄЗὤἥბĔõὅῥŋБщἝξĢюᾫაπჟῸდΓÕűřἅгἰშΨńģὌΥÒᾬÏἴქὀῖὣᾙῶŠὟὁἵÖἕΕῨčᾈķЭτἻůᾕἫжΩᾶŇᾁἣჩαἄἹΖеУŹἃἠᾞåᾄГΠКíōĪὮϊὂᾱიżŦИὙἮὖÛĮἳφᾖἋΎΰῩŚἷРῈĲἁéὃσňİΙῠΚĸὛΪᾝᾯψÄᾭêὠÀღЫĩĈμΆᾌἨÑἑïოĵÃŒŸζჭᾼőΣŻçųøΤΑËņĭῙŘАдὗპŰἤცᾓήἯΐÎეὊὼΘЖᾜὢĚἩħĂыῳὧďТΗἺĬὰὡὬὫÇЩᾧñῢĻᾅÆßшδòÂчῌᾃΉᾑΦÍīМƒÜἒĴἿťᾴĶÊΊȘῃΟúχΔὋŴćŔῴῆЦЮΝΛῪŢὯнῬũãáἽĕᾗნᾳἆᾥйᾡὒსᾎĆрĀüСὕÅýფᾺῲšŵкἎἇὑЛვёἂΏθĘэᾋΧĉᾐĤὐὴιăąäὺÈФĺῇἘſგŜæῼῄĊἏØÉПяწДĿᾮἭĜХῂᾦωთĦлðὩზკίᾂᾆἪпἸиᾠώᾀŪāоÙἉἾρаđἌΞļÔβĖÝᾔĨНŀęᾤÓцЕĽŞὈÞუтΈέıàᾍἛśìŶŬȚĳῧῊᾟάεŖᾨᾉςΡმᾊᾸįᾚὥηᾛġÐὓłγľмþᾹἲἔбċῗჰხοἬŗŐἡὲῷῚΫŭᾩὸùᾷĹēრЯĄὉὪῒᾲΜᾰÌœĥტ'
    ascii_replacements='UoyBdeAieDaoiiZVNiIzeneyAOiiEyyrZONgulVoeETUiOgzEaoUkyjAoGFGYUNLCiIrOOoqaKyCDOOUniOeiIIOSulEySAoEAyooZoibEoornBSEkGYOapzOdGOuraGisPngOYOOIikoioIoSYoiOeEYcAkEtIuiIZOaNaicaaIZEUZaiIaaGPKioIOioaizTIYIyUIifiAYyYSiREIaeosnIIyKkYIIOpAOeoAgYiCmAAINeiojAOYzcAoSZcuoTAEniIRADypUitiiIiIeOoTZIoEIhAYoodTIIIaoOOCSonyKaAsSdoACIaIiFIiMfUeJItaKEISiOuxDOWcRoiTYNLYTONRuaaIeinaaoIoysACRAuSyAypAoswKAayLvEaOtEEAXciHyiiaaayEFliEsgSaOiCAOEPYtDKOIGKiootHLdOzkiaaIPIIooaUaOUAIrAdAKlObEYiINleoOTEKSOTuTEeiaAEsiYUTiyIIaeROAsRmAAiIoiIgDylglMtAieBcihkoIrOieoIYuOouaKerYAOOiaMaIoht'
    translator=str.maketrans(strange,ascii_replacements)
    return input_text.translate(translator)
def removeAccentsFromAList(words):
  ls = []
  for word in words:
    word = removeAccents(word)
    ls.append(word)
  return ls



# Section 3 - making the embeddings

This is mostly done for the sake of understanding the logic underneath skip-grams and just making it work within context of my dataset. The easiest way to make a compatible dataset was to make a pandas Series object, and to get the Series we've used a standard dictionary.

This section of code will take a while to compute, therefore I recommend using maybe up to 5 epochs if you are just using Colab.

Basic logic:


1.   Get data and all sentences
2.   Vectorize the sentences
3.   Generate targets, labels and contexts for training the model:
*    for each sentence and all target words X within the sentence generate positive skip-grams (if I show you the word X what would be the words that are close to that word X). Effectively these will make a 'true' label equal to 1.
*    for each positive skip-gram generate a few fake negative skip-grams for the sake of training (given that I know that in the neighbourhood of the word X I have the word Y in this sentence find me a couple of 'fake-pairs' (X,Z1) (X,Z2)) .... Effectively these will make a 'false' label equal to 0.
*    stitch together three lists:  target words (vectorized) , a list of context words for each target word (vectorized), and a list of binary labels so that the model will be able to differentiate and learn.
4.   Train the model using Word2Vec that embeds and then looks up both the target word and the context words and then creates a dot product for all of target-context combinations (in this case 5 multiplications per target word). These dot products are used as base for logits in case of the loss products. Loss function is then applied across the batch to upgrade the embeddings.
5.   Save the embeddings and fitting words in an external file









In [13]:
dictionary = {}
i = 0
#data is a a DataFrame with a single column, data[0] is a Series with all sentences 
for value in data[0]:
  dictionary[i] = removeAccents(value)
  i+=1
s = pd.Series(dictionary)



dataset = tf.data.Dataset.from_tensor_slices(s)
# tensorflow word2vec tutorial
# this is just changing a sentence to a sequence of numbers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text_dataset = dataset
max_features = 30000  # Maximum vocab size.
max_len = 2000  # Sequence length to pad the outputs to.
embedding_dims = 2

# Create the layer.
vectorize_layer = TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(text_dataset.batch(64))

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["Janusz"]]
model.predict(input_data)





array([[1227,    0,    0, ...,    0,    0,    0]])

In [14]:
def vectorize_text(text):
  text = tf.expand_dims(text, -1)
  return tf.squeeze(vectorize_layer(text))
# Vectorize the data in text_ds.
text_vector_ds = dataset.batch(1024).prefetch(1).map(vectorize_layer).unbatch()

In [15]:
sequences = list(text_vector_ds.as_numpy_iterator())

In [16]:
#here we generate "tensorflow-friendly" training data, basing on skip-grams
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
AUTOTUNE = tf.data.AUTOTUNE
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for vocab_size tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence, 
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples 
    # with positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1, 
          num_sampled=num_ns, 
          unique=True, 
          range_max=vocab_size, 
          seed=42, 
          name="negative_sampling")

      # Build context and label vectors (for one target word)
      negative_sampling_candidates = tf.expand_dims(
          negative_sampling_candidates, 1)

      context = tf.concat([context_class, negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64") #one skip-gram is correct, others ain't

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

In [17]:
targets, contexts, labels = generate_training_data(
    sequences=sequences, 
    window_size=3, 
    num_ns=4, 
    vocab_size=30000, 
    seed=42)
print(len(targets), len(contexts), len(labels))

100%|██████████| 39566/39566 [11:59<00:00, 54.98it/s]

1579081 1579081 1579081





In [53]:
class Word2Vec(Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = Embedding(vocab_size, 
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_embedding", )
    #^ this is the layer we've wanted all along
    self.context_embedding = Embedding(vocab_size, 
                                       embedding_dim, 
                                       input_length=4+1)
    self.dots = Dot(axes=(3,2))
    self.flatten = Flatten()

  def call(self, pair):
    target, context = pair
    we = self.target_embedding(target) # 1024 * 1 * 128
    ce = self.context_embedding(context) # 1024 * 5 * 1 * 128
    dots = self.dots([ce, we]) # 1024 * 5 * 1 * 1
    # tf.print(self.flatten(dots)
    return self.flatten(dots) #1024* 5 -> batched dot products that are going to become logits through from_logits=False

In [19]:
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with
      _logits(logits=x_logit, labels=y_true)

In [54]:
embedding_dim = 128
word2vec = Word2Vec(30000, embedding_dim)
word2vec.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [21]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [56]:
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)


In [None]:
word2vec.fit(dataset, epochs=5, callbacks=[tensorboard_callback]) #I usually use 30 epochs, but they take a while

In [58]:
#get the weights and save later for further analysis
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [59]:
word2vec.get_layer('w2v_embedding')(vectorize_text("")[0]) #exact same embed for empty text - correct

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-0.02483749, -0.03673226,  0.01241419,  0.01682151,  0.00544847,
       -0.01824473, -0.02830104,  0.03343764, -0.00673729,  0.04658386,
       -0.04492407,  0.00945389,  0.03718581,  0.03711898,  0.00615319,
        0.04511199, -0.01636462,  0.04169687,  0.04331774, -0.00337882,
       -0.01841544, -0.00472198, -0.04208391, -0.01743228,  0.00390824,
       -0.02334045, -0.02550955,  0.03696283, -0.01576573,  0.00769513,
        0.04868232, -0.01871911,  0.02419556, -0.0018949 , -0.02006028,
       -0.02160146, -0.03797182,  0.03930881, -0.02757534,  0.03011488,
       -0.02690549, -0.01187821, -0.03138383,  0.02126404,  0.02107034,
       -0.02285545,  0.04750523,  0.04638315,  0.0240323 , -0.00519545,
       -0.02278857,  0.04368022, -0.02596071,  0.01892178, -0.03381976,
       -0.01043171, -0.02123529, -0.01035564,  0.03518642, -0.00031258,
       -0.00473078,  0.04352957, -0.03083882,  0.03857578,  0.02121847,
       -0.011086

In [60]:
weights[0] #most popular token is just empty one - correct

array([-0.02483749, -0.03673226,  0.01241419,  0.01682151,  0.00544847,
       -0.01824473, -0.02830104,  0.03343764, -0.00673729,  0.04658386,
       -0.04492407,  0.00945389,  0.03718581,  0.03711898,  0.00615319,
        0.04511199, -0.01636462,  0.04169687,  0.04331774, -0.00337882,
       -0.01841544, -0.00472198, -0.04208391, -0.01743228,  0.00390824,
       -0.02334045, -0.02550955,  0.03696283, -0.01576573,  0.00769513,
        0.04868232, -0.01871911,  0.02419556, -0.0018949 , -0.02006028,
       -0.02160146, -0.03797182,  0.03930881, -0.02757534,  0.03011488,
       -0.02690549, -0.01187821, -0.03138383,  0.02126404,  0.02107034,
       -0.02285545,  0.04750523,  0.04638315,  0.0240323 , -0.00519545,
       -0.02278857,  0.04368022, -0.02596071,  0.01892178, -0.03381976,
       -0.01043171, -0.02123529, -0.01035564,  0.03518642, -0.00031258,
       -0.00473078,  0.04352957, -0.03083882,  0.03857578,  0.02121847,
       -0.01108634,  0.03293141,  0.03806681, -0.04023711, -0.04

In [61]:
# save as files
# I've commented these out after first run
# you can download these if you want to

with open('/content/Data/fullvectors.tsv', 'w', encoding='utf-8') as writefile:
  for index, word in enumerate(vocab):
    vec = weights[index]
    writefile.write('\t'.join([str(x) for x in vec]) + "\n")


with open('/content/Data/fullvocab.tsv', 'w', encoding='utf-8') as writefile:
  for index, word in enumerate(vocab):
    vec = weights[index]
    writefile.write(word + "\n")



In [62]:
df = pd.DataFrame(index = vocab, data = weights)

In [63]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
,-0.024837,-0.036732,0.012414,0.016822,0.005448,-0.018245,-0.028301,0.033438,-0.006737,0.046584,-0.044924,0.009454,0.037186,0.037119,0.006153,0.045112,-0.016365,0.041697,0.043318,-0.003379,-0.018415,-0.004722,-0.042084,-0.017432,0.003908,-0.023340,-0.025510,0.036963,-0.015766,0.007695,0.048682,-0.018719,0.024196,-0.001895,-0.020060,-0.021601,-0.037972,0.039309,-0.027575,0.030115,...,0.047388,0.000687,-0.027133,0.020897,-0.049600,-0.038707,-0.021299,-0.025382,-0.003328,-0.045998,-0.002449,0.017047,0.016116,-0.011192,0.030194,0.016583,0.022205,0.004009,0.016263,-0.002267,-0.047626,-0.024924,-0.036637,0.032473,-0.007820,0.004478,-0.023541,0.018130,0.048227,-0.020017,-0.014868,0.022978,0.016364,-0.005410,-0.034334,-0.000721,-0.016032,0.014276,0.038685,-0.048515
[UNK],-0.064359,-0.047015,-0.004526,-0.180534,-0.123830,0.065138,-0.273333,-0.106018,0.035714,0.392990,0.197673,0.251411,0.029346,0.209724,-0.392422,0.504875,0.110454,0.373819,-0.101998,0.035423,-0.186290,0.339091,0.215500,0.037200,0.107900,-0.154004,0.016064,-0.052432,0.243302,-0.209174,0.050753,0.188743,0.037367,0.064409,-0.198460,-0.054484,0.230221,0.171321,-0.188478,0.024947,...,0.082984,-0.398212,-0.152714,-0.205721,0.019334,0.062322,-0.189677,-0.112330,-0.012367,0.083114,0.062711,-0.050594,0.394863,-0.173089,0.051981,0.411680,-0.457209,-0.092347,0.240177,0.284800,0.308165,-0.188713,-0.156145,-0.273702,-0.188762,0.065544,0.221635,-0.019786,-0.179682,0.057393,-0.281604,-0.134375,0.347260,0.234844,-0.520911,0.030800,-0.063385,0.123444,-0.006586,0.073736
w,0.056266,-0.045822,0.395911,-0.038012,-0.118911,-0.049477,-0.204109,-0.132256,0.209648,-0.098129,-0.068143,-0.175674,-0.108409,0.218563,-0.060869,-0.325691,0.227364,0.021227,-0.010059,-0.112198,-0.036145,0.101465,0.148372,0.062876,0.103714,-0.136108,-0.232315,0.126297,-0.191143,-0.535246,-0.129462,0.220678,-0.214070,-0.011959,0.045341,0.209881,-0.272174,0.077772,0.023669,-0.287892,...,0.153236,-0.217081,0.108446,-0.170730,0.097523,-0.005537,-0.421822,0.229934,-0.216907,-0.363507,-0.001881,0.074098,-0.153773,-0.144519,-0.118475,0.381971,-0.011843,-0.257297,-0.310013,-0.131974,-0.073667,0.073014,0.011014,-0.141172,0.352729,-0.249161,0.316622,-0.071495,-0.217206,0.004697,-0.075110,-0.125751,0.296172,0.091784,-0.072615,-0.182985,-0.041949,0.190219,-0.048720,-0.020745
i,-0.098896,0.054621,-0.201118,0.108995,-0.074785,0.378209,-0.011638,-0.129472,0.048733,0.227663,-0.252854,-0.326302,0.210770,0.403392,-0.240531,-0.192813,-0.075281,-0.170519,-0.102864,0.233133,-0.246156,0.037031,-0.184695,0.355848,0.038112,-0.230394,-0.163109,-0.183815,0.188815,-0.029593,-0.021575,0.126311,-0.096236,-0.231859,0.027623,0.068245,-0.234732,-0.149734,0.064883,0.134109,...,0.330849,-0.146145,0.165825,-0.198414,0.095211,-0.094334,0.117748,0.056130,0.129842,-0.158052,-0.112068,0.050612,0.357501,-0.212121,0.034596,0.179489,-0.113229,-0.435815,0.247262,-0.068847,0.163874,-0.146734,-0.136601,0.037481,0.295391,0.036213,-0.118212,0.152301,-0.048234,-0.101550,-0.034617,-0.086828,-0.102655,0.124136,0.163484,0.142503,0.211788,0.043500,-0.033191,-0.237854
sie,-0.070718,-0.002241,-0.047012,-0.371792,-0.083358,0.096385,0.244463,0.003952,-0.076309,-0.030033,-0.109702,-0.505827,0.146561,0.135615,-0.328960,-0.005799,0.210233,-0.153766,0.116432,-0.074843,-0.190284,0.265984,-0.043114,0.184675,0.079656,0.147328,-0.210179,-0.207956,0.128580,-0.021809,0.081989,-0.090329,0.051729,0.068627,0.131462,-0.191967,-0.019155,0.070678,0.041561,-0.072503,...,-0.012833,-0.359638,0.100713,0.030499,0.009069,0.340664,-0.089640,0.010810,-0.238951,0.131268,0.108168,0.025912,-0.196329,-0.113024,0.192865,0.023198,-0.265593,-0.083724,0.027461,0.134425,-0.252683,-0.061862,-0.325193,-0.384301,0.029241,0.179920,0.467372,-0.205286,-0.333561,0.031550,-0.022287,0.045707,-0.284012,0.285857,-0.099026,-0.044976,-0.083772,0.401077,0.271711,0.027083
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
trzymamy,0.052315,-0.008762,0.129851,-0.025644,-0.156673,0.094053,0.026970,0.148202,-0.078114,0.035028,0.080837,-0.009543,0.039121,0.165118,-0.081582,-0.000328,-0.128693,0.012118,0.049143,0.009930,0.095287,-0.048284,-0.045773,-0.097127,0.171910,-0.036805,0.028408,-0.149740,0.029134,0.149688,0.091240,0.080241,0.047771,-0.121698,0.074974,0.111587,0.220265,0.119885,0.006444,0.023481,...,-0.033830,0.034160,0.020482,0.069199,0.174578,-0.011803,-0.172840,0.238714,-0.194605,-0.006980,0.045307,-0.046986,0.080449,0.111523,-0.064762,0.043405,-0.000153,-0.095957,-0.053971,-0.060867,0.065962,0.062860,-0.087496,-0.059379,0.097202,0.218391,-0.018738,0.023419,-0.090280,0.023935,-0.055868,0.016594,0.040294,-0.081500,0.026649,0.060546,-0.011309,0.023799,-0.046171,-0.153522
trzymacie,-0.062123,0.186920,-0.091498,0.083987,-0.342476,0.078476,0.131664,0.314952,0.011901,0.149310,-0.247451,0.171842,0.248780,-0.065513,-0.322037,0.067789,0.381753,-0.064026,-0.170716,-0.029317,0.024556,0.071480,-0.094231,0.056266,-0.109961,-0.160616,-0.229537,-0.145049,0.170932,0.007391,-0.172557,0.010699,0.118290,0.132552,-0.327730,0.030509,-0.132483,0.349839,-0.144324,0.209134,...,0.042758,-0.098441,0.014189,-0.063849,-0.086092,0.152088,0.020124,0.235945,0.114999,-0.027333,0.176772,0.105222,0.068621,-0.040074,-0.112347,-0.122145,-0.015784,-0.165722,-0.114229,0.155939,-0.035118,-0.038683,-0.215092,0.072998,-0.080205,0.103007,-0.034898,-0.085899,-0.118534,0.071638,-0.161946,0.041331,0.036249,0.225211,-0.071558,-0.001049,-0.065818,0.321383,0.071582,-0.041985
trzydziestej,0.351589,0.024960,-0.259358,-0.011474,-0.110001,0.034904,0.136953,0.022302,0.047036,0.343929,-0.074164,-0.307230,-0.013447,-0.129908,-0.155190,0.066066,0.120953,0.267573,0.294167,0.148991,-0.004710,0.268174,-0.125255,0.112876,-0.040184,0.282129,-0.180770,0.097785,-0.065791,-0.075466,-0.152915,-0.085818,-0.224804,-0.160662,-0.209274,-0.181914,0.205868,-0.064031,0.041652,-0.011547,...,-0.167090,0.150052,0.145778,-0.161874,0.024398,0.147614,-0.035390,0.376594,-0.301244,-0.164475,0.067786,-0.035736,0.230805,-0.247909,-0.384732,0.097048,-0.260754,-0.159712,-0.336423,0.251263,0.314574,0.070573,-0.266805,0.216211,-0.173197,0.151920,0.045521,-0.098654,-0.010340,0.000676,0.087192,-0.011863,-0.002598,0.077393,-0.095633,-0.058531,0.109051,0.030025,0.251389,0.256590
trzydniowa,0.154316,0.338883,-0.202823,0.017109,-0.069654,0.147592,0.072134,-0.077329,0.042543,0.000958,0.042164,0.227838,-0.100273,-0.059961,-0.153195,0.023772,-0.039304,0.100093,0.113152,0.008360,-0.131142,0.035947,0.253485,-0.101730,0.185677,0.078599,-0.291340,0.057882,0.070235,-0.253420,-0.056537,0.165932,-0.090237,-0.062576,-0.087962,0.273487,-0.015274,-0.162589,-0.024496,0.365763,...,0.070416,-0.113697,-0.003403,-0.046504,-0.058013,0.074039,-0.191720,-0.047577,0.012867,0.059482,0.158064,0.225145,0.133150,-0.049210,0.051741,-0.035212,-0.063664,0.231165,0.094318,-0.282643,-0.081737,0.229816,0.094642,0.255421,0.046669,-0.084916,0.103934,0.133178,-0.079175,-0.043617,0.104222,0.215814,0.043279,-0.095749,-0.080650,-0.188117,0.023888,-0.031042,-0.066440,0.370761


In [64]:
df.index = vocab
df.index

Index(['', '[UNK]', 'w', 'i', 'sie', 'na', 'nie', 'z', 'do', 'to',
       ...
       'tuli', 'tula', 'tudziez', 'tucholskiego”', 'tt', 'trzymamy',
       'trzymacie', 'trzydziestej', 'trzydniowa', 'trzezwym'],
      dtype='object', length=30000)

In [65]:

def similarity(text1, text2):
  try:
    return cosine_similarity(df.loc[text1].values.reshape(1,-1), df.loc[text2].values.reshape(1,-1))[0][0]
  except KeyError:
    return np.nan
    pass
def generate_scoreboard(names, objects):
  table = []
  for name in names:
    row = []
    for obj in objects:
      row.append(similarity(name, obj))
    table.append(row)
  df = pd.DataFrame(data = table, index = names, columns = objects)
  df.dropna(how = 'all', axis = 1, inplace=True)
  df.dropna(how = 'all', axis = 0, inplace=True)
  return df

In [66]:
from sklearn.metrics.pairwise import cosine_similarity 
positive_adjectives = removeAccentsFromAList(positive_adjectives)
positive_words = removeAccentsFromAList(positive_words)
negative_adjectives = removeAccentsFromAList(negative_adjectives)
negative_words = removeAccentsFromAList(negative_words)
first_names = removeAccentsFromAList(first_names)
all_names = removeAccentsFromAList(all_names)

In [67]:
#let's see how closely related are names and adjectives and other emotionally charged words 
positive_scores_from_adj = generate_scoreboard(all_names, positive_adjectives)
positive_scores_from_words = generate_scoreboard(all_names, positive_words)
negative_scores_from_words = generate_scoreboard(all_names, negative_words)
negative_scores_from_adj = generate_scoreboard(all_names, negative_adjectives)

In [68]:
positive_scores_from_adj

Unnamed: 0,swietny,ladny,sloneczny,wierny,pozytywny,szczesliwy,zdrowy,mily,czysty,udany,wygodny,bezpieczny,zadowolony,inteligentny,wiosenny,wesoly,genialny,spokojny,sliczny,kochany,radosny,cieply,kolorowy,wspanialy
adam,0.213965,0.099813,0.183076,0.112663,-0.033610,0.172997,0.160217,0.216495,0.209499,0.094382,0.021792,0.073307,0.103370,0.103290,0.167065,0.299295,0.257546,0.072736,0.083186,0.146089,0.126639,0.109453,0.227432,0.060280
adolf,0.137449,-0.058679,0.106547,0.047050,0.028958,0.191087,0.153232,0.034508,0.029242,0.146049,-0.076716,-0.020616,0.124444,0.005798,0.229138,0.196488,0.265730,0.181999,0.220892,0.139591,0.335833,0.170038,0.341744,0.081969
adrian,0.162258,-0.235445,0.145528,-0.014037,-0.042950,0.013868,-0.079577,-0.142102,0.203345,-0.009664,-0.269542,-0.074669,0.045400,-0.022945,0.209224,-0.105478,0.073231,0.031932,-0.021906,0.009868,0.260384,0.064198,0.211016,0.217653
agata,0.311782,0.261003,0.350692,0.361884,0.067773,0.328007,0.206374,0.269149,0.438176,0.038511,-0.003899,0.022451,0.255077,0.327868,0.260027,0.297693,0.184877,0.107622,0.360481,0.273347,0.166735,0.140366,0.301142,0.479917
ala,0.117852,0.250598,0.247033,0.170614,0.186519,0.008233,0.142353,0.170999,0.035461,0.162404,0.202874,-0.037489,0.202850,0.145227,-0.132467,0.288612,0.003538,-0.038870,0.119018,0.100231,0.017138,0.026092,0.094723,0.206338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wlodzimierza,0.178356,-0.028909,0.154642,0.209339,0.217597,0.115000,0.113227,0.079454,0.232664,0.183903,-0.015616,-0.063733,0.082424,0.135541,0.559247,0.082155,0.288818,0.199700,0.080160,0.141993,0.322348,0.178780,0.191694,0.158977
zbigniewa,0.082865,0.215007,-0.042065,0.246781,-0.094747,0.202065,0.104366,0.220096,-0.003072,0.082810,-0.034926,-0.038475,0.117778,-0.038180,0.302546,0.277738,0.312383,0.087901,0.334147,0.245057,0.159119,0.216899,0.255016,0.176748
zbyszek,0.012970,0.277027,0.095971,0.261003,0.246741,0.329874,0.159392,0.193538,0.050384,-0.041352,0.318520,0.125237,0.478673,0.078606,-0.022471,0.262232,0.111541,0.149243,0.266282,0.125378,-0.085944,0.029299,0.110129,0.096676
zdzislawa,0.154115,-0.090976,0.090922,0.128597,-0.129328,0.109762,0.018274,0.036572,0.022555,-0.011697,-0.132256,-0.041932,0.050664,-0.037877,0.238570,0.218806,0.323835,0.103888,0.270156,0.141876,0.108451,0.090291,0.262907,0.016177


In [69]:
# let's try to find "strongest" positive and negative features of a name, basing on encodings
# in other words - what would be predicted features of a baby with a given name according to the language itself?

def rank_features_of_a_name(name, df):
  return df.sort_values(by = name, axis = 1).loc[name]

rank_features_of_a_name("harry", positive_scores_from_adj).tail(10)
#harry scores good grades in genius, great and good-hearted, looks like the books are reflected here well
#this snippet requires using all_names to generate scoreboards, Harry isn't a traditional Polish name

wiosenny      0.260281
szczesliwy    0.265580
czysty        0.288713
kochany       0.293227
kolorowy      0.300339
swietny       0.318472
sloneczny     0.324745
mily          0.351989
sliczny       0.378656
wspanialy     0.413249
Name: harry, dtype: float64

In [70]:
#Good sanity check - look up Adolf in the context of negative adjectives, if any name had strong associations it would be that
#there might have been a couple more Adolfs during the decades, but let's be honest - you know the one responsible for 95%+ of articles
#this snippet requires using all_names to generate scoreboards, Adolf isn't exactly popular anymore
rank_features_of_a_name("adolf", negative_scores_from_adj).tail(10)
#some of the related are vain, mad, spendy, offended, depressed, despairing, pedantic, stubborn, arrogant
#sounds about right

olbrzymi       0.177481
wyczerpany     0.186030
uparty         0.190441
ponury         0.193700
zrozpaczony    0.208108
tajny          0.219996
niespokojny    0.292116
glosny         0.343888
policyjny      0.358815
daleki         0.420135
Name: adolf, dtype: float64

In [71]:
#Okay, what would be his strong points then?
rank_features_of_a_name("adolf", positive_scores_from_adj).tail(10)
#surprisingly the top word appears to be "kind hearted", perhaps in context of his closest environment?
#also genius, trusted, loyal, intelligent, lovely, cheerful
#taking into account that he was highly bipolar this might be sensible

zdrowy        0.153232
cieply        0.170038
spokojny      0.181999
szczesliwy    0.191087
wesoly        0.196488
sliczny       0.220892
wiosenny      0.229138
genialny      0.265730
radosny       0.335833
kolorowy      0.341744
Name: adolf, dtype: float64

In [72]:
for name in all_names:
  print(name) #list of names for reference

Jakub
Julia
Kacper
Wiktoria
Zuzanna
Szymon
Natalia
Aleksandra
Mateusz
Maja
Michal
Oliwia
Filip
Lena
Amelia
Jan
Bartosz
Zofia
Dawid
Piotr
Mikolaj
Antoni
Adam
Weronika
Hanna
Martyna
Wojciech
Alicja
Wiktor
Anna
Aleksander
Karolina
Patryk
Emilia
Maciej
Kamil
Maria
Nikola
Dominik
Igor
Franciszek
Pawel
Gabriela
Marcel
Magdalena
Oskar
Patrycja
Karol
Klaudia
Krzysztof
Paulina
Tomasz
Maksymilian
Kinga
Hubert
Adrian
Oliwier
Bartlomiej
Katarzyna
Laura
Milosz
Sebastian
Stanislaw
Nadia
Dominika
Alan
Milena
Nikodem
Lukasz
Agata
Antonina
Krystian
Marta
Damian
Marcin
Daniel
Kornelia
Pola
Iga
Malgorzata
Joanna
Michalina
Leon
Konrad
Liliana
Gabriel
Tymoteusz
Fabian
Marcelina
Jagoda
Kamila
Tymon
Daria
Ignacy
Sandra
Julian
Izabela
Barbara
Nina
Rafal
Helena
Przemyslaw
Grzegorz
Artur
Roksana
Eryk
Agnieszka
Ksawery
Justyna
Natan
Kaja
Sara
Ewa
Blazej
Monika
Blanka
Lucja
Radoslaw
Olga
Olaf
Anastazja
Borys
Kuba
Klara
Kajetan
Robert
Marek
Angelika
Adrianna
Arkadiusz
Malwina
Norbert
Cezary
Matylda
Eliza
Urszula
G

In [73]:
user_name = "mateusz" #@param {type:"string"}
user_df = positive_scores_from_adj #@param ["positive_scores_from_adj", "negative_scores_from_adj", "positive_scores_from_words", "negative_scores_from_words"] {type:"raw"}
#Try it yourself!
#Type in a name form field and press Shift+Enter to automatically execute this cell and find positive/negative adjectives/words closest to that name
rank_features_of_a_name(user_name, user_df).tail(10)

wierny        0.187829
czysty        0.194548
zdrowy        0.198441
wygodny       0.203467
wesoly        0.220694
zadowolony    0.266448
mily          0.273808
sloneczny     0.303760
szczesliwy    0.318839
kochany       0.351850
Name: mateusz, dtype: float64

In [74]:
user_word = "swietny" #@param {type:"string"}
user_df = positive_scores_from_adj #@param ["positive_scores_from_adj", "negative_scores_from_adj", "positive_scores_from_words", "negative_scores_from_adj"] {type:"raw"}
user_names = pop_names #@param ["all_names", "pop_names", "pop_boy_names", "pop_girl_names"] {type:"raw"}
def rank_names_by_word(word, df, names=user_names):
  return df.loc[list(set(names).intersection(set(df.index)))].sort_values(by = word)[[word]]

rank_names_by_word(user_word, user_df).tail(10) #what names are the closest to given adjective (happy by default)?

Unnamed: 0,swietny


In [75]:
user_df = positive_scores_from_adj #@param ["positive_scores_from_adj", "positive_scores_from_words", "negative_scores_from_adj", "negative_scores_from_words"] {type:"raw"}
user_positive = True #@param {type:"boolean"}
user_names = pop_boy_names #@param ["all_names", "pop_names", "pop_boy_names", "pop_girl_names"] {type:"raw"}

#is there a "best name"?
#let's find out!
#note that you can change user_names to boy or girl names if that's interesting for you

def rank_names(df, positive=True, names=pop_names):
  if positive == True:
    return df.loc[list(set(names).intersection(set(df.index)))].mean(axis = 1, skipna=True).sort_values()
  else:
    return (df.loc[list(set(names).intersection(set(df.index)))].mean(axis = 1, skipna=True)*- 1.).sort_values()


rank_names(user_df, user_positive, user_names).tail(50) 
#try it yourself, leaving user_positive unchecked flips the values multiplying them by -1
#scoring negative dataframes this way ensures the "least bad names" will make their way to the tail (the top)
#rank_names gives list of names that "are best" in given context, the higher score the better

Series([], dtype: float64)

In [76]:
overall_scores_words = pd.concat((positive_scores_from_words * 1/((len(positive_scores_from_words.columns)/len(negative_scores_from_words.columns))), negative_scores_from_words* -1.), axis =1) 
# because there is many more negative words than positive ones, I'm scaling positive scores by 1/(number of positive columns/number of negative columns)
overall_scores_adj = pd.concat((positive_scores_from_adj * (len(negative_scores_from_adj.columns)/len(positive_scores_from_adj.columns)), negative_scores_from_adj * -1), axis=1)

In [77]:
rank_names(overall_scores_adj).tail(30)

#"best balanced", "with least powerful downside" names appear to be about 80% female names, overall "non-traditional" names
#conversly, changing "names" parameter of the function to all_names yields about 30% of "traditional" names


Series([], dtype: float64)

In [78]:
def rank_features_of_a_name(name, df):
  return df.sort_values(by = name, axis = 1).loc[name]

rank_features_of_a_name("adolf", negative_scores_from_words).tail(40)

benzyna           0.222444
ostrzezenie       0.233749
studnia           0.234231
tunel             0.234584
wystepowac        0.237473
sedzia            0.246977
dreszcz           0.250389
zaklad            0.252409
posiedzenie       0.255715
lancuch           0.256237
zaprzeczac        0.270997
odpady            0.275062
wdowa             0.275090
dol               0.275191
przepasc          0.277833
zoladek           0.280786
pluc              0.284764
propozycja        0.284802
pila              0.287025
bariera           0.287538
niespokojny       0.292116
rakieta           0.303604
krawedz           0.304331
napiecie          0.304435
lek               0.319531
armia             0.333853
funkcjonariusz    0.334371
szczecina         0.343662
glosny            0.343888
amunicja          0.346355
lza               0.356335
policyjny         0.358815
agent             0.375063
kopalnia          0.393088
mikrofon          0.399019
lawina            0.415727
daleki            0.420135
p

In [79]:
def rank_names_by_word(word, df):
  return df.sort_values(by = word)[[word]]

rank_names_by_word("inteligentny", positive_scores_from_adj)

Unnamed: 0,inteligentny
maja,-0.193539
emila,-0.174965
mariusz,-0.162603
slawomir,-0.158105
slawomir,-0.158105
...,...
kasia,0.368313
teodora,0.382183
kinga,0.390986
wiktoria,0.395117


In [80]:
def rank_names(df, positive=True):
  if positive == True:
    return df.mean(axis = 1, skipna=True).sort_values()
  else:
    return (df.mean(axis = 1, skipna=True)*- 1.).sort_values()

rank_names(positive_scores_from_adj).tail(50) #50 names that score the highest on average on positive adjectives

bogumil       0.195868
franciszek    0.196553
aniela        0.197043
waldemara     0.197170
mariola       0.199307
karolina      0.200099
teodor        0.201775
feliks        0.202681
tomek         0.207587
kinga         0.208709
krystian      0.208965
wiktor        0.209539
radosc        0.213986
iza           0.214176
bernard       0.214772
roma          0.215044
eleonora      0.215149
august        0.217011
jurek         0.218027
slawa         0.218496
jagienka      0.219044
weronika      0.221347
teofil        0.221398
walentyna     0.221838
harry         0.222615
ela           0.223590
paulina       0.224483
hania         0.226854
peter         0.227600
zygmunt       0.229687
bozena        0.230188
dominik       0.231068
waldemar      0.232323
klara         0.232966
bogumila      0.240056
albert        0.240187
natalia       0.240845
agata         0.241961
joachim       0.242817
grazyna       0.244121
ewelina       0.246784
alfreda       0.247279
igor          0.248177
laura      

In [81]:
rank_features_of_a_name("mateusz", positive_scores_from_words).tail(40)

raj            0.127373
wycieczka      0.130996
medal          0.133621
komfort        0.134496
przyjaciel     0.137999
wspierac       0.139397
spac           0.140038
sliczny        0.143835
ladny          0.143981
sport          0.148676
lato           0.150071
radosny        0.161353
smiech         0.165649
spokojny       0.167351
tort           0.177643
wiosna         0.179456
wierny         0.187829
spokoj         0.188713
czysty         0.194548
slonce         0.196945
zdrowy         0.198441
wygodny        0.203467
jesc           0.219418
przyjemnosc    0.219580
wesoly         0.220694
muzyka         0.222124
cud            0.226112
biust          0.234330
zadowolony     0.266448
mily           0.273808
talent         0.276817
hobby          0.279429
humor          0.288328
sloneczny      0.303760
mama           0.305648
szczesliwy     0.318839
zart           0.331795
lubic          0.338042
kochany        0.351850
poduszka       0.371697
Name: mateusz, dtype: float64

In [82]:
full_scores_words = positive_scores_from_words.merge(negative_scores_from_words*-1.0, left_index=True, right_index=True)
full_scores_words = full_scores_words[~full_scores_words.index.duplicated(keep='first')]
rank_names(full_scores_words).tail(40) #40 "on average" best names, or maybe with the smallest downside?

mateusz      -0.005714
albert       -0.005221
malgorzata   -0.005087
roza         -0.005073
alina        -0.004937
nadzieja     -0.004834
eliza        -0.004769
zbigniewa    -0.004155
joanna       -0.004106
halina       -0.003781
milena       -0.003622
patrycja     -0.003448
wanda        -0.003374
karol        -0.002922
ferdynand    -0.002845
marian       -0.002586
wladimir     -0.002403
teresa       -0.002357
krzysiek     -0.002198
natalia      -0.001807
oktawia      -0.001368
pawla        -0.001217
marcin        0.000059
ala           0.000287
wiktoria      0.000514
karolina      0.000665
jadwiga       0.001359
milosza       0.001729
magda         0.002104
emanuel       0.002793
renata        0.003431
monika        0.003546
rafala        0.004179
basia         0.005904
kasia         0.008540
ela           0.008698
bartek        0.010308
kacper        0.011048
lucja         0.012025
marianna      0.014053
dtype: float64

In [83]:
df[:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
,-0.024837,-0.036732,0.012414,0.016822,0.005448,-0.018245,-0.028301,0.033438,-0.006737,0.046584,-0.044924,0.009454,0.037186,0.037119,0.006153,0.045112,-0.016365,0.041697,0.043318,-0.003379,-0.018415,-0.004722,-0.042084,-0.017432,0.003908,-0.02334,-0.02551,0.036963,-0.015766,0.007695,0.048682,-0.018719,0.024196,-0.001895,-0.02006,-0.021601,-0.037972,0.039309,-0.027575,0.030115,...,0.047388,0.000687,-0.027133,0.020897,-0.0496,-0.038707,-0.021299,-0.025382,-0.003328,-0.045998,-0.002449,0.017047,0.016116,-0.011192,0.030194,0.016583,0.022205,0.004009,0.016263,-0.002267,-0.047626,-0.024924,-0.036637,0.032473,-0.00782,0.004478,-0.023541,0.01813,0.048227,-0.020017,-0.014868,0.022978,0.016364,-0.00541,-0.034334,-0.000721,-0.016032,0.014276,0.038685,-0.048515
[UNK],-0.064359,-0.047015,-0.004526,-0.180534,-0.12383,0.065138,-0.273333,-0.106018,0.035714,0.39299,0.197673,0.251411,0.029346,0.209724,-0.392422,0.504875,0.110454,0.373819,-0.101998,0.035423,-0.18629,0.339091,0.2155,0.0372,0.1079,-0.154004,0.016064,-0.052432,0.243302,-0.209174,0.050753,0.188743,0.037367,0.064409,-0.19846,-0.054484,0.230221,0.171321,-0.188478,0.024947,...,0.082984,-0.398212,-0.152714,-0.205721,0.019334,0.062322,-0.189677,-0.11233,-0.012367,0.083114,0.062711,-0.050594,0.394863,-0.173089,0.051981,0.41168,-0.457209,-0.092347,0.240177,0.2848,0.308165,-0.188713,-0.156145,-0.273702,-0.188762,0.065544,0.221635,-0.019786,-0.179682,0.057393,-0.281604,-0.134375,0.34726,0.234844,-0.520911,0.0308,-0.063385,0.123444,-0.006586,0.073736
w,0.056266,-0.045822,0.395911,-0.038012,-0.118911,-0.049477,-0.204109,-0.132256,0.209648,-0.098129,-0.068143,-0.175674,-0.108409,0.218563,-0.060869,-0.325691,0.227364,0.021227,-0.010059,-0.112198,-0.036145,0.101465,0.148372,0.062876,0.103714,-0.136108,-0.232315,0.126297,-0.191143,-0.535246,-0.129462,0.220678,-0.21407,-0.011959,0.045341,0.209881,-0.272174,0.077772,0.023669,-0.287892,...,0.153236,-0.217081,0.108446,-0.17073,0.097523,-0.005537,-0.421822,0.229934,-0.216907,-0.363507,-0.001881,0.074098,-0.153773,-0.144519,-0.118475,0.381971,-0.011843,-0.257297,-0.310013,-0.131974,-0.073667,0.073014,0.011014,-0.141172,0.352729,-0.249161,0.316622,-0.071495,-0.217206,0.004697,-0.07511,-0.125751,0.296172,0.091784,-0.072615,-0.182985,-0.041949,0.190219,-0.04872,-0.020745
i,-0.098896,0.054621,-0.201118,0.108995,-0.074785,0.378209,-0.011638,-0.129472,0.048733,0.227663,-0.252854,-0.326302,0.21077,0.403392,-0.240531,-0.192813,-0.075281,-0.170519,-0.102864,0.233133,-0.246156,0.037031,-0.184695,0.355848,0.038112,-0.230394,-0.163109,-0.183815,0.188815,-0.029593,-0.021575,0.126311,-0.096236,-0.231859,0.027623,0.068245,-0.234732,-0.149734,0.064883,0.134109,...,0.330849,-0.146145,0.165825,-0.198414,0.095211,-0.094334,0.117748,0.05613,0.129842,-0.158052,-0.112068,0.050612,0.357501,-0.212121,0.034596,0.179489,-0.113229,-0.435815,0.247262,-0.068847,0.163874,-0.146734,-0.136601,0.037481,0.295391,0.036213,-0.118212,0.152301,-0.048234,-0.10155,-0.034617,-0.086828,-0.102655,0.124136,0.163484,0.142503,0.211788,0.0435,-0.033191,-0.237854
sie,-0.070718,-0.002241,-0.047012,-0.371792,-0.083358,0.096385,0.244463,0.003952,-0.076309,-0.030033,-0.109702,-0.505827,0.146561,0.135615,-0.32896,-0.005799,0.210233,-0.153766,0.116432,-0.074843,-0.190284,0.265984,-0.043114,0.184675,0.079656,0.147328,-0.210179,-0.207956,0.12858,-0.021809,0.081989,-0.090329,0.051729,0.068627,0.131462,-0.191967,-0.019155,0.070678,0.041561,-0.072503,...,-0.012833,-0.359638,0.100713,0.030499,0.009069,0.340664,-0.08964,0.01081,-0.238951,0.131268,0.108168,0.025912,-0.196329,-0.113024,0.192865,0.023198,-0.265593,-0.083724,0.027461,0.134425,-0.252683,-0.061862,-0.325193,-0.384301,0.029241,0.17992,0.467372,-0.205286,-0.333561,0.03155,-0.022287,0.045707,-0.284012,0.285857,-0.099026,-0.044976,-0.083772,0.401077,0.271711,0.027083
na,0.094989,-0.138435,0.268268,-0.133411,0.005595,-0.045381,0.302277,0.065705,-0.149646,0.207797,0.41832,0.069702,-0.236195,0.00031,-0.026303,-0.027301,-0.195893,0.013216,-0.210038,-0.456683,-0.174639,0.277412,-0.383787,0.275872,0.159973,0.132721,0.247778,0.056552,0.072145,0.369459,-0.178587,0.413389,0.323408,-0.1375,-0.20786,0.249293,-0.058914,0.036899,0.226499,-0.114159,...,0.141564,-0.350961,0.003885,0.11657,-0.112948,0.043836,0.125023,0.221218,-0.065547,0.266148,0.096736,-0.050871,-0.033972,-0.170005,-0.098052,-0.019353,-0.143551,-0.074583,0.008292,0.210233,0.056219,0.052335,-0.14636,-0.050764,0.215547,-0.109305,0.076133,-0.040316,0.216177,-0.115203,-0.146856,0.193346,-0.172362,0.215984,0.116352,0.053128,0.185232,-0.024301,0.30127,0.108475
nie,-0.016828,0.080407,-0.064766,0.24201,-0.202142,0.055977,-0.211779,-0.30857,0.025311,-0.06389,-0.078602,-0.219529,0.125917,-0.016703,0.343509,-0.110908,0.137495,-0.257601,-0.05223,0.13343,0.061872,-0.145657,0.005206,-0.143927,0.260739,0.132336,-0.121228,-0.073217,0.19433,0.208707,0.021924,0.249826,0.202913,0.139853,-0.147759,-0.423485,-0.102136,0.160104,0.153297,0.090204,...,0.07749,-0.215175,0.194701,0.128733,-0.047145,-0.04631,-0.207536,0.165743,-0.03316,-0.056121,-0.138126,-0.197391,-0.126295,0.072731,-0.196428,0.244897,-0.006697,-0.439443,0.296264,0.101855,-0.0649,-0.272586,0.027974,-0.159216,0.07781,0.21916,-0.167731,0.257529,0.162152,-0.34926,0.061351,-0.208306,-0.103432,-0.041028,-0.304114,-0.334979,0.046871,-0.306897,0.117508,-0.12677
z,-0.20084,-0.097292,0.582133,-0.343571,0.072971,-0.104237,-0.018696,-0.095806,-0.037956,0.078828,-0.121294,-0.363225,-0.109516,0.013249,0.052822,-0.177849,0.092145,-0.116198,-0.089622,0.011578,-0.153903,0.1053,0.186985,0.213728,-0.224319,0.096104,0.043713,-0.522521,-0.409028,-0.436095,0.003052,-0.26029,-0.115179,0.084519,0.061224,0.171878,0.029656,-0.476242,0.087381,0.072059,...,-0.217078,-0.202879,0.014725,-0.08489,-0.108245,0.092486,-0.39392,-0.006321,0.166764,-0.148751,-0.41294,-0.040359,-0.107552,-0.32542,-0.366268,0.065154,0.184371,-0.242825,0.119193,0.107811,-0.167147,0.031072,-0.150131,-0.291759,-0.33393,-0.209495,0.003205,0.258409,0.203966,0.296849,-0.191301,0.111497,0.336033,0.149509,-0.535386,-0.113001,0.19272,0.351019,0.008984,-0.098384
do,0.184176,0.258459,-0.167279,-0.178438,-0.085894,0.249364,-0.080821,0.317001,0.091114,0.16265,-0.090484,0.05347,0.157851,-0.294134,0.141775,0.248727,-0.171279,-0.035977,0.007758,0.148924,-0.241007,-0.247883,-0.003103,0.216072,0.256303,-0.222622,0.06543,-0.262108,-0.419506,-0.355705,0.287099,0.189843,-0.248175,0.344646,0.331359,-0.186142,-0.024074,-0.122,-0.133515,0.129806,...,0.213478,-0.510763,-0.047727,0.084877,0.06394,0.132011,-0.498584,-0.188604,0.261172,-0.157516,-0.046866,-0.121181,-0.162406,-0.04905,-0.096664,0.147041,-0.545862,-0.128476,-0.143182,0.237387,0.164801,0.055793,-0.343227,-0.057342,-0.069825,-0.186813,-0.045051,0.131058,-0.235888,0.046444,-0.188131,-0.20917,0.169586,0.10391,0.096414,0.170154,-0.147615,-0.000854,-0.00648,0.058288
to,-0.026749,0.195447,-0.180041,-0.172786,-0.176641,-0.061498,-0.10689,0.067505,-0.029466,-0.089735,-0.170702,0.009715,0.228167,-0.235104,-0.023694,-0.061049,-0.03445,-0.307337,0.114359,-0.23037,0.196238,0.139627,0.202034,-0.029819,0.21449,0.190586,0.060001,0.142347,-0.149775,0.104574,-0.183094,-0.078349,0.123903,0.162021,-0.024742,-0.110994,-0.217437,0.336977,-0.076563,-0.145794,...,-0.000865,-0.092719,-0.062991,0.210581,0.147764,-0.106518,0.007073,0.074407,0.081801,0.092644,-0.26209,-0.081203,0.052523,0.346497,-0.265002,0.056847,-0.152249,-0.232236,-0.130962,0.206671,0.089821,-0.250786,-0.189257,0.040382,0.356003,0.235183,-0.145561,-0.225651,-0.088966,0.145093,0.005613,-0.11941,0.04835,-0.079429,-0.158255,0.292353,-0.029878,0.167054,-0.107909,0.054254


In [84]:
full_scores_adj = positive_scores_from_adj.merge(negative_scores_from_adj*-1.0, left_index=True, right_index=True)
full_scores_adj = full_scores_adj[~full_scores_adj.index.duplicated(keep='first')] #reduce to adjectives only
rank_names(full_scores_adj).tail(40)

grazyna       0.025895
aleksandra    0.026379
ludwik        0.027018
kinga         0.027031
milosza       0.027153
bogdan        0.027433
ferdynand     0.027474
janusz        0.027818
bernard       0.028023
tomasz        0.028096
nina          0.028529
wladyslaw     0.028794
alojzy        0.029371
wladimir      0.029897
bogdana       0.030196
wieslaw       0.030319
malgorzata    0.030492
marian        0.030626
irena         0.030676
miroslawa     0.031775
teresa        0.032230
beata         0.032247
eugeniusz     0.032328
robert        0.032834
henryka       0.033608
zdzislawa     0.034293
waldemar      0.034776
ewa           0.035111
zdzislaw      0.035353
wladyslawa    0.035593
walentyna     0.036748
rafala        0.038108
jadwiga       0.038487
miroslaw      0.038700
bartek        0.040234
marianna      0.041440
zenona        0.045360
zbigniewa     0.051169
albert        0.053609
lucja         0.059532
dtype: float64