# Don't Go Breaking my Heart: Bayesian/GMM clustering of potential romantic partners

The following project is a showcase for how to use a Bayesian evidential approach to BERT embeddings coupled with GMM to generate related clusters of individuals based on their encoded cultural views in language.

#### Basic steps
1. restrict data to female, gay, and in Berekeley
2. Select our target person at random, and a person whom they had a bad experience with, at random.
3. Take one of the essays for both people, pull out all NNs in the essay. Use these NNs as $w'$ in other essays.
4. Use each NN as a w' to search profiles with
5. Generate evidence scores
6. Use GMM to cluster close to person and close to bad date.
7. Select middle people (because maximal distance from subject, and minimal proximity to bad date.

#### Alternative.
So this version I'm basing off of the TEDx Talk here: https://www.ted.com/talks/hannah_fry_the_mathematics_of_love/transcript?language=en

Here's how we implement her select the 3rd person logic.

1. restrict data to female, gay, and in Berekeley
2. Select our target person at random, and 3 people whom they had experiences with at random.
3. Take one of the essays for all 3 people, pull out all NNs in the essay. Use these NNs as $w'$ in other essays.
4. Now, use $P(R|u) \sim \sum P(E_{u, w'}|E_{k,w'})$ where u is unknown users, k is the known set of four, and $w'$ are the search terms we pulled from k's profiles.
5. Use GMM to cluster the data into 2-groups--yes and no.
6. Reduce dimensions for our 4 axes to 2 using PCA.
7. Joint kde-plot for visualization.

##### Notes:
Okay. Pre-sampled is pretty good on Essay 1 and Essay 4. I think I might do essay 1.
Search terms I want to include here
- 'grad'
- 'career'
- 'job'
- 'travel'
- 'human rights'
- 'time'

If essay 4, just do
- 'book'
- 'movies'
- 'music'



In [None]:
# initial samples and data constructor
import pandas as pd
import numpy as np
import spacy
nlp = spacy.load('en_core_web_trf')

df = pd.read_csv('datingapp/data/okcupid_profiles.csv')

samples = df.loc[df['orientation'].isin(['gay']) & df['sex'].isin(['f']) & df['location'].isin(['berkeley, california'])].index.tolist()
# sampled = np.random.choice(samples, size=(4,), replace=False)
sampled = np.array([17148, 49387, 18574,  5060])

w_ = ['grad', 'career', 'job', 'human rights', 'time', 'science', 'liberation movement', 'jobs']
data = []
for i in df.loc[samples].index:
    text = str(df['essay1'].loc[i])
    data+= [[i,w,text] for w in w_ if w in text.lower()]
data = np.array(data)
data = pd.DataFrame(data, columns=['id','w','text'])
data.to_csv('datingapp/data/population.csv', index=False, encoding='utf-8')

### Embedded Representations of search terms in context

All of this script will be offloaded to the remote server for efficiency (my macbook does not have a great graphics card for pumping out BERT representations/searching through them for the right one). Implementationally, we convert the entirety of a text t to BERT embeddings, and then select only those embeddings indexed by whether or not their tokens are equal to our search term $w'$.

$$ E_{t,w'} = BERT(x)\delta_{w=w'} $$

In [None]:
import pandas as pd
import numpy as np
from kgen2.BERTcruncher.context_vecs import *

PATH = DPATH + 'datingapp/'
df = pd.read_csv(PATH + 'population.csv')

print(PATH, len(df))

level=0

#(1) Set up .csv file for data repo
vec_data = pd.DataFrame(columns=['id', 'w', 'vec', 'text'])
vec_data.to_csv(PATH+'okc-vecs.csv', index=False, encoding='utf-8')

#(2) Generate embeddings with appropriate metadata
for id,w,TEXT in df.values:
    try:
        vecs = nC(str(w),str(TEXT),layer_no=level)
        update = [[id, str(w), str(vec.view(-1).tolist()), str(TEXT)] for vec in vecs]
        update = pd.DataFrame(np.array(update), columns=list(vec_data))
        update.to_csv(PATH +'okc-vecs.csv', index=False, encoding='utf-8', header=False, mode='a')
    except ValueError:
        0
    except IndexError:
        0

### Evidentiality Score

At the heart of what I'm doing with this project is asking how similar are potential new dates to older dates who the fictional protagonist has dated. To do this, we'll rely on what's basically the weighted evidence that two people are related to one another/part of the same group. This looks like this mathematically, where $j$ is the $j^{th}$ previously dated person, and $u$ is the $u^{th}$ new, undated user of OKC:

$$ P(j|u) = \frac{1}{k_{u}} \sum_{i,j} \frac{P(E_{u,w'_i}|E_{j_n,w'_i})}{P(j_n,w'_i)} \delta_{j_n \in j} $$

Let me explain here. For each term $w'_i$ for each entity in our set of known, previous dates $j$, we'll normalize. Then we'll sum across all of these normalized instances, and renormalize across $u$ to get a probability that u and j are similar.

In [None]:
from datingapp.mod.sim_matrix import *

df = pd.read_csv('datingapp/data/okc-vecs.csv')

sampled = np.array([17148, 49387, 18574,  5060])

#We'll use these here to help us use the sel() function
# to pick correct folumns and rows as needed.
sampled_df = df.loc[df['id'].isin(sampled)]
non_sampled_df = df.loc[~df['id'].isin(sampled)]

V = torch.FloatTensor([[np.float(i) for i in vec.replace('[', '').replace(']', '').split(', ')] for vec in df['vec'].values])

pi = probFn(.1)
P = pi.PROB(V[~sel(sampled,df['id'].values)], V[sel(sampled,df['id'].values)])

#Creates a 2-D mask of shape (w' x Elements)
mW1 = torch.cat([torch.FloatTensor(sel([w_], non_sampled_df['w'].values)).view(1,-1) for w_ in df['w'].unique()])
mW2 = torch.cat([torch.FloatTensor(sel([w_], sampled_df['w'].values)).view(1,-1) for w_ in df['w'].unique()])

#The objective with these masks is to limit our analysis
# to only those instances in which we're comparing
# our search term w'i to w'i from our data. This is
# just an efficient, linear algebra way of doing this.
P = (P * mW1.unsqueeze(-1))
P = P.sum(dim=0)
P = (P * mW2.unsqueeze(1))
P = P.sum(dim=0)
P = P/P.sum(dim=0)

P = torch.cat([P[:,sel([j], sampled_df['id'].values)].sum(dim=-1).view(-1,1) for j in sampled], dim=-1)
P = torch.cat([P[sel([u], non_sampled_df['id'].values)].sum(dim=0).view(1,-1) for u in non_sampled_df['id'].unique()], dim=0)


This will give use the evidence in favor of two people being similar to one another based on the semantic contexts surrounding key interests that our protagonist is interested in. Now, we want to find from all the remaining people who is a possible, good partner. To do this, we'll use a GMM model to create three clusters. We'll pick people from the cluster that neither includes the bad date or the protagonist--we want some fun differences between us and our partners after all :).

In [None]:
from datingapp.mod.gmm_alt import *

GMM = GaussianMixture(n_components=3, n_features=4)
GMM.fit(P)
labels = GMM.predict(P)

We can from her reduce the dimensionality of the data (using PCA) and plot it using normal means.

In [1]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(P.numpy())

dfi = pd.DataFrame()
dfi['country'] = non_sampled_df['id'].unique()
dfi[['x0', 'x1']] = pca.transform(P.numpy()) * 4
dfi['l'] = labels.view(-1).numpy()

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')

sns.kdeplot(data=dfi, x='x0', y='x1', hue='l')
plt.show()



NameError: name 'P' is not defined