In [2]:
#dependencies
import numpy as np
import pandas as pd

import random

import seaborn as sns

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
from sklearn.feature_extraction.text import TfidfVectorizer

import string
import re

import gensim
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string, stem_text
from gensim.models import Word2Vec

from tqdm import tqdm
tqdm.pandas()

import spacy
from collections import Counter

[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# Text preprocessing

In [3]:
#importing the dataset
essays = pd.read_csv('essays.csv')

essay0- My self summary \
essay1- What I’m doing with my life \
essay2- I’m really good at \
essay3- The first thing people usually notice about me \
essay4- Favorite books, movies, show, music, and food \
essay5- The six things I could never do without \
essay6- I spend a lot of time thinking about \
essay7- On a typical Friday night I am \
essay8- The most private thing I am willing to admit \
essay9- You should message me if...

In [4]:
essays = essays.fillna('.')

In [36]:
clean_df = pd.DataFrame()
clean_df["id"] = essays.index
# set 'essay0','essay1','essay2','essay5','essay7' as description
des_col = ['essay0','essay1','essay2','essay5','essay7']
clean_df['describe'] = essays[des_col].apply(lambda x: ' '.join(x.astype(str)), axis=1)
# set 'essay9' as expectation
clean_df['expect'] = essays[essays.columns[-1:]]

# drop na value
clean_df = clean_df[-(clean_df['describe'] == "........")]
clean_df = clean_df[-(clean_df['expect'] == ".")]

Data: The dataset comprises 10 essays (essay0 to essay9), totaling 59,946 rows.

Feature Selection (Column Processing): 

Since interests are used as classification criteria, we concatenated essay0, essay1, essay2, essay5, and essay7 to form a comprehensive user description while excluding essay4 (Favorite books, movies, shows, music, and food) to prevent potential noise caused by the diverse range of interests listed in this section. Additionally, essay9 was selected to represent each user’s expectations for a potential match. The user ID was also retained.

Data Cleaning (Row Processing): 

Entries with missing values in all five descriptive essays (essay0, essay1, essay2, essay5, essay7) or in essay9 were removed, resulting in 47,335 remaining rows. To test different methods on a smaller scale, a random sample of 3,000 users was selected using a sampling approach.

Data Transformation: 

We applied the ‘en_core_web_sm’ model from SpaCy to process the user descriptions (describe), performing tokenization, punctuation and stop-word removal, and lemmatization, ultimately generating describe_ls. The same preprocessing steps were applied to expect.

In [6]:
#aggregating user-generated essays
sample = clean_df.sample(n=3000, replace=False, random_state=42).reset_index()

In [7]:
nlp = spacy.load('en_core_web_sm')

def process_text(text, extra_list):

    # Load stopwords
    all_stopwords = nlp.Defaults.stop_words.union(set(extra_list))
    
    # Tokenization using Spacy
    doc = nlp(text)
    
    # Process tokens
    tokens_clean = [
        (token.lemma_[:-4] if token.lemma_.endswith('.<br') else token.lemma_)
        for token in doc
        if token.text.lower() not in all_stopwords and not token.is_punct
    ]
    
    return tokens_clean

In [8]:
extra_list = ['\n','<','>','br','.','<br']

sample['describe_ls'] = sample['describe'].progress_apply(lambda x: process_text(x, extra_list))
sample['expect_ls'] = sample['expect'].progress_apply(lambda x: process_text(x, extra_list))

100%|██████████| 3000/3000 [01:12<00:00, 41.56it/s]
100%|██████████| 3000/3000 [00:14<00:00, 213.81it/s]


In [9]:
#sample.to_csv('sample_less.csv')

# Contextualized Weak Supervision

Logistics:

We first divide both the user descriptions (describe) and expectations (expect) into 10 groups. Each user will be matched with others whose describe label aligns with their expect label.

Method: 

Due to the lack of labeled data, we employ a Minimally Supervised approach. Specifically, we manually define 10 categories, each associated with either 3 or 10 seed words, using only the textual content of the documents. Given that state-of-the-art methods such as WeSTClass and ConWea require significant computational resources, we instead adopt a structural approximation. We determine category assignment by comparing the similarity between seed words and the text within each category.

To ensure that the categories are both mutually independent and representative, as well as that the seed words within each category are meaningful, we generate them using the following structured dialogue with ChatGPT: https://chatgpt.com/share/e/67d609c4-6ed4-800e-ad6b-921d13f6250d

In [10]:
# sample = pd.read_csv('sample_less.csv')
# sample['describe_ls'] = sample['describe_ls'].apply(lambda x: x[2:-2].split("', '"))
# sample['expect_ls'] = sample['expect_ls'].apply(lambda x: x[2:-2].split("', '"))
# del sample['Unnamed: 0']
# del sample['index']

In [11]:
cat_df = pd.read_csv('dating_categories.csv')
cat_df['Seed Words'] = cat_df['Seed Words'].apply(lambda x: x.split(', '))

In [13]:
cat_df_10 = pd.read_csv('dating_categories_updated.csv')
cat_df_10['Seed Words'] = cat_df_10['Seed Words'].apply(lambda x: x[2:-2].split("', '"))

## Algorithm 1: String Matching

We assume that if certain keywords appear frequently in a document, the document is likely to belong to the corresponding category. Based on this assumption, we calculate the frequency of seed words from different categories appearing in each user’s describe section and assign the label of the category with the highest frequency (Figure 1).

Initially, we determined labels solely based on the total frequency of all seed words. However, this approach presented a limitation: If a specific seed word (e.g., love) is widely used across different contexts, many users may be incorrectly classified into that category.

To mitigate this issue, we refined our method by first assessing the coverage of seed words in the text before considering their frequency. Specifically, we first count the number of unique seed words from each category that appear in a given text. If multiple categories contain the same number of unique seed words, we then compare the total occurrences of all three seed words within the text. This refined approach effectively prevents overgeneralization into a single dominant category and provides a more accurate representation of users’ actual preferences.

In [40]:
def assign_category_with_coverage(desc_list, cat_df):
    best_category = None
    word_counts = Counter(desc_list)
    category_scores = {}

    for _, row in cat_df.iterrows():
        category = row['Category']
        seed_words = set(row['Seed Words'])  # Convert to set for uniqueness

        # Count how many unique seed words are covered
        unique_matches = sum(1 for word in seed_words if word in word_counts)
        # Compute total occurrences of all matched seed words
        total_match_count = sum(word_counts[word] for word in seed_words if word in word_counts)

        category_scores[category] = (unique_matches, total_match_count)

    # Determine the best category based on unique matches first, then frequency
    sorted_categories = sorted(category_scores.items(), key=lambda x: (x[1][0], x[1][1]), reverse=True)

    if len(sorted_categories) == 1 or (sorted_categories[0][1] > sorted_categories[1][1]):
        best_category = sorted_categories[0][0]

    return best_category

def ar1(sample_df, cat_df):
    
    cat_fq_des = sample_df['describe_ls'].progress_apply(lambda x: assign_category_with_coverage(x, cat_df))
    car_fq_exp = sample_df['expect_ls'].progress_apply(lambda x: assign_category_with_coverage(x, cat_df))

    return cat_fq_des,car_fq_exp

In [15]:
cat_fq_des,cat_fq_exp = ar1(sample, cat_df)

In [16]:
cat_fq_des.isna().sum(),cat_fq_exp.isna().sum()

(1075, 2322)

In [17]:
cat_fq_des_10,cat_fq_exp_10 = ar1(sample, cat_df_10)

In [18]:
cat_fq_des_10.isna().sum(),cat_fq_exp_10.isna().sum()

(845, 2117)

This method has a limitation: the set of seed words is finite. For instance, the seed words for the Adventurous Outdoorsy category are travel, hike, and adventure. However, if a text describes outdoor activities such as camping, mountain climbing, or skiing, it may not be classified into this category due to the absence of these specific keywords.

As a result, when the number of seed words is set to three, one-third of the describe entries and two-thirds of the expect entries cannot be classified. Expanding the seed word list to ten reduces the number of unclassified instances, but still, 28% of the describe entries and two-thirds of the expect entries remain unclassified.

The higher proportion of missing classifications in expect than describe is primarily due to the fact that most users provide only a short sentence for their expectations, whereas their self-descriptions tend to be more informative.

## Algorithm 2: Similarity Comparison on the entire document

To measure similarity, we use cosine similarity instead of Euclidean distance. This choice is based on our observation that the seed words from different categories have significant differences in vector magnitudes, which could result in most users being classified into a single dominant category. Cosine similarity, by ignoring vector magnitude and focusing on angular difference, provides a more balanced classification approach.

We first vectorize the describe text using the word2vec-google-news-300 model, obtaining a matrix of size (number of words in the document × 300). We then compute the column-wise mean of this matrix to derive a single 1 × 300 vector representation for the entire document.

Next, we apply the same word embedding method to the seed words of each category. For a category with three seed words, we obtain a 3 × 300 matrix.

We then compute the cosine similarity between the document vector and each of the three seed word vectors within a category. The category is selected based on the following rules:
	1.	Compute the average cosine similarity of the three seed words for each category and select the category with the highest value.
	2.	If the highest average similarity among all categories is ≤ 0, then select the category with the highest individual similarity score.
	3.	If the highest individual similarity score among all categories is still ≤ 0, return None (unclassified).

The same classification procedure is applied to the expect text.(figure 2)

In [19]:
import gensim.downloader as api
model_google = api.load('word2vec-google-news-300')

In [41]:
def doc2vec(doc,wv,len=300):
    vecs = []
    for token in doc:
        try:
            vecs.append(wv[token])
        except KeyError:
            pass
    if vecs == []:
        return np.zeros(len)
    return np.mean(vecs, axis=0)

def assign_category_by_similarity(desc_vec, cat_df, row_name='seed_vec'):
    
    category_scores = {}
    max_value_per_category = {}
    # Iterate through each category and compute similarity scores
    for _, row in cat_df.iterrows():
        category = row['Category']
        seed_vecs = row[row_name]
        
        # Compute cosine similarity for each seed word
        desc_vec_norm = np.linalg.norm(desc_vec) + 1e-9  
        seed_vecs_norm = np.linalg.norm(seed_vecs, axis=1) + 1e-9

        similarities = np.dot(seed_vecs, desc_vec) / (seed_vecs_norm * desc_vec_norm)
        
        # Compute the average and max similarity score
        avg_similarity = np.mean(similarities)
        max_similarity = np.max(similarities)

        category_scores[category] = avg_similarity
        max_value_per_category[category] = max_similarity

    # Find the best category based on average similarity
    best_category = max(category_scores, key=category_scores.get)
    best_score = category_scores[best_category]

    if best_score > 0:
        return best_category
    else:
        # If all scores are <= 0, choose based on max individual similarity value
        best_category_by_max = max(max_value_per_category, key=max_value_per_category.get)
        best_max_value = max_value_per_category[best_category_by_max]
        if best_max_value > 0:
            return best_category_by_max
        else:
            return None  # If all values are <= 0



def ar2(sample_df,cat_df):
    
    sample_df['describe_vec'] = sample_df['describe_ls'].progress_apply(lambda x: doc2vec(x,model_google))
    sample_df['expect_vec'] = sample_df['expect_ls'].progress_apply(lambda x: doc2vec(x,model_google))
    cat_df['seed_vec'] = cat_df['Seed Words'].progress_apply(lambda x: np.array([doc2vec(wd,model_google) for wd in x]))
    
    cat_sm_des = sample_df['describe_vec'].progress_apply(lambda x: assign_category_by_similarity(x, cat_df))
    cat_sm_exp = sample_df['expect_vec'].progress_apply(lambda x: assign_category_by_similarity(x, cat_df))

    return cat_sm_des,cat_sm_exp

In [21]:
cat_sm_des,cat_sm_exp = ar2(sample, cat_df)

In [22]:
cat_sm_des.isna().sum(),cat_sm_exp.isna().sum()

(11, 20)

In [23]:
cat_sm_des_10,cat_sm_exp_10 = ar2(sample, cat_df_10)

This method effectively reduces the issue of unclassified instances present in Algorithm 1. However, a small number of descriptions still cannot be classified due to insufficient information.(e.g. someone wrote the expectation as "at least 5'4"") With the three-seed-word setting, the proportion of unclassified entries is 1% for describe and 2% for expect.

Additionally, this method has a limitation: averaging the vectors of all words compresses a significant amount of information. This approach assumes that all words contribute equally to the final representation, which is not always realistic.

## Algorithm 3: Similarity Comparison on Key Words

To emphasize the contribution of important words, we use TF-IDF to select the top 10 most important words from each describe entry. We set min_df = 2 to ignore words that appear only once, reducing noise caused by typographical errors. For sentences containing fewer than ten words, we select all words with a TF-IDF score > 0.

Next, we vectorize all selected keywords using word2vec-google-news-300, resulting in a (number of keywords × 300) matrix. We then compute the cosin similarity between each keyword and each category’s seed words. For each category, we take the highest similarity score among all keyword–seed word pairs as the category’s final score. The category with the highest score is assigned as the label. (Figure 3)

In [24]:
vectorizer = TfidfVectorizer(min_df=2) # 减少误打的字的影响
vectorizer.fit([" ".join(tokens) for tokens in sample['describe_ls']] 
               + [" ".join(tokens) for tokens in sample['expect_ls']] )

In [42]:
def extract_top_tfidf_words_single(vectorizer, row_tokens, top_n=10):

    if not row_tokens:  # 处理空列表情况
        return []

    row_text = " ".join(row_tokens)
    row_tfidf = vectorizer.transform([row_text])  # 仅计算当前行的 TF-IDF
    feature_names = vectorizer.get_feature_names_out()

    # 获取 TF-IDF 词语及其得分
    row_tfidf_scores = row_tfidf.toarray().flatten()
    non_zero_indices = row_tfidf_scores.nonzero()[0]  # 仅选取 TF-IDF > 0 的词索引

    # 处理可能的短行情况
    num_top_words = min(top_n, len(non_zero_indices))  # 取最小值，防止超出索引范围
    if np.sum(row_tfidf_scores) == 0:
        return []
    # 按照得分排序，获取重要词语
    top_indices = row_tfidf_scores.argsort()[-num_top_words:][::-1]
    top_words = [feature_names[i] for i in top_indices]

    return top_words

def compute_max_cosine_similarity(describe_top_g, seed_vec_g):
    
    if describe_top_g.shape[0] == 0:
        return 0.0  # 如果描述为空，返回0

    # 计算余弦相似度
    describe_norm = np.linalg.norm(describe_top_g, axis=1, keepdims=True) + 1e-9  # 避免除零
    seed_norm = np.linalg.norm(seed_vec_g, axis=1, keepdims=True) + 1e-9

    cosine_similarities = np.dot(describe_top_g, seed_vec_g.T) / (describe_norm * seed_norm.T)  # 计算 x*3 矩阵

    # 计算最终得分
    final_score = np.max(cosine_similarities)
    
    return final_score

def similarity(x,cat_df,row_name = 'seed_vec_g'):
    category_scores = {}
    for i in range(cat_df.shape[0]):
        seed_vec_g = cat_df[row_name][i]
        cate = cat_df['Category'][i]
        category_scores[cate] = compute_max_cosine_similarity(x, seed_vec_g)

    best_category = max(category_scores, key=category_scores.get)
    best_score = category_scores[best_category]

    return best_category

def ar3(sample_df,cat_df):
    
    sample_df['describe_top10'] = sample_df['describe_ls'].progress_apply(lambda x: extract_top_tfidf_words_single(vectorizer,x))
    sample_df['expect_top10'] = sample_df['expect_ls'].progress_apply(lambda x: extract_top_tfidf_words_single(vectorizer,x))

    sample_df['describe_top_vec'] = sample_df['describe_top10'].progress_apply(lambda x:  np.array([doc2vec(wd,model_google) for wd in x]))
    sample_df['expect_top_vec'] = sample_df['expect_top10'].progress_apply(lambda x: np.array([doc2vec(wd,model_google) for wd in x]))
    cat_df['seed_vec'] = cat_df['Seed Words'].apply(lambda x: np.array([doc2vec(wd,model_google) for wd in x]))

    cat_kw_des = sample_df['describe_top_vec'].progress_apply(lambda x: similarity(x, cat_df,row_name = 'seed_vec'))
    cat_kw_exp = sample_df['expect_top_vec'].progress_apply(lambda x: similarity(x, cat_df,row_name = 'seed_vec'))

    return cat_kw_des,cat_kw_exp

In [26]:
cat_kw_des,cat_kw_exp = ar3(sample, cat_df)

100%|██████████| 3000/3000 [00:07<00:00, 424.45it/s]
100%|██████████| 3000/3000 [00:06<00:00, 449.05it/s]


In [27]:
cat_kw_des.isna().sum(),cat_kw_exp.isna().sum()

(0, 0)

In [28]:
cat_kw_des_10,cat_kw_exp_10 = ar3(sample, cat_df_10)

100%|██████████| 3000/3000 [00:07<00:00, 395.70it/s]
100%|██████████| 3000/3000 [00:06<00:00, 452.81it/s]


# Evaluation

Due to the lack of labeled data, we could only randomly sample instances for evaluation. Upon manual inspection, we found that each of the three methods had its own advantages. Therefore, we selected the most frequently assigned category across all six methods as the final label.

However, even when combining all six methods, this seed-word-based weakly supervised learning approach still has several limitations.

First, all methods assume that each user belongs to a single category, making it difficult to accurately classify individuals with diverse interests. In dating apps, most users tend to present multiple facets of themselves, leading to high variance in our classification results.

Second, none of the methods account for negative exclusions. For example:

1. If a user’s expect states, “You should message me if you’re not a nerdy guy,” the model would delete the stop word "not" and likely classify them under Intellectual/Bookish based on "nerdy", which contradicts their actual preference.

2. Similarly, if someone’s expect states, “You should message me if you accept an unconventional family and want to have a child,” the presence of keywords like child and family may lead to classification under Family-Oriented. However, this expectation does not align with the typical preferences of users labeled as Family-Oriented.

In [29]:
result = pd.DataFrame()
result['describe'] = sample['describe']
result['label_frequence'] = cat_fq_des
result['label_frequence_10'] = cat_fq_des_10
result['label_similarity'] = cat_sm_des
result['label_similarity_10'] = cat_sm_des_10
result['label_keywords'] = cat_kw_des
result['label_keywords_10'] = cat_kw_des_10

In [30]:
def most_frequent_label(row):

    label_counts = {}

    for col in ['label_frequence', 'label_frequence_10', 'label_similarity', 
                'label_similarity_10', 'label_keywords', 'label_keywords_10']:
        label = row[col]
        if pd.notna(label):  # 确保不是 NaN
            label_counts[label] = label_counts.get(label, 0) + 1

    if not label_counts:
        return None  # 如果没有任何 label，返回 None

    max_count = max(label_counts.values())
    top_labels = [label for label, count in label_counts.items() if count == max_count]

    return top_labels[0] if len(top_labels) == 1 else None  # 如果有多个相同最大值的 label，则返回 None

In [31]:
final_des_label = result.apply(most_frequent_label, axis=1)

In [32]:
result = pd.DataFrame()
result['label_frequence'] = cat_fq_exp
result['label_frequence_10'] = cat_fq_exp_10
result['label_similarity'] = cat_sm_exp
result['label_similarity_10'] = cat_sm_exp_10
result['label_keywords'] = cat_kw_exp
result['label_keywords_10'] = cat_kw_exp_10

In [33]:
final_exp_label = result.apply(most_frequent_label, axis=1)

# Roll Out

In [39]:
clean_df['describe_ls'] = clean_df['describe'].progress_apply(lambda x: process_text(x, extra_list))
clean_df['expect_ls'] = clean_df['expect'].progress_apply(lambda x: process_text(x, extra_list))

100%|██████████| 47335/47335 [21:37<00:00, 36.49it/s]   
100%|██████████| 47335/47335 [04:59<00:00, 158.12it/s]


In [43]:
cat_fq_des,cat_fq_exp = ar1(clean_df, cat_df)
cat_fq_des_10,cat_fq_exp_10 = ar1(clean_df, cat_df_10)
cat_sm_des,cat_sm_exp = ar2(clean_df, cat_df)
cat_sm_des_10,cat_sm_exp_10 = ar2(clean_df, cat_df_10)
cat_kw_des,cat_kw_exp = ar3(clean_df, cat_df)
cat_kw_des_10,cat_kw_exp_10 = ar3(clean_df, cat_df_10)

100%|██████████| 47335/47335 [00:08<00:00, 5681.79it/s]
100%|██████████| 47335/47335 [00:07<00:00, 6473.29it/s]
100%|██████████| 47335/47335 [00:08<00:00, 5637.76it/s]
100%|██████████| 47335/47335 [00:07<00:00, 6267.54it/s]
100%|██████████| 47335/47335 [00:08<00:00, 5416.53it/s]
100%|██████████| 47335/47335 [00:01<00:00, 30914.18it/s]
100%|██████████| 10/10 [00:00<00:00, 10111.63it/s]
100%|██████████| 47335/47335 [00:15<00:00, 3104.66it/s]
100%|██████████| 47335/47335 [00:14<00:00, 3219.74it/s]
100%|██████████| 47335/47335 [00:05<00:00, 8305.85it/s]
100%|██████████| 47335/47335 [00:00<00:00, 48271.88it/s]
100%|██████████| 10/10 [00:00<00:00, 6054.13it/s]
100%|██████████| 47335/47335 [00:15<00:00, 2998.11it/s]
100%|██████████| 47335/47335 [00:15<00:00, 3130.64it/s]
100%|██████████| 47335/47335 [01:48<00:00, 435.67it/s]
100%|██████████| 47335/47335 [01:45<00:00, 449.97it/s]
100%|██████████| 47335/47335 [00:05<00:00, 9380.14it/s] 
100%|██████████| 47335/47335 [00:03<00:00, 13394.63it/s]
1

In [45]:
result = pd.DataFrame()
result['label_frequence'] = cat_fq_des
result['label_frequence_10'] = cat_fq_des_10
result['label_similarity'] = cat_sm_des
result['label_similarity_10'] = cat_sm_des_10
result['label_keywords'] = cat_kw_des
result['label_keywords_10'] = cat_kw_des_10
final_des_label = result.progress_apply(most_frequent_label, axis=1)

100%|██████████| 47335/47335 [00:00<00:00, 109667.41it/s]


In [49]:
result = pd.DataFrame()
result['label_frequence'] = cat_fq_exp
result['label_frequence_10'] = cat_fq_exp_10
result['label_similarity'] = cat_sm_exp
result['label_similarity_10'] = cat_sm_exp_10
result['label_keywords'] = cat_kw_exp
result['label_keywords_10'] = cat_kw_exp_10
final_exp_label = result.progress_apply(most_frequent_label, axis=1)

100%|██████████| 47335/47335 [00:00<00:00, 106619.52it/s]


In [64]:
clean_df['describe_label'] = final_des_label
clean_df['expect_label'] = final_exp_label

# Case Study

In [82]:
clean_df[clean_df.id == 16440]['expect'][16440]

'you have hair...and teeth.<br />\nyou like to have fun, and you know how to.<br />\nyou were born a female.'

In [83]:
clean_df[clean_df.id == 2340]['expect'][2340]

'you are down to earth, open minded, kind, funny, friendly, honest,\nspontaneous, willing to play, passionate, a good listener and\ncommunicator and soulful.'

In [95]:
match_16440 = clean_df[clean_df.describe_label == 'Creative/Artistic'].sample(n=3, replace=False, random_state=42)
print(clean_df[clean_df.id == 16440]['expect'][16440],clean_df[clean_df.id == 16440]['expect_label'][16440])
for i in match_16440.index:
    print('------------------------------')
    print(match_16440.loc[i,'describe'])


you have hair...and teeth.<br />
you like to have fun, and you know how to.<br />
you were born a female. Creative/Artistic
------------------------------
i'm just kind of looking to see who's out there. i'm not someone
who needs to be in a relationship to feel complete. i like myself
(though i suppose i could be more ambitious) and don't mind being
alone. in fact, i enjoy it. that way, i always get to do what i
want. . music (singing, occasionally composing), writing,
design/decorating, trivia. a television/movies<br />
music<br />
my imagination<br />
air<br />
family<br />
friends at the gym, then home alone, watching tv/movies, reading.
------------------------------
chill guy here, masculine, smart, fun, honest, and loyal. i'm
looking to get to know other guys to hang out with and go from
there. good people and good conversation are important. i'm not
really down with pop culture. working class, tats, scruff, geeks
and great smiles are sexy. i help people, and have fun. and that's

In [96]:
match_2340 = clean_df[clean_df.describe_label == 'Party/Nightlife Lover'].sample(n=3, replace=False, random_state=42)
print(clean_df[clean_df.id == 2340]['expect'][2340],clean_df[clean_df.id == 2340]['expect_label'][2340])
for i in match_2340.index:
    print('------------------------------')
    print(match_2340.loc[i,'describe'])

you are down to earth, open minded, kind, funny, friendly, honest,
spontaneous, willing to play, passionate, a good listener and
communicator and soulful. Party/Nightlife Lover
------------------------------
i've been lurking on here for a while now and browse the occasional
profile and answer a question or 3. it makes a nice change from
farcebook(tm)<br />
<br />
there are a couple of potabilities: you're browsing or otherwise
found your way here. welcome!<br />
<br />
or: i've sent you a message and you came here to see if you can
find out a bit more about me before deciding if you want to reply.
very sensible but think for a moment how many guys put the truth
down in these profile things?<br />
<br />
i've allways wanted to be a 6' 6" rocket scientist who looks just
like (insert name of movie star here. anyone apart from brad pitt)
thanks to the wonders of the www i and millions of others can now
be who we want to be. scarey isn't it?<br />
<br />
maybe i'll reinvent myself some tim

In [97]:
match_307 = clean_df[clean_df.describe_label == 'Adventurous Outdoorsy'].sample(n=3, replace=False, random_state=42)
print(clean_df[clean_df.id == 307]['expect'][307],clean_df[clean_df.id == 307]['expect_label'][307])

for i in match_307.index:
    print('------------------------------')
    print(match_307.loc[i,'describe'])

if you like having fun, like adventures! dont judge. and dont care
about deep conversations or politics (thank you in advance!) Adventurous Outdoorsy
------------------------------
just moved to the city from chicago not too long ago so i don't
know very many people in the area. i am looking for a fun,
intelligent, laid back and most importantly drama free girl who i
can explore the city with and with whom i can at least be friends
with if nothing else ensues. i like to live life to its fullest and
do what makes me happy and hope that you do too. i love to meet new
people, if you want to know more, just ask :)<br />
<br />
i am no drama, silly, and laid back working for a startup in sv! seinfeld trivia friends/family<br />
sunshine<br />
staying fit<br />
sense of humor<br />
sports<br />
expanding my knowledge out exploring the city!
------------------------------
racing down mount diablo on my bicycle is pure joy, but crashing my
bicycle on the mountain is not. i am more cautious now

In [104]:
match_629 = clean_df[clean_df.describe_label == 'Foodie/Culinary'].sample(n=3, replace=False, random_state=32)
print(clean_df[clean_df.id == 629]['expect'][629],clean_df[clean_df.id == 629]['expect_label'][629])

for i in match_629.index:
    print('------------------------------')
    print(match_629.loc[i,'describe'])

you know where to get good food and willing to share intelligence Foodie/Culinary
------------------------------
i value joy, honesty, and an appreciation of the beauty in the
world and in people. i try to be open to the experiences of the
world that others have to share, but i'm not always good at
listening. i want to do something good and important with my life,
and so far i've had a few good tries, but i'm not quite sure at the
moment of what the 'next big thing' will be for me. i have spent a
lot of time in school and while i'm always a sucker for gobbling up
new ideas, there will be no more school in my future (unless i'm
teaching it!)<br />
<br />
i know that i want to share my life with someone. someone who has a
good heart and positive energy. i think it will probably be
important that we have a fair number of activity-type interests in
common, but not every single interest has gotta be the same.<br />
<br />
ah, what do i do? well, i'm a scientist for now, and i'm aspiring
to 

In [105]:
#clean_df.to_csv('clean_df.csv')