## Word Embeddings

In this notebook, we will explore the use of word embeddings. Text similarity can also be done using cosine similarity of word embeddings. Word embeddings are a way of representing words as vectors and BERT can generate these embeddings. The issue I see is that many of the industries are sort of similar. IOT is similar to AI and so on.

This notebook uses functions different to the `Embeddings()` class found in `src/data/tagging.py`, as this notebook was just meant for an exploration of the concept as a whole.

### Cell 1 - Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
import warnings
import torch
from tqdm import tqdm_notebook as tqdm
warnings.filterwarnings('ignore')


### Cell 2 - Load Data

In [None]:
industry_data = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industries_clean.csv', sep='\t')
startups = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed/startup_dataset_clean_1560_range.csv')
startups = startups.head(200)

### Cell 3 - Generate Embeddings

We will use the `generate_embeddings()` function to generate the embeddings for the startups and industries. We will use the `bert-base-uncased` model and the `AutoTokenizer` and `AutoModel` classes from the `transformers` library. The `generate_embeddings()` function takes in the following parameters:

- `texts`: The DataFrame containing the text to be embedded
- `tokenizer`: The tokenizer to be used
- `model`: The model to be used
- `startup`: A boolean value indicating whether the text is from a startup or an industry
- `pool`: The pooling method to be used. Can be either `max`, `avg` or `concat`

In [67]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')


def generate_embeddings(texts, tokenizer, model, startup=True, pool='max'):
    embeddings_list = []
    for i, row in tqdm(texts.iterrows()):
        id = row['id']
        if startup:
            description = row['cb_description']
        else:
            description = row['keywords']
        inputs = tokenizer.encode_plus(description, return_tensors="pt", truncation=True, padding="max_length", max_length=60)
        outputs = model(**inputs)
        last_hidden_states = outputs.last_hidden_state
        if pool == 'max':
            pooling = torch.max(last_hidden_states, dim=1).values
        elif pool == 'avg':
            pooling = torch.mean(last_hidden_states, dim=1)
        elif pool == 'concat':
            max_pooling = torch.max(last_hidden_states, dim=1).values
            average_pooling = torch.mean(last_hidden_states, dim=1)
            pooling = torch.cat((max_pooling, average_pooling), dim=1)
        else:
            raise ValueError('pool must be either max, avg or concat')

        # Add the id and the embeddings (as a list) to the embeddings_list
        embeddings_list.append({'id': id, 'embeddings': pooling.detach().numpy().tolist()})

    # Create a DataFrame from the embeddings_list and merge it with the original DataFrame
    embeddings_df = pd.DataFrame(embeddings_list)
    merged_df = pd.merge(texts, embeddings_df, on='id', how='left')

    return merged_df


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [68]:
industries_embeds = generate_embeddings(industry_data, tokenizer, model, startup=False, pool='concat')
startups_embeds = generate_embeddings(startups, tokenizer, model, startup=True, pool='concat')

0it [00:00, ?it/s]

0it [00:00, ?it/s]

In [47]:
# # # # Cluster using the embeds # # # #
from src.data.algorithms import Clustering

#clustering = Clustering(data=startups_embeds, n_clusters=5)

embeddings = np.array([embedding for embedding in startups_embeds['embeddings']]).squeeze()
k = 5  # Replace this with your chosen number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(embeddings)
startups_embeds['cluster'] = clusters
silhouette_avg = silhouette_score(embeddings, clusters)
print(f"Silhouette score: {silhouette_avg}")



Silhouette score: 0.0636466486209045


In [23]:
from sklearn.metrics.pairwise import cosine_similarity

def assign_industry(startups, industries):
    assigned_industries = []

    for startup_embedding in startups['embeddings']:
        startup_embedding = np.array(startup_embedding).flatten()

        industry_embeddings = np.array([np.array(x).flatten() for x in industries['embeddings']])
        similarities = cosine_similarity([startup_embedding], industry_embeddings)[0]
        best_industry_index = np.argmax(similarities)
        assigned_industries.append(industries.iloc[best_industry_index]['industry'])

    return assigned_industries

startups['assigned_industry'] = assign_industry(startups_embeds, industries_embeds)

In [69]:
# TOP 3

def assign_industry_v2(startups, industries):
    assigned_industries = []

    for startup_embedding in startups['embeddings']:
        startup_embedding = np.array(startup_embedding).flatten()

        industry_embeddings = np.array([np.array(x).flatten() for x in industries['embeddings']])
        similarities = cosine_similarity([startup_embedding], industry_embeddings)[0]

        # Get the top 3 industries and their scores
        top_3_industry_indices = np.argsort(similarities)[-3:][::-1]
        top_3_industries = [{'industry': industries.iloc[index]['industry'], 'score': similarities[index]} for index in top_3_industry_indices]

        assigned_industries.append(top_3_industries)

    return assigned_industries

assigned_industries = assign_industry_v2(startups_embeds, industries_embeds)

startups['industry1'] = [x[0]['industry'] for x in assigned_industries]
startups['score1'] = [x[0]['score'].round(3) for x in assigned_industries]
startups['industry2'] = [x[1]['industry'] for x in assigned_industries]
startups['score2'] = [x[1]['score'].round(3) for x in assigned_industries]
startups['industry3'] = [x[2]['industry'] for x in assigned_industries]
startups['score3'] = [x[2]['score'].round(3) for x in assigned_industries]

startups



Unnamed: 0,id,name,cb_description,industry1,score1,industry2,score2,industry3,score3
0,1820,0xKYC,modular knowledge system identity credential m...,Web3,0.917,DeFi,0.912,InsurTech,0.909
1,3640,10X-Genomics,genomic create revolutionary dna sequence tech...,Genomics,0.922,Longevity,0.905,Generative AI,0.897
2,9594,111Skin,commit positive luxury skincare push boundary ...,Fashion,0.891,Circular Economy,0.891,Beauty,0.891
3,473,1stdibs,internet company offer marketplace rare desira...,Creator Economy,0.910,Retail,0.900,Fashion,0.899
4,7956,1v1Me,application allow user play match favorite vid...,Creator Economy,0.899,Gaming,0.892,Metaverse,0.890
...,...,...,...,...,...,...,...,...,...
195,4900,Aptatek-Biosciences,aptatek develop diagnosis monitor inborn disea...,Genomics,0.928,Longevity,0.911,BioTech,0.901
196,4702,Aqdot,aqdot technology allow raw material efficient ...,Gut Microbiome,0.911,Semiconductors,0.905,DeFi,0.905
197,7814,Aquapak-Polymers,aquapak polymer developer polyvinyl alcohol re...,Longevity,0.897,DeFi,0.895,3D Printing,0.892
198,2884,Aquapharm-Biodiscovery,aquapharm innovative drug discovery company fo...,Longevity,0.908,Gut Microbiome,0.907,Psychedelics,0.904


In [60]:
startups
#tartups.to_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\tagged/tagged_after_keyword_embedding_200startups.csv')

Unnamed: 0,id,name,cb_description,industry1,score1,industry2,score2,industry3,score3
0,1820,0xKYC,modular knowledge system identity credential m...,Web3,0.975,Health Tech,0.973,RegTech,0.968
1,3640,10X-Genomics,genomic create revolutionary dna sequence tech...,Genomics,0.971,Deep Tech,0.957,Computer Vision,0.956
2,9594,111Skin,commit positive luxury skincare push boundary ...,Circular Economy,0.962,Automotive,0.961,Climate Tech/CleanTech,0.961
3,473,1stdibs,internet company offer marketplace rare desira...,Gaming,0.965,Creator Economy,0.964,PropTech,0.963
4,7956,1v1Me,application allow user play match favorite vid...,Web3,0.967,Metaverse,0.962,Sharing Economy,0.961
...,...,...,...,...,...,...,...,...,...
195,4900,Aptatek-Biosciences,aptatek develop diagnosis monitor inborn disea...,Genomics,0.978,Health Tech,0.976,MedTech,0.971
196,4702,Aqdot,aqdot technology allow raw material efficient ...,AgTech,0.956,GreenTech,0.952,Energy Storage,0.951
197,7814,Aquapak-Polymers,aquapak polymer developer polyvinyl alcohol re...,Nano,0.960,Deep Tech,0.958,Materials,0.958
198,2884,Aquapharm-Biodiscovery,aquapharm innovative drug discovery company fo...,Genomics,0.942,Longevity,0.939,AgTech,0.938


In [15]:
startups.to_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\startups_with_industries_keyword_embeds.csv', index=False)

## Results

The results are terrible. Maybe I should use LLMs to generate keywords associated with each industry and then use those keywords to cluster the industries. I will try that in the next notebook. This method would be much much much more expensive though if using GPT3. Both computationally and financially.

I tried using the openai-gpt model on huggingface but the results were not so good. Instead I just used chatgpt, which is not a feasible long term solution but it works for now at least.

In [10]:
warnings.filterwarnings('ignore')

industry_data = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industries_clean.csv', sep='\t')
startups = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed/startup_dataset_clean_1560_range.csv')
startups = startups.head(200)

class Embedding:

    def __init__(self, startups, industries, llm='bert'):
        self.startups = startups
        self.industries = industries
        self.llm = {
            'bert': 'bert-base-uncased',
            'gpt2': 'gpt2',
            'gpt': 'openai-gpt',
            'roberta': 'roberta-base',
            'distilbert': 'distilbert-base-uncased',
            'xlnet': 'xlnet-base-uncased',
            'electra': 'google/electra-base-discriminator',
            'industry_classifier': 'sampathkethineedi/industry-classification'
        }
        self.model = AutoModel.from_pretrained(self.llm[llm])
        self.tokenizer = AutoTokenizer.from_pretrained(self.llm[llm])

    def generate_embeddings(self, startup=True, pool='max'):
        texts = self.startups if startup else self.industries
        embeddings_list = []
        for i, row in tqdm(texts.iterrows()):
            id = row['id']
            if startup:
                description = row['cb_description']
            else:
                description = row['keywords']
            inputs = self.tokenizer.encode_plus(description, return_tensors="pt", truncation=True, padding="max_length", max_length=60)
            outputs = self.model(**inputs)
            last_hidden_states = outputs.last_hidden_state

            if pool == 'max':
                pooling = torch.max(last_hidden_states, dim=1).values
            elif pool == 'avg':
                pooling = torch.mean(last_hidden_states, dim=1)
            elif pool == 'concat':
                max_pooling = torch.max(last_hidden_states, dim=1).values
                average_pooling = torch.mean(last_hidden_states, dim=1)
                pooling = torch.cat((max_pooling, average_pooling), dim=1).detach().numpy()
            else:
                raise ValueError('pool must be either max, avg or concat')

            embeddings_list.append({'id': id, 'embeddings': pooling.detach().numpy().tolist()})

        embeddings_df = pd.DataFrame(embeddings_list)
        merged_df = pd.merge(texts, embeddings_df, on='id', how='left')

        if startup:
            self.startups = merged_df
        else:
            self.industries = merged_df

        return merged_df

    def assign_industry(self, num_labels=3):
        self.assigned_industries = []
        for startup_embedding in self.startups['embeddings']:
            startup_embedding = np.array(startup_embedding).flatten()
            industry_embeddings = np.array([np.array(x).flatten() for x in self.industries['embeddings']])

            similarities = cosine_similarity([startup_embedding], industry_embeddings)[0]
            top_industry_indices = np.argsort(similarities)[-num_labels:][::-1]
            top_industries = [{'industry': self.industries.iloc[index]['industry'], 'score': similarities[index]} for index in top_industry_indices]

            self.assigned_industries.append(top_industries)

        return self.assigned_industries

    def update_dataframe(self):
        max_industries = max([len(x) for x in self.assigned_industries])

        for i in range(max_industries):
            self.startups[f'industry{i + 1}'] = [x[i]['industry'] if i < len(x) else None for x in self.assigned_industries]
            self.startups[f'score{i + 1}'] = [x[i]['score'].round(3) if i < len(x) else None for x in self.assigned_industries]

        self.startups.drop(columns=['embeddings'], inplace=True)
        self.industries.drop(columns=['embeddings'], inplace=True)

        return self.startups


In [11]:
embedding_class = Embedding(startups, industry_data, llm='industry_classifier')
embedding_class.generate_embeddings(startup=True, pool='max')
embedding_class.generate_embeddings(startup=False, pool='max')
embedding_class.assign_industry()
df = embedding_class.update_dataframe()

df

Some weights of the model checkpoint at sampathkethineedi/industry-classification were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0it [00:00, ?it/s]

0it [00:00, ?it/s]

Unnamed: 0,id,name,cb_description,industry1,score1,industry2,score2,industry3,score3
0,1820,0xKYC,modular knowledge system identity credential m...,Analytics,0.888,Platforms,0.834,Blockchain,0.827
1,3640,10X-Genomics,genomic create revolutionary dna sequence tech...,Life Sciences,0.763,Deep Tech,0.756,Genomics,0.724
2,9594,111Skin,commit positive luxury skincare push boundary ...,Beauty,0.910,Nutrition,0.849,Gut Microbiome,0.563
3,473,1stdibs,internet company offer marketplace rare desira...,Fashion,0.776,Creator Economy,0.599,Travel,0.585
4,7956,1v1Me,application allow user play match favorite vid...,Social Networks,0.683,Metaverse,0.658,Creator Economy,0.641
...,...,...,...,...,...,...,...,...,...
195,4900,Aptatek-Biosciences,aptatek develop diagnosis monitor inborn disea...,MedTech,0.814,FamilyTech,0.738,Health Tech,0.733
196,4702,Aqdot,aqdot technology allow raw material efficient ...,3D Printing,0.851,Materials,0.838,Chemicals,0.797
197,7814,Aquapak-Polymers,aquapak polymer developer polyvinyl alcohol re...,Chemicals,0.779,Materials,0.759,3D Printing,0.743
198,2884,Aquapharm-Biodiscovery,aquapharm innovative drug discovery company fo...,Life Sciences,0.895,BioTech,0.868,Longevity,0.833


In [4]:
from sentence_transformers import SentenceTransformer

In [89]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Assuming you have a DataFrame called industries with a 'keywords' column
industry_keywords = industry_data['keywords']

# Generate BERT embeddings for the keywords
model = SentenceTransformer('bert-base-nli-mean-tokens')
industry_embeddings = model.encode(industry_keywords)

# Determine the optimal number of clusters using silhouette scores
silhouette_scores = []
max_clusters = 10
for n_clusters in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmeans.fit_predict(industry_embeddings)
    silhouette_avg = silhouette_score(industry_embeddings, cluster_labels)
    silhouette_scores.append(silhouette_avg)

optimal_clusters = np.argmax(silhouette_scores) + 2

# Perform clustering using the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=0)
clusters = kmeans.fit_predict(industry_embeddings)

# Add cluster labels to the industries DataFrame
industry_data['cluster'] = clusters

# Visualize the clusters using PCA
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(industry_embeddings)

plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=clusters, cmap='viridis')
plt.show()

Unnamed: 0,id,name,cb_description,industry1,score1,industry2,score2,industry3,score3,embeddings
0,1820,0xKYC,modular knowledge system identity credential m...,Web3,0.924,DeFi,0.921,Blockchain,0.920,"[[0.6992632746696472, 0.9152489900588989, 1.11..."
1,3640,10X-Genomics,genomic create revolutionary dna sequence tech...,Genomics,0.928,Longevity,0.916,Generative AI,0.913,"[[1.0245652198791504, 0.848852276802063, 1.081..."
2,9594,111Skin,commit positive luxury skincare push boundary ...,Beauty,0.911,Circular Economy,0.907,Connected Life,0.906,"[[1.5274184942245483, 0.44603708386421204, 0.8..."
3,473,1stdibs,internet company offer marketplace rare desira...,Creator Economy,0.918,Retail,0.913,Fashion,0.912,"[[1.1128085851669312, 0.945479154586792, 0.915..."
4,7956,1v1Me,application allow user play match favorite vid...,Creator Economy,0.906,Gaming,0.905,Esports,0.902,"[[0.579791784286499, 0.3094499409198761, 0.988..."
...,...,...,...,...,...,...,...,...,...,...
195,4900,Aptatek-Biosciences,aptatek develop diagnosis monitor inborn disea...,Genomics,0.937,Longevity,0.924,BioTech,0.917,"[[0.6032254099845886, 0.7531402707099915, 1.11..."
196,4702,Aqdot,aqdot technology allow raw material efficient ...,Gut Microbiome,0.929,Longevity,0.926,DeFi,0.924,"[[1.2192195653915405, 1.3865875005722046, 1.19..."
197,7814,Aquapak-Polymers,aquapak polymer developer polyvinyl alcohol re...,Longevity,0.913,DeFi,0.911,Psychedelics,0.908,"[[1.034882664680481, 0.9140047430992126, 1.133..."
198,2884,Aquapharm-Biodiscovery,aquapharm innovative drug discovery company fo...,Gut Microbiome,0.926,Longevity,0.925,DeFi,0.922,"[[0.8383574485778809, 0.8880403637886047, 1.24..."
