# Explanation

This is the notebook that generates embeddings using sentence transformers. The architecture of the transformers are similar but these models give better semantic context given a sentence rather than word for word embedding followed by pooling like BERT.


### Cell 1
The cell below is a copy of the `Embedding()` class you can find on `src/data/tagging.py` module. You can import this module by calling ```from src.data.tagging import Embedding``` and then use the class as shown in cells 2-6

In [1]:
import pandas as pd
import numpy as np
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
from tqdm import tqdm_notebook as tqdm
from sentence_transformers import SentenceTransformer
import torch


class Embedding:
    """
    A class to generate embeddings for startups and industries using specified language models and pooling methods.
    """

    def __init__(self, startups, industries, llm='bert', pool='max', sentence_transformer=False):

        """
        Initializes the Embedding class with specified language models and pooling methods.

        :param startups: DataFrame containing startup data with 'id' and 'cb_description' columns
        :param industries: DataFrame containing industry data with 'id' and 'keywords' columns
        :param llm: string, the language model to use for generating embeddings, default is 'bert'
        :param pool: string, the pooling method to use for generating embeddings, default is 'max'
        :param sentence_transformer: bool, whether to use a sentence transformer model, default is False
        """

        self.startups = startups
        self.industries = industries
        self.sentence_transformer = sentence_transformer
        self.pool = pool
        self.llm = {
            'bert': 'bert-base-uncased',
            'gpt2': 'gpt2',
            'gpt': 'openai-gpt',
            'roberta': 'roberta-base',
            'distilbert': 'distilbert-base-uncased',
            'xlnet': 'xlnet-base-uncased',
            'electra': 'google/electra-base-discriminator',
            'industry_classifier': 'sampathkethineedi/industry-classification'
        }
        if not sentence_transformer:
            self.model = AutoModel.from_pretrained(self.llm[llm])
            self.tokenizer = AutoTokenizer.from_pretrained(self.llm[llm])
        else:
            self.model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    def generate_embeddings(self, startup=True):
        """
        Generates embeddings for startups or industries using the specified language model and pooling method.

        :param startup: bool, if True, generates embeddings for startups, if False, generates embeddings for industries
        :return: DataFrame with generated embeddings merged with the original input DataFrame
        """
        texts = self.startups if startup else self.industries
        embeddings_list = []

        for i, row in tqdm(texts.iterrows()):
            id = row['id']
            if startup:
                description = row['cb_description']
            else:
                description = row['keywords']
            if self.sentence_transformer:
                embeddings = self.model.encode(description)
            else:
                inputs = self.tokenizer.encode_plus(description, return_tensors="pt", truncation=True, padding="max_length", max_length=60)
                outputs = self.model(**inputs)
                last_hidden_states = outputs.last_hidden_state
                embeddings = self.pooling(last_hidden_states)

            embeddings_list.append({'id': id, 'embeddings': embeddings.tolist()})

        embeddings_df = pd.DataFrame(embeddings_list)
        merged_df = pd.merge(texts, embeddings_df, on='id', how='left')

        if startup:
            self.startups = merged_df
        else:
            self.industries = merged_df

        return merged_df


    def assign_industry(self, num_labels=3):
        """
        Assigns top industries to startups based on their cosine similarity to the industry embeddings.

        :param num_labels: int, the number of top industries to assign to each startup, default is 3
        :return: list of lists containing dictionaries with assigned industries and their similarity scores
        """
        self.assigned_industries = []
        for startup_embedding in self.startups['embeddings']:
            startup_embedding = np.array(startup_embedding).flatten()
            industry_embeddings = np.array([np.array(x).flatten() for x in self.industries['embeddings']])

            similarities = cosine_similarity([startup_embedding], industry_embeddings)[0]
            top_industry_indices = np.argsort(similarities)[-num_labels:][::-1]
            top_industries = [{'industry': self.industries.iloc[index]['industry'], 'score': similarities[index]} for index in top_industry_indices]

            self.assigned_industries.append(top_industries)

        return self.assigned_industries

    def pooling(self, last_hidden_states):
        """
        Applies the specified pooling method to the given last hidden states tensor.

        :param last_hidden_states: tensor, the last hidden states from the language model
        :return: NumPy array of pooled embeddings
        """
        if self.pool == 'max':
            self.pooled_embeds = torch.max(last_hidden_states, dim=1).values
        elif self.pool == 'avg':
            self.pooled_embeds = torch.mean(last_hidden_states, dim=1)
        elif self.pool == 'concat':
            max_pooling = torch.max(last_hidden_states, dim=1).values
            average_pooling = torch.mean(last_hidden_states, dim=1)
            self.pooled_embeds = torch.cat((max_pooling, average_pooling), dim=1)
        else:
            raise ValueError('pool must be either max, avg or concat')
        return self.pooled_embeds.detach().numpy()

    def update_dataframe(self):
        """
        Updates the startup and industry DataFrames with assigned industries and their similarity scores.

        :return: DataFrame with updated startups data
        """
        max_industries = max([len(x) for x in self.assigned_industries])

        for i in range(max_industries):
            self.startups[f'industry{i + 1}'] = [x[i]['industry'] if i < len(x) else None for x in self.assigned_industries]
            self.startups[f'score{i + 1}'] = [x[i]['score'].round(3) if i < len(x) else None for x in self.assigned_industries]

        self.startups.drop(columns=['embeddings'], inplace=True)
        self.industries.drop(columns=['embeddings'], inplace=True)

        return self.startups


### Cell 2
We load the data in a separate cell so that we can change the datasets adhoc.

In [2]:
warnings.filterwarnings('ignore')

industry_data = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industries_clean.csv', sep='\t')
startups = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\startups_clean_noents_1560.csv')

### Cell 3
This is where the magic happens. We create an instance of the Embedding class and pass the dataframes as arguments. Embeddings can be generated using one of 2 methods:
1. General transformers (BERT, GPT2, etc.)
    - If using a general transformer, you need to specify the `llm` argument. The `llm` argument is a dictionary that maps the name of the transformer to the name of the model in the HuggingFace library. The default value is `bert-base-uncased`. You can find the list of models [here](https://huggingface.co/models), or refer to the `llm` dictionary in the Embedding class. You can also specify the pooling method using the `pool` argument. The default value is `max`. The available options are `max`, `avg` and `concat`, which is a mix of both the max and average.
2. Sentence transformers (SBERT)
    - If using a sentence transformer, all you need to do is set `sentence_transformer=True`, and the class logic will handle the rest.

In [3]:
embeddings = Embedding(startups, industry_data, sentence_transformer=True, pool='max', llm='industry_classifier')


### Cell 4
We can now generate the embeddings for the startups and industries. The `startup` argument is a boolean that specifies whether we want to generate embeddings for the startups or the industries. The default value is `True`. This function is generating embeddings for the startups, as well as the industries on the second line. There is no need to return anything as the embeddings are stored in the `startups` and `industries` attributes of the Embedding class.

In [7]:
#embeddings.generate_embeddings(startup=True)
industry_data = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industries_clean.csv', sep='\t')
industry_data.dropna(inplace=True)
embeddings.generate_embeddings(startup=False)

0it [00:00, ?it/s]

Unnamed: 0,id,industry,keywords,embeddings
0,0,neuro,neurology signal neuron memory network cogniti...,"[-0.008603241294622421, -0.09912002831697464, ..."
1,1,procurement,source supply chain proposal supplier negotiat...,"[-0.08471089601516724, 0.0118993716314435, 0.0..."
2,2,greentech,biofuel solar renewable sustainability geother...,"[0.04065382108092308, 0.07953336834907532, 0.0..."
3,3,social impact,empowerment volunteer justice activism social ...,"[0.0023484171833842993, 0.012581953778862953, ..."
4,4,esports,streaming competition game virtual tournament ...,"[0.01624584011733532, -0.006599493324756622, -..."
...,...,...,...,...
115,117,extremism,violence radicalization right speech hate far ...,"[0.0543396957218647, 0.04583355411887169, -0.0..."
116,118,connected home,appliance,"[-0.08160713315010071, 0.04330090433359146, -0..."
117,119,network infrastructure,sdn router optic wan switch backbone,"[-0.012510308064520359, -0.05666343495249748, ..."
118,120,food & beverage,restaurant beverage catering foodtech,"[0.004870914854109287, -0.025196079164743423, ..."


### Cell 5

In this cell we assign the industries to the startups. The `num_labels` argument specifies the number of industries we want to assign to each startup. The default value is 3. The function returns a list of dictionaries, where each dictionary contains the name of the industry and the cosine similarity score. The list is stored in the `assigned_industries` attribute of the Embedding class.

In [9]:
embeddings.assign_industry(num_labels=3)

[[{'industry': 'cybersecurity', 'score': 0.35219988499503274},
  {'industry': 'payments', 'score': 0.3104075607907534},
  {'industry': 'professional services', 'score': 0.310153900753311}],
 [{'industry': 'biotech', 'score': 0.40561349479158315},
  {'industry': 'genomics', 'score': 0.3783544543743757},
  {'industry': 'longevity', 'score': 0.34221835381535737}],
 [{'industry': 'beauty', 'score': 0.4568139952826272},
  {'industry': 'fashion', 'score': 0.3424932103045579},
  {'industry': 'social impact', 'score': 0.31223965328988634}],
 [{'industry': 'fashion', 'score': 0.46614104445633964},
  {'industry': 'e-commerce', 'score': 0.4531913040837924},
  {'industry': 'analytics', 'score': 0.41094813654028706}],
 [{'industry': 'esports', 'score': 0.45680492230202446},
  {'industry': 'social networks', 'score': 0.3517344025167309},
  {'industry': 'sharing economy', 'score': 0.3346094884140761}],
 [{'industry': 'fashion', 'score': 0.47951542785542584},
  {'industry': 'e-commerce', 'score': 0.32

### Cell 6

Finally, we update the dataframe with the assigned industries. The function returns a dataframe with the assigned industries and their scores. The dataframe is stored in the `startups` attribute of the Embedding class.

In [13]:
df = embeddings.update_dataframe()

In [14]:
df.to_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\tagged\tagged_with_sentence_transformer.csv', index=False)

KeyError: <__main__.Embedding object at 0x0000020D970A9300>