# Progress

### EDA
The project started with getting the data and the description. From that description I then did some EDA on the data, cleaned and processed the data in the following ways:

1. Stop word removal using Spacy
2. Lemmatization using Spacy
3. Tokenization using Spacy
4. Removal of noise tokens such as punctuations and emails

Following this, I visualized the data for a little bit and decided to keep the startup descriptions that were between 15-60 tokens in length. This was because I wanted to keep the data that was not too short, and not too long. I experimented with several different lengths until I landed on this range. There are currently 2512 startups in the data that meet this criteria.

### Topic Modelling & Clustering

Currently I am toying around with several algorithms:
1. Vectorization
    - TFIDF
2. Topic modelling
    - LDA
    - LSA
    - NMF
3. Clustering
    - DBSCAN
    - KM
    - HC
4. Dimension Reduction
    - t-SNE

I used a combination of topic modelling and clustering to cluster the startups into industries. I used LDA/LSA/NMF to get the topics from the data, and then used the topics to cluster the data.

Through topic modelling, I found that some of the topics were just location-based. For example there were topics that had as top words: india, san francisco, US etc. I decided to remove these words from my dataset by returning to the preprocessing step and removing all entities from the data using Spacy. I then tried topic modelling again, and found that the topics were much better.

Following this, I clustered the data using DBSCAN, KM and HC. I tried two approaches:
1. Using the top 10 words in each topic as the features
2. Using the whole description as the features

I then used the best clustering algorithm to cluster the data. The silhouette scores were not so good, so i decided to abort midway and go another route.


### Embeddings

The next route I tried was using word embeddings. This is the progress on that so far:
1. I used BERT to get the embeddings for the data
2. The embeddings were pooled (max, avg and concat)

I then tried two different things:
1. Directly associating the embeddings of the startups with the industries using cosine similarity between the embeddings
    - This proved quite poor, and the industries were wrong for many of the startups.
    - This could be because the embeddings are not good enough, or because the cosine similarity is not a good metric to use

2. Indirectly associating the embeddings of the startups with the industries using cosine similarity between the embeddings obtained from keywords of each industry

    - I reasoned that perhaps the industry names were not good enough to be used as a comparison, so I asked gpt4 to generate 10 keywords per industry, and used those to make embeddings for the industries
    - The result of this was much better, but still not good enough
    - A major problem I saw was that many of the industries are actually very closely related. There is an entry called oceantech and one called marinetech. There is genomics, life sciences, and biotech. I therefore decided maybe it would be best to cluster related industries together, and then use the embeddings to cluster the startups. This step is still not done



### Transformers

The transformers that i used for the v1 of the project were:
1. BERT
2. GPT2
4. XLNet
5. RoBERTa
6. DistilBERT
7. T5
8. ALBERT
9. ELECTRA
10. BART
11. sampathkethineedi/industry-classification

The best one was BERT, but still not good enough for using the labels as training data for a NN. I decided there was one of two things that could be done:
1. Industries must be defined more concretely for the embeddings to work.
This is a problem that I am not sure how to solve without human intervention. This whole project I have tried to keep from scraping new data and just rely on the data that I already have.
2. Using sentence embeddings instead of word embeddings.
This method can be done using the sentence transformers library. I have used this before but the results were worse than word embeddings. When I used it I had a scientific corpus, so that might be why. I will try it again with the startup data and see if it works better on general english.

### Results

It worked very well. The sentence embeddings were quite accurate, and I suppose the best pipeline would be one where the sentence embeddings are used to produce the labels for the data, and then the labels are used to train a NN. I will try this out in the next version of the project.

