# Report part 1

# NLP Research into public 10K filings

In this project, Mary Xu, Peitong Lu, and Richard Ye worked with the `business description` of 1500+ 10K Annual Report Filings from SEC, with support from [Ubineer](https://www.ubineer.com/).

# Packages used

- os
- json
- re
- scikit-learn
- pandas
- numpy
- nltk
- bertopic
- matplotlib
- plotly

In [1]:
import os
import json
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# nltk.download('stopwords')
# nltk.download('wordnet')

## Topics explored

1. Word Embedding techniques
2. Distance metrics on those word embeddings
3. Topic Modelling/Emebedding
4. Dynamic Word Embedding comparisons

## Data cleaning/processing

The data was provided to us by Ubineer, which has been pulled and preprocessed for us. One of the main datasets we used is the `bq_2018_top5SIC.json` file prepared by Professor Sotiros Damouras, by selecting companies who have filed in 2018 and belong to the top 5 industries within the dataset. This file has 1127 filings (one per company).

The file schema contains the columns:
- `accessionNumber`
- `filingDate`
- `reportingDate`
- `financialEntity`
- `htmlFile`
- `coDescription`
- `CIK`
- `name`
- `countryinc`
- `cityma`
- `SIC`
- `SIC_desc`

For our purposes, we will be focusing on `name` (identifies the company), `coDescription` (the Business Overview), `SIC_desc` (the industry they operate in)

Within our pre-processing, we focus on `coDescription`. 
We further cleaned up the `Description` text by removing HTML code, the first couple words which were common among all filings such as _"business overview"_, and filtering for filings with over 250 characters.

We then removed _stop words_ from the business descriptions, which are very commong words like "the, they, and, is, are, there" and others. These words don't provide meaning and therefore do not contribute to our goal of extracting meaning.

We also lemmatized all possible words, aka Text/Word Normalization which means all instances of "am, are, is" are converted to "be" and "playing, played, plays" are all converted to "play". This reduces the amount of different words we have to process, and also condensing the amount of information we recieve since words that all carry the same meaning are represented together.

In [None]:
df = pd.read_json("bq_2018_top5SIC.json", lines = True)

#strip any left over html code
def clean_data_fn(insrt_data):
    clean_data = []
    for idx, ele in insrt_data.iterrows():
        if "https://www.sec.gov/Archives/edgar/data/" in ele["coDescription"]:
            pass
        else:
            clean_txt = re.compile('<.*?>')
            desc = re.sub(clean_txt,'',ele["coDescription"]).replace(u'\xa0', u' ').replace("   ", "").replace("'", "").replace('"','')
            if re.search('<', desc):
                pos = re.search('<', desc).start()
            desc = desc[:pos].lower()
            if (desc.find("business") >= 20): # didnt find it in the first 20 characters then look for next
                desc = desc[6 : ( desc.rfind("<") )] # remove the "Item 1." stuff only
            else: # found "business", remove everything before it
                desc =  desc[( desc.find("business") + 8 ) : ( desc.rfind("<") ) ]
            if (desc.find("overview") <= 20): # didnt find it in the first 20 characters then look for next
                desc =  desc[( desc.find("overview") + 8 ) :]
            # remove leading white space and periods
            desc = re.sub(r"^\.", "", desc).strip()            
            new_data = ele.copy()
            new_data["coDescription"] = desc
            # remove any filings with a description less than 250 characters (not enough information for us)
            if len(desc)<250:
                pass
            else:
                clean_data.append(new_data)
                
    return(pd.DataFrame(clean_data))

non_html_data = clean_data_fn(df)#.rename(columns = {"financialEntity":"CIK"})
non_html_data["CIK"] = non_html_data["CIK"].astype(int)

#lemmatization
lemmatizer = WordNetLemmatizer()
    
def lemmatize_sentence(sentence):
    lemmatized_output = [lemmatizer.lemmatize(w) for w in word_tokenize(sentence)]
    return " ".join(lemmatized_output)

lemma_desc = non_html_data["coDescription"].apply(lemmatize_sentence)
non_html_data["coDescription_lemmatized"] = lemma_desc
non_html_data["coDescription_lemmatized"].head()

# remove all numbers so they don't show up as dimensions
def remove_nums(x):
    text = x.lower()
    text = re.sub(r'\d+', '', text)
    return text

# remove stopwords and punctuation
def remove_stopwords(x):
    stop_words = set(stopwords.words('english'))

    word_tokens = word_tokenize(x)

    filtered_sentence = ' '.join([w for w in word_tokens if not w.lower() in stop_words and w.isalnum()])

    return(filtered_sentence)

rm_num_stopwords = non_html_data["coDescription_lemmatized"].apply(remove_nums).apply(remove_stopwords)
non_html_data["coDescription_stopwords"] = rm_num_stopwords

non_html_data.head()

## Evaluation and visualization techniques

In this project we focused on visually examining 2 and 3 dimensional plots of the word embeddings reduced using PCA and Truncated SVD (specifically for LSA). We also had access to extra information for 2018 filings, allowing us to evaluate the word embedding clusters against their actual industry classification. This was done using a simple 1-NN clustering. The results were put into a confusion matrix, allowing us to identify how well the 1-NN clutsering did on our word embedding.

# 1. Word Embeddings techniques

## Term Frequency/Counter Vectorizer/Bag of Words

We started off with the basic Term Frequency Matrix, which breaks down each company description into a vector of `n` words/terms (a hyperparameter), where each dimension is a word/term, and the value is the count of that word in the document. Obviously the number of unique words in a document can be very large, but in this technique, we select only the `n` words that occur the most in all documents.


\begin{align}
\text{The formula:}\ \ \text{tf}(t,d) &= |t| \text{ in document}\\
\end{align}

This technique helps us analyze how many of the `n` words each filing contains, which provides us with information about the kind of terms or topics each company may discuss. This approach is very easy to implement, but is not very powerful because very common words will have the largest values and therefore carry the most weight. For financial statements like these, you can expect words like "financial" and "report" to have some of the highest values.

From these vectors for each company filing, we can think of each term as a dimension and actually project these `n` dimensional vectors into an `n` dimensional space, which is called a **word embedding**. You can think of these as points in a `n`D space.

## Term Frequency - Inverse Document Frequency (tf-idf)

To solve the above issue, we moved on to the tf-idf technique. tf-idf augments the term frequency matrix we created above by multiplying each word in each docuemnt by its "importance" to that document. The details are within the [1_Tf-idf_analysis.ipynb](https://github.com/richardye101/ubineer_nlp_research/blob/main/content/richard/1_Tf-idf_analysis.ipynb) notebook. This technique is meant to adjust the weighting of terms used in the word embedding so that the _points_ used to represent each company filing is more accurate in representing where companies are in this `n` dimensional space in comparison to other companies. For example, ideally we want technology companies close together, and pharmaceutical companies close together.

\begin{align}
\text{tf-idf}(t,d) &= \text{tf}(t,d) \cdot \text{idf}(t,d)\\ \\
\text{Where: } \quad \text{tf}(t,d) &= |t| \text{ in document}\\ \\
\text{idf}(t,d) &= \log\frac{N}{\text{df}(t)}\\
\end{align}


## 