# Seeking Alpha Market News NLP and Topic Modeling

With unsupervised machine learning technqiues, we will analyze Seeking Alpha's news posts and try to discover topics and trends over time. Tools include MongoDB, numpy, NLTK, spaCy, scipy, and TextBlob.

Final Model: NMF with TF-IDF-vectorized text with 15-dimensional space

Topics: Valuation, IPO/SEOs, Recommendations, Pharamaceuticals, Capital Markets, Energy, Mangements, Earnings, Federal Reserve, Retail, Technology, Mergers & Acquisitions, Debt Offerings, Corportate Strategy, Job Market

### Import relevant libraries

In [None]:
import numpy as np
import random
from datetime import datetime
import time
import pickle
from pymongo import MongoClient

import re
import string
import nltk
from nltk.corpus import stopwords
import spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse.linalg import svds
from sklearn.decomposition import NMF, TruncatedSVD
from stemming.porter2 import stem
from sklearn.preprocessing import normalize, StandardScaler, RobustScaler, robust_scale, QuantileTransformer

nlp = spacy.load('en')
stopwords = stopwords.words()

### Import data from MongoDB

In [None]:
client = MongoClient()
db = client.sa_structured_v2
docs = db.collections

## NLP

Add words to stopwords lists (these are either overrepresented and/or have very little semantic value)

In [None]:
stopwords += ['new', 'cash', 'flow'] + ['time', 'times'] + ['quick', 'loss', 'include', 'including', 'included', 'total', 'form', 'perform', 'gross', 'operating', 'expense', 'expenses', 'asset', 'assets', 'flow', 'margin', 'average', 'firm', 'according', 'rather', 'among', 'given', 'know', 'take', 'look', 'loss', 'using', 'january', 'february', 'april', 'june', 'july', 'august', 'september', 'october', 'november', 'december', 'week', 'recent', 'although', 'making', 'reported', 'since', 'looks', 'see', 'things', 'made', 'another', 'showed', 'shows', 'show', 'told', 'likely', 'use', 'almost', 'due', 'used', 'seeing', 'dividend', 'dividends', 'saw', 'could', 'like', 'yet', 'day', 'saying', 'comes', 'would', 'sees', 'others', 'via', 'ago', 'eps', 'say', 'results', 'result', 'get', 'owned', 'years', 'got', 'set', 'expects', 'basis', 'points', 'prices', 'result', 'begin', 'conference', 'call', 'announced', 'announce', 'announces', 'div', 'payable', 'year', 'share', 'consensus', 'view', 'open', 'boe', 'price', 'profit', 'revenue', 'says', 'bps', 'ah', 'earning', 'interest', 'said', 'net', 'bbl', 'adj', 'inc', 'guidance', 'income', 'shares', 'revenues', 'earnings', 'press', 'release', 'unit', 'units', 'reports', 'nthe', 'sales', 'report', 'may', 'etf', 'misses', 'miss', 'month', 'beat', 'beats', 'line']

### Keep relevant posts and clean remaining documents

Many posts were full of stock tickers and their daily gains/losses, companies announcing dividends or earnings, or generally had very little semantic value. These were full of numbers or, if they had text, the words were the same from post to post so they were deemed useless for this analysis. This cut the number of documents in half (around 200k, 2013-2017) but the remaining documents have much greater semantic value.

This function is rather verbose, however, given the format and needs of my corpus/text it works quite well.

In [None]:
def clean_text(docs, stopwords=stopwords):
    '''
    Iterate through each SA post and check the title to see whether or not the document should be kept in the corpus.
    If the document is to be kept, clean documents removing punctuation, numbers, named entities, uppercase letters, and condense words to their stems.
    Add cleaned text to existing MongoDB documents and ID these documents with a 1 for `keep`.
    Certain posts won't have any remaining words after they are cleaned and therefore need to be removed from the corpus.
    Irrelevant posts and ones that were removed during the cleaning process receive a 0 for `keep`.
    ---
    IN:
        docs: MongoDB collection of documents (title, text, and date are important keys)
        stopwords: list of words that have little semantic value (default list from NLTK plus words from above)
    OUT:
        If kept and cleaned, will add cleaned text to Mongo document along with a 1 for `keep`
        If removed, cleaned text will not be added and 0 will be passed to `keep`
        Timestamp added to every document
    '''

    for doc in docs.find():
        
        #Add timestamp to Mongo document
        dt = datetime.strptime(doc['date'], '%m/%d/%y')
        t_tuple = dt.timetuple()
        ts = int(time.mktime(t_tuple))
        docs.update_one({'_id': doc['_id']}, {'$set': {'timestamp': ts}})
        
        if 'Gainers / Losers' in doc['title'] or doc['text'] == '' or 'On the hour' in doc['title'] or 'At the close' in doc['title'] or 'At the open' in doc['title'] or 'dividend' in doc['title'] or ('Today\'s' in doc['title'] and 'performance' in doc['title']) or re.search('declares ?.* distribution', doc['title']) or re.search('announces ?.* distribution', doc['title']) or re.search(r'reports ?.* results', doc['title']) or re.search(r'reports ?.* earnings', doc['title']) or 'Notable earnings' in doc['title'] or ('More on' in doc['title'] and re.search(r'Q\d', doc['title'])) or 'car registrations' in doc['title'] or 'load factor' in doc['text'].lower():
            docs.update_one({'_id': doc['_id']}, {'$set': {'keep': 0}})
        
        else:
            #Copy post text from Mongo and identify named entities
            text = doc['text']
            ents = nlp(text).ents
            
            #Remove text at the end of the posts that reference past posts/information as well as anything within parentheses
            text = re.sub(r'Previously:.*\)', ' ', text)
            text = re.sub(r'Earlier:.*', '', text)
            text = re.sub(r'Source:.*', '', text)
            text = re.sub(r'\(.*?\)', '', text)

            #Remove named entities with the exceptions listed below
            for ent in ents:
                e = str(ent)
                if e not in ['Fed', 'Federal Reserve', 'FDA', 'bitcoin', 'Bitcoin', 'ECB', 'BOJ', 'Underweight', 'Overweight', 'IPO']:
                    text = text.replace(e, '')
            
            #Remove any strings that contain numbers
            text = ' '.join(word for word in text.split() if not any(char.isdigit() for char in word))
            #Ensure 'IPO' is retained
            text = text.replace('IPO', 'ipo')
            #Remove any strings with at least 3 consecutive letters, this removes most tickers
            text = re.sub(r'[A-Z]{3,}', '', text)
            #Remove all punctuation
            text = re.sub('[%s]' % re.escape(string.punctuation + '…' + "“" + "’"), ' ', text)

            #Lower all words, remove stopwords, and append stemmed words to cleaned_words list
            cleaned_words = []
            for word in text.split(' '):
                lower_word = word.lower()
                if lower_word not in stopwords and len(lower_word) > 2 and len(lower_word) < 15:
                    cleaned_words.append(stem(lower_word))
                    
            #Add cleaned text to Mongo document, rejects posts that had most of text removed during cleaning process
            if len(cleaned_words) > 5:
                docs.update_one({'_id': doc['_id']}, {'$set': {'keep': 1, 'cleaned_text': ' '.join(cleaned_words)}})
            
            else:
                docs.update_one({'_id': doc['_id']}, {'$set': {'keep': 0}})

clean_text(docs)

#### List of remaining documents

In [None]:
final_docs = [doc for doc in docs.find() if doc['keep'] == 1]

#### List of remaining documents' texts (Corpus)

In [None]:
cleaned_data = [doc['cleaned_text'] for doc in final_docs]

### Count Vectorize data (w/ StandardScaler and Normalization)

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1, 2), dtype='f', max_features=int(1e6))
cv_data = count_vectorizer.fit_transform(cleaned_data)

ss_cv = StandardScaler(with_mean=False)
cv_data_ss = ss_cv.fit_transform(cv_data)
cv_data_norm = normalize(cv_data)

#rs_cv = RobustScaler(with_centering=False)
#cv_data_rs = robust_scale(cv_data, with_centering=False)
#cv_data_qt = QuantileTransformer(n_quantiles=5).fit_transform(cv_data)

### TF-IDF Vectorize data (w/ StandardScaler and Normalization)

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), dtype='f', max_features=int(0.5e6))
tfidf_data = tfidf_vectorizer.fit_transform(cleaned_data)

ss_tfidf = StandardScaler(with_mean=False)
tfidf_data_ss = ss_tfidf.fit_transform(tfidf_data)
tfidf_data_norm = normalize(tfidf_data, axis=1)

#rs_cv = RobustScaler(with_centering=False)
#tfidft_data_rs = robust_scale(tfidft_data, with_centering=False)
#tfidf_data_qt = QuantileTransformer(n_quantiles=5).fit_transform(tfidf_data)

### Function that displays topics

In [None]:
def display_topics(model_comp, feature_names, no_top_words, topic_names=None):
    '''
    Print the top words for each topic
    IN:
        model_comp: Model Component Matrix (dim x word component matrix)
        feature_names: words (columns of doc/word matrix)
        no_top_words: number of words to display
        topic_names: predetermined topic names
    OUT:
        Prints topic names (default numbers) and the top (no_top_words) words for each topic
    '''
    for ix, topic in enumerate(model):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

## Final Model (NMF w/ TF-IDF-vectorized data, reduced to 15-dimensional "topic space")

In [165]:
nmf_tfidf = NMF(n_components=15)
W_tfidf_15 = nmf_tfidf.fit_transform(tfidf_data_norm)
H_tfidf_15 = nmf_tfidf.components_

print('TF-IDF NMF (k=15)')
display_topics(H_tfidf_15, tfidf_vectorizer.get_feature_names(), 20)

TF-IDF NMF (k=15)

Topic  0
bank, invest, fund, book, valu, loan, book valu, manag, capit, portfolio, credit, equiti, return, busi, invest bank, debt, hedg, ratio, market, investor

Topic  1
common, common stock, offer common, stock, public offer, public, allot, underwrit allot, underwrit, stock underwrit, close, close date, offer, date, warrant, proceed, allot addit, addit close, volum, corpor purpos

Topic  2
target, buy, upgrad, downgrad, buy rate, upgrad buy, rate target, initi, buy target, overweight, analyst, hold, stock, target initi, rais, invest, coverag, initi buy, rate, valuat

Topic  3
patient, treatment, studi, trial, phase, clinic, cancer, cell, approv, clinic trial, respons, endpoint, data, assess, drug, dose, primari, develop, diseas, therapi

Topic  4
stock, yield, index, higher, futur, lower, market, trade, gold, crude, gain, oil, sector, ahead, fed, nasdaq, investor, high, flat, crude oil

Topic  5
oil, gas, product, natur gas, natur, project, pipelin, field, crude, 

## Final list of topics (k=15)

In [None]:
topics = ["Valuation", "IPO", "Recommendations", "Pharmaceuticals", "Capital Markets", "Energy", "Management", "Earnings", "Federal Reserve", "Retail", "Technology", "Mergers & Acquisitions", "Debt Offerings", "Corporate Strategy", "Job Market"]

## Extract results from document-topic component matrix

In [None]:
start = 1356998400 #01/01/2013
stop = 1509753600 #11/04/2017

def strength(start, stop, time):
    '''
    Aggregate each topic's strength over all documents for a given time period.
    Simply averages topic strength over all documents in time period.
    ---
    IN:
        start: start date (01/01/2013)
        end: end date (stop conditions, 11/04/2017)
        time: time peiod (i.e. days = 1, weeks = 7, etc.)
    OUT:
        Array of topic strengths, (no. of time periods x topics (15))
    '''
    strengths = []
    while start < stop:
        end = start + 86400*time
        subset_indexes = [idx for idx, doc in enumerate(final_docs) if doc['timestamp'] >= start and doc['timestamp'] < end]
        try:
            doc_mat_subset = W_tfidf_15[subset_indexes[0]:(subset_indexes[-1]+1)]
            topic_avg = doc_mat_subset.mean(axis=0)
            strengths.append([topic_avg[i] for i in range(len(topics))])
        except:
            strengths.append([0 for i in range(len(topics))])
        start = end
    
    return strengths

#Topic strengths for every day
day_strengths = strength(start,stop,1)
#Topic strengths for every week
week_strengths = strength(start,stop,7)

### Make list of dates for each time period

In [None]:
start = 1356998400 #01/01/2013
stop = 1509753600 #11/04/2017

def make_dates(start,stop,time):
    '''
    Similar to function from above, however, simply produces list of dates for given time period
    ---
    IN:
        start date, end date, time periods (i.e. days, weeks, etc.)
    OUT:
        list of dates
    '''
    dates = []
    while start < stop:
        date = datetime.fromtimestamp(start).date()
        dates.append(int(datetime.strftime(date, format="%Y%m%d")))
        start = start + 86400*time
    return dates

days = make_dates(start,stop,1)
weeks = make_dates(start,stop,7)

## Export topic series time series data to csv (for d3 visualization)

In [None]:
with open("week_records.csv", "w") as f:
    f.write("date,Valuation,IPO,Recommendations,Pharmaceuticals,Capital Markets,Energy,Management,Earnings,Federal Reserve,Retail,Technology,Mergers & Acquisitions,Debt Offerings,Corporate Strategy,Job Market\n")
    for date,values in zip(weeks,week_strengths):
        f.write('%d'%date)
        f.write(',')
        for i,v in enumerate(values):
            if i == 14:
                f.write('%f'%v)
                f.write('\n')
            else:
                f.write('%f'%v)
                f.write(',')

### Save final model and NMF component matrices

In [None]:
with open('doc_topic_mat_15.pkl', 'wb') as f:
    pickle.dump(W_tfidf_15, f)
    
with open('word_topic_mat_15.pkl', 'wb') as f:
    pickle.dump(H_tfidf_15, f)
    
with open('tfidf_nmf_model_15.pkl', 'wb') as f:
    pickle.dump(nmf_tfidf, f)

## LSA/SVDS

Brief attempt at LSA/SVDS. Never yielded any actionable results. NMF performed much better.

In [None]:
k = 15
U_cv, s_cv, VT_cv = svds(cv_data_ss, k=k)
U_tfidf, s_tfidf, VT_tfidf = svds(tfidf_data_ss, k=k)

In [1213]:
display_topics(VT_tfidf, count_vectorizer.get_feature_names(), 20)


Topic  0
least anytim, longer train, knowledg word, thought perfect, way pure, role human, your play, train mayb, brute, brute forc, self play, liter start, start noth, complet self, pure learn, play liter, play learn, random longer, soon interview, evalu constrain

Topic  1
leverag gross, provid daylight, seem pri, mixtur competit, enterpris gradual, loom undiscrimin, spend incumb, leap advantag, fewer deal, hybrid stori, storag general, aggress newcom, help assum, cloud loom, compani midmarket, migrat acceler, big contribut, away hybrid, cannot seem, self market

Topic  2
hybrid stori, self market, help assum, fight fewer, startup leap, storag general, unchang reinvest, reinvest solv, contributor difficulti, insist unchang, suspect public, pri custom, trend sweep, acceler caus, cycl lost, difficulti lack, base cannot, leverag gross, particular buyer, aggress startup

Topic  3
eventu contact, car exampl, headset enabl, headset current, biggest acceler, beyond smartph, beyond beyond, 

In [1214]:
display_topics(VT_cv, count_vectorizer.get_feature_names(), 20)


Topic  0
howev simpl, sampl level, quantif dystrophin, baselin sampl, shoot big, differ distanc, fluoresc stain, quantit walk, surfac effect, subject ad, endpoint undermin, dystrophi read, inconsist immunofluoresc, sampl normal, fiber muscl, percentag dystrophin, done immunofluoresc, thumb review, clean data, level confer

Topic  1
debut gen, bulb smart, homepod goe, goe introduc, forthcom homepod, screen spot, monitor part, sell hue, bundl updat, similar physic, issu wasnt, bullet didnt, act buzzer, tweeter help, built hub, game integr, audio featur, remot act, embed far, swappabl cloth

Topic  2
kit framework, user desktop, codenam heavili, map open, renam next, mac watch, hub control, pictur echo, interact messag, music paid, echo process, icloud free, icloud drive, unlock mac, clipboard, sub updat, watch icloud, clipboard simultan, tab support, notif interact

Topic  3
misstep cite, stake inevit, quit distanc, bias realiti, provid incorrect, true imposs, familiar biotech, know ext