CS410 Final Project: Iterative Topic Modeling with Time Series Feedback

Team Starks

Saad Rasheed - srashee2
Javier Huamani - huamani2
Sai Allu - allu2

Demonstration

https://youtu.be/BC6qcxoF8XQ

Purpose

Team starks set out to recreate Mining Causal Topics in Text Data

Title
Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback

Authors
Hyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas Rietz, and Daniel Diermeier

Citation
Mining causal topics in text data: Iterative topic modeling with time series feedback. In Proceedings of the 22nd ACM international conference on information & knowledge management (CIKM 2013). ACM, New York, NY, USA, 885-890. DOI=10.1145/2505515.2505612

Introduction

In our final project we create a text mining application to find causal topics from corpus data. Our application takes a probabilistic topic model and using time-series data we explore topics that are causally correlated with said time-series data. We improve on the topics at each iteration by using prior distributions. To determine causality we are using Granger causality tests which is a popular testing mechanism used with lead and lag relationships across time series.

Libraries

pandas - Used for data manipulation and analysis
scikit-learn - Used for classification, regression and clustering algorithms
nltk - Used for symbolic and statistical natural language processesing
pyLDAvis - Used to help interptret topics from a LDA topic model
statsmodels - Used for statistical computations

Files

Corpus Data

NYT2000_1.csv
NYT2000_2.csv

IEM Stock Data

IEM2000.xlsx

Main Application

LDA.py

Code Walkthrough

We begin by reading in the corpus data that we segmented into two files to be able to store alongside the code. We then clean the data by removing NaN values and other filtering. We do more filtering and remove unneccesary characters and then lemmatize the data.

We then read in the IEM data and normalize it.

We then take the corpus data and generate the counts and vocabulary to create a document term matrix

Now we're able to fit the LDA model, we use 15 topics and have found that number to be optimal. For each date we create a topic stream and aggregate topic coverages and plot them.

To evaluate causality we then run Granger tests against each topic and output the p values for the f Tests against each lag. To determine the optimal lag value we aggregate p values. We sort the p values in ascending order.

Of the top 25 words for each topic we run granger causality tests and pearson coefficient tests. We only continue if we get a p value of less than .05. To actually create the priors we evaluate a topic based on its negative or positive bias. If a topic has a dominated negative or positive bias we create a prior for each word and assign it to a single topic. Conversely, if there is no negative or positive bias we split the word into two topics and assign it to a single topic.

Our code then iterates using the generated prior (On the first iteration the priors are empty) and fits the LDA model again according to the max iteration.

How to run

The easiest way to run our code is to download Anaconda and run it through jupyter notebook

$ git clone https://github.com/srashee2/CourseProject.git
$ cd CourseProject
$ jupyter notebook

You can now click on LDA.ipynb and click run all cells. It will take some time to run through the code, approximately 1 hour.

Contributions

Team Starks came together over the course of a few months with weekly meetings to understand, learn and recreate Iterative Topic Modeling with Time Series Feedback. More specific contributions for the team members can be found below.

All team members did the following: library research, paper breakdown and documentation.

Saad Rasheed - Logistical work, Corpus Text Extraction, Presentation, and LDA modeling iteration
Javier Huamani - Text Filtering and Manipulation, LDA Modeling, Granger Causality, and Pearson Coefficient Tests
Sai Allu - Text Filtering and Manipulation, LDA Modeling, Granger Causality, and Presentation

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CS410 Progress Report.pdf		CS410 Progress Report.pdf
IEM2000.xlsx		IEM2000.xlsx
LDA.ipynb		LDA.ipynb
NYT2000_1.csv		NYT2000_1.csv
NYT2000_2.csv		NYT2000_2.csv
Project Proposal Submission.pdf		Project Proposal Submission.pdf
README.md		README.md
SKlearn_w_subclass.py		SKlearn_w_subclass.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS410 Final Project: Iterative Topic Modeling with Time Series Feedback

Team Starks

Demonstration

Purpose

Introduction

Libraries

Files

Code Walkthrough

How to run

Contributions

About

Releases

Packages

Languages

srashee2/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CS410 Final Project: Iterative Topic Modeling with Time Series Feedback

Team Starks

Demonstration

Purpose

Introduction

Libraries

Files

Code Walkthrough

How to run

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages