# DS1003 Final Project 
## Step 2 Data Preprocessing

1. Get tokens for each forum
2. Run them through them through the LDA to get X variables for regression
3. Run a logistic regression on them (all topics from a single post) to predict positive (1 for increase) or negative (0 for decrease)
4. Compare with our predictions based on sentiment analysis (accuracy)

Our poster will have one section dedicated to data collection, one section dedicated to data preprocessing (cleaning, tokenizing), one section dedicated to sentiment analysis, one section dedicated to LDA, one section dedicated to sLDA.

In [1]:
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load data
data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), categories=['sci.space', 'rec.sport.baseball', 'comp.graphics'])
docs = data.data
y = np.where(data.target > 0, 1, 0)  # Create a binary response variable

# Create a document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(docs)

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(doc_term_matrix, y, test_size=0.2, random_state=42)

# Train LDA
lda = LatentDirichletAllocation(n_components=3, random_state=0)
X_topics = lda.fit_transform(X_train)

# Use the document-topic distribution as features to predict the response variable
reg = LogisticRegression()
reg.fit(X_topics, y_train)

# Evaluate the model
print("Model score:", reg.score(X_topics, y_train))

# Print topics
words = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx}: {' '.join([words[i] for i in topic.argsort()[-10:]])}")


Model score: 0.8633093525179856
Topic 0: shuttle program information software edu available nasa image data space
Topic 1: bit format images files gif edu graphics file image jpeg
Topic 2: team know game good time year like just don think


In [5]:
print(data.target_names[0:100])

['comp.graphics', 'rec.sport.baseball', 'sci.space']


In [6]:
import pandas as pd
wsbfinal2 = pd.read_csv("wsbFinal2.csv")
wsbfinal2.head()

Unnamed: 0,Post ID,Title,Publish Date,Selftext,SentScore,Tickers,Unnamed: 6,Unnamed: 7,Actual Ticker,By Sentiment,Sentiment W(1)/L(0),Actual Position,Actual Win/L,Predicted Win/L,Binary Actual
0,ejnlo3,$CRC DD,2020-01-03 23:18:04,"Alright autists, keeping up with the Daddy Ira...",-0.001141,"['OXY', 'IR', 'HAL', 'SLB']",,,CRC,S,0,Long,1,0,1
1,eomcpm,AMZN DD INSIDE (LONG),2020-01-14 15:00:29,**NOTE: This write up was done by u/soundofree...,0.001189,"['AMZN', 'F']",,,AMZN,L,1,Long,1,1,1
2,epc4op,Yet another $SPCE DD for all ye autists,2020-01-16 1:37:41,This is the ride of our generation. You didn’t...,0.005563,"['AAPL', 'NFLX', 'AMZN', 'FB', 'GOOG']",,,SPCE,l,1,Long,1,1,1
3,ermnes,FDX DD,2020-01-21 0:37:05,This afternoon I had my handler drive me to a ...,0.0,FDX,,,FDX,l,1,Long,0,0,1
4,eshzd5,"[DD] Wuhan Coronavirus: $APT, $MMM",2020-01-22 20:50:57,$APT makes protective equipment and tends to d...,0.00101,MMM,,,APT,l,1,Long,1,1,1


In [19]:
test_df = wsbfinal2.Selftext.iloc[0:52]
test_df

0     Alright autists, keeping up with the Daddy Ira...
1     **NOTE: This write up was done by u/soundofree...
2     This is the ride of our generation. You didn’t...
3     This afternoon I had my handler drive me to a ...
4     $APT makes protective equipment and tends to d...
5     We would all agree that Microsoft is a blue ch...
6     Wuhan virus is only killing old people who don...
7     FAANGS are trading at stupid high P/Es. Amazon...
8     Buy CHRW leaps today. The truckload transporta...
9     Hi Fuckers.  I work in tech sales for a digita...
10    As you may have heard Macau shut down all casi...
11    Your favorite $MSFT shill is back for round th...
12       Your favorite $MSFT shill is back for round...
13     Your favorite $MSFT shill is back for round t...
14    No bullshit story just JRE #1425 with Garrett ...
15    Alright, fellow autists, new and ready to be s...
16    So here is some decent DD... I work for one of...
17    Staged stores as you have probably heard i

In [21]:
df_y = wsbfinal2["Binary Actual "]
df_y = df_y.dropna()
dfy = df_y[0:52]

In [23]:
# Create a document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(test_df)

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(doc_term_matrix, dfy, test_size=0.2, random_state=42)

# Train LDA
lda = LatentDirichletAllocation(n_components=3, random_state=0)
X_topics = lda.fit_transform(X_train)

# Use the document-topic distribution as features to predict the response variable
reg = LogisticRegression()
reg.fit(X_topics, y_train)

# Evaluate the model
X_topics_test = lda.transform(X_test)
print("Model score:", reg.score(X_topics_test, y_test))

# Print topics
words = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
  print(f"Topic {topic_idx}: {' '.join([words[i] for i in topic.argsort()[-10:]])}")

Model score: 0.7272727272727273
Topic 0: stock disney oil market know just like debt price year
Topic 1: 2020 drug coronavirus gild gilead virus com www https remdesivir
Topic 2: market cloud azure like www amd com https amp amazon
