# Author of this notebook: Fanglida Yan

* [feature preprocessing](#section-1)
    - [change US to us](#subsection-1)
    - [lower case](#subsection-2)
    - [recover abbreviations](#subsection-3)
    - [remove everything except words, underscores and white spaces](#subsection-4)
    - [tokenization](#subsection-5)
    - [lemmatization](#subsection-6)
    - [remove stop words](#subsection-7)
* [Ideas for EDA](#section-2) 
    - [number of 1's vs 0's](#subsection-21)
    - [Test how weekday Monday/Tuesday/etc affect the label](#subsection-22)
    - [word map, top 1000 most frequent words](#subsection-23)
    - [see if the most frequent words are the same for top1, top2, etc](#subsection-24)
    - [test how presence of frequent words affect the result 1 and 0](#subsection-25)
    - [correlation matrix of features](#subsection-26)
    - [use tf-idf as features and repeat the analysis](#subsection-27)
* [Simple models for benchmarks](#section-3) 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**import all necessary libraries**

In [1]:
import math
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import re
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.corpus import wordnet # for pos tagging, pos is verb noun adj ect for lemmatization
from nltk.corpus import stopwords
from wordcloud import WordCloud
from collections import Counter
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Lambda, Dense, Concatenate, Dropout, Softmax
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# read csv as pandas dataframe

In [1]:
djia_df = pd.read_csv("/kaggle/input/stocknews/upload_DJIA_table.csv")
news_and_djia_df = pd.read_csv("/kaggle/input/stocknews/Combined_News_DJIA.csv")
news_df = pd.read_csv("/kaggle/input/stocknews/RedditNews.csv")
news_and_djia_df_og = pd.read_csv("/kaggle/input/stocknews/Combined_News_DJIA.csv")

In [1]:
#news_and_djia_df.head(3)

In [1]:
news_and_djia_df.info()

<a id="section-1"></a>
# 1. Feature preprocessing

## (a) change US to us
## (b) ower case 
## (c) recover abbreviations (change they'll to they will, etc)
## (d) remove everything except words, underscores and white spaces
## (e) tokenization
## (f) lemmatization
## (g) remove stop words

In [1]:
news_columns=news_and_djia_df.columns[2:]
news_columns

**try lowering case but found some problems with top23 top24 and top25 news**

In [1]:
for column in news_columns:
    try:
        news_and_djia_df[column].apply(lambda x : x.lower())
    except:
        print(column)

**value_counts function is very useful in this case**

In [1]:
news_and_djia_df['Top23'].apply(lambda x : type(x)).value_counts()

In [1]:
news_and_djia_df['Top23'][news_and_djia_df['Top23'].apply(lambda x : type(x))!=type('something')]

In [1]:
news_and_djia_df['Top24'][news_and_djia_df['Top24'].apply(lambda x : type(x))!=type('something')]

In [1]:
news_and_djia_df['Top25'][news_and_djia_df['Top25'].apply(lambda x : type(x))!=type('something')]

**it seems that almost every news start with b' so we are going to impute the missing values with b'**

In [1]:
news_and_djia_df['Top23']=news_and_djia_df['Top23'].replace(float("nan"),"b'");
news_and_djia_df['Top24']=news_and_djia_df['Top24'].replace(float("nan"),"b'");
news_and_djia_df['Top25']=news_and_djia_df['Top25'].replace(float("nan"),"b'");

In [1]:
news_and_djia_df['Top23'][277]

**make sure that we have taken care of the missing values**

In [1]:
for column in news_columns:
    try:
        news_and_djia_df[column].apply(lambda x : x.lower())
    except:
        print(column)

<a id="subsection-1"></a>
## (a) before lowering case, change US to usa (otherwise US will become us which is not a country)

In [1]:
def US_to_america(news):
    return re.sub(r'US', 'usa', news)
    #return re.sub(r'UK', 'united kingdom', news) unlike us, uk is not a word so it's probably ok
    #return re.sub(r'EU', 'european union', news)

for column in news_columns:
    news_and_djia_df[column]=news_and_djia_df[column].apply(lambda x : US_to_america(x))

<a id="subsection-2"></a>
## (b) now we can lower case all news

In [1]:
for column in news_columns:
    news_and_djia_df[column]=news_and_djia_df[column].apply(lambda x : x.lower())

**Are there any news that doesn't begin with b'? It seems that the answer is yes. The news doesn't begin with b' starting from number 477 (except the rows with missing values, we imputed them with b').**

In [1]:
for column in news_columns[:3]:
    mask1 = news_and_djia_df[column].apply(lambda x : x[:2])!="b'"
    mask2 = news_and_djia_df[column].apply(lambda x : x[:2])!='b"'
    mask = np.bitwise_and(mask1, mask2)
    print(column)
    print(news_and_djia_df[column][mask].head(3))
    print()

**remove the b' and b'' at the beginning of news**

In [1]:
for column in news_columns:
    mask1 = news_and_djia_df[column].apply(lambda x : x[:2])=="b'"
    mask2 = news_and_djia_df[column].apply(lambda x : x[:2])=='b"'
    mask = np.logical_or(mask1, mask2)
    for i in range(mask.shape[0]):
        if mask.loc[i] == True:
            news_and_djia_df.loc[mask.index[i],column] =  news_and_djia_df.loc[mask.index[i],column][2:]

In [1]:
print(news_and_djia_df.iloc[1601,3])
news_and_djia_df_og.iloc[1601,3]

<a id="subsection-3"></a>
## (c) recover abbreviations (change they'll to they will, etc) 

**I copied the code from the follow url by Yann Dubois** <br>
https://stackoverflow.com/questions/43018030/replace-apostrophe-short-words-in-python

In [1]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"u\.s\.", "usa", phrase)
    phrase = re.sub(r"united states of america", "usa", phrase)
    phrase = re.sub(r"american", "usa", phrase)
    phrase = re.sub(r"russian", "russia", phrase)
    phrase = re.sub(r"israeli", "israel", phrase)
    phrase = re.sub(r"united nations", "un", phrase)
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    #phrase = re.sub(r"u\.n\.", "united nations", phrase)
    #phrase = re.sub(r"un", "united nations", phrase)
    #phrase = re.sub(r"u\.s\.a\.", "america", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

for column in news_columns:
    news_and_djia_df[column] = news_and_djia_df[column].apply(lambda x : decontracted(x))

<a id="subsection-4"></a>
## (d) remove everything except words and white space

In [1]:
def remove_punc(news):
    news = re.sub('-', ' ', news)
    news = re.sub(r'[^\w\s]', '', news) # remove everything except words, digits, underscores and white spaces
    #news = re.sub(r'\s[^\d]{0,}[\d]{1,}[^\d]{0,}\s', '', news) #  remove words with digits e.g. $12b 
    news = re.sub(r'[\d]', '', news) # remove all digits
    news = re.sub('_', ' ', news) # the previous row doesn't remove underscore
    return news

for column in news_columns:
    news_and_djia_df[column] = news_and_djia_df[column].apply(lambda x : remove_punc(x))

<a id="subsection-5"></a>
## (e) tokenization 

In [1]:
for column in news_columns:
    news_and_djia_df[column] = news_and_djia_df[column].apply(lambda x: nltk.word_tokenize(x))

<a id="subsection-6"></a>
## (f) Lemmatization

In [1]:
wl = WordNetLemmatizer()
wl.lemmatize('feet','n')

In [1]:
tagged = nltk.pos_tag(['there','are','many','books','that','can','be','patiently','read'])
tagged

**the function below is taken from the top answer in https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python**

In [1]:
def get_wordnet_pos(treebank_tag):
    
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
def lemmatize_list(lista):
    tagged = nltk.pos_tag(lista)
    for i, ele in enumerate(lista):
        lista[i] = wl.lemmatize(ele, get_wordnet_pos(tagged[i][1]))
    return lista

In [1]:
lemmatize_list(['there','are','many','books','that','can','be','peacefully','read'])

In [1]:
for column in news_columns:
    news_and_djia_df[column] = news_and_djia_df[column].apply(lambda x : lemmatize_list(x))

<a id="subsection-7"></a>
## (g) remove stop words

In [1]:
stop_words=stopwords.words('english')
#stop_words.append('u') # 'u' is often short for 'you' in casual english
stop_words.append('one'); stop_words.append('two'); stop_words.append('three')
stop_words.append('four'); stop_words.append('five'); stop_words.append('six')
stop_words.append('seven'); stop_words.append('eight'); stop_words.append('nine')
stop_words.append('ten'); stop_words.append('eleven'); stop_words.append('twelve')
stop_words.append('thirteen'); stop_words.append('fourteen'); stop_words.append('fifteen')
stop_words.append('sixteen'); stop_words.append('seventeen'); stop_words.append('eighteen')
stop_words.append('nineteen'); stop_words.append('twenty'); stop_words.append('million')
stop_words.append('billion'); stop_words.append('th'); stop_words.append('m') # million
stop_words.append('b') # billion

stop_words.append('say'); stop_words.append('people'); 

stop_words.remove('no'); stop_words.remove('not')
stop_words.remove('above'); stop_words.remove('below')
#stop_words.remove('before'); stop_words.remove('after') 
stop_words.remove('up'); stop_words.remove('down') 
stop_words.remove('over'); stop_words.remove('under')

def remove_stop_words(lista):
    pt=0 # don't use a for loop because len(lista) keeps changing as we remove stop words.
    while pt<len(lista):
        if lista[pt] in stop_words:
            lista.remove(lista[pt])
        else:
            pt+=1
    return lista

for column in news_columns:
    news_and_djia_df[column] = news_and_djia_df[column].apply(lambda x: remove_stop_words(x))

In [1]:
#stop_words

In [1]:
pd.options.display.max_colwidth = 300
head=100
add=20
for i in range(head,head+add):
    print(i)
    print(news_and_djia_df_og.iloc[i, 3])
    print(news_and_djia_df.iloc[i, 3])
    print()
pd.options.display.max_colwidth = 50

<a id="section-2"></a>
# 2. Ideas for EDA 
## (a) number of 1's vs 0's
## (b) Test how weekday Monday/Tuesday/etc affect the label 
## (c) word map, top 1000 most frequent words 
## (d) see if the most frequent words are the same for top1, top2, etc 
## (e) test how presence of frequent words affect the result 1 and 0 
## (f) correlation matrix of features

<a id="subsection-21"></a>
## (a) number of 1's vs 0's

**replace the first column with timestamp**

In [1]:
def str_to_timestamp(date_string):
    return pd.Timestamp(year= int(date_string[:4]), month = int(date_string[5:7]), day = int(date_string[8:]))

news_and_djia_df['Date'] = news_and_djia_df['Date'].apply(str_to_timestamp)

**create a new dataframe for date and label only**

In [1]:
df = news_and_djia_df[['Date','Label']]
df = df.assign(Date = df['Date'].apply(lambda x : x.weekday()))
df = df.rename(columns={"Date": "weekday"})

**count the number of 0's and 1's in the label**

In [1]:
asdf = df['Label'].value_counts()
print(asdf)

**the fact that there are more 1's than 0's is maybe consistent with the fact that the stock market is growing on large time scale**

In [1]:
f, ax = plt.subplots(1,1, figsize=(5,5))
ax.pie(asdf)
ax.legend(['1','0'])
plt.title('Labels');

<a id="subsection-22"></a>
## (b) Test how weekday Monday/Tuesday/etc affect the label

In [1]:
summary = df.groupby(by = 'weekday').sum()
summary = summary.rename(columns = {'Label' : 'counts of 1'})
summary

**looks like Monday is correlated with decrease of DJIA**

In [1]:
f, ax = plt.subplots(1,1, figsize=(5,5))
plt.bar(['Mon','Tues','Wed','Thur','Fri'], summary['counts of 1'])
ax.set_xlabel('weekday')
ax.set_ylabel('days of market increase');

<a id="subsection-23"></a>
## (c) word map for all news
<a id="subsection-24"></a>
## & (d) see if the most frequent words are the same for top1, top2, etc 

In [1]:
f, ax = plt.subplots(3,3, figsize=(20,20))
#ax[0,0].bar(['Mon','Tues','Wed','Thur','Fri'], summary['counts of 1'])
for i in range(9):
    row = int(i/3)
    column = i % 3
    long_string=''
    tot = news_and_djia_df['Top'+str(i+1)].values.shape[0]
    for j in range(tot):
        lista = news_and_djia_df['Top'+str(i+1)][j]
        for ele in lista:
            long_string = long_string + ele + ' '
    wc = WordCloud(background_color ='white', width = 800, height = 800, min_font_size = 10).generate(long_string)
    ax[row, column].imshow(wc)

In [1]:
tot = news_and_djia_df['Top1'].values.shape[0]
huge_list = []
for i in range(tot):
    for j in range(25):
        huge_list = huge_list + news_and_djia_df.iloc[i,j+2]
        
freq_dict = Counter(huge_list)

In [1]:
freq_list = []

for key in freq_dict:
    freq_list.append([key, freq_dict[key]])
    
def return_freq(lista):
    return lista[1]

freq_list.sort(key = return_freq, reverse = True)

In [1]:
num =6 # num^2 is the number of most frequent words we want
words = [freq_list[i][0] for i in range(num**2)]
freqs = [freq_list[i][1] for i in range(num**2)]
ax = plt.subplots(1, 1, figsize=(23, 10))
plt.bar(words, freqs)
plt.ylabel("number of occurences");

In [1]:
print(news_and_djia_df.iloc[1601,3])
news_and_djia_df_og.iloc[1601,3]

<a id="subsection-25"></a>
## (e) test how presence of frequent words affect the result 1 and 0 

**create a dictionary that stores number of occurences for words in each day**<br>
**[china, china, usa] would be {china:2, usa:1}**

In [1]:
store_dicts=[]
for i in range(news_and_djia_df.shape[0]):
    dic={}
    for j in range(25):
        lista=news_and_djia_df.iloc[i,j+2]
        for ele in lista:
            try:
                dic[ele]+=1
            except:
                dic[ele]=1
    store_dicts.append(dic)

**create a numpy array that stores the number of occurences for the most frequent words in each day**

In [1]:
main = np.zeros((tot, num**2))

for i in range(tot):
    for j in range(num**2):
        try:
            main[i,j] = store_dicts[i][words[j]]
        except:
            _

In [1]:
main_df = pd.DataFrame(main)
main_df['label'] = news_and_djia_df['Label']

**maybe today's news will make good prediction for tomorrow's market change, so we try out this idea**

In [1]:
next_5_day_label = news_and_djia_df['Label']
next_5_day_label = next_5_day_label[5:]
for i in range(5):
    next_5_day_label = np.append(next_5_day_label, float('NAN'))
main_df['next_5_day_label'] = next_5_day_label
main_df.iloc[-10:]

**certain words in today's news affect today's market change, remember to normaliza the histogram because the number of market growing days are more than dropping days**

In [1]:
f, ax = plt.subplots(num, num, figsize=(20,20))
bins=[0,1,2,3,4,5,6,7,8,9,10,11]
for i in range(num**2):
        row = int(i/num)
        column = i % num
        ax[row, column].hist(main_df[main_df['label']==1][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
        ax[row, column].hist(main_df[main_df['label']==0][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
        ax[row, column].legend([1,0])
        ax[row, column].set_title(words[i])

**certain words in today's news affect the market change 5 days later, remember to normaliza the histogram because the number of market growing days are more than dropping days**

In [1]:
f, ax = plt.subplots(num, num, figsize=(20,20))
bins=[0,1,2,3,4,5,6,7,8,9,10,11]
for i in range(num**2):
        row = int(i/num)
        column = i % num
        ax[row, column].hist(main_df[main_df['next_5_day_label']==1][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
        ax[row, column].hist(main_df[main_df['next_5_day_label']==0][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
        ax[row, column].legend([1,0])
        ax[row, column].set_title(words[i])

<a id="subsection-26"></a>
## (f) correlation matrix of features

In [1]:
f, ax = plt.subplots(1, 1, figsize=(10,10))
ax.matshow(main_df.corr())

In [1]:
print(words[1], words[16], ':high positive correlation')
print(words[13], words[30], ':high positive correlation')
print(words[15],words[7], ':high positive correlation')
print(words[0],words[10], ':high negative correlation')
print(words[13],words[19], ':high negative correlation')
print(words[13],words[20], ':high negative correlation')

<a id="subsection-27"></a>
## (g) repeat the analysis with tfidf

**total number of words**

In [1]:
tot_words=[]
for i in range(tot):
    tot_words.append(0)
    for j in range(25):
        tot_words[-1]+=len(news_and_djia_df.iloc[i,j+2])

**calculate tf**

In [1]:
tf_df=main_df
for i in range(tot):
    for j in range(num**2):
        tf_df.iloc[i,j] = tf_df.iloc[i,j]/tot_words[i]

**calculate idf**

In [1]:
idf=[0]*(num**2)
for i in range(tot):
    for j in range(num**2):
        try:
            store_dicts[i][words[j]]
            idf[j]+=1
        except:
            None
            
for i in range(num**2):
    idf[i]=np.log(tot/idf[i])

**tf-idf**

In [1]:
tf_idf_df=tf_df
for i in range(tot):
    for j in range(num**2):
        tf_idf_df.iloc[i,j]=tf_idf_df.iloc[i,j] * idf[j]

**this tells us how today's news affect today's stock market movement**

In [1]:
f, ax = plt.subplots(num, num, figsize=(20,20))
bins=[i/1000 for i in range(11)]
for i in range(num**2):
    row = int(i/num)
    column = i % num
    ax[row, column].hist(tf_idf_df[tf_idf_df['label']==1][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
    ax[row, column].hist(tf_idf_df[tf_idf_df['label']==0][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
    ax[row, column].legend([1,0])
    ax[row, column].set_title(words[i])

**certain words in today's news affect next 5 today's market change**

In [1]:
f, ax = plt.subplots(num, num, figsize=(20,20))
bins=[i/1000 for i in range(11)]
for i in range(num**2):
    row = int(i/num)
    column = i % num
    ax[row, column].hist(tf_idf_df[tf_idf_df['next_5_day_label']==1][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
    ax[row, column].hist(tf_idf_df[tf_idf_df['next_5_day_label']==0][i], alpha=0.5, rwidth=0.9, bins=bins, density=True)
    ax[row, column].legend([1,0])
    ax[row, column].set_title(words[i])

<a id="section-3"></a>
# 3. Simple models for benchmark

**use the idea of n-grams and TF-IDF, do logistic regressions**<br>
**the idea is take from the two notebooks below**<br>
https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit<br>
https://www.kaggle.com/lseiyjg/use-news-to-predict-stock-markets

In [1]:
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.linear_model import LogisticRegression
import random
store_null=[] # store null accuracy
store_result=[] # store accuracy
store_auc=[] # store auc roc

data = pd.read_csv('../input/stocknews/Combined_News_DJIA.csv') # read in data
   
for i in range(100):
    
    train_X, test_X, train_Y, test_Y = train_test_split(data.iloc[:,2:], data.iloc[:,1], test_size=0.2, random_state=None) 

    trainheadlines = []
    for row in range(train_X.shape[0]):
        trainheadlines.append(' '.join(str(x) for x in train_X.iloc[row])) # join together all 25 news

    testheadlines = []
    for row in range(test_X.shape[0]):
        testheadlines.append(' '.join(str(x) for x in test_X.iloc[row])) # join together all 25 news
    
    # count TF-IDF on 2-grams
    advancedvectorizer = TfidfVectorizer(min_df=0.03, max_df=0.97, max_features = 200000, ngram_range = (2, 2)) 
    advancedtrain = advancedvectorizer.fit_transform(trainheadlines)
    advancedtest = advancedvectorizer.transform(testheadlines)
    
    # C is regularization parameter, when C gets larger regularization becomes smaller
    advancedmodel = LogisticRegression(C=1000, solver='liblinear')
    advancedmodel.fit(advancedtrain, train_Y)

    preds13 = advancedmodel.predict(advancedtest) # binary prediciton
    preds13prob = advancedmodel.predict_proba(advancedtest)[:,1] # probablity prediction
    acc13 = accuracy_score(test_Y, preds13)
    
    store_null.append(sum(test_Y)/test_Y.shape[0])
    store_result.append(acc13)
    store_auc.append(roc_auc_score(test_Y,preds13prob))

print('Average null accuracy: ',sum(store_null)/len(store_null))
print('Average accuracy: ',sum(store_result)/len(store_result))
print('Average AUC score: ',sum(store_auc)/len(store_auc))

In [1]:
import statistics

print('Standard deviation of null accuracy: ', statistics.pstdev(store_null))
print('Standard deviation of accuracy: ', statistics.pstdev(store_result))
print('Standard deviation of AUC score: ', statistics.pstdev(store_auc))