# Predicting Stock Market Prices Through News Texts
### Homework project prepared for the Machine Learning Course at Higher School of Economics

# 0. Introduction
The Dow Jones Industrial Average (DJIA) is the arithmetic average of the stock prices of the 30 largest US companies. DJIA reports the current state of affairs on the stock exchange since prices of the stocks of these companies correlate in some way: if DJIA grows, the prices of the companies' stocks also grow, and if it falls, the stock prices fall too. 

Dynamics of stock prices changing can be caused by a variety of different events in the world, both economic and political. Stock traders are constantly up to date on the latest news to be aware of possible changes in stock prices. Roughly, all the incoming news could be divided into the two types: the ones causing the growth of certain stocks, and the ones causing their fall. Here comes an idea that it could be possible to create a trading bot that if could automatically process incoming data about events in the world from the news and sell or buy stocks. However, the problem is that there is a too great variety of circumstances that could cause stock price changes. Moreover, not all the information about of these factors could be obviously derived from news reports, and, after all, the trading strategy, decision-making algorithms and correct treshholds for different actions are not less that important, than a clever predictive model.

Nevertheless, I decided to check whether it is possible to win on the stock market using only an automatic classifier trained on the news bulletins using only the linguistic analysis of news texts without any external information. This notebook is divided into two parts: data exploration which could open more inside on the dataset, and a feature engineering in which I try to get different linguistic features from the raw text data.

In [None]:
# basic text pre-processing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, LancasterStemmer
from nltk import pos_tag

# core modules
from pandas import DataFrame, concat, options
import numpy as np
import re
from collections import defaultdict, Counter

# tools and models for machine learning
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from gensim.models import Word2Vec

# visualization
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import matplotlib.pyplot as plt
from matplotlib import use
import seaborn as sns
from wordcloud import WordCloud

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()
stemmer =  LancasterStemmer()
stop = stopwords.words('english')
py.init_notebook_mode(connected=True)
options.mode.chained_assignment = None

In [None]:
import warnings
warnings.filterwarnings('ignore')

# 1. Data Exploration

First of all, I manually divided the dataset into the training and the test set. News from 2008 to 2015 consist the training set, all other news consist the test set. 

In [None]:
data = DataFrame.from_csv('../input/Combined_News_DJIA.csv').reset_index().fillna(' ')

In [None]:
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']
y_train = train.Label.values
y_test = test.Label.values
col_number = 26

In [None]:
print('The size of the training set is {} news, the size of the test set is {} news'.format(len(train), len(test)))

Let's look at our data:

In [None]:
train.head()

We have an almost equal class balance in the whole dataset. That's good.

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(data.Label.value_counts().index, data.Label.value_counts().values, alpha=0.8)
plt.ylabel('Amount of objects', fontsize=16)
plt.xlabel('Class', fontsize=16)
plt.show();

To begin with, I preprocess the texts with the help of basic NLP techniques: tokenization, lowercasing, lemmatization and/or stemming. Lemmatization is an algorithm of a morphological normalization which reduces the word to the to its morphological root (normalized form of a word), stemming is also an algorithm of a morphological normalization, but it reduces the word to its stem.

In [None]:
lmb_f = [lambda x: re.sub("""^b("|')""",'', str(x)),  
         lambda x: str(x).lower(),
         lambda x: str(x).replace("'",''),
         lambda x: word_tokenize(str(x)),
         lambda x: [wordnet_lemmatizer.lemmatize(str(i)) for i in x],
         lambda x: [stemmer.stem(str(i)) for i in x],
         lambda x: ' '.join(x)
        ]

In [None]:
def parse_trainset(data, preproc='lem'):
    if preproc == 'lem':
        lambdas = lmb_f[0:5] + lmb_f[6:]
    elif preproc == 'stem':
        lambdas = lmb_f[0:4] + lmb_f[5:]
    elif preproc == 'lem+stem':
        lambdas = lmb_f
    elif preproc == '_':
        lambdas = lmb_f[0:5]
    li = []
    for col in range(1, col_number):
        s = data.loc[:,'Top' + str(col)]
        for a in lambdas:
            s = s.apply(a)
        li.append(s)
    return li

I will create four types of datasets using different options of preprocessing:

1. Raw texts (no morphological normalization);
2. Lemmatization;
3. Stemming;
4. Lemmatization + Stemming.

In the following examples I will experiment with only the raw text, but on the final stage of this work I will compare performance on different models on different techniques of morphological normalization.

In [None]:
train_lem = concat([train.drop(train.columns[2:], axis=1), DataFrame(parse_trainset(train)).transpose()], axis=1)
test_lem = concat([test.drop(test.columns[2:], axis=1), DataFrame(parse_trainset(test)).transpose()], axis=1)

In [None]:
train_stem = concat([train.drop(train.columns[2:], axis=1), DataFrame(parse_trainset(train, 'stem')).transpose()], axis=1)
test_stem = concat([test.drop(test.columns[2:], axis=1), DataFrame(parse_trainset(test, 'stem')).transpose()], axis=1)

In [None]:
train_lem_stem = concat([train.drop(train.columns[2:], axis=1), DataFrame(parse_trainset(train, 'lem+stem')).transpose()], axis=1)
test_lem_stem = concat([test.drop(test.columns[2:], axis=1), DataFrame(parse_trainset(test, 'lem+stem')).transpose()], axis=1)

In [None]:
train_ = concat([train.drop(train.columns[2:], axis=1), DataFrame(parse_trainset(train, '_')).transpose()], axis=1)
test_ = concat([test.drop(test.columns[2:], axis=1), DataFrame(parse_trainset(test, '_')).transpose()], axis=1)

For instance, the stemmed dataset looks like this. The difference is notable.

In [None]:
train_stem.head()

All at all, I need to transform the text data to numerical features with which I could train the supervised classifier.  One of the most simple text features that one could obtain from the raw text is a length of it. Before we start engineering this feature, let's check if there any correlation of this feature with DJIA:

In [None]:
mean_list = []
for number in range(train_.shape[0]):
    s = np.mean([len(f) for f in list(train_.loc[number][2:].values)])
    mean_list.append((s, y_train[number]))

for number2 in range(number+1, number+1+test_.shape[0]):
    s = np.mean([len(f) for f in list(test_.loc[number2][2:].values)])
    mean_list.append((s, y_test[number2-number-1]))
mean_list = sorted(mean_list)

In [None]:
plt.figure(figsize=(20,6))
plt.plot(range(len(mean_list))[:20], list(y_train)[:20])
plt.legend()
plt.xticks(range(20))
plt.xlabel('Average text length', size=16)
plt.show();

Well, correlation exists, so we could use this feature to build of predictive model. Great! What other features could we use? In my opinion, another example of a nice feature is an amount of mentioning of the countries' leaders. Let's check Barack Obama for instance: does his name correlates with DJIA?

In [None]:
lists = ['obama', 'us', 'u.s.', 'u s', 'u.s', 'united', 'state', 'america']

all_sum = []
for number in range(train_.shape[0]):
    s = [i for f in list(train_.loc[number][2:].values) for i in f]
    summ = 0
    for i in lists:
        summ+=Counter(s)[i]
    all_sum.append((summ, y_train[number]))

for number2 in range(number+1, number+1+test_.shape[0]):
    s = [i for f in list(test_.loc[number2][2:].values) for i in f]
    summ = 0
    for i in lists:
        summ+=Counter(s)[i]
    all_sum.append((summ, y_test[number2-number-1]))

In [None]:
a = sorted([(i[0], dict(Counter(all_sum))[i]) for i in dict(Counter(all_sum)) if i[1]==0])
a1= [i[1] for i in a]
b = sorted([(i[0], dict(Counter(all_sum))[i]) for i in dict(Counter(all_sum)) if i[1]==1])
b1= [i[1] for i in b]
red = '#B2182B'
blue = '#2166AC'
width = 0.35

plt.figure(figsize=(12,6))
plt.bar(np.arange(12), a1, width, color=red, label='0 label')
plt.bar(np.arange(12), b1, width, bottom=a1, color=blue, label='1 label')
plt.legend()
plt.xticks(range(12))
plt.xlabel('Key word amount', size=16)
plt.show()

Mentionong Obama shows correlation with DJIA, so now we have another feature to use. But what about other countries? How many news are about them? I will compare two countries: Russia and unrecognised (yet) Islamic State of Iraq and al-Sham (ISIS).  Let's check how many news mention Russia and Putin, and how many ones mention ISIS:

In [None]:
news_topics_russia = ['russian', 'russia', 'putin']
news_topics_isis = ['isil', 'isis', 'levant', 'daesh']

In [None]:
news_russia = []
news_isis = []

for number in range(train_.shape[0]):
    news_russia += [i for f in list(train_.loc[number][2:].values) for i in f if i.lower() in news_topics_russia]
    news_isis += [i for f in list(train_.loc[number][2:].values) for i in f if i.lower() in news_topics_isis]
for number2 in range(number+1, number+1+test_.shape[0]):
    news_russia += [i for f in list(test_.loc[number2][2:].values) for i in f if i.lower() in news_topics_russia]
    news_isis += [i for f in list(train_.loc[number][2:].values) for i in f if i.lower() in news_topics_isis]

news_counter_russia = Counter(news_russia)
news_counter_isis = Counter(news_isis)

In [None]:
print('Russia is mentioned {} times, and ISIS is mentioned {} times'.format(sum(news_counter_russia.values()), sum(news_counter_isis.values())))

russia_bar = go.Bar(
    x=list(news_counter_russia.keys()),
    y=list(news_counter_russia.values()),
    name='Russia',
    marker=dict(
        color='red'
    )
)
isis_bar = go.Bar(
    x=list(news_counter_isis.keys()),
    y=list(news_counter_isis.values()),
    name='ISIS',
    marker=dict(
        color='black'
    )
)

layout = go.Layout(
    title='Amount of mentioning Russia and ISIS in the news',
    xaxis=dict(
        tickfont=dict(
            size=14,
            color='black'
        )
    ),
    yaxis=dict(
        title='Amount of news containing this word',
        titlefont=dict(
            size=16,
            color='black'
        ),
        tickfont=dict(
            size=14,
            color='black'
        )
    ),
    legend=dict(
        x=0,
        y=1.0,
    ),
    barmode='group',
    bargap=0.15,
    bargroupgap=0.1
)

fig = go.Figure(data=[russia_bar, isis_bar], layout=layout)
py.iplot(fig, filename='style-bar')

Well, too few news mention ISIS, and a lot news mention Russia. So, probably, features based on finding mentions of ISIS would not be so helpful as the features based of finding mentions of Russia. Let's try to explore the last thing that probably could help us to understand what features should we use. I wonder, what world crisises are mentioned the most?

In [None]:
crisis_marker = 'crisis'

In [None]:
list_of_crisis = defaultdict(lambda: 0.0)

for ind in range(train_.shape[0]):
    list_of_articles = list(train_.loc[ind])[2:]
    for i in list_of_articles:
        if crisis_marker in i:
            try:
                word = pos_tag([i[i.index(crisis_marker)-1]])[0]
                if word[1]=='JJ':
                    list_of_crisis[word[0]] += 1.0
            except Exception as e:
                pass

for ind2 in range(ind+1, test_.shape[0]+1):
    list_of_articles = list(test_.loc[ind2])[2:]
    for i in list_of_articles:
        if crisis_marker in i:
            try:
                word = pos_tag([i[i.index(crisis_marker)-1]])[0]
                if word[1]=='JJ':
                    list_of_crisis[word[0]] += 1.0
            except Exception as e:
                pass

In [None]:
wordcloud = WordCloud(background_color='white')
wordcloud.generate_from_frequencies(dict(list_of_crisis))
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Notably, despite the fact that common collocative words like "global" or "humanitarian" are used with "crisis" most often, the adjectives that characterize specific world military conflict are also observable. We could see the crisis in Georgia, the crisis in Syria and the crisis in Ukraine (as well as the slightly less pronounced crisis in Italy and the crisis in Europe, which were more the destabilization of the economy, rather than serious political crises). However, military conflicts, for example, in African countries, are not mentioned in the news.

I also looked at the collocations of the "crisis" word with the help of distributional semantics, visualizing the nearest words to the word "crisis" by learning the Word2Vec model on the news texts. But, at first, I create a column with text data merged from all 25 columns.

In [None]:
def make_raw_text_col(df):
    df['text'] = df['Top1'].str[1:] + ' '
    for i in df.loc[:,'Top2':'Top25']:
        df['text'] += df[i].str[1:] + ' '
    df['text'] = df['text'].str.lower().str.replace('[^a-zA-Z ]', '')
    return df

In [None]:
def make_raw_text_col_lem(df):
    df['text'] = df['Top1']
    for i in df.loc[:,'Top2':'Top25']:
        df['text'] += df[i]
    df['text'] = df['text'].str.lower().str.replace('[^a-zA-Z ]', '')
    return df

In [None]:
train = make_raw_text_col(train)
test = make_raw_text_col(test)

In [None]:
train_lem = make_raw_text_col_lem(train_lem)
test_lem = make_raw_text_col_lem(test_lem)

train_stem = make_raw_text_col_lem(train_stem)
test_stem = make_raw_text_col_lem(test_stem)

train_lem_stem = make_raw_text_col_lem(train_lem_stem)
test_lem_stem = make_raw_text_col_lem(test_lem_stem)

I will use a standard Skip-Gram model in order to finding distributional neighbords of the word "crisis" in a vector space.

In [None]:
model = Word2Vec([word_tokenize(text) for text in np.hstack((train.text.values, test.text.values))], min_count=4, size=300, window=4, sg=1, alpha=1e-4)

The high-dimensional vectors would be projected in a two-dimensional space with the help of dimensionality reduction technique t-SNE:

In [None]:
topn=10

labels = []
tokens = []

for word in model.most_similar(crisis_marker, topn=topn):
    tokens.append(model[word[0]])
    labels.append(word[0])

    
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)

x = []
y = []
for value in new_values:
    x.append(value[0])
    y.append(value[1])

plt.figure(figsize=(5, 5))
sns.set_style('whitegrid')
plt.grid(False)
for i in range(len(x)):
    plt.scatter(x[i],y[i])
    plt.annotate(labels[i],
                fontsize=15,
                color='black',
                xy=(x[i], y[i]),
                xytext=(2, 2),
                textcoords='offset points',
                ha='right',
                va='bottom')

Well, in word embeddings the word "crisis" has neighbours that have nothing to do with words mentioned in the previous diagram.

# 2. Feature Engineering

In this part I will use different ways of obtaining numerical features from the raw texts. I will train the gradient boosting classifier on these features and explore the role of different features in the final result. 

## 2.1 Linguistic Features

The most simple features that one could use to train the baseline classifier are derived directly from the text. Some of them I've mentioned above in the data exploration part of this notebook, some of them are listed below:

* Amount of words in the news text;
* Amount of unique words;
* Amount of symbols;
* Amount of stop-words (I'm using NLTK English stop-word list);
* Average word length.

In [None]:
lambda_func_features = [
    (lambda x: len(str(x).split()), 'NumWords'),
    (lambda x: len(set(str(x).split())), 'NumUniqueWords'),
    (lambda x: len(str(x)), 'NumChars'),
    (lambda x: len([w for w in str(x).lower().split() if w in stop]), 'NumStopWords'),
    (lambda x: np.mean([len(w) for w in str(x).split()]), 'MeanWordLen'),
]

All at all, I obtain 6x25 features which I will put into a DataFrame.

In [None]:
def generate_features(train, test, train_la, test_la, lambda_func, func_name):
    train_features = DataFrame([train.loc[:,'Top' + str(col)].apply(lambda_func) for col in range(1, col_number)]).transpose()
    test_features = DataFrame([test.loc[:,'Top' + str(col)].apply(lambda_func) for col in range(1, col_number)]).transpose()
    train_features.columns = [func_name + str(i) for i in range(1, col_number)]
    test_features.columns = [func_name + str(i) for i in range(1, col_number)]
    return concat([train_la, train_features], axis=1), concat([test_la, test_features], axis=1)

In [None]:
def create_linguistic_features(train_data, test_data):
    train_la = DataFrame()
    test_la = DataFrame()
    for lambda_func, func_name in lambda_func_features:
        train_la, test_la = generate_features(train_data, test_data, train_la, test_la, lambda_func, func_name)
    test_la = test_la.reset_index(drop=True)
    return train_la, test_la

In [None]:
train_la, test_la = create_linguistic_features(train, test)

These feature dataframes look like this:

In [None]:
train_la.head()

Let's set the hyperparameters of the gradient boosting classifier and train it:

In [None]:
params = {}
params['objective'] = 'multi:softprob'
params['eta'] = 0.1
params['max_depth'] = 3
params['silent'] = 1
params['num_class'] = 3
params['eval_metric'] = 'mlogloss'
params['min_child_weight'] = 1
params['subsample'] = 0.8
params['colsample_bytree'] = 0.3
params['seed'] = 0

In [None]:
boost_rounds = 20
kfolds = 5
pred_full_test = 0

def make_xgboost_predictions(train_df, test_df):
    pred_full_test = 0
    pred_train = np.zeros([train_df.shape[0], len(set(y_train))])
    
    for dev_index, val_index in KFold(n_splits=kfolds, shuffle=True, random_state=42).split(train_df):
        dev_X, val_X = train_df.loc[dev_index], train_df.loc[val_index]
        dev_y, val_y = y_train[dev_index], y_train[val_index]
        xgtrain = xgb.DMatrix(dev_X, dev_y)
        xgtest = xgb.DMatrix(test_df)
        model = xgb.train(params=list(params.items()), dtrain=xgtrain, num_boost_round=boost_rounds)
        predictions = model.predict(xgtest, ntree_limit=model.best_ntree_limit)
        pred_full_test = pred_full_test + predictions
    return pred_full_test / kfolds

In [None]:
f1_score(make_xgboost_predictions(train_la, test_la).argmax(axis=1), y_test) 

Well, the F1-score is not bad. We obtained a pretty nice baseline performance just on simple linguistic features. Let's look at more complex ways of obtaining numerical data from text.

## 2.2 Probabilities of Predictions of Naive Bayes as Features

Now I will try to use predictions of a classifier trained on co-occurrences of words and character n-grams (word-word, n-gram-n-gram, word-text and n-gram-text). Notably, I am not pushing the co-occurrence vectors themselves into the boosting model. I am pushing probabilities of class predictions of Naive Bayes classifier and train the gradient boosting on this data. Additionally, I am using a technique of Singular Value Decomposition to reduce the dimenstionality of the obtained sparse matrices.

In [None]:
vectorizers = [
    (CountVectorizer(stop_words='english', ngram_range=(1,5), analyzer='char'), 'CountVectorizerChar'),
    (TfidfVectorizer(stop_words='english', ngram_range=(1,5), analyzer='char'), 'TfIdfVectorizerChar'),
    (CountVectorizer(stop_words='english', ngram_range=(1,2), analyzer='word'), 'CountVectorizerWord'),
    (TfidfVectorizer(stop_words='english', ngram_range=(1,2), analyzer='word'), 'TfIdfVectorizerWord')
]

In [None]:
models = [#(MultinomialNB(), {'alpha':[0, 0.1, 0.5, 0.8, 1]}, "Naive Bayes"),
          (xgb.XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, 
                        colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
                        {'max_depth':range(3,10,9), 'min_child_weight':range(1,6,5)}, "XGB")
         ]


In [None]:
def do_grid_search(alg, array_of_vectors, array_of_tags, parameters):
    clf = model_selection.GridSearchCV(alg, parameters, error_score=0.0)
    clf.fit(array_of_vectors, array_of_tags)
    print(clf.best_estimator_)
    return clf.best_estimator_

As in the previous step, I will create a DataFrame that contains all necessary feature data without any source texts.

In [None]:
def create_proba_features(train_data, test_data):

    kfolds = 5
    train_vecs = DataFrame()
    test_vecs = DataFrame()

    for vec, vec_name in vectorizers:
        vectorizer = vec
        full = vectorizer.fit_transform(np.hstack((train_data.text.values, test_data.text.values)))

        X_train_raw = vectorizer.transform(train_data.text.values)
        X_test_raw = vectorizer.transform(test_data.text.values)

        normalized = Normalizer()
        normalized.fit_transform(full)
        X_train = normalized.transform(X_train_raw)
        X_test = normalized.transform(X_test_raw)

        pred_full_test = 0
        pred_train = np.zeros([train_data.shape[0], len(train_data.Label.unique())])

        for dev_index, val_index in KFold(n_splits=kfolds, shuffle=True, random_state=42).split(train_data):
            dev_X, val_X = X_train[dev_index], X_train[val_index]
            dev_y, val_y = y_train[dev_index], y_train[val_index]
            model = MultinomialNB()
            model.fit(dev_X, dev_y)
            pred_full_test = pred_full_test + model.predict_proba(X_test)
            pred_train[val_index,:] = model.predict_proba(val_X)

        pred_full_test = pred_full_test / kfolds

        train_vecs[vec_name + 'Zero'] = pred_train[:,0]
        train_vecs[vec_name + 'One'] = pred_train[:,1]
        test_vecs[vec_name + 'Zero'] = pred_full_test[:,0]
        test_vecs[vec_name + 'One'] = pred_full_test[:,1]
        
        return train_vecs, test_vecs

In [None]:
train_vecs, test_vecs = create_proba_features(train, test)

In [None]:
f1_score(make_xgboost_predictions(train_vecs, test_vecs).argmax(axis=1), y_test) 

F1-score surprisingly is not so high as with more simple features. May be texts are too short, and it is not possible to construsct an adequate model of semantics through co-ocurrence matrices. Let's try some other techniques.

## 2.3 Decomposed TF-IDF Vectors

I will give another chance to TF-IDF vectorizer by directly pushing vectors to the gradient boosting. The matrix is decomposed, so the amount of features is lower than in previous techniques.

In [None]:
def create_svd_features(train_data, test_data):
    svd_components = 20

    vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word')
    full = vectorizer.fit_transform(np.hstack((train_data.text.values, test_data.text.values)))
    X_train = vectorizer.transform(train_data.text.values)
    X_test = vectorizer.transform(test_data.text.values)

    svd = TruncatedSVD(n_components=svd_components, algorithm='arpack')
    svd.fit(full)
    train_svd = DataFrame(svd.transform(X_train))
    test_svd = DataFrame(svd.transform(X_test))

    train_svd.columns = ['SVD' + str(i) for i in range(svd_components)]
    test_svd.columns = ['SVD' + str(i) for i in range(svd_components)]
    
    return train_svd, test_svd

In [None]:
train_svd, test_svd = create_svd_features(train, test)

In [None]:
f1_score(make_xgboost_predictions(train_svd, test_svd).argmax(axis=1), y_test) 

However, F1-score is still not so high as I would like.

## 2.4 Decomposed Embedding Vectors

Previously, in the data exploration part, I've already used word embeddings, real-valued representations of words obtained by counting words' neighbors in a window sliding through the sentences of a corpus. Now I will try to replace TF-IDF vectorizer with a  Skip-Gram model that is a more state-of-the-art semantics model and possibly should give a better classification score.

In [None]:
def get_feature_vec(tokens, num_features, model):
    featureVec = np.zeros(shape=(1, num_features), dtype='float32')
    missed = 0
    for word in tokens:
        try:
            featureVec = np.add(featureVec, model[word])
        except KeyError:
            missed += 1
            pass
    if len(tokens) - missed == 0:
        return np.zeros(shape=(num_features), dtype='float32')
    return np.divide(featureVec, len(tokens) - missed).squeeze()

In [None]:
def create_embedding_features(train_data, test_data):
    num_features = 100

    model = Word2Vec([word_tokenize(text) for text in np.hstack((train_data.text.values, test_data.text.values))], min_count=4, size=num_features, window=4, sg=0, alpha=1e-4)

    train_embedding_vectors = []
    for i in train_data.text.values:
        train_embedding_vectors.append(get_feature_vec(word_tokenize(i), num_features, model))

    test_embedding_vectors = []
    for i in test_data.text.values:
        test_embedding_vectors.append(get_feature_vec(word_tokenize(i), num_features, model))

    train_w2v = DataFrame(train_embedding_vectors)
    test_w2v = DataFrame(test_embedding_vectors)
    
    train_w2v.columns = ['W2V' + str(i) for i in range(num_features)]
    test_w2v.columns = ['W2V' + str(i) for i in range(num_features)]
    
    return train_w2v, test_w2v

In [None]:
train_w2v, test_w2v = create_embedding_features(train, test)

In [None]:
f1_score(make_xgboost_predictions(train_w2v, test_w2v).argmax(axis=1), y_test) 

The score is increased a little bit. Well, may be the vector space models are unable to predict dynamics of DJUA. 

## 2.5 Combining Features

In the end, I will combine all the feature matrices used before in order to obtain a better classification score.

In [None]:
X_train = concat([train_la, train_svd, train_vecs, train_w2v], axis=1)
X_test = concat([test_la, test_svd, test_vecs, test_w2v], axis=1)
print('Feature matrix consists of {} features'.format(len(X_train.columns.values)))

In [None]:
print(f1_score(make_xgboost_predictions(X_train, X_test).argmax(axis=1), y_test))

In [None]:
topf=200
features = X_train.columns.values
model_xgb = xgb.XGBClassifier()
model_xgb.fit(X_train, y_train)
x, y = (list(x) for x in zip(*sorted(zip(model_xgb.feature_importances_, features), reverse = False)[:topf]))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Feature importance for XGBoost',
    orientation='h',
)

layout = dict(
    title='Barplot of TOP-{} Features importances for XGBoost'.format(topf),
    width = 1000, height = 1000,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')

I will also use all the kinds of morphologically normalized text data created in the start of this notebook. This little experiment will help us to unearth whether morphological preprocessing is able to predict dynamics of stock prices more correctly.

In [None]:
preprocessed_data = [
                    (train, test, 'raw'),
                    (train_lem, test_lem, 'lem'),
                    (train_stem, test_stem, 'stem'),
                    (train_lem_stem, test_lem_stem, 'lem+stem')
]

In [None]:
for train, test, preprocess in preprocessed_data:
    train_la, test_la = create_linguistic_features(train, test)
    train_proba, test_proba = create_proba_features(train, test)
    train_svd, test_svd = create_svd_features(train, test)
    train_w2v, test_w2v = create_embedding_features(train, test)

    X_train = concat([train_la, train_svd, train_proba, train_w2v], axis=1)
    X_test = concat([test_la, test_svd, test_proba, test_w2v], axis=1)
    
    print('F1 with preprocessing by {} is {:0.4f}'.format(preprocess, f1_score(make_xgboost_predictions(X_train, X_test).argmax(axis=1), y_test)))
    print('Accuracy with preprocessing by {} is {:0.4f}\n'.format(preprocess, accuracy_score(make_xgboost_predictions(X_train, X_test).argmax(axis=1), y_test)))

However, combination of all of the features used by me in this notebook wasn't unable to outperform the score obtained with the simple linguistic features despite the diagram of feature importance says that there are some vector space features that were more important then linguistic features in this task. 

It is also observable that morphological normalization didn't helped to increase performance of classification. It's not so surprising since lemmatization and stemming could really increase the performance only on languages with a highly fusional morphology, like Russian. May be morphological preprocessing brings only the noise to the data of English language.

All at all, in this little study we explored some interesting characteristics of our data and tried to used different kinds of features to predict stock market prices through the news texts. The predictions are weren't very successful, but the accuracy is higher that 0.5, so hypothetically to have profit trading on a stock market with a simple model trained on news texts. 

Thanks for reading this kernel, and I am grateful to [Ekaterina Chernyak](https://github.com/echernyak) for the very nice machine learning course for which this study was done. Natural Language Processing is fun!