# Using Twitter Moods to predict the FTSE 100

The aim of this project is to determine whether the measurements of collective mood states derived from Twitter feeds are correlated to, and predictive of, the value of the FTSE 100 index over time. I will be collecting a daily grouping of a sample of tweets from the UK along with the sentiment and mood of these tweets, and the daily closing price of the FTSE 100 index.

After analysing the Tweets, I will be using a Neural Network to predict the future value of the FTSE 100 index based on the Twitter mood, measuring the mean absolute percentage error (MAPE); the absolute percentage difference between the actual value and the predicted forecast.

My hypothesis that twitter will be in some way predictive of movements in the FTSE 100 index.

This piece of work is inspired by Johan Bollen et al. Their paper can be found here - https://arxiv.org/pdf/1010.3003.pdf

### Collecting a years worth of Tweets from Twitter

I was originally going to use the Twitter API to collect all the tweets from Twitter, but there are various restrictions whilst using the API. You can only get historical tweets going back 7 days, and there is a limit to how many requests you can make every 15 mins.

I looked into using beautifulsoup to scrape the search data from Twitter, but Ahmet Taspinar - http://ataspinar.com/ - has already created a library which scrapes data from Twitter, and returns the results as a json. The library is called twitter scraper. You can install the library directly using pip as shown below:

That will save the top 20 tweets with the word "Trump" to a file called 'test.json' which looks like:

In [8]:
import json

# PrettyPrint a json file
def PrettyPrint(file, x=None):
    with open(file) as open_file:
        file_parsed = json.load(open_file)
        print json.dumps(file_parsed[:x], indent=4, sort_keys=True)

# Only show 1 example for this demo
PrettyPrint('test.json', 1)

[
    {
        "fullname": "Betsy Holahan", 
        "id": "857296620395909120", 
        "text": "White House unveils dramatic plan to overhaul tax code in major test for Trump http://wapo.st/2q8phNF?tid=ss_tw\u00a0\u2026", 
        "timestamp": "2017-04-26T18:13:36", 
        "user": "GreatPointStrat"
    }
]


Now that we have tested twitterscraper, we can extend the command to include our search query and date range. We type the commands below into different instances of the terminal, and wait for our data to be collected.

After running the 1st command for 12 hours, I'd only managed to collect 1.1 million tweets which only went back 2 weeks...

We can try to use twitterscraper directly from within Python instead:

In [11]:
from datetime import date, timedelta
from twitterscraper import query_tweets

start_date = date(2016, 4, 28)
mid_date = date(2016, 4, 29)
end_date = date(2017, 4, 28

delta = timedelta(days=1)

while mid_date <= end_date:
    since = start_date.strftime("%Y-%m-%d")
    until = mid_date.strftime("%Y-%m-%d")
    
    i = 0
    
    test_tweets = query_tweets("%22i%20feel%22%20since%3A"+since+"%20until%3A"+until)
    
    time_file = open('timestamp.txt', 'a')
    text_file = open('tweets.txt', 'a')
    
    for tweet in test_tweets:
        i += 1
        # collect every 1000th tweet to speed it up
        if i % 1000 == 0:
            time_file.write('%s \n' % tweet.timestamp)
            text_file.write('%s \n' % tweet.text.encode('utf-8').strip().replace('\n',' ').replace('\t',' '))
    
    start_date += delta
    mid_date += delta
    
    time_file.close()
    text_file.close()

I collected all of the data from the 18th Apr 2017 to the most recent date, but for some reason, I struggled to get anything prior to that date. It just keeps failing...