## What is the role of the media coverage in explaining stock market fluctuations ?

### Imports

[Small description of all the imports we do]

In [None]:
import numpy as np
from util.dataloader import *
from util.finance import *
from util.plots import *
from util.quotebankexploration import *
from util.wikipedia import *
from util.apple_stores import *
from util.predictive_model import *
import plotly.io as pio
import numpy as np

%load_ext autoreload
%autoreload 2

In [None]:
pio.renderers.default = "notebook_connected"

# <a class="anchor" id="TOC"></a> Table of Contents

***

* 0. [Introduction](#intro)
* 1. [Loading the datasets](#sect_1)
    * 1.1 [Apple stock : yFinance API](#sect_1_1)
    * 1.2 [Quotebank dataset](#sect_1_2)
* 2. [First look into the Apple stock market and related quotes](#sect_2)
    * 2.1 [Observations on the volatility, liquidity traded and media discussion](#sect_2_1)
    * 2.1 [Finding patterns and studying correlated features](#sect_2_2)
* 3. [Sentiment Analysis of Apple related quotes](#sect_3)
* 4. [Impact of the number of pageviews on the speakers' Wikipedia page](#sect_4)
    * 4.1 [Wiki labels and wiki ID](#sect_4_1)
    * 4.2 [Number of wiki page views](#sect_4_2)
    * 4.3 [Exact label for each quotes](#sect_4_3)
    * 4.4 [Scoring quotes](#sect_4_4)
    * 4.5 [Get back sentiment analysis](#sect_4_5)
    * 4.6 [Final scoring](#sect_4_6)
    * 4.7 [Final plot](#sect_4_7)

# 0. Introduction <a class="anchor" id="intro"></a>
[Back to table of contents](#TOC)

***

This notebook is intended to show and demonstrate the thought process that went into this project, on identifying patterns and correlations in the stock market and media quotes. In particular we have focused on the Apple Stock market, as it is one of the most valuable company on earth and it is widely covered in the media. 

# 1. Loading the datasets <a class="anchor" id="sect_1"></a>
[Back to table of contents](#TOC)

***

In this project we have studied three datasets to provide more insights in the stock price evolution, the coverage of Apple in the media, and its relationship with the various speakers. This first session is dedicated to loading the various datasets and preprocessing the valuable informations.

### 1.1 Apple stock : yFinance API <a class="anchor" id="sect_1_1"></a>
[Back to table of contents](#TOC)

The yFinance API is provided by Yahoo Finance and provides an easy access to various financial metrics for most stocks in the market. The _ticker_ of the Apple stock is denoted _AAPL_, and we will first focus on a date range from 2008 up to 2020. We also provide an additional indicator of the daily volatility with the ``Liquidity`` field, which is the daily volume multiplied by the average daily stock price. This indicator in dollars is appropriate to have a quick overview on the quantity of Apple stock traded in a day, as a day of high liquidity is also said to be a day of high volatility. 

In [None]:
# Using yFinance API we load various metrics of the Apple stock ranging from 2008 to 2020
stock = load_stock("AAPL", 2008, 2020)

# We set a day of high volatility as a day among the highest 2% of liquidity in that year.
stock = high_volatility(stock, quantile = 0.98)

display(stock)

### 1.2 Quotebank Dataset <a class="anchor" id="sect_1_2"></a>
[Back to table of contents](#TOC)

The Quotebank data is a large text corpus of more than 178 million quotations scrapped over 337 websites. As we are focusing on Apple related quotations, we have applied a various amount of filtering to reduce the number of quotes to 310'816. Some of the various techniques employed are: 
1. White list of words that should be contained in quotes : _Apple_, _iPhone_, _Macbook_ etc.
2. Black list of words that should not be contained : _Mac n Cheese, _apple_, _Big Apple_ etc.
3. White list of speakers, as we also included speakers related to Apple regardless of the aforementioned white words : _Steve Jobs_, _Tim Cook_, _Steve Wozniak_ etc.

This final dataset of Apple related quotes has been saved in a `pkl` file to improve the ease of manipulation, and can be accessed with the `get_filtered_quotes` function.

In [None]:
quotes = get_filtered_quotes()

display(quotes)

# 2. First look into the Apple stock market and related quotes <a class="anchor" id="sect_2"></a>
[Back to table of contents](#TOC)

One objective of this project is to find patterns and events in both the stock market and media discussion related to Apple.

***

### 2.1 Observations on the volatility, liquidity traded and media discussion <a class="anchor" id="sect_2_1"></a>
[Back to table of contents](#TOC)

In order to do so, we need metrics and qualitative visualization of the aforementioned data. To find days of interest in the stock market, we introduce the _Liquidity_, which is the mean price day multiplied by the daily volume of exchange. This indicator provides an intuition of the amount of $AAPL stock exchanged in day. A day of high liquidity will be synonymous of a day of high volatility, which may indicate a particular event related to Apple.

In the following plot we highlight the yearly top 2% of the liquidity days. In other words, we will observe each year the days with the highest liquidity. We observe a pattern that will be further studied, as the iPhone september events, or the quarterly reports. 

In [None]:
weekly_liquidity(stock, quantile=0.98)

Naturally we proceed to this comparison of high exchange days by looking at the daily price of the $AAPL stock. We highlight the same days of high volatility from the previous figures but next to the stock price instead of the liquidity. The intuition is that a day of high volatility may have repercussions on the stock price, either as a price fall or a rebound. 

In [None]:
daily_stock_price(stock)

Finally we will take a look at the filtered Quotebank dataset, which only describes Apple related quote. In the same idea as previous plots, we can highlight the yearly top 2% of days with the most quotes. Again we can observe some patterns, that will be further dissected in the next few sections. For example we observe a yearly spike in September when the new iPhone is released, or in June with the yearly developer conference. Most importantly the highest spike the 6th October 2011 is related to Steve Jobs death, which was widely covered in the media.

In [None]:
daily_quotes(quotes, quantile = 0.98)

### 2.2 Finding patterns and studying correlated features <a class="anchor" id="sect_2_2"></a>
[Back to table of contents](#TOC)

From the above observations, we had the intuition that each quarter report released by Apple was synonymous with a day of high volatility. A quarterly report is a summary or collection of unaudited financial statements, such as  gross revenue, net profit, operational expenses, and cash flow. As we have 252 trading days in a year and 4 quarter reports per year, we expect a periodicity of high liquidity days of around $252/4 = 64$ days. This in fact validated by the next cell, which performs a seasonal analysis over a wide range of days period, and keeps the days with the lowest null probability (i.e p-value). Later on we will see if this pattern is also observable on the Apple related quotes.

In [None]:
seasonal_analysis(stock[stock.Date.dt.year.isin([2018,2019])], column="Liquidity")

What we finally observe is that there is a qualitative correlation between the number of daily quotes related to Apple and the daily liquidity. The two are positively correlated (the Pearson correlation coefficient is approximately 0.3 and the p-value is very small), which indicates that an increase in liquidity will be likely associated with an increase in discussion related to Apple and the other way around. 

In [None]:
stock_price_with_quotes(stock,quotes)

In [None]:
pearson_stock_quotes(stock[stock.Date.dt.year.isin(range(2015,2019))],quotes)

# 3. Sentiment Analysis of Apple related quotes <a class="anchor" id="sect_3"></a>
[Back to table of contents](#TOC)

***

One further study that we can perform on the set of quotes related to Apple is a sentimental analysis. Performing a sentiment distinction on the quotes may be useful to understanding the semantic orientation during a period of time, and find events of interest. We will use [VADER](https://www.nltk.org/howto/sentiment.html), a lexicon and rule based sentiment analysis tool that can classify each quote in either positive, neutral or negative intent. More precisely, this predictive model gives a score in a $[-1,1]$ range, with $1$ being very positive, $-1$ very negative and $0$ neutral sentiment. In this section we will look at the correlation between the semantic content of those quotes and the actual stock price.


In [None]:
quotes_sentiments = predict_sentiment(quotes)

Furthermore we can qualitatively estimate the correlation between distinct sentiments and the observed market liquidity. Just as seen in the previous section, we will compute the Pearson correlation between either the positive, negative or neutral quotes and the liquidity. As we found a correlation value of 0.3154 with all quotes regardless of the sentiment, here we expect to observe a lower score for negative quotes rather than positive or neutral. 



In [None]:
correlation_stock_sentiment(quotes_sentiments,stock)

Looking at the computed correlations, we have indeed a lower correlation score of 0.1653 for negative quotes, compared to 0.2146 and 0.2192 for positive or neutral quotes. As all those values are positive, we may say that regardless of the sentiment, an increase in the number of quotations is most likely associated with an increase in liquidity and volatility of the Apple stock. Most importantly we observe that positive quotes have a greater correlation than negative quotes. One hypothesis that we may establish is that while negative quotes may make it more likely to sell a higher volume of stocks than usual, positive quotes have higher influence on the volatility and the market may be more willing to buy more stocks.


In [None]:
fig_all_sentiments(quotes_sentiments,stock)

# 4. Impact of the number of pageviews on the speakers' Wikipedia page <a class="anchor" id="sect_4"></a>
[Back to table of contents](#TOC)

***

Previously, we have looked at the impact of the valence of the different quotes related to Apple on the stock market. Let's add more depth to our analysis of the impact of the media on the stock market. What is the impact of the speaker of a quote? The response is relatively simple, it depends on how well known the speaker was at the time he or she was quoted in the media. And a quite simple way to measure this is to use the number of pageviews on the speakers' Wikipedia page (if there is one!).

#### 4.1 Wiki labels and wiki ID <a class="anchor" id="sect_4_1"></a>
[Back to table of contents](#TOC)

Before accessing to the pageviews statistics, we need the exact label of the speakers' Wikipedia page. For this purpose, we load the following data set which links each speaker to its Wikipedia page.

In [None]:
# Get the wiki labels with its corresponding wiki ID
wiki_labels = get_wiki_labels()[['id', 'label']]
display(wiki_labels.head(5))

#### 4.2 Number of wiki page views <a class="anchor" id="sect_4_2"></a>
[Back to table of contents](#TOC)

After that, we need a way to get the number of pageviews of the speakers annually. We use the package `pageviewapi` which gives us all the wiki page views since 2015. We would have liked to have all the pageviews since 2008, but it was too complicated. Indeed one person has created a way to get these data, but the website did not work anymore. Thus, we focused our study between 2015 and 2020, and we designed a function which returns the number of pageviews for a specific wiki page and year.

In [None]:
# Get the page views for the Steve Jobs wikipedia page in 2015
speaker = 'Steve Jobs'
year = 2015
print(get_pageviews_per_year(speaker, year))

#### 4.3 Exact label for each quote <a class="anchor" id="sect_4_3"></a>
[Back to table of contents](#TOC)

The idea now is to get the exact label of each speaker for every quotes. However, the different quotes can have more than one QID! Indeed, sometimes the name of the speaker for a specific quote can be confused with another speaker, so it gives two QID in the list for that quote! To deal with this issue, we decided to look at all the different QID for each quote, and we kept the label which corresponds to the speaker having the maximum number of total pageviews. 

Before that, we recover all the speakers' ID for all the quotes that are not anonymous.

In [None]:
# Get the speakers' ID
speakers_id = get_speakers_ids(quotes)
speakers_id_sample = speakers_id.head(5)
display(speakers_id_sample)

Here we can clearly see that for one speaker, there might be more than one QID. Thus, in the following cell, we recover for all the speakers the true label in Wikipedia, i.e. the one for which the number of pageviews is the highest.

**Remark :** The following cell shows how the process is done over a sample of the full dataframe `quotes`. We are not doing the whole filtering below because the run is about 22h. Thus, we have done it once on multiple clusters and have saved the results in a pickle file.

In [None]:
# Get the right label of the speakers of each quotes.
# It adds a new column `label` in quotes dataframe containig the wiki label of the speaker.
speakers_labels_sample = find_labels(speakers_id_sample, wiki_labels)
display(speakers_labels_sample)

# Get the whole data set from a .pkl file
speakers_labels = get_speakers_labels()

#### 4.4 Scoring quotes <a class="anchor" id="sect_4_4"></a>
[Back to table of contents](#TOC)

Now we have the exact label of the speaker for every quotes. In the following, we want to get the wikipedia page views statistics for every speakers at the year where the quote was published. We take the label and the year of each quotes and use the function `get_pageviews_per_year` to add a new column `score` in our data frame.

**Remark :** As in the previous subsection, here is an example of how the process works on a small sample of the data set because the run for the whole data set is too long (around 8h for this one). Since the runnning time is so long, we decided to split the steps such that we can run it on multiple computer. Eventually, we were able to get the number of pageviews for each speaker for every year between 2015 and 2020. Here is an example of how the code works.

In [None]:
# Apply the process on a small sample
speakers_pageviews_sample = get_speakers_pageviews_per_year(speakers_labels_sample)
display(speakers_pageviews_sample)

# Load the whole data set from a pickle file
speakers_pageviews = get_speakers_pageviews()

Now, we directly join the year of the quotes with the corresponding number of pageviews of the speakers. After that we normalize the value and we then obtain a score for each quotes (in absolute value for now).

**Remark :** The process here does not use the `pageviewapi` package, so it is pretty fast then we can directly do it on the whole data frame.

In [None]:
# Add score column for our data frame
quotes_score = get_score_quotes(quotes, speakers_pageviews)
display(quotes_score.head(5))

#### 4.5 Get back to sentiment analysis <a class="anchor" id="sect_4_5"></a>
[Back to table of contents](#TOC)

Now we have an absolute score value for each single quote which is a way to represent the notoriety of the speaker. Can we combine this information with the sentiment analysis for each quote used in the previous section? We will use again the function designed in the section 3.4 to get the valence of each quote (+1 for positive, -1 for negative and 0 for neutral). In the following cell, we show how the process works to add a column `sentiment` to our dataframe.

In [None]:
# Add the column sentiment
quotes_sentiment = get_sentiment_quotes(quotes_score)
display(quotes_sentiment.head(5))

#### 4.6 Final scoring <a class="anchor" id="sect_4_6"></a>
[Back to table of contents](#TOC)

Finally, we multiply the valence of the quotes with the absolute score to get a positive and negative score. So we first create two columns `negative_score` and `positive_score` for each quote. 

**Example :** If the quote has a negative valence, in column `negative_score` there will be the valence (-1) multiplied with the absolute score and in the column `positive_score` the value will be equal to zero.

In [None]:
# Add positive and negative score
quotes_neg_pos_score = get_neg_pos_score_quotes(quotes_sentiment)
display(quotes_neg_pos_score.head(5))

Then, we sum the score every day. The idea is to identify the days for which a lot of famous people have talked about Apple positively or negatively. We keep only the columns `date`, `positive_score` and `negative_score` for the final plot. The following cell shows how the process is applied.

In [None]:
# Get the negative and positive scores for every days
score_date = get_score_date(quotes_neg_pos_score)
display(score_date.head(5))

#### 4.7 Final plot <a class="anchor" id="sect_4_7"></a>
[Back to table of contents](#TOC)

After all these steps, we can finally have a concrete visualization of our results. We plot the `positive_score` and the `negative_score` depending of the `date`. In addition, we plot on the same figure the Apple stock price depending also of the `date`. The goal is to see if there is a visible correlation betwen the positive and negative scores and the stock price. Here is the following plot.

In [None]:
stock_price_against_quotes_score(score_date, stock)

#### 4.8 Distribution of the quotes according to their valence and to the fame of the speaker <a class="anchor" id="sect_4_8"></a>
[Back to table of contents](#TOC)

In the following plot, we just wanted to visualize the distribution of the quotes according to their valence and to the fame of the speaker. We have chosen 6 events which were highly mediatized between 2015 and 2020. We chose 6 major events picked up from the 2% in this plot for which we were able to find relevant thematic associated e.g. the “FBI-Apple encryption dispute”, “Release of the iPhone X”, etc. Quotations were plotted according to their valence and to the fame of their speaker.

In [None]:
plot_distrib_val_fame(quotes_sentiment)

In [None]:
plot_wordcloud_speakers(quotes,speakers_pageviews)

----
# 5. Building a model for stock market prediction
Using the quotes related to Apple, speakers data and past stock performance, we can perform a first attempt at predicting the daily stock price and the liquidity. For this section we will use the Facebook's Prophet library, which provides powerful and easy to use forecasting tools. At its core, the model is a modular linear regression model, that can take into account past performance and additional factors. 

In [None]:
quotes_sentiment = pd.read_pickle("data/quotes_score.pkl")

prediction_frame = build_prediction_frame(stock[stock.Date.dt.year.isin(range(2015,2018))],quotes_sentiment)

m = fit_prophet(Prophet(changepoint_prior_scale=0.05, seasonality_prior_scale=0.1), prediction_frame, features=['positive','negative','total'], response='Open')
pred = predict_future(m,prediction_frame,feature_frame=quotes_sentiment)

In [None]:
plot_prediction(stock, quotes_sentiment,pred)

In [None]:
# Performance evaluation
df_cv = cross_validation(m, initial='150 days', period='30 days', horizon = '60 days',parallel="processes")
df_p = performance_metrics(df_cv)

print("Mean absolute percentage error in a first week horizon", df_p["mape"].values[0])


In [None]:
times_series_predict(stock, quotes_sentiment, features = None, response = 'Open')

In [None]:
param_grid = {  
  'changepoint_prior_scale': np.linspace(0.2,2,10) ,
  'seasonality_prior_scale': np.logspace(-2,1,10),
}

tuning_results = prophet_cross_validation(param_grid, stock, quotes_sentiment, features = ['positive','negative','total'], response = 'Open', metric = 'mape')
tuning_results[tuning_results.mape == tuning_results.mape.min()]