### Imports

[Small description of all the imports we do]

In [2]:
# Useful starting lines
import numpy as np
import seaborn as sns
from util.dataloader import *
from util.finance import *
from util.plots import *
from util.finance import stock, compare
from util.quotebankexploration import *
from util.wikipedia import *
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
import numpy as np

from util.apple_stores import *

from task1 import *
from task2 import *
from task3 import *
from task4 import *

%load_ext autoreload
%autoreload 2

In [None]:
pio.renderers.default = "notebook_connected"

# <a class="anchor" id="TOC"></a> Table of Contents

***

* 0. [Introduction](#intro)
* 1. [Loading the datasets](#sect_1)
* 2. [First look into the Apple stock market and related quotes](#sect_2)
* 3. [Impact of the number of pageviews on the speakers' Wikipedia page](#sect_3)
    * 3.1 [Wiki labels and wiki ID](#sect_3_1)
    * 3.2 [Number of wiki page views](#sect_3_2)
    * 3.3 [Exact label for each quotes](#sect_3_3)
    * 3.4 [Scoring quotes](#sect_3_4)
    * 3.5 [Get back sentiment analysis](#sect_3_5)
    * 3.6 [Final plot](#sect_3_6)

# 0. Introduction <a class="anchor" id="intro"></a>
[Back to table of contents](#TOC)

***

This notebook is intended to show and demonstrate the thought process that went into this project, on identifying patterns and correlations in stock markets and media quotes. In particular we have focused on the Apple Stock market, as it is one of the most valuable company on earth and it is widely covered in the media. 

# 1. Loading the datasets <a class="anchor" id="sect_1"></a>
[Back to table of contents](#TOC)

***

In this project we have studied three datasets to provide more insights in the stock price evolution, the coverage of Apple in the media, and its relationship with the various speakers. This first session is dedicated to loading the various datasets and preprocessing the valuable information.

### Apple stock : yFinance API
The yFinance API is provided by Yahoo Finance and provides an easy access to various financial metrics for most stocks in the market. The _ticker_ of the Apple stock is denoted _AAPL_, and we will first focus on a date range from 2010 up to 2020. We also provide an additional indicator of the daily volatility with the ``Liquidity`` field, which is the daily volume multiplied by the average daily stock price. This indicator in dollars is appropriate to have a quick overview on the quantity of Apple stock traded in a day, as a day of high liquidity is also said to be a day of high volatility. 

In [3]:
# Using yFinance API we load various metrics of the Apple stock ranging from 2008 to 2020
stock = load_stock("AAPL", 2015, 2020)

# We set a day of high volatility as a day among the highest 2% of liquidity in that year.
stock = high_volatility(stock, quantile = 0.98)

display(stock)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Liquidity,Yearly Percentile
0,2015-08-31,28.007500,28.632500,28.000000,28.190001,25.846573,224917200,6.319892e+09,Lower 98%
1,2015-09-01,27.537500,27.969999,26.840000,26.930000,24.691313,307383600,8.371208e+09,Lower 98%
2,2015-09-02,27.557501,28.084999,27.282499,28.084999,25.750305,247555200,6.887295e+09,Lower 98%
3,2015-09-03,28.122499,28.195000,27.510000,27.592501,25.298744,212935600,5.931853e+09,Lower 98%
4,2015-09-04,27.242500,27.612499,27.127501,27.317499,25.046602,199985200,5.455596e+09,Lower 98%
...,...,...,...,...,...,...,...,...,...
1190,2020-05-22,78.942497,79.807503,78.837502,79.722504,78.955231,81803200,6.489652e+09,Lower 98%
1191,2020-05-26,80.875000,81.059998,79.125000,79.182503,78.420425,125522000,1.004537e+10,Lower 98%
1192,2020-05-27,79.035004,79.677498,78.272499,79.527496,78.762093,112945200,8.954437e+09,Lower 98%
1193,2020-05-28,79.192497,80.860001,78.907501,79.562500,78.796761,133560800,1.060172e+10,Lower 98%


## Quotebank Dataset
The Quotebank data is a large text corpus of more than 178 millions quotations scrapped over 337 websites. As we are focusing on Apple related quotations, we have applied a various amount of filtering to reduce the number of quotes to 310'816. Some of the various techniques employed: 
1. White list of words that should be contained in quotes : _Apple_, _iPhone_, _Macbook_ etc.
2. Black list of words that should not be contained : _Mac n Cheese, _apple_, _Big Apple_ etc.
3. White list of speakers, as we also included speakers related to Apple regardless of the aforementioned white words : _Steve Jobs_, _Tim Cook_, _Steve Wozniak_ etc.

This final dataset of Apple related quotes have been saved in a `pkl` file to improve the ease of manipulation, and can be accessed with the `get_filtered_quotes` function.

In [4]:
quotes = get_filtered_quotes()

display(quotes)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2008-10-15-012147,five times the graphics performance.,steve jobs,[Q19837],2008-10-15 06:38:46,2,"[[steve jobs, 0.7707], [None, 0.2293]]",[http://us.rd.yahoo.com/dailynews/rss/search/m...,A
1,2008-09-25-038377,my iphone is full of pictures. i wanted to see...,ken whisenhunt,[Q1758635],2008-09-25 08:08:08,1,"[[ken whisenhunt, 0.3759], [None, 0.3186], [ma...",[http://azcentral.com/arizonarepublic/sports/a...,A
2,2008-09-04-052789,the people who connect needs and ideas the bes...,steve jobs,[Q19837],2008-09-04 18:19:47,1,"[[steve jobs, 0.717], [None, 0.283]]",[http://businessweek.com/magazine/content/07_4...,A
3,2008-11-10-038937,sesame workshop will get 70 percent of the rev...,robert macmillan,"[Q21453558, Q28094302]",2008-11-10 22:59:44,1,"[[robert macmillan, 0.6071], [None, 0.3929]]",[http://macdailynews.com/index.php/weblog/comm...,A
4,2008-09-26-048683,the huge success of the apple itunes app store...,steve howard,"[Q23670647, Q2457386, Q7612886]",2008-09-26 11:29:23,1,"[[steve howard, 0.5034], [None, 0.4966]]",[http://ipod.consumerelectronicsnet.com/articl...,A
...,...,...,...,...,...,...,...,...,...
310811,2020-03-06-004716,Apple continues to show their tremendous suppo...,none,[],2020-03-06 01:14:31,2,"[[None, 0.6748], [Octavia Spencer, 0.2615], [M...",[https://www.denofgeek.com/tv/truth-be-told-se...,E
310812,2020-01-21-010221,"But at the same time, we're also not saying th...",satya nadella,[Q7426870],2020-01-21 15:44:24,1,"[[Satya Nadella, 0.8487], [None, 0.1513]]",[https://www.theverge.com/2020/1/21/21071108/m...,E
310813,2020-01-22-079970,The Big Mac burger sauce is so hard to get hol...,none,[],2020-01-22 12:34:13,2,"[[None, 0.8268], [Big Mac, 0.1733]]",[https://www.mirror.co.uk/money/shopping-deals...,E
310814,2020-04-01-004824,Archie has been loving doing FaceTime playdate...,none,[],2020-04-01 12:14:50,1,"[[None, 0.6707], [Prince Harry, 0.2197], [Megh...",[https://www.pinkvilla.com/entertainment/holly...,E


# 2. First look into the Apple stock market and related quotes <a class="anchor" id="sect_2"></a>
[Back to table of contents](#TOC)

***

In [None]:
weekly_liquidity(stock, quantile=0.98)

In [None]:
daily_stock_price(stock)

In [None]:
quotes.date

In [None]:
daily_quotes(quotes[quotes.date.dt.year >= 2015], quantile = 0.98)

In [None]:
seasonal_analysis(stock[stock.Date.dt.year.isin([2018,2019])], column="Liquidity")

In [None]:
stock_price_with_quotes(stock,quotes)

In [None]:
pearson_stock_quotes(stock[stock.Date.dt.year.isin(range(2015,2019))],quotes)

# 3. Impact of the number of pageviews on the speakers' Wikipedia page <a class="anchor" id="sect_3"></a>
[Back to table of contents](#TOC)

***

Now we have looked the positiveness of the different quotes of our data set, we want to add more depth about on the impact of the quotes on the stock market. To do so, we pose to ourselves the following question: What is the impact of a quote on others ? The response is relatively simple, it depends on how well known the speaker was at the time he or she was quoted in the media. And a quite simple way to have that indicators is the number of pageviews on then speakers' Wikipedia page (if there is one !).

#### 3.1 Wiki labels and wiki ID <a class="anchor" id="sect_3_1"></a>
[Back to table of contents](#TOC)

But before accessing to this number of pageviews, we need the exact label of the speakers' Wikipedia page. And for that, we will load the following data set that particularly contains the precise label's page for its ocrresponding wiki ID that we have for each quotes in our quotes' data set. 

In [None]:
# Get the wiki labels with its corresponding wiki ID
wiki_labels = get_wiki_labels()[['id', 'label']]
display(wiki_labels.head(5))

#### 3.2 Number of wiki page views <a class="anchor" id="sect_3_2"></a>
[Back to table of contents](#TOC)

After that, we need a way to get the number of pageviews of some speaker for a specific year. For that we use the package `pageviewapi` that gives us all the wiki page views from 2015. We would have liked to have all the pageviews since 2008, but it was too complicated and one person cerate a way to get these data, but the website was not able to work anymore. Then, we had focus our study between 2015 and 2020, and we had design a fucntion that returns the number of pageviews of a specific wiki page and year. This is what follows.

In [None]:
# Get the page views for the Steve Jobs wikipedia page in 2015
speaker = 'Steve Jobs'
year = 2015
print(get_pageviews_per_year(speaker, year))

#### 3.3 Exact label for each quotes <a class="anchor" id="sect_3_3"></a>
[Back to table of contents](#TOC)

The idea now is to get the exact label for each speaker of every single quotes in our quotes data set. This can be done directly by merging the ID in `wiki_labels` with the QID of the data frame `quotes`, but here it's not that simple. The different quotes of our data set given by quotebank has more than one QID because the name of the speaker can be confusing with another speaker, so it gives the two QID in the list for that quote. To deal with this issue, we decided to use look at all the different QID for each quotes, and keep the label that corresponds to the speaker having the maximum number of total pageviews. 

Before that, we get all the speakers' ID that are given for each quotes in our data set.

In [None]:
# Get the speakers' ID
speakers_id = get_speakers_ids(quotes)
speakers_id_sample = speakers_id.head(5)
display(speakers_id_sample)

Here we can clearly see that for one speaker for oour quotes, there might have more than one QID. So here is in the following cell the process we applied to get for all speakers the true label in Wikipedia.

**Remark :** The following cell show how the process is done over a sample the dataframe `quotes`. We do not do the whole filtering here because the run is about 22h, so we did it once on multiple clusters and save it in a `.pkl` file. The run is so long because of the `pageviewapi` we call at each iteration at least once, to get the number of pageview for the considered QID.

In [None]:
# Get the right label of the speakers of each quotes.
# It adds a new column `label` in quotes dataframe containig the wiki label of the speaker.
speakers_labels_sample = find_labels(speakers_id_sample, wiki_labels)
display(speakers_labels_sample)

# Get the whole data set from a .pkl file
speakers_labels = get_speakers_labels()

#### 3.4 Scoring quotes <a class="anchor" id="sect_3_4"></a>
[Back to table of contents](#TOC)

Now we have the exact label of the spaker of each quotes, we want to get the number of the speaker's wiki page views at the year where the quote was published. So the idea is simple: we just take the label and the year of each quotes and use the function `get_pageviews_per_year` to add a new column `score` in our data frame.

**Remark :** As in the previous subsection, here is an example ofhow the process works on a small smaple of the data set because is too long (around 8h for this one). Since the large runnning time, we decided to split the steps such taht we can run it on multiple computer. The idea was then, to get pageviews for each speaker for every year between 2015 and 2020. Here is a sample of how the code works.

In [None]:
# Apply the process on a small sample
speakers_pageviews_sample = get_speakers_pageviews_per_year(speakers_labels_sample)
display(speakers_pageviews_sample)

# Load the whole data set from a pickle file
speakers_pageviews = get_speakers_pageviews()

Now, we directly join the year of the quotes with the corresponding number of pageviews of the speakers. After that we normalize the value to be sure to have a score that is relevant. 

In [None]:
# Add score column fro our data frame
quotes_score_sample = get_score_quotes(quotes, speakers_pageviews)
display(quotes_score_sample.head(5))

# Get the whole data set from a .pkl file


#### 3.5 Get back sentiment analysis <a class="anchor" id="sect_3_5"></a>
[Back to table of contents](#TOC)

In [None]:
# Use the code of Camille

#### 3.6 Final plot <a class="anchor" id="sect_3_6"></a>
[Back to table of contents](#TOC)

In [None]:
# Plot

In [17]:
task3()

Pearson pos (0.13376385130081211, 1.8546584403164494e-06)
Pearson neg (0.054111576648412166, 0.05913964751159423)
Pearson neut (0.12004957725073283, 2.302720989785653e-05)


----
# 5. Building a model for stock market prediction
Using the quotes related to Apple, speakers data and past stock performance, we can perform a first attempt at predicting the daily stock price and the liquidity. For this section we will use the Facebook's Prophet library, which provides powerful and easy to use forecasting tools. At its core, the model is a modular linear regression model, that can take into account past performance and additional factors. 

### 5.1 Model Fitting
In this section we will fit and test in the range from 2015 to 2019. We have seen in a 2020 a stock split of $AAPL, which can throw off the model prediction. We will train our model between 2015 and 2018, and test its predictive capability in 2019. The obtained model will be further validated through forecasting cross validation. 

In [8]:
# Load the sentiment and quotes features obtained in previous sections
quotes_sentiment = pd.read_pickle("data/quotes_score.pkl")

# Build the train dataframe of stock, sentiment and quotes features
prediction_frame = build_prediction_frame(stock[stock.Date.dt.year.isin(range(2015,2018))],quotes_sentiment)

# Fit the model on the train dataframe
m = fit_prophet(Prophet(changepoint_prior_scale=0.05, seasonality_prior_scale=0.1), prediction_frame, features=['positive','negative','total'], response='Open')

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


Initial log joint probability = -5.45147
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       1728.96    0.00605099       863.923           1           1      116   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       1748.05    0.00128284       166.683           1           1      224   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       1753.58   0.000238521       110.466      0.8504      0.8504      338   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       1757.61   0.000331049       162.537           1           1      454   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499        1760.6   0.000569949       206.192       0.931       0.931      570   
    Iter      log prob        ||dx||      ||grad||       alpha  

In [9]:
# Predict the next 300 days of trading
pred = predict_future(m,prediction_frame,feature_frame=quotes_sentiment)

# Plot the predicted future 300 days compared to the true stock price
plot_prediction(stock, quotes_sentiment, pred)

In [16]:
# Estimate the 60 days future prediction using forecasting cross validation
df_cv = cross_validation(m, initial='150 days', period='30 days', horizon = '60 days',parallel="processes")

# Evaluate multiple metrics on the 60 days future prediction
df_p = performance_metrics(df_cv)

print("Mean absolute percentage error in a first week horizon", df_p["mape"].values[0])


INFO:prophet:Making 22 forecasts with cutoffs between 2016-02-08 00:00:00 and 2017-10-30 00:00:00
INFO:prophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7f87f0274eb0>


Initial log joint probability = -2.35821
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       385.278   0.000147265         148.1      0.2859      0.2859      117   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     188       385.421   6.57907e-09       99.9369       0.189      0.5712      222   
Optimization terminated normally: 
  Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -2.27415
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Initial log joint probability = -2.20148
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       528.918   0.000853828        109.95           1           1      120   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       457.835    0.00199577       170.772 

In [19]:
# Set the parameters to be evaluated
param_grid = {  
  'changepoint_prior_scale': np.linspace(0.2,2,5) ,
  'seasonality_prior_scale': np.logspace(-2,1,5),
}

# Performs grid search cross validation to estimate the mean MAPE for each pair of parameters
tuning_results = prophet_cross_validation(param_grid, stock, quotes_sentiment, features = ['positive','negative','total'], response = 'Open', metric = 'mape')
tuning_results[tuning_results.mape == tuning_results.mape.min()]

  0%|          | 0/25 [00:00<?, ?it/s]

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:prophet:Making 50 forecasts with cutoffs between 2016-02-07 00:00:00 and 2020-02-16 00:00:00
INFO:prophet:Applying in parallel with <concurrent.futures.process.ProcessPoolExecutor object at 0x7f87929bf070>


Initial log joint probability = -10.2896
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99        2940.1     0.0149804       513.547           1           1      115   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       2984.75     0.0191359       451.358           1           1      232   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       3063.04     0.0109136       377.113           1           1      345   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399        3096.7     0.0181931         300.7           1           1      455   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       3113.78     0.0032601       231.557           1           1      573   
    Iter      log prob        ||dx||      ||grad||       alpha  

Process SpawnProcess-24:
Process SpawnProcess-21:
Process SpawnProcess-22:
Process SpawnProcess-17:
Process SpawnProcess-20:
Process SpawnProcess-23:
Process SpawnProcess-19:
Traceback (most recent call last):
  File "/Users/raphaelattias/opt/miniconda3/envs/ada/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/raphaelattias/opt/miniconda3/envs/ada/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/raphaelattias/opt/miniconda3/envs/ada/lib/python3.9/concurrent/futures/process.py", line 237, in _process_worker
    call_item = call_queue.get(block=True)
  File "/Users/raphaelattias/opt/miniconda3/envs/ada/lib/python3.9/multiprocessing/queues.py", line 102, in get
    with self._rlock:
  File "/Users/raphaelattias/opt/miniconda3/envs/ada/lib/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (mos

KeyboardInterrupt: 

Initial log joint probability = -7.46068
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Initial log joint probability = -5.32387
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     999       2595.35    0.00285932       97.9253           1           1     1125   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Initial log joint probability = -6.15183
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Initial log joint probability = -5.21176
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
    2799       2518.96   0.000291695       76.9768           1           1     3239   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Initial log joint probability = -7.87178
    Iter      log prob        ||dx||      ||grad||       alpha      a