#### Imports

[Small description of all the imports we do]

In [1]:
# Useful starting lines
import numpy as np
import seaborn as sns
from util.dataloader import *
from util.finance import *
from util.plots import *
from util.finance import stock, compare
from util.quotebankexploration import *
from util.wikipedia import *
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
import numpy as np

from util.apple_stores import *

from task1 import *
from task2 import *
from task3 import *
from task4 import *

pio.renderers.default = "notebook_connected"

%load_ext autoreload
%autoreload 2

# <a class="anchor" id="TOC"></a> Table of Contents

***

* 0. [Introduction](#intro)
* 1. [Loading the datasets](#sect_1)
* 2. [First look into the Apple stock market and related quotes](#sect_2)
* 3. [Impact of the number of pageviews on the speakers' Wikipedia page](#sect_3)
    * 3.1 [Wiki labels and wiki ID](#sect_3_1)
    * 3.2 [Number of wiki page views](#sect_3_2)
    * 3.3 [Exact label for each quotes](#sect_3_3)
    * 3.4 [Scoring quotes](#sect_3_4)
    * 3.5 [Get back sentiment analysis](#sect_3_5)
    * 3.6 [Final plot](#sect_3_6)

# 0. Introduction <a class="anchor" id="intro"></a>
[Back to table of contents](#TOC)

***

This notebook is intended to show and demonstrate the thought process that went into this project, on identifying patterns and correlations in stock markets and media quotes. In particular we have focused on the Apple Stock market, as it is one of the most valuable company on earth and it is widely covered in the media. 

# 1. Loading the datasets <a class="anchor" id="sect_1"></a>
[Back to table of contents](#TOC)

***

In this project we have studied three datasets to provide more insights in the stock price evolution, the coverage of Apple in the media, and its relationship with the various speakers. This first session is dedicated to loading the various datasets and preprocessing the valuable information.

### Apple stock : yFinance API
The yFinance API is provided by Yahoo Finance and provides an easy access to various financial metrics for most stocks in the market. The _ticker_ of the Apple stock is denoted _AAPL_, and we will first focus on a date range from 2010 up to 2020. We also provide an additional indicator of the daily volatility with the ``Liquidity`` field, which is the daily volume multiplied by the average daily stock price. This indicator in dollars is appropriate to have a quick overview on the quantity of Apple stock traded in a day, as a day of high liquidity is also said to be a day of high volatility. 

In [2]:
# Using yFinance API we load various metrics of the Apple stock ranging from 2008 to 2020
stock = load_stock("AAPL", 2008, 2020)

# We set a day of high volatility as a day among the highest 2% of liquidity in that year.
stock = high_volatility(stock, quantile = 0.98)

display(stock)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Liquidity,Yearly Percentile
0,2008-09-02,6.157143,6.196429,5.892857,5.935357,5.088770,780763200,4.720690e+09,Lower 98%
1,2008-09-03,5.958571,6.024286,5.857143,5.962857,5.112345,734834800,4.380140e+09,Lower 98%
2,2008-09-04,5.923571,5.996786,5.743214,5.757857,4.936587,743386000,4.341905e+09,Lower 98%
3,2008-09-05,5.663929,5.800000,5.630357,5.720714,4.904742,786884000,4.479197e+09,Lower 98%
4,2008-09-08,5.877500,5.888929,5.409286,5.640000,4.835539,1045979200,6.023533e+09,Lower 98%
...,...,...,...,...,...,...,...,...,...
2951,2020-05-22,78.942497,79.807503,78.837502,79.722504,78.955231,81803200,6.489652e+09,Lower 98%
2952,2020-05-26,80.875000,81.059998,79.125000,79.182503,78.420410,125522000,1.004537e+10,Lower 98%
2953,2020-05-27,79.035004,79.677498,78.272499,79.527496,78.762093,112945200,8.954437e+09,Lower 98%
2954,2020-05-28,79.192497,80.860001,78.907501,79.562500,78.796753,133560800,1.060172e+10,Lower 98%


## Quotebank Dataset
The Quotebank data is a large text corpus of more than 178 millions quotations scrapped over 337 websites. As we are focusing on Apple related quotations, we have applied a various amount of filtering to reduce the number of quotes to 310'816. Some of the various techniques employed: 
1. White list of words that should be contained in quotes : _Apple_, _iPhone_, _Macbook_ etc.
2. Black list of words that should not be contained : _Mac n Cheese, _apple_, _Big Apple_ etc.
3. White list of speakers, as we also included speakers related to Apple regardless of the aforementioned white words : _Steve Jobs_, _Tim Cook_, _Steve Wozniak_ etc.

This final dataset of Apple related quotes have been saved in a `pkl` file to improve the ease of manipulation, and can be accessed with the `get_filtered_quotes` function.

In [3]:
quotes = get_filtered_quotes()

display(quotes)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2008-10-15-012147,five times the graphics performance.,steve jobs,[Q19837],2008-10-15 06:38:46,2,"[[steve jobs, 0.7707], [None, 0.2293]]",[http://us.rd.yahoo.com/dailynews/rss/search/m...,A
1,2008-09-25-038377,my iphone is full of pictures. i wanted to see...,ken whisenhunt,[Q1758635],2008-09-25 08:08:08,1,"[[ken whisenhunt, 0.3759], [None, 0.3186], [ma...",[http://azcentral.com/arizonarepublic/sports/a...,A
2,2008-09-04-052789,the people who connect needs and ideas the bes...,steve jobs,[Q19837],2008-09-04 18:19:47,1,"[[steve jobs, 0.717], [None, 0.283]]",[http://businessweek.com/magazine/content/07_4...,A
3,2008-11-10-038937,sesame workshop will get 70 percent of the rev...,robert macmillan,"[Q21453558, Q28094302]",2008-11-10 22:59:44,1,"[[robert macmillan, 0.6071], [None, 0.3929]]",[http://macdailynews.com/index.php/weblog/comm...,A
4,2008-09-26-048683,the huge success of the apple itunes app store...,steve howard,"[Q23670647, Q2457386, Q7612886]",2008-09-26 11:29:23,1,"[[steve howard, 0.5034], [None, 0.4966]]",[http://ipod.consumerelectronicsnet.com/articl...,A
...,...,...,...,...,...,...,...,...,...
310811,2020-03-06-004716,Apple continues to show their tremendous suppo...,none,[],2020-03-06 01:14:31,2,"[[None, 0.6748], [Octavia Spencer, 0.2615], [M...",[https://www.denofgeek.com/tv/truth-be-told-se...,E
310812,2020-01-21-010221,"But at the same time, we're also not saying th...",satya nadella,[Q7426870],2020-01-21 15:44:24,1,"[[Satya Nadella, 0.8487], [None, 0.1513]]",[https://www.theverge.com/2020/1/21/21071108/m...,E
310813,2020-01-22-079970,The Big Mac burger sauce is so hard to get hol...,none,[],2020-01-22 12:34:13,2,"[[None, 0.8268], [Big Mac, 0.1733]]",[https://www.mirror.co.uk/money/shopping-deals...,E
310814,2020-04-01-004824,Archie has been loving doing FaceTime playdate...,none,[],2020-04-01 12:14:50,1,"[[None, 0.6707], [Prince Harry, 0.2197], [Megh...",[https://www.pinkvilla.com/entertainment/holly...,E


## Wikipedia

wallah je sais pas

In [None]:
print('caca yolo prout')

# 2. First look into the Apple stock market and related quotes <a class="anchor" id="sect_2"></a>
[Back to table of contents](#TOC)

***

In [4]:
weekly_liquidity(stock, quantile=0.98)

In [5]:
daily_stock_price(stock)

In [6]:
daily_quotes(quotes, quantile = 0.98)

In [7]:
seasonal_analysis(stock[stock.Date.dt.year.isin([2018,2019])], column="Liquidity")

  0%|          | 0/120 [00:00<?, ?it/s]

The Liquidity can be fitted with a seasonal model of period 64 with p_value 6.639388969749278e-28


In [None]:
stock_price_with_quotes(stock,quotes)

In [None]:
pearson_stock_quotes(stock[stock.Date.dt.year.isin(range(2015,2019))],quotes)

# Predictor

In [None]:
task4(stock)

# 3. Impact of the number of pageviews on the speakers' Wikipedia page <a class="anchor" id="sect_3"></a>
[Back to table of contents](#TOC)

***

Now we have looked the positiveness of the different quotes of our data set, we want to add more depth about on the impact of the quotes on the stock market. To do so, we pose to ourselves the following question: What is the impact of a quote on others ? The response is relatively simple, it depends on how well known the speaker was at the time he or she was quoted in the media. And a quite simple way to have that indicators is the number of pageviews on then speakers' Wikipedia page (if there is one !).

#### 3.1 Wiki labels and wiki ID <a class="anchor" id="sect_3_1"></a>
[Back to table of contents](#TOC)

But before accessing to this number of pageviews, we need the exact label of the speakers' Wikipedia page. And for that, we will load the following data set that particularly contains the precise label's page for its ocrresponding wiki ID that we have for each quotes in our quotes' data set.

In [None]:
# Get the wiki labels with its corresponding wiki ID
wiki_labels = get_wiki_labels()[['id', 'label']]
display(wiki_labels.head(5))

#### 3.2 Number of wiki page views <a class="anchor" id="sect_3_2"></a>
[Back to table of contents](#TOC)

After that, we need a way to get the number of pageviews of some speaker for a specific year. For that we use the package `pageviewapi` that gives us all the wiki page views from 2015. We would have liked to have all the pageviews since 2008, but it was too complicated and one person cerate a way to get these data, but the website was not able to work anymore. Then, we had focus our study between 2015 and 2020, and we had design a fucntion that returns the number of pageviews of a specific wiki page and year. This is what follows.

In [None]:
# Get the page views for the Steve Jobs wikipedia page in 2015
speaker = 'Steve Jobs'
year = 2015
print(get_page_views_per_year(speaker, year))

#### 3.3 Exact label for each quotes <a class="anchor" id="sect_3_3"></a>
[Back to table of contents](#TOC)

The idea now is to get the exact label for each speaker of every single quotes in our quotes data set. This can be done directly by merging the ID in `wiki_labels` with the QID of the data frame `quotes`, but here it's not that simple. The different quotes of our data set given by quotebank has more than one QID because the name of the speaker can be confusing with another speaker, so it gives the two QID in the list for that quote. To deal with this issue, we decided to use look at all the different QID for each quotes, and keep the label that corresponds to the speaker having the maximum number of total pageviews. The following cell show how the process is done over a sample the dataframe `quotes`. We do not do the whole filtering here because the run is about 22h, so wa done it once on multiple clusters and save it in a `.pkl` file.

In [None]:
# Get the right label of the speakers of each quotes.
# It adds a new column `label` in quotes dataframe containig the wiki label of the speaker.
quotes_label_sample = find_labels(quotes.head(5), wiki_labels)
display(quotes_label_sample)

#### 3.4 Scoring quotes <a class="anchor" id="sect_3_4"></a>
[Back to table of contents](#TOC)



In [None]:
# Add every year pageview of the data frame and select the ones
# corresponding to the publish year of the quotes.

#### 3.5 Get back sentiment analysis <a class="anchor" id="sect_3_5"></a>
[Back to table of contents](#TOC)

In [None]:
# Use the code of Camille

#### 3.6 Final plot <a class="anchor" id="sect_3_6"></a>
[Back to table of contents](#TOC)

In [None]:
# Plot