In [1]:
import os
from datetime import datetime
import pandas as pd

# Parameters

In [2]:
# Parameters dictionary.
pm = {
    'organization': 'Google Inc',
    'start_date_train': datetime(2022,1,1),
    'end_date_train': datetime(2022,11,1),
    'start_date_test': datetime(2022,11,1),
    'end_date_test': datetime(2023,1,1),
    'n_articles': 5,
    'text_columns': [ 'abstract', 'lead_paragraph', 'snippet', 'headline.main', ],
}

⭕ **Possible Improvements:**

* The date range for the market data (dependent variable) could be larger than the date range for the news, since there may be a time lag.
* Test for a number of years.
* Play with what text columns are included or not.
* Test if weighting articles by how much a company is mentioned in the article improves predictions.
* Inspect how the number of articles published changes things.

# NYTimes Data

## Retrieve Data
To access the NYTimes API we will by using the `pynytimes` repository, for which the bibtex citations is:
```
@software{Den_Heijer_pynytimes_2023,
    author = {Den Heijer, Micha},
    license = {MIT},
    title = {{pynytimes}},
    url = {https://github.com/michadenheijer/pynytimes},
    version = {0.10.0},
    year = {2023},
    doi = {10.5281/zenodo.7821090}
}
```

Our API key is stored int the environment variable `NYTIMES_KEY`, which is set in e.g. `~/.bash_profile` or `~/.zshrc`

In [3]:
from pynytimes import NYTAPI
from my_api_info import get_nytimes_key

In [4]:
nytapi = NYTAPI(get_nytimes_key(), parse_dates=True )

In [5]:
results = nytapi.article_search(
    query=pm['organization'],
    results=pm['n_articles'],
    dates={ 'begin':pm['start_date_train'], 'end':pm['end_date_train'] }
)

⭕ **Possible Improvement:**

Currently searching with keywords. An advanced option is to use the filter query feature of the NYTimes API, e.g.
```
options={
    'fq': 'organizations:("Google Inc")',
},
```
This requires also filtering on the "rank" of the organization in regards to the article, as found in e.g. `article['keywords']['rank']`. Otherwise we'll get articles tangentially related to the target company.

## Format Data

In [6]:
# Create storage dictionary
nyt_data = {
    'pub_date': [],
}
for column in pm['text_columns']:
    nyt_data[column] = []

In [7]:
# Collect
for i, result in enumerate( results ):
    for column in nyt_data.keys():
        
        # Parse column
        if '.' in column:
            column_keys = column.split( '.' )
            column_val = result[column_keys[0]][column_keys[1]]
        else:
            column_val = result[column]
            
        # Store
        nyt_data[column].append( column_val )

In [8]:
# Turn into a dataframe
nyt = pd.DataFrame( nyt_data )

In [9]:
# Collect the full string
nyt['text'] = ( nyt[pm['text_columns']] + ' ' ).sum( axis=1 )

In [10]:
nyt.head()

Unnamed: 0,pub_date,abstract,lead_paragraph,snippet,headline.main,text
0,2022-10-25 20:37:03+00:00,Google’s parent company reported earnings that...,"Even Alphabet, the parent company of Google an...",Google’s parent company reported earnings that...,Alphabet’s Profit Drops 27 Percent From a Year...,Google’s parent company reported earnings that...
1,2022-10-26 22:47:44+00:00,A series of quarterly earnings reports is show...,Google this week reported a steep decline in p...,A series of quarterly earnings reports is show...,Tech’s Biggest Companies Are Sending Worrying ...,A series of quarterly earnings reports is show...
2,2022-10-20 15:05:58+00:00,"Ken Paxton, the state attorney general, said p...",The Texas attorney general filed a privacy law...,"Ken Paxton, the state attorney general, said p...",Texas Sues Google for Collecting Biometric Dat...,"Ken Paxton, the state attorney general, said p..."
3,2022-10-28 12:06:38+00:00,The social network’s new owner has just a few ...,Elon Musk closes his purchase of Twitter and f...,The social network’s new owner has just a few ...,Elon Musk Faces Another Big Decision at Twitter,The social network’s new owner has just a few ...
4,2022-10-25 18:24:56+00:00,Apple has rejected Spotify’s new app three tim...,"Daniel Ek, the chief executive of Spotify, wan...",Apple has rejected Spotify’s new app three tim...,Spotify Wants to Get Into Audiobooks but Says ...,Apple has rejected Spotify’s new app three tim...


## Sentiment Analysis
Here, we are using textblob as our sentiment analysis tool. We are taking data from the text column of the data frame and outputting both polarity and subjectivity for each article. At the end, we are combining it into one single dataframe.

In [15]:
from textblob import TextBlob

In [17]:
pol_vec = []
subj_vec = []
for i in range (len(nyt['text'])):
    blob = TextBlob(nyt['text'][i])
    pol = blob.sentiment.polarity
    subj = blob.sentiment.subjectivity
    pol_vec.append(pol)
    subj_vec.append(subj)
    
    

In [18]:
d = {'Polarity': pol_vec, 'Subjectivity': subj_vec}
t = pd.DataFrame(data=d)

In [19]:
display(t)

Unnamed: 0,Polarity,Subjectivity
0,0.075,0.41
1,0.229167,0.558333
2,0.0375,0.375
3,0.093939,0.274242
4,0.292532,0.487013
5,0.21131,0.342262
6,0.171429,0.565568
7,-0.068376,0.401709
8,0.355682,0.581818
9,-0.075385,0.182418


In [20]:
f = pd.concat([nyt,t], axis=1)

In [21]:
display(f)

Unnamed: 0,pub_date,abstract,lead_paragraph,snippet,headline.main,text,Polarity,Subjectivity
0,2022-10-25 20:37:03+00:00,Google’s parent company reported earnings that...,"Even Alphabet, the parent company of Google an...",Google’s parent company reported earnings that...,Alphabet’s Profit Drops 27 Percent From a Year...,Google’s parent company reported earnings that...,0.075,0.41
1,2022-10-26 22:47:44+00:00,A series of quarterly earnings reports is show...,Google this week reported a steep decline in p...,A series of quarterly earnings reports is show...,Tech’s Biggest Companies Are Sending Worrying ...,A series of quarterly earnings reports is show...,0.229167,0.558333
2,2022-10-20 15:05:58+00:00,"Ken Paxton, the state attorney general, said p...",The Texas attorney general filed a privacy law...,"Ken Paxton, the state attorney general, said p...",Texas Sues Google for Collecting Biometric Dat...,"Ken Paxton, the state attorney general, said p...",0.0375,0.375
3,2022-10-28 12:06:38+00:00,The social network’s new owner has just a few ...,Elon Musk closes his purchase of Twitter and f...,The social network’s new owner has just a few ...,Elon Musk Faces Another Big Decision at Twitter,The social network’s new owner has just a few ...,0.093939,0.274242
4,2022-10-25 18:24:56+00:00,Apple has rejected Spotify’s new app three tim...,"Daniel Ek, the chief executive of Spotify, wan...",Apple has rejected Spotify’s new app three tim...,Spotify Wants to Get Into Audiobooks but Says ...,Apple has rejected Spotify’s new app three tim...,0.292532,0.487013
5,2022-10-13 19:26:02+00:00,Shares of the SPAC trying to merge with Donald...,The shares of the cash-rich special purpose ac...,Shares of the SPAC trying to merge with Donald...,Google’s Move to Include Truth Social in App S...,Shares of the SPAC trying to merge with Donald...,0.21131,0.342262
6,2022-10-12 17:00:18+00:00,"Developed with Fitbit, Google’s first smart wa...","It’s 2022, and Google finally has a response t...","Developed with Fitbit, Google’s first smart wa...",The Google Watch Is Here. But You’d Better Lov...,"Developed with Fitbit, Google’s first smart wa...",0.171429,0.565568
7,2022-09-29 18:01:39+00:00,"After nearly three years, Google has decided t...",Google said it would shutter the video game st...,"After nearly three years, Google has decided t...",Google to Shut Down Stadia Video Game Streamin...,"After nearly three years, Google has decided t...",-0.068376,0.401709
8,2022-09-28 17:25:57+00:00,From more photo-based results to neighborhood ...,Google’s search engine looks a little differen...,From more photo-based results to neighborhood ...,Google to Make Search and Maps More ‘Immersive’,From more photo-based results to neighborhood ...,0.355682,0.581818
9,2022-09-29 09:00:28+00:00,"Online, Iranians engage in a world their leade...","In the physical world, Iran’s authoritarian le...","Online, Iranians engage in a world their leade...","Despite Iran’s Efforts to Block Internet, Tech...","Online, Iranians engage in a world their leade...",-0.075385,0.182418


## Removing 'Irrelevant' Rows
Some of the text columns don't mention google *enough* times. This portion of the notebook removes those rows.

We need to import the library in the cell below in order to use the word_count function. 

In [25]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michaelluo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Counting instances of key words

word_threshold is a hyperparameter for how many times 'google' is mentioned in the text, we can also add frequencies for additional/other key words based on the problem at hand. For example, doing research on another stock.

In [26]:
freq = blob.word_counts['google']
word_threshold = 1

In [28]:
for i in range (len(nyt['text'])):
    blob = TextBlob(nyt['text'][i])
    if blob.word_counts['google'] < word_threshold:
        f = f.drop(i)

In [29]:
display(f)

Unnamed: 0,pub_date,abstract,lead_paragraph,snippet,headline.main,text,Polarity,Subjectivity
0,2022-10-25 20:37:03+00:00,Google’s parent company reported earnings that...,"Even Alphabet, the parent company of Google an...",Google’s parent company reported earnings that...,Alphabet’s Profit Drops 27 Percent From a Year...,Google’s parent company reported earnings that...,0.075,0.41
1,2022-10-26 22:47:44+00:00,A series of quarterly earnings reports is show...,Google this week reported a steep decline in p...,A series of quarterly earnings reports is show...,Tech’s Biggest Companies Are Sending Worrying ...,A series of quarterly earnings reports is show...,0.229167,0.558333
2,2022-10-20 15:05:58+00:00,"Ken Paxton, the state attorney general, said p...",The Texas attorney general filed a privacy law...,"Ken Paxton, the state attorney general, said p...",Texas Sues Google for Collecting Biometric Dat...,"Ken Paxton, the state attorney general, said p...",0.0375,0.375
5,2022-10-13 19:26:02+00:00,Shares of the SPAC trying to merge with Donald...,The shares of the cash-rich special purpose ac...,Shares of the SPAC trying to merge with Donald...,Google’s Move to Include Truth Social in App S...,Shares of the SPAC trying to merge with Donald...,0.21131,0.342262
6,2022-10-12 17:00:18+00:00,"Developed with Fitbit, Google’s first smart wa...","It’s 2022, and Google finally has a response t...","Developed with Fitbit, Google’s first smart wa...",The Google Watch Is Here. But You’d Better Lov...,"Developed with Fitbit, Google’s first smart wa...",0.171429,0.565568
7,2022-09-29 18:01:39+00:00,"After nearly three years, Google has decided t...",Google said it would shutter the video game st...,"After nearly three years, Google has decided t...",Google to Shut Down Stadia Video Game Streamin...,"After nearly three years, Google has decided t...",-0.068376,0.401709
8,2022-09-28 17:25:57+00:00,From more photo-based results to neighborhood ...,Google’s search engine looks a little differen...,From more photo-based results to neighborhood ...,Google to Make Search and Maps More ‘Immersive’,From more photo-based results to neighborhood ...,0.355682,0.581818
