# Overnight News Sentiment
With some scraping magic, our team collected news headlines for our `FANG` Stocks from `https://www.nasdaq.com/symbol/{ticker}/news-headlines` into `dataset/nasdaq/FANG.csv`

With the `mkt_dt_utils` and `sentiment_utils` we created, we can compute **Overnight News Sentiments** for a given stock for the purpose of doing studies against overnight stock returns

## Let's Load Our Data

In [15]:
import pandas as pd
import numpy as np

In [16]:
fname = 'dataset/nasdaq/raw.csv'
data = pd.read_csv( fname, header = 0, index_col = None)

In [17]:
data.tail()

Unnamed: 0,datetime,stockcode,source,headline,article,urls
6400,"February 04, 2019, 03:26:00 PM EDT",NFLX,Zacks.com,"Super Bowl Spending, Streaming & Amazon's Big ...",\n Welcome to the latest episode of the Fu...,https://www.nasdaq.com/article/super-bowl-spen...
6401,"February 04, 2019, 01:24:00 PM EDT",NFLX,Zacks.com,Buy Apple (AAPL) Stock on Streaming TV Push to...,\n Shares of Apple AAPL jumped over 2.5% Mond...,https://www.nasdaq.com/article/buy-apple-aapl-...
6402,"February 04, 2019, 12:42:43 PM EDT",NFLX,InvestorPlace Media,3 Must-Watch Comic Book Movies on Netflix,"\nInvestorPlace - Stock Market News, Stock Adv...",https://www.nasdaq.com/article/3-must-watch-co...
6403,"February 04, 2019, 12:36:05 PM EDT",NFLX,InvestorPlace Media,2 Crazy Acquisitions That Would Boost Alphabet...,"\nInvestorPlace - Stock Market News, Stock Adv...",https://www.nasdaq.com/article/2-crazy-acquisi...
6404,"February 04, 2019, 04:38:00 PM EDT",NFLX,,US STOCKS-Boost in tech shares sends Wall Stre...,\n\n\nShutterstock photo\n\n@media screen and ...,https://www.nasdaq.com/article/us-stocksboost-...


In [18]:
# data['timestring'] = data.index
# data['timestring'] = data['timestring'].astype('str')
data['datetime'] = data['datetime'].astype('str')
data['tz'] = data['datetime'].apply( lambda x: x[-3:])

In [19]:
data[data['tz']!= 'EDT']

Unnamed: 0,datetime,stockcode,source,headline,article,urls,tz
545,,AMZN,,"The Zacks Analyst Blog Highlights: Netflix, Di...",\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/the-zacks-analy...,
551,,AMZN,,Amazon Announces New Job Positions for New Yor...,\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/amazon-announce...,
582,,AMZN,,Why Amazon (AMZN) is Poised to Beat Earnings E...,\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/why-amazon-amzn...,
602,,AMZN,,Amazon (AMZN) Dips More Than Broader Markets: ...,\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/amazon-amzn-dip...,
2061,,GOOGL,,Amazon Announces New Job Positions for New Yor...,\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/amazon-announce...,
2089,,GOOGL,,Alphabet (GOOGL) Dips More Than Broader Market...,\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/alphabet-googl-...,
4992,,NFLX,,"The Zacks Analyst Blog Highlights: Netflix, Di...",\n\n\nShutterstock photo\n\n\n\n\n\n\n\r\n ...,https://www.nasdaq.com/article/the-zacks-analy...,


### Remove Bad Data

In [29]:
dup_filter = data.duplicated()
na_dt = pd.to_datetime(data['datetime']).isna() #data.datetime.isna()

print(f'Dataset has {len(data)} points, of which; {sum(dup_filter)} are duplicates, and {sum(na_dt)} are missing DateTime info.')

Dataset has 6021 points, of which; 0 are duplicates, and 7 are missing DateTime info.


In [30]:
data = data.drop_duplicates()
data = data[ - pd.to_datetime(data.datetime).isna()]
print(f'After cleaning we have {len(data)} data points')



After cleaning we have 6014 data points


In [22]:
data['tz'] = data['tz'].apply( lambda x : 'EST' if x == 'EDT' else x)
data['dt'] = pd.to_datetime(data['datetime']).dt.tz_localize('EST')

In [8]:
data.head()

Unnamed: 0,datetime,stockcode,source,headline,article,urls,tz,dt
0,"February 07, 2019, 09:11:00 PM EDT",AMZN,,Amazon CEO Jeff Bezos Accuses National Enquire...,\n\n\nShutterstock photo\n\n@media screen and ...,https://www.nasdaq.com/article/amazon-ceo-jeff...,EST,2019-02-07 21:11:00-05:00
1,"February 07, 2019, 02:49:22 PM EDT",AMZN,BNK Invest,"Notable Thursday Option Activity: AMZN, BKNG, ...",\nAmong the underlying components of the S&P 5...,https://www.nasdaq.com/article/notable-thursda...,EST,2019-02-07 14:49:22-05:00
2,"February 07, 2019, 06:55:00 PM EDT",AMZN,,Amazon's Bezos says National Enquirer owner tr...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,EST,2019-02-07 18:55:00-05:00
3,"February 07, 2019, 06:26:00 PM EDT",AMZN,,Amazon's Bezos says National Enquirer tried to...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,EST,2019-02-07 18:26:00-05:00
4,"February 07, 2019, 11:55:45 AM EDT",AMZN,InvestorPlace Media,"IRS Tax Refund 2019: So, Where’s My Tax Refund?","\nInvestorPlace - Stock Market News, Stock Adv...",https://www.nasdaq.com/article/irs-tax-refund-...,EST,2019-02-07 11:55:45-05:00


## Read Pre-existing DF and only work on New records

In [128]:
output_fname = 'dataset/nasdaq/overnight_sentiments.csv'

old_df = pd.read_csv(output_fname, index_col = 0)

l_new = [(url in old_df['urls']) for url in data['urls']]

sum(l_new)

0

In [132]:
test_url = data['urls'][0]
test_url

'https://www.nasdaq.com/article/amazon-ceo-jeff-bezos-accuses-national-enquirer-owner-of-extortion--blackmail-20190207-01538'

In [133]:
test_url in old_df['urls']

False

In [130]:
old_df['urls'][0]

'https://www.nasdaq.com/article/amazon-ceo-jeff-bezos-accuses-national-enquirer-owner-of-extortion--blackmail-20190207-01538'

## Label Overnight News and Output Next Market Date

In [38]:
from datetime import datetime
from mkt_dt_utils import IsMarketOpen, GetNextMktDate, days_hours_mins_secs

temp_fname = 'dataset/nasdaq/overnight_sentiments_temp.csv'
ExchgName = 'NYSE'
data_limit = None #10

df = data.drop(columns= ['tz'])
if data_limit:
    df = df[: data_limit]

stime = datetime.now()
print(f'Filtering for {len(df)} Overnight News Articles...')
df['IsMarketOpen'] = df['dt'].apply(
    lambda x: IsMarketOpen(x.to_pydatetime(), ExchgName)    
    )

df.to_csv(temp_fname)

print(f'Determining Trade Date for Overnight News Articles...')
df_on = df[df['IsMarketOpen'] == False]
df_on['TradeDate'] = df_on.apply(
    lambda x: GetNextMktDate(x['dt'].to_pydatetime(),ExchgName),
    axis =1
    )

df_on.to_csv(temp_fname)

ttime = datetime.now() - stime
d_ , h_, m_ , s_ = days_hours_mins_secs(ttime)
print(f'Time elapsed {h_} hours, {m_} minutes, {s_} seconds.')

Filtering for 6014 Overnight News Articles...
Determining Trade Date for Overnight News Articles...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Time elapsed 1 hours, 29 minutes, 47 seconds.


In [39]:
print(f'--- Only {len(df_on)} of {len(df)} are overnight news ---')

--- Only 3148 of 6014 are overnight news ---


## Generate Relevance + Sentiment for each Article

In [17]:
import pandas as pd
from datetime import datetime
from mkt_dt_utils import days_hours_mins_secs

temp_fname = 'dataset/nasdaq/overnight_sentiments_temp.csv'

#df_ = df_on
tmp_fname = 'dataset/nasdaq/overnight_sentiments_temp_0.csv'
df_ = pd.read_csv(tmp_fname, index_col = 0)

In [18]:
from sentiment_utils import GetCleanText, GetSummary, GetRelavency, GetSentimentScore, Stock_KW_Dict

stime = datetime.now()

# 1. Clean + Summarize
print(f'Summarizing {len(df_)} overnight news article...')
df_['summary'] = df_['article'].apply(
    lambda x : GetSummary(GetCleanText( x ), method = 'sumy-lex_rank')
)

df_.to_csv(temp_fname)

ttime = datetime.now() - stime
d_ , h_, m_ , s_ = days_hours_mins_secs(ttime)
print(f'Time elapsed {h_} hours, {m_} minutes, {s_} seconds.')

Summarizing 3148 overnight news article...
Time elapsed 0 hours, 3 minutes, 29 seconds.


In [19]:
# 3. Relavency
stime = datetime.now()
print(f'Determining {len(df_)} overnight news article relavance...')
df_['_relevance'] = df_.apply(
    lambda x: GetRelavency( x['summary'], Stock_KW_Dict[ x['stockcode']], debugmode = False),
    axis = 1
)

df_.to_csv(temp_fname)

ttime = datetime.now() - stime
d_ , h_, m_ , s_ = days_hours_mins_secs(ttime)
print(f'Time elapsed {h_} hours, {m_} minutes, {s_} seconds.')

Determining 3148 overnight news article relavance...
Time elapsed 0 hours, 49 minutes, 12 seconds.


In [25]:
import numpy as np

# 4. Sentiment
stime = datetime.now()
print(f'Generating {len(df_)} Sentiment Score...')
df_['_sentiment'] = df_['summary'].apply(
    lambda x: GetSentimentScore( x , method = 'bespoke') if x != '' else np.nan
)

df_.to_csv(temp_fname)
ttime = datetime.now() - stime
d_ , h_, m_ , s_ = days_hours_mins_secs(ttime)
print(f'Time elapsed {h_} hours, {m_} minutes, {s_} seconds.')

Generating 3148 Sentiment Score...
Time elapsed 0 hours, 2 minutes, 25 seconds.


In [26]:
df_.head()

Unnamed: 0,datetime,stockcode,source,headline,article,urls,dt,IsMarketOpen,TradeDate,summary,_relevance,_sentiment
0,"February 07, 2019, 09:11:00 PM EDT",AMZN,,Amazon CEO Jeff Bezos Accuses National Enquire...,\n\n\nShutterstock photo\n\n@media screen and ...,https://www.nasdaq.com/article/amazon-ceo-jeff...,2019-02-07 21:11:00-05:00,False,2019-02-08,Shutterstock photo@media screen and (Amazon CE...,11,0.717833
2,"February 07, 2019, 06:55:00 PM EDT",AMZN,,Amazon's Bezos says National Enquirer owner tr...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,2019-02-07 18:55:00-05:00,False,2019-02-08,"Jeff Bezos, chief executive of Amazon.com Inc,...",5,0.34276
3,"February 07, 2019, 06:26:00 PM EDT",AMZN,,Amazon's Bezos says National Enquirer tried to...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,2019-02-07 18:26:00-05:00,False,2019-02-08,"Jeff Bezos, chief executive of Amazon.com Inc,...",5,0.45036
22,"February 07, 2019, 09:25:09 AM EDT",AMZN,Motley Fool,Only 1 Amazon Metric Should Really Matter to I...,\n Wall Street didn't show much love for Ama...,https://www.nasdaq.com/article/only-1-amazon-m...,2019-02-07 09:25:09-05:00,False,2019-02-07,"Yes, management's projections for next quart...",4,0.41008
23,"February 07, 2019, 08:36:00 AM EDT",AMZN,Motley Fool,Amazon Is Running Away With the Smart-Speaker ...,\n The latest data for the smart-speaker marke...,https://www.nasdaq.com/article/amazon-is-runni...,2019-02-07 08:36:00-05:00,False,2019-02-07,The company's Echo devices hold about 70% of t...,5,0.43714


### Some Final Data Cleaning before exporting to CSV

In [30]:
from nltk import sent_tokenize

def get_source(df, debugmode = False):
    """input: the entire dataframe"""

    for i in df.index:
        
        if debugmode:
            print(f'row {i} of {len(df)}')
            
        if pd.isna(df.source[i]):
            sent_tokens = sent_tokenize(df.article[i])

            if len(sent_tokens) > 1:
                if ('RTTNews' in sent_tokens[0]) or ('RTTNews' in sent_tokens[-1]) or ('RTTNews' in sent_tokens[1]):
                    df.source[i] = 'RTTNews'
                elif ('Reuters' in sent_tokens[0]) or ('reuters' in sent_tokens[-1]) or ('Reuters' in sent_tokens[1]):
                    df.source[i] = 'Reuters'
            else:
                if 'RTTNews' in sent_tokens[0]:
                    df.source[i] = 'RTTNews'
                elif 'Reuters' in sent_tokens[0]:
                    df.source[i] = 'Reuters'
    return df

In [31]:
df_out = get_source(df_)
source_filt = df_out['source'].isna()
print(f'{sum(source_filt)} records with source = NA')
df_out[source_filt]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0 records with source = NA


Unnamed: 0,datetime,stockcode,source,headline,article,urls,dt,IsMarketOpen,TradeDate,summary,_relevance,_sentiment


#### Remove records without Summary and no Sentiment Score

In [33]:
bad_data = df_out[df_out['summary']== '']
print(f'{len(bad_data)} records with no summary')
bad_data

4 records with no summary


Unnamed: 0,datetime,stockcode,source,headline,article,urls,dt,IsMarketOpen,TradeDate,summary,_relevance,_sentiment
1123,"December 20, 2018, 08:32:00 AM EDT",AMZN,Reuters,"U.S. STOCKS ON THE MOVE-Tilray, Conagra, MannK...",\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/us-stocks-on-th...,2018-12-20 08:32:00-05:00,False,2018-12-20,,0,
2340,"December 20, 2018, 08:32:00 AM EDT",GOOGL,Reuters,"U.S. STOCKS ON THE MOVE-Tilray, Conagra, MannK...",\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/us-stocks-on-th...,2018-12-20 08:32:00-05:00,False,2018-12-20,,0,
3827,"December 20, 2018, 08:32:00 AM EDT",FB,Reuters,"U.S. STOCKS ON THE MOVE-Tilray, Conagra, MannK...",\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/us-stocks-on-th...,2018-12-20 08:32:00-05:00,False,2018-12-20,,0,
4941,"January 18, 2019, 07:51:00 AM EDT",NFLX,Reuters,"U.S. STOCKS ON THE MOVE-Tesla, Netflix, VF Cor...",\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/us-stocks-on-th...,2019-01-18 07:51:00-05:00,False,2019-01-18,,0,


In [43]:
df_out = df_out[ df_out['summary']!= '']
print(f'Final DF has {len(df_out)} records.')

Final DF has 3144 records.


In [122]:
data.columns

Index(['datetime', 'stockcode', 'source', 'headline', 'article', 'urls', 'tz',
       'dt'],
      dtype='object')

In [44]:
df_out.columns

Index(['datetime', 'stockcode', 'source', 'headline', 'article', 'urls', 'dt',
       'IsMarketOpen', 'TradeDate', 'summary', '_relevance', '_sentiment'],
      dtype='object')

In [45]:
df_out.describe()

Unnamed: 0,_relevance,_sentiment
count,3144.0,3144.0
mean,1.235051,0.385152
std,1.882425,0.262075
min,0.0,-0.84185
25%,0.0,0.24296
50%,0.0,0.41643
75%,2.0,0.547865
max,21.0,0.992


### Add new data into CSV file

In [46]:
df_all = df_out

## Write DF to File

In [47]:
l_req_col = ['datetime', 'stockcode', 'source', 'headline', 'article', 'urls']
output_fname = 'dataset/nasdaq/overnight_sentiments.csv'

df_all.to_csv(output_fname)

In [48]:
df_all.head()

Unnamed: 0,datetime,stockcode,source,headline,article,urls,dt,IsMarketOpen,TradeDate,summary,_relevance,_sentiment
0,"February 07, 2019, 09:11:00 PM EDT",AMZN,RTTNews,Amazon CEO Jeff Bezos Accuses National Enquire...,\n\n\nShutterstock photo\n\n@media screen and ...,https://www.nasdaq.com/article/amazon-ceo-jeff...,2019-02-07 21:11:00-05:00,False,2019-02-08,Shutterstock photo@media screen and (Amazon CE...,11,0.717833
2,"February 07, 2019, 06:55:00 PM EDT",AMZN,Reuters,Amazon's Bezos says National Enquirer owner tr...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,2019-02-07 18:55:00-05:00,False,2019-02-08,"Jeff Bezos, chief executive of Amazon.com Inc,...",5,0.34276
3,"February 07, 2019, 06:26:00 PM EDT",AMZN,Reuters,Amazon's Bezos says National Enquirer tried to...,\n\n\nReuters\n\n@media screen and (max-device...,https://www.nasdaq.com/article/amazons-bezos-s...,2019-02-07 18:26:00-05:00,False,2019-02-08,"Jeff Bezos, chief executive of Amazon.com Inc,...",5,0.45036
22,"February 07, 2019, 09:25:09 AM EDT",AMZN,Motley Fool,Only 1 Amazon Metric Should Really Matter to I...,\n Wall Street didn't show much love for Ama...,https://www.nasdaq.com/article/only-1-amazon-m...,2019-02-07 09:25:09-05:00,False,2019-02-07,"Yes, management's projections for next quart...",4,0.41008
23,"February 07, 2019, 08:36:00 AM EDT",AMZN,Motley Fool,Amazon Is Running Away With the Smart-Speaker ...,\n The latest data for the smart-speaker marke...,https://www.nasdaq.com/article/amazon-is-runni...,2019-02-07 08:36:00-05:00,False,2019-02-07,The company's Echo devices hold about 70% of t...,5,0.43714
