# <center>Testing News Asymmetry Between Small and Large Companies</center>


### <center>A sentiment analysis study by Robert Grote and Ryan Fairhurst</center>

This is an interactive Python notebook, which requires that the user run the cells within in order to see the results. Run the cells from the top, all the way down the page by clicking on the first cell, then holding down shift or command or control (depending on whether you're on PC or Mac), then hitting enter.

Install the modules that you will need for reading in all of the data by running the following cell.    
Uncomment the following two lines the first time you run this in order to install the necessary modules by deleting the hash symbol.

In [3]:
#For Mac
'''
!pip install pandas-datareader
!pip install yahoo-finance
!pip install matplotlib
'''

'\n!pip install pandas-datareader\n!pip install yahoo-finance\n!pip install matplotlib\n'

Import the following modules:

In [4]:
import yahoo_finance as yfin
import pandas as pd
import pandas_datareader.data as web
import datetime
import csv
import math
import pandas_datareader.data as data
from yahoo_finance import Share
from pandas_datareader.yahoo.quotes import _yahoo_codes
import matplotlib

The basic idea is that, as companies grow in size, the relative proportion of bad news grows with respect to the amount of good news. We are interested in testing whether this is true and appears in the data, whether it differs by industry, whether there is a maximum threshold of how much bad news can be experienced by small companies, and how the proportions change by company size. In the following  prototype, we conceptualize using jumps in Google Trends data and jumps in stock price data to detect newsworthy events, and observations which will amount to the idea of linking sentiment to a company, then recording its size and several other key variables that we think will have interesting relationships with company size and sentiment.

In [5]:
#'Jumps' in data--a place for our ideas
"""
Defining a Jump
When an event happens
Positive/Negative Event
Positive/Negative interest
"""
#trends-->prices
#prices-->trends
#volume-->prices


#Jump Program: Percentage change--not dollar change
#Different times (Stamp when, magnitude, duration, direction)
#--make a program that measures relatively, not absolutely (percentages, not dollars)
#Sustained vs transient growth

#Jumps:
#1: Identify Jumps
#2:Categorize comparable categories
##Compare news profiles (trends)
##Compare

#Var Types:
#C-Continuous
#S-Scalar
#B-Binary (Dummy)
#P-Percentage change
#R-Rate of change
#D-Discreet--qualitative
#Z-Z Score
#T-Time

#Create a new dataset of jump observations
###Categories###
#*0: Company Name/Ticker Symbol
#1:S Size of company (Market cap pre- and post-jump)
#*2:B Direction (+/-, good/bad news)
#*3:S Time: Does this relationship between news and jumps change in different market conditions?
#4:B Period (Does the relationship between news and jumps change in different market conditions (recession, recovery))
###Possibly just use S&P 500 or define broad market categories
#*4:P Magnitude (percent change in price from before jump until after jump)
#5:R Average Rate of change (magnitude/time [measured in days])
#6:D Industry (do different industries behave differently)
#7:B Sustainability-do prices return to pre-jump level (separate test)
#8: Accompanying Google Trends pattern
###Increased interest before/after/during
###sustained interest sustained increased interest?
###Magnitude
###Rate of change
#*9:Z News volatility (maybe find a relationship between this and company size)
#10:Z Stock Price Volatility
#10:T Duration: when do prices stabilize?

#Defining jump: like Robot Broker Program, but with variable x time and variable percentage p
#get rid of redundancy and characterize different types of movement

'\nDefining a Jump\nWhen an event happens\nPositive/Negative Event\nPositive/Negative interest\n'

In [6]:
stockHistory=pd.read_csv('WilshireHist')

In [7]:
#graph=stockHistory.plot.line()
#graph.draw()

In [8]:
stockHistory

Unnamed: 0,Date,AA,AACC,AAI,AAII,AAME,AAN,AAON,AAP,AAPL,...,ZN,ZNT,ZOLL,ZOLT,ZOOM,ZQK,ZRAN,ZTHO,ZUMZ,ZZ
0,2004-01-01,,,,,,,,,,...,,,,,,,,,,
1,2004-01-02,,,,,2.967206,7.933725,3.250350,25.954398,1.384490,...,,,,5.250,,,,,,
2,2004-01-05,,,,,2.957786,8.036587,3.265805,25.718507,1.442394,...,,,,5.210,,,,,,
3,2004-01-06,,,,,3.004885,8.036587,3.312162,26.177538,1.437189,...,,,,5.409,,,,,,
4,2004-01-07,,,,,2.863589,7.929433,3.294995,26.999970,1.469720,...,,,,5.700,,,,,,
5,2004-01-08,,,,,2.901268,8.208036,3.298428,26.285920,1.519816,...,,,,5.400,,,,,,
6,2004-01-09,,,,,2.882428,8.229465,3.370541,26.254043,1.496395,...,,,,5.210,,,,,,
7,2004-01-12,,,,,2.929527,8.400914,3.348219,26.566441,1.543889,...,,,,5.460,,,,,,
8,2004-01-13,,,,,2.920107,8.465209,3.348219,26.713074,1.569262,...,,,,5.400,,,,,,
9,2004-01-14,,,,,2.920107,8.572364,3.504475,26.840585,1.574467,...,,,,5.250,,,,,,


In [9]:
jumpFilter=pd.DataFrame()
#jumpFilter['ticker']=[]
#jumpFilter['percentjumpy']=[]
#jumpFilter['posneg']=[]
#jumpFilter['startdate']=[]


In [10]:
jumpFilter

In [11]:
print stockHistory.iloc[0,0]

2004-01-01


In [12]:
type(stockHistory.values[1][0])

str

In [13]:
stockHistory.values[1:5]

array([['2004-01-02', nan, nan, ..., nan, nan, nan],
       ['2004-01-05', nan, nan, ..., nan, nan, nan],
       ['2004-01-06', nan, nan, ..., nan, nan, nan],
       ['2004-01-07', nan, nan, ..., nan, nan, nan]], dtype=object)

In [23]:
def jumpFilter(df, y=.1,x=1):
    lastPrice=1000 #something really high so that we do not have to call the first thing a jump
    listOfJumps=[] #ticker, date, jump threshold passed, lookahead period, pos/neg, percentJump, avgdailyrateofchange, counter
    y=.15 #threshold
    x=1 #lookahead period
    for i in range(1,len(df.columns)): #go over the columns. skip the first one because it is a set of dates; change 3 to len(df.columns) for implementation
        counter=0
        for j in range(len(df.values[1:10])): #go down and check for each row; get rid of the [1:10] to be comprehensive
            if (j+x)<len(df.values): #execute the following code until the final date minus the lookahead
                if type(df.values[j][i])!=str:
                    if math.isnan(float(df.values[j][i]))==False:
                        currentPrice=df.values[j][i]
                        futurePrice=df.values[j+x][i]
                        if futurePrice>currentPrice:
                            posneg=1
                        elif futurePrice<currentPrice:
                            posneg=0
                        percentJump=abs((currentPrice-futurePrice)/currentPrice)
                        if (currentPrice-futurePrice)/currentPrice>=y:
                            counter+=1
                            avgDailyRateChange=percentJump/(x+counter-1)
                        elif (currentPrice-futurePrice)/currentPrice<y and counter>=1:
                            listOfJumps.append([df.columns[i],df.values[j][0], y, x, posneg, percentJump, avgDailyRateChange, counter])
                            counter=0
    return listOfJumps

In [88]:
jumpFilter(stockHistory) #columns=['ticker','date','ythresh','lookahead','pos1/neg0','percentjump','avgdailyrateofchange','counter']

[['AACC',
  '2004-02-26',
  0.03,
  1,
  1,
  0.012784880489160669,
  0.03383458646616555,
  1],
 ['AACC',
  '2004-03-12',
  0.03,
  1,
  1,
  0.00673400673400679,
  0.03675675675675674,
  1],
 ['AACC',
  '2004-04-14',
  0.03,
  1,
  0,
  0.025773195876288662,
  0.04901960784313726,
  1],
 ['AACC', '2004-06-21', 0.03, 1, 0, 0.0, 0.034852546916890007, 1],
 ['AACC',
  '2004-06-28',
  0.03,
  1,
  1,
  0.044738500315059805,
  0.022275737507525595,
  2],
 ['AACC',
  '2004-07-14',
  0.03,
  1,
  1,
  0.003634161114475997,
  0.042898550724637594,
  1],
 ['AACC',
  '2004-07-21',
  0.03,
  1,
  1,
  0.030303030303030304,
  0.03339191564147629,
  1],
 ['AACC',
  '2004-09-08',
  0.03,
  1,
  1,
  0.0023242300987797297,
  0.03908431044109432,
  1]]

In [None]:
ourdata=pd.DataFrame(jumpFilter(stockHistory), columns=['ticker','date','ythresh','lookahead','pos1/neg0','percentjump','avgdailyrateofchange','counter'])

In [None]:
ourdata

In [16]:
bonk=float('NaN')
bonk
math.isnan(bonk)
type(bonk)

NameError: name 'math' is not defined

In [17]:
#jumplist=[]
#for rowIndex in range(len(stockHistory.values)): #delete the 3 for enlarged application
#    for ticker in range(len(stockHistory.columns[1:3])):
#        myfavthings.append(stockHistory.iloc[ticker,rowIndex])
#jumplist

In [18]:
#jumpFilter['stuff']=jumplist

In [19]:
#stockHistory.values

In [20]:
#list(stockHistory)

## Companies we had in mind:

Dryships
<img src='DRYS.png'>
(Source: Yahoo Finance)

Semi-LED Corp
<img src='DRYS.png'>
(Source: Yahoo Finance)

Chipotle
<img src='CMG.png'>
(Source: Yahoo Finance)

In [43]:
#read in the csv with the russell 3000 stock data and Wilshire 5000 stock data (Just a dataframe of symbols for now)
#Commented out since the repository now contains this csv

dfRussell=pd.read_csv('russell_3000_2011-06-27.csv')
dfWilshire=pd.read_csv('wilshire5000.csv')

In [44]:
#Taking the values from the dataframes and making them into arrays
#Commented out since the repository now contains this csv

arrayRussell=dfRussell.values.flatten()
arrayWilshire=dfWilshire.values.flatten()

In [45]:
#Make the dataframe with the adjusted close prices for all of the Wilshire 5000 Companies
#Commented out since the repository now contains this csv

ls_key = 'Adj Close'
start = datetime.datetime(2004,1,1)
end = datetime.datetime(2016,12,31)
f = data.DataReader(arrayWilshire[:], 'yahoo',start,end)

cleanData = f.ix[ls_key]
dataFrame = pd.DataFrame(cleanData)

print dataFrame[:]



                   AA    AACC  AAI  AAII      AAME        AAN       AAON  \
Date                                                                       
2004-01-01        NaN     NaN  NaN   NaN       NaN        NaN        NaN   
2004-01-02        NaN     NaN  NaN   NaN  2.967206   7.933725   3.250350   
2004-01-05        NaN     NaN  NaN   NaN  2.957786   8.036587   3.265805   
2004-01-06        NaN     NaN  NaN   NaN  3.004885   8.036587   3.312162   
2004-01-07        NaN     NaN  NaN   NaN  2.863589   7.929433   3.294995   
2004-01-08        NaN     NaN  NaN   NaN  2.901268   8.208036   3.298428   
2004-01-09        NaN     NaN  NaN   NaN  2.882428   8.229465   3.370541   
2004-01-12        NaN     NaN  NaN   NaN  2.929527   8.400914   3.348219   
2004-01-13        NaN     NaN  NaN   NaN  2.920107   8.465209   3.348219   
2004-01-14        NaN     NaN  NaN   NaN  2.920107   8.572364   3.504475   
2004-01-15        NaN     NaN  NaN   NaN  2.920107   8.572364   3.700213   
2004-01-16  

In [46]:
#Make the dataframe into a csv for turning into more meaningful data on 'jumps'
#Commented out since the repository now contains this csv

dataFrame.to_csv('WilshireHist')

In [25]:
yahoo=Share('YHOO')

NameError: name 'Share' is not defined

In [26]:
yahoo.get_market_cap()

NameError: name 'yahoo' is not defined

# Land of Forgotten Code

In [8]:
#start=datetime.datetime(2013,1,1)
#end=datetime.datetime(2017,1,1)
#df = data.DataReader(arrayWilshire[0:4], 'yahoo', start, end) 
#print df
#dates =[]
#for x in range(len(df)):
#    newdate = str(df.index[x])
#    newdate = newdate[0:10]
#    dates.append(newdate)

#df['dates'] = dates

#print df.head()
#print df.tail()

In [None]:
#yahoo=Share('YHOO')
#print yahoo
#yahoo.get_price()

#stocklist = ['aapl','goog','fb','amzn','COP']

#http://www.jarloo.com/yahoo_finance/
#https://greenido.wordpress.com/2009/12/22/yahoo-finance-hidden-api/
#_yahoo_codes.update({'Market Cap': 'j1'})
#_yahoo_codes.update({'Div Yield': 'y'})
#_yahoo_codes.update({'Bid': 'b'})
#_yahoo_codes.update({'Ask': 'a'})
#_yahoo_codes.update({'Prev Close': 'p'})
#_yahoo_codes.update({'Open': 'o'})
#_yahoo_codes.update({'1 yr Target Price': 't8'})
#_yahoo_codes.update({'Earnings/Share': 'e'})
#_yahoo_codes.update({"Day’s Range": 'm'})
#_yahoo_codes.update({'52-week Range': 'w'})
#_yahoo_codes.update({'Volume': 'v'})
#_yahoo_codes.update({'Avg Daily Volume': 'a2'})
#_yahoo_codes.update({'EPS Est Current Year': 'e7'})
#_yahoo_codes.update({'EPS Est Next Quarter': 'e9'})

#data.get_quote_yahoo(stocklist).to_csv('test.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)

#data.get_quote_yahoo(stocklist).transpose()