The aim of this work is to analyse transcripts and do comparisons between transcripts of adjacent quarters.

In order to do that we use 3 methods:
    1. Fuzzy Wuzzy: Implement a transcript summary utility that compares one transcript to the previous one and lists the major differences between the quarters.
    
    2. Topics Modelling: What are new things that happened this quarter and didn't happen in the previous one. What things happened in the previous quarter and didn't happen in this one. 
    
    3. Give a sentiment score to each transcript with Mc Donald Loughran Lexicon, Vader and Text Blob.


# FS2017-FS2018

In [4]:
import pandas as pd
import numpy as np

from flashtext import KeywordProcessor
pos_keyword_processor = KeywordProcessor()
neg_keyword_processor = KeywordProcessor()


In [2]:
fs2018 = pd.read_csv('FS2018.csv',encoding='utf-8')
fs2017 = pd.read_csv('FS2017.csv',encoding='utf-8')

In [3]:
len(fs2018)

13371

I append the transcripts of 2017 to the file of 2018.

In [7]:
fs2018=fs2018.append(fs2017,sort=True)

In [8]:
fs2018[['Indicator','Ticker']] = fs2018.ticker.str.split(":",expand=True)
indexNames = fs2018[fs2018['Indicator']== 'EXCH' ].index
# Delete these row indexes from dataFrame
fs2018.drop(indexNames , inplace=True)


In [9]:
fs2018 = fs2018.reset_index(drop=True)

In [10]:
fs2018.head()

Unnamed: 0,Indicator,Ticker,article_date,article_id,article_source,article_text,article_title,organization_name,ticker
0,US,IUS.XX9,2018-01-03 15:00:00|2018-01-03 15:53:05,2026528,Q1 2018 Earnings Call - 2026528 : 904708104,ALL TEXT IS RELEVANT\r\n\r\nLadies and gentlem...,Q1 2018 Earnings Call,UniFirst Corp.,US:IUS.XX9
1,US,UNF,2018-01-03 15:00:00|2018-01-03 15:53:05,2026528,Q1 2018 Earnings Call - 2026528 : 904708104,ALL TEXT IS RELEVANT\r\n\r\nLadies and gentlem...,Q1 2018 Earnings Call,UniFirst Corp.,US:UNF
2,US,VERU,2018-01-05 13:00:00|2018-01-05 21:23:05,2029277,Q4 2017 Earnings Call - 2029277 : 92536C103,"ALL TEXT IS RELEVANT\r\n\r\nGood morning, ladi...",Q4 2017 Earnings Call,"Veru, Inc.",US:VERU
3,GB,MCRO,2018-01-08 09:00:00|2018-01-08 10:48:49,2021320,Q2 2018 Earnings Call - 2021320 : G6117L186,ALL TEXT IS RELEVANT\r\n\r\nGood morning every...,Q2 2018 Earnings Call,Micro Focus International Plc,GB:MCRO
4,CA,EXF,2018-01-09 22:00:00|2018-01-09 23:42:39,2024627,Q1 2018 Earnings Call - 2024627 : 302046107,ALL TEXT IS RELEVANT\r\n\r\nGood day and welco...,Q1 2018 Earnings Call,"EXFO, Inc.",CA:EXF


I remove all the stopwords to compare in more efficient manner further the transcripts of adjacent quarters.

In [11]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def clean_string(text):
    #text=''.join([word for word in text if word not in string.punctuation])
    #text=text.lower()
    text=' '.join([word for word in text.split() if word not in stop_words])
    return text

fs2018['article_text'] = fs2018['article_text'].apply(clean_string)

In [12]:
import pickle

with open('mypickle.pickle', 'wb') as f:
    pickle.dump(fs2018, f)

# Summary transcript

### Arrange the data

I arrange the data to compare the adjacent quarters later.

In [15]:


with open("mypickle.pickle", "rb") as f:
    fs2018 = pickle.load(f)

fs2018['article_title'] = fs2018['article_title'].str[:7]
# dropping ALL duplicte values 
fs2018.drop_duplicates(subset =["article_title","Ticker"], 
                     keep = 'last', inplace = True) 


    
data = np.array([['article_title'],
                ['Q1 2017'],
                ['Q2 2017'],
                ['Q3 2017'],
                ['Q4 2017'],
                ['Q1 2018'],
                ['Q2 2018'],
                ['Q3 2018'],
                ['Q4 2018'],
                ])
                
main_df=pd.DataFrame(data=data[1:,:],
                  columns=data[0,:])


fs2018.drop(['organization_name', 'ticker', 'article_id', 'article_source', 'article_date','Indicator'], 1, inplace=True)
fs2018.dropna()

tickers = fs2018['Ticker'].unique().tolist()



for tick in tickers:
   
    df = fs2018.loc[fs2018.Ticker==tick]

    df.rename(columns={'article_text': tick}, inplace=True)
    
    df.drop(['Ticker'], 1, inplace=True)
   
  
    main_df= pd.merge(main_df , df, on='article_title',how='left')
    


main_df.set_index('article_title', inplace=True)
print(main_df)

                                                         IUS.XX9  \
article_title                                                      
Q1 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q2 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q3 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q4 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q1 2018        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q2 2018        ALL TEXT IS RELEVANT Welcome second quarter ea...   
Q3 2018                                                      NaN   
Q4 2018                                                      NaN   

                                                             UNF  \
article_title                                                      
Q1 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q2 2017        ALL TEXT IS RELEVANT Ladies gentlemen, thank s...   
Q3 2017        ALL TEXT IS RELEVANT Ladies gent

In [16]:
with open('Earning-Quarter.pickle', 'wb') as f:
    pickle.dump(main_df, f)

# FUZZY WUZZY

I use the method FUZZY WUZZY to detect similar sentence and then get the difference between them.

In [6]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



I first preprocess my data by stemmatizing and lematizing the words of each transcripts.

In [8]:

import re
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 

 
# init stemmer
porter_stemmer=PorterStemmer()
 
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def my_cool_preprocessor(text):
    
    text=text.lower() 
    text=re.sub("\\W"," ",text) # remove special chars
    text=re.sub("\\s+(in|the|all|for|and|on)\\s+"," _connector_ ",text) # normalize certain words

    # stem words
    words=re.split("\\s+",text)
    stemmed_words=[porter_stemmer.stem(word=word) for word in words]
    lematized_words=[lemmatizer.lemmatize(word=word) for word in stemmed_words]
    return ' '.join(lematized_words)
 

I compare all the sentence in adjacent quarter that have a ratio of similarity higher than 70% and more than 10 characters.
(To avoid the sentences like "Yeah.", "Good morning.")

Then I get 4 set:
    
    1. Words that appears in this quarter but not in the previous one.
    2. Numbers that appears in this quarter but not in the previous one.
    1. Words that appears in the previous quarter but not in the current one.
    1. Numbers that appears in this quarter but not in the current one.

In [9]:
def match_names(previous_Q,follow_Q):
    
    ratio_array=[]
    followQ_words=set()
    followQ_numerical=set()
    previousQ_numerical=set()
    previousQ_words=set()

    for row_p in previous_Q:
        for row_f in follow_Q:
            sort_ratio=fuzz.token_sort_ratio(row_p, row_f)
            if len(row_p)>10 and  len(row_f)>10 and sort_ratio>70:
               ratio_array.append((row_p,row_f,sort_ratio))
            
               elem1 = set(my_cool_preprocessor(row_p).split(' '))
               elem2 = set(my_cool_preprocessor(row_f).split(' '))
             
               for item in elem1:
                    if item not in elem2 and len(item)>2 and not item.isdigit():
                        previousQ_words.add(item)
                    elif item not in elem2 and item.isdigit():
                        previousQ_numerical.add(item)
            
               for item in elem2:
                    if item not in elem1 and len(item)>2 and not item.isdigit():
                        followQ_words.add(item)
                    elif item not in elem1 and item.isdigit():
                        followQ_numerical.add(item)
             
             
              
    return followQ_words,followQ_numerical,previousQ_numerical,previousQ_words
 
 

In [10]:
import pickle
with open("Earning-Quarter.pickle", "rb") as f:
    sum_transcript = pickle.load(f)
    sum_transcript.dropna(axis='columns',how="all")
    

with open("Earning-Quarter.pickle", "rb") as f:
    main_df= pickle.load(f)
    main_df.dropna(axis='columns',how="all")
    
main_df.head()

Unnamed: 0_level_0,IUS.XX9,UNF,VERU,MCRO,EXF,DUST,SZU,532187,8905,ABFLA,...,MXFC,540702,EEI,PURP,531092,ACR,TXCL,SKO,MYDX,CLNV
article_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q1 2017,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",,,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",,,,"ALL TEXT IS RELEVANT I Umeda, General Manager ...",ALL TEXT IS RELEVANT Good morning. My name Reg...,...,,,,,,,,,,
Q2 2017,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",ALL TEXT IS RELEVANT [Abrupt Start] gentlemen ...,,ALL TEXT IS RELEVANT Good day welcome EXFO's S...,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",,,ALL TEXT IS RELEVANT I Yoshida AEON MALL. Than...,ALL TEXT IS RELEVANT Good morning. My name Reg...,...,,,,,,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, welcome...",,,
Q3 2017,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Good morning, ladies gent...",,ALL TEXT IS RELEVANT Please stand by. Good day...,"ALL TEXT IS RELEVANT Good morning, ladies gent...",ALL TEXT IS RELEVANT Thank good morning everyb...,"ALL TEXT IS RELEVANT Ladies gentlemen, good da...",ALL TEXT IS RELEVANT [Abrupt Start] Now I talk...,ALL TEXT IS RELEVANT Good morning. My name Reg...,...,,,,,,,,,,
Q4 2017,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Good morning, ladies gent...","ALL TEXT IS RELEVANT Good morning, everyone. T...",ALL TEXT IS RELEVANT Good day welcome EXFO's F...,"ALL TEXT IS RELEVANT Good morning, ladies gent...","ALL TEXT IS RELEVANT Thank you, good afternoon...",ALL TEXT IS RELEVANT Good afternoon. Thank joi...,,ALL TEXT IS RELEVANT Good morning. My name Reg...,...,ALL TEXT IS RELEVANT I think 9:30 kick then. T...,,,,,,,,,
Q1 2018,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Ladies gentlemen, thank s...","ALL TEXT IS RELEVANT Good morning, ladies gent...",,ALL TEXT IS RELEVANT Good day welcome EXFO's F...,"ALL TEXT IS RELEVANT Thank standing by, ladies...","ALL TEXT IS RELEVANT Good morning, ladies gent...","ALL TEXT IS RELEVANT Ladies gentlemen, good da...",,ALL TEXT IS RELEVANT Good morning. My name Reg...,...,,,"ALL TEXT IS RELEVANT Good afternoon, welcome E...",,,,,,,


I summarize all the information in a dataframe

In [12]:
from nltk.tokenize import sent_tokenize

for col in main_df.columns[:10]:
    for row in range(1,main_df.shape[0]):
  
        if pd.isnull(main_df[col][row])==False and pd.isnull(main_df[col][row-1])==False:
            previous_Q=sent_tokenize(
                main_df[col][row-1])
            follow_Q=sent_tokenize(main_df[col][row])
        
            solution=match_names(previous_Q, follow_Q)
            sum_transcript.at[main_df.index[row-1],'{} Summary quarter'.format(col)]=' '.join(list(solution[0]))
            sum_transcript.at[main_df.index[row-1],'{} Summary num quarter'.format(col)]=' '.join(list(solution[1]))
            sum_transcript.at[main_df.index[row-1],'{} Summary previous quarter'.format(col)]=' '.join(list(solution[3]))
            sum_transcript.at[main_df.index[row-1],'{} Summary num previous quarter'.format(col)]=' '.join(list(solution[2]))
             



In [13]:
sum_transcript=sum_transcript.drop(columns=main_df.columns)
    

In [14]:
sum_transcript.head()

Unnamed: 0_level_0,IUS.XX9 Summary quarter,IUS.XX9 Summary num quarter,IUS.XX9 Summary previous quarter,IUS.XX9 Summary num previous quarter,UNF Summary quarter,UNF Summary num quarter,UNF Summary previous quarter,UNF Summary num previous quarter,VERU Summary quarter,VERU Summary num quarter,...,532187 Summary previous quarter,532187 Summary num previous quarter,8905 Summary quarter,8905 Summary num quarter,8905 Summary previous quarter,8905 Summary num previous quarter,ABFLA Summary quarter,ABFLA Summary num quarter,ABFLA Summary previous quarter,ABFLA Summary num previous quarter
article_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q1 2017,sic like million the partial provid second mon...,391 33 43 4 9 114 1 313 2 7 363 57 8 16 358 69...,result like come million next rest second decr...,42 73 28 21 43 4 18 53 9 35 1 373 2 286 386 17...,sic like million the partial provid second mon...,391 33 43 4 9 114 1 313 2 7 363 57 8 16 358 69...,result like come million next rest second decr...,42 73 28 21 43 4 18 53 9 35 1 373 2 286 386 17...,,,...,,,rise come full expans space steadili incom hal...,22 5 12 8 21 11 141 4 9 14 7 6 31 3,million page annual high attend incom china ov...,22 5 35 12 1 8 28 71 11 500 900 0 63 7 6 9,fulli think kbw organiz kleinhanzl non phase e...,27 124 4 25 827 9 1 29 80 17 24 13 150 151 12 ...,lower happi discus million the regina session ...,448 28 30 21 623 9 1 29 539 100 17 981 14 7 18...
Q2 2017,arrow plan result like million close futur sin...,367 30 4 49 9 68 312 1 80 2 2016 55 7 19 24 3 ...,record arrow result substanti provid second un...,0 4 9 114 313 2 2016 7 3 57 11 16 358 69 22 5 ...,arrow plan result like million close futur sin...,367 30 4 49 9 68 312 1 80 2 2016 55 7 19 24 3 ...,record arrow result substanti provid second un...,0 4 9 114 313 2 2016 7 3 57 11 16 358 69 22 5 ...,eastern discus the consid associ remark better...,800 46 03 21 4 18 02 49 53 9 35 10110602 570 2...,...,,,result initi the digit page next annual doubl ...,211 8 33 2 6 19 4 18 7 32 9,oper store overview thi significantli announc ...,22 5 13 12 21 11 141 4 14 6 3,exampl result highlight underwrit stabl our te...,5 13 338 710 926 20 80 6 34 102 4 15 19 14 32 ...,think exist the rest strong pick still postag ...,27 21 4 9 200 20 1 80 2 17 7 569 13 12 168 8 1...
Q3 2017,result like full term impair charg next unifir...,0 4 104 9 35 20 7 14 207 2017 363 364 403 41 5...,plan dilut the sintro month decreas unifirst n...,367 30 4 49 9 68 312 1 29 80 2 409 55 7 19 3 8...,result like full term impair charg next unifir...,0 4 104 9 35 20 7 14 207 2017 363 364 403 41 5...,plan dilut the sintro month decreas unifirst n...,367 30 4 49 9 68 312 1 29 80 2 409 55 7 19 3 8...,fc2 preboost back million close remark execut ...,42 30 01 47 4 25 26 06 2 7 3 000 2017 13 8 51 ...,...,_connector_ reason one that third,,,,,,share join kbw punch the our page nanci regina...,5 1 29 28 2 16 14 59 24 26 9,think million our sandler regina vine remind n...,5 20 8 512 16 4 15 19 24 26
Q4 2017,dilut think like full come our decreas unifirs...,28 47 0 9 374 65 1 373 2 2016 19 3 67 2017 13 ...,like full come the next unifirst anticip addit...,625 0 4 9 35 2 2016 7 14 24 207 3 2017 363 12 ...,dilut think like full come our decreas unifirs...,28 47 0 9 374 65 1 373 2 2016 19 3 67 2017 13 ...,like full come the next unifirst anticip addit...,625 0 4 9 35 2 2016 7 14 24 207 3 2017 363 12 ...,sheet preboost dilut break the remark beyond w...,5 12 8 944 1 56 2 2019 3 08 4 04 14 10116643 6...,...,result four doe that part there today much,,,,,,lower outsid the nanci regina vine chang remin...,5 800 12 1 8 30 54 75 16 4 19 23 17 6 53 9,fulli record join million full our higher outs...,2017 5 22 1 29 11 2 6 16 4 2016 25 7 14 350 20...
Q1 2018,like the remind second revis unifirst brief bu...,42 27 85 4 9 1 2 55 7 24 3 419 379 16 387 22 5...,sheet like million gener futur paid second ste...,46 28 0 4 9 374 1 373 2 19 67 8 5 415 56 34 38...,like the remind second revis unifirst brief bu...,42 27 85 4 9 1 2 55 7 24 3 419 379 16 387 22 5...,sheet like million gener futur paid second ste...,46 28 0 4 9 374 1 373 2 19 67 8 5 415 56 34 38...,,,...,time mahesh due right base correct constraint ...,60 357 500 700 000,,,,,,,,


In [15]:
with open('Summary_transcript.pickle', 'wb') as f:
    pickle.dump(sum_transcript, f)


  
# TOPIC MODELLING

Now that I have an insight of the difference between adjacent quarters, I want to know what are the main topics of each transcript.

In [28]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation


In [33]:

with open("mypickle.pickle", "rb") as f:
    fs2018 = pickle.load(f)


I want to get the four main topics with the 20 most relevant words.

In [36]:
n_samples = 2000
n_features = 1000
n_topics = 4
n_top_words = 20
dic={}

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
   
        dic[topic_idx]=" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]])
    return dic

I personalize my list of stopwords.
And I use two methods to get to the main topics:
    1) Count Vectorizer
    2) Tfidf Vectorizer

In [37]:
my_stops = stopwords.words('english')
my_stops = my_stops + ['ahead', 'youre', 'weve', 'yeah', 'hi', 'hey', 'im', 'youve', 'theres', 'indiscernible',\
                      'thats', 'theyre', 'please', 'operator', 'glenn', 'officer', 'executive', 'vice', 'president',\
                       'mayo', 'morning']

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2, stop_words=my_stops, max_features=n_features, ngram_range=(1,2))
cv_vectorizer = CountVectorizer(max_df=0.9, min_df=2, stop_words=my_stops, max_features=n_features)
tfidf_vectorizer 
cv_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=1000, min_df=2,
                ngram_range=(1, 1), preprocessor=None,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [44]:
cv_df=pd.DataFrame()
tfidf_df=pd.DataFrame()

quarters=[ 'Q1 2017',
            'Q2 2017',
            'Q3 2017',
            'Q4 2017',
            'Q1 2018',
            'Q2 2018',
            'Q3 2018',
            'Q4 2018']
                



for col in main_df.columns:
    for row in range(0,main_df.shape[0]):
        
        if pd.isnull(main_df[col][row])==False :
  
             tfidf = tfidf_vectorizer.fit_transform(sent_tokenize(fs2018['article_text'][row]))
             cv = cv_vectorizer.fit_transform(sent_tokenize(fs2018['article_text'][row]))
             nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5)
             topic_vectors = nmf.fit_transform(cv)
        
        

             cvdic={}
             tfidfdic={}
            
             
             cv_feature_names = cv_vectorizer.get_feature_names()
             cvdic=print_top_words(nmf, cv_feature_names, n_top_words)
             cv_df['{} {}'.format(col,quarters[row])]=list(dic.values())
            
            
             tfidf_feature_names = tfidf_vectorizer.get_feature_names()   
             tfidfdic=print_top_words(nmf, tfidf_feature_names, n_top_words)
             tfidf_df['{} {}'.format(col,quarters[row])]=list(dic.values())
             


In [45]:
cv_df.head()

Unnamed: 0,IUS.XX9 Q1 2017,IUS.XX9 Q2 2017,IUS.XX9 Q3 2017,IUS.XX9 Q4 2017,IUS.XX9 Q1 2018,IUS.XX9 Q2 2018,UNF Q1 2017,UNF Q2 2017,UNF Q3 2017,UNF Q4 2017,UNF Q1 2018,UNF Q2 2018
0,quarter million first income fiscal revenues o...,quarter million first income fiscal revenues o...,million year prior net 2017 september 30 compa...,year period focus micro hpe software half six ...,million quarter first sale 2018 2017 fourth co...,sek year million last compared quarter cash fl...,quarter million first income fiscal revenues o...,quarter million first income fiscal revenues o...,million year prior net 2017 september 30 compa...,year period focus micro hpe software half six ...,million quarter first sale 2018 2017 fourth co...,sek year million last compared quarter cash fl...
1,think going side people sort saying us busines...,think going side people sort saying us busines...,nursing men term care tamsulosin capsules rele...,debt net adjusted number ebitda million times ...,astellia exfo share financial result 2018 loss...,services advanced also products course smb mar...,think going side people sort saying us busines...,think going side people sort saying us busines...,nursing men term care tamsulosin capsules rele...,debt net adjusted number ebitda million times ...,astellia exfo share financial result 2018 loss...,services advanced also products course smb mar...
2,tax fiscal rate benefit reform 2018 expect asu...,tax fiscal rate benefit reform 2018 expect asu...,company risks related development results deve...,months 12 october revenue going 2017 look like...,year yenista optics well expenses million doll...,margin volume see bit would focus gross say th...,tax fiscal rate benefit reform 2018 expect asu...,tax fiscal rate benefit reform 2018 expect asu...,company risks related development results deve...,months 12 october revenue going 2017 look like...,year yenista optics well expenses million doll...,margin volume see bit would focus gross say th...
3,year little compared bit ago last third first ...,year little compared bit ago last third first ...,prostate cancer clinical veru oral therapies r...,business one software really think management ...,fiber gig deployment good business high 100 ge...,segment sales growth slide organic smb result ...,year little compared bit ago last third first ...,year little compared bit ago last third first ...,prostate cancer clinical veru oral therapies r...,business one software really think management ...,fiber gig deployment good business high 100 ge...,segment sales growth slide organic smb result ...


In [47]:
tfidf_df.head()

Unnamed: 0,IUS.XX9 Q1 2017,IUS.XX9 Q2 2017,IUS.XX9 Q3 2017,IUS.XX9 Q4 2017,IUS.XX9 Q1 2018,IUS.XX9 Q2 2018,UNF Q1 2017,UNF Q2 2017,UNF Q3 2017,UNF Q4 2017,UNF Q1 2018,UNF Q2 2018
0,margin honest difficult far diluted next quest...,margin honest difficult far diluted next quest...,get much prescription launch keep going start ...,product innovation debt provision go detail en...,foreign exchange interconnect credit facility ...,november remember improved good cash margin ba...,margin honest difficult far diluted next quest...,margin honest difficult far diluted next quest...,get much prescription launch keep going start ...,product innovation debt provision go detail en...,foreign exchange interconnect credit facility ...,november remember improved good cash margin ba...
1,prior year environmental outstanding like part...,prior year environmental outstanding like part...,granule formulation future operating back dr n...,committed guidance range 842 million helps us ...,also continued margins credit let 19 million f...,offering activities advanced managed conclude ...,prior year environmental outstanding like part...,prior year environmental outstanding like part...,granule formulation future operating back dr n...,committed guidance range 842 million helps us ...,also continued margins credit let 19 million f...,offering activities advanced managed conclude ...
2,pretty diluted mean asu million reported 2016 ...,pretty diluted mean asu million reported 2016 ...,biopharmaceutical marketing looked capsules al...,got take 12 heritage micro mainframe dividend ...,overall part gross margins optics continuing f...,higher margins quite nordic b2b relation credi...,pretty diluted mean asu million reported 2016 ...,pretty diluted mean asu million reported 2016 ...,biopharmaceutical marketing looked capsules al...,got take 12 heritage micro mainframe dividend ...,overall part gross margins optics continuing f...,higher margins quite nordic b2b relation credi...
3,really future results caution balance addition...,really future results caution balance addition...,last associated better pipeline growth opportu...,beginning hewlett packard month let organizati...,cost decrease company deployment announcements...,norriq new ended sek organic let organizations...,really future results caution balance addition...,really future results caution balance addition...,last associated better pipeline growth opportu...,beginning hewlett packard month let organizati...,cost decrease company deployment announcements...,norriq new ended sek organic let organizations...


# Sentiment Analyser

#### Count Positive/Negative words with Mcdonald Loughran lexicon

In [62]:
with open("mypickle.pickle", "rb") as f:
    fs2018 = pickle.load(f)


In [63]:
def read_list_from_file(fname):
    with open(fname) as f:
        content = f.readlines()
    content = [x.strip() for x in content]
    return content

In [64]:
pos_list = read_list_from_file('LM_Positive.txt')
neg_list = read_list_from_file('LM_Negative.txt')

for word in pos_list:
    pos_keyword_processor.add_keyword(word)
for word in neg_list:
    neg_keyword_processor.add_keyword(word)

In [65]:
from collections import Counter

import pandas as pd
from nltk import word_tokenize

positive_words = set(pos_list)
negative_words = set(neg_list)



2 methods to count positive and negative words for each transcript.

In [70]:
def count_pos_neg(df):
    df['Tokenized'] = df['article_text'].apply(str.lower).apply(word_tokenize)
    df['WordCount'] = df['Tokenized'].apply(lambda x: Counter(x))

    df['Positive'] = df['WordCount'].apply(lambda x: sum(v for k,v in x.items() if k in positive_words))
    df['Negative'] = df['WordCount'].apply(lambda x: sum(v for k,v in x.items() if k in negative_words))
    return df

def count_pos_neg_flash(df):
    #df['Tokenized'] = df['article_text'].apply(str.lower).apply(word_tokenize)
    #df['WordCount'] = df['Tokenized'].apply(lambda x: Counter(x))

    df['Positive'] = df['article_text'].apply(lambda x: len(pos_keyword_processor.extract_keywords(x)))
    df['Negative'] = df['article_text'].apply(lambda x: len(neg_keyword_processor.extract_keywords(x)))
    
    return df



In [71]:
%time fs2018 = count_pos_neg_flash(fs2018)

Wall time: 32min 51s


I calculate the sentiment score between -1 and 1.

In [72]:
fs2018['LM Score']=(fs2018['Positive']-fs2018['Negative'])/fs2018['Positive']

# VADER

“Valence Aware Dictionary and sEntiment Reasoner” is another popular rule-based library for sentiment analysis. Like TextBlob, it uses a sentiment lexicon that contains intensity measures for each word based on human-annotated labels. A key difference however, is that VADER was designed with a focus on social media texts. This means that it puts a lot of emphasis on rules that capture the essence of text typically seen on social media — for example, short sentences with emojis, repetitive vocabulary and copious use of punctuation (such as exclamation marks). 

In my case it is not the most appropriate method but I still want to use it to check if it can give me good result anyway.

In [76]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer


[nltk_data] Downloading package vader_lexicon to C:\Users\Ilan
[nltk_data]     avraham\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [78]:
analyzer = SentimentIntensityAnalyzer()

fs2018['Vader'] = fs2018['article_text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])   
    

# TextBlob

TextBlob is a popular Python library for processing textual data. It is built on top of NLTK, another popular Natural Language Processing toolbox for Python. TextBlob uses a sentiment lexicon (consisting of predefined words) to assign scores for each word, which are then averaged out using a weighted average to give an overall sentence sentiment score. Three scores: “polarity”, “subjectivity” and “intensity” are calculated for each word.

In [82]:
from textblob import TextBlob

fs2018['TextBlob'] = fs2018['article_text'].apply(str.lower).apply(lambda x:TextBlob(x).sentiment.polarity)   


In [83]:
fs2018.head()

Unnamed: 0,Indicator,Ticker,article_date,article_id,article_source,article_text,article_title,organization_name,ticker,Positive,Negative,LM Score,Vader,TextBlob
0,US,IUS.XX9,2018-01-03 15:00:00|2018-01-03 15:53:05,2026528,Q1 2018 Earnings Call - 2026528 : 904708104,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",Q1 2018 Earnings Call,UniFirst Corp.,US:IUS.XX9,74,37,0.5,0.9999,0.13827
1,US,UNF,2018-01-03 15:00:00|2018-01-03 15:53:05,2026528,Q1 2018 Earnings Call - 2026528 : 904708104,"ALL TEXT IS RELEVANT Ladies gentlemen, thank s...",Q1 2018 Earnings Call,UniFirst Corp.,US:UNF,74,37,0.5,0.9999,0.13827
2,US,VERU,2018-01-05 13:00:00|2018-01-05 21:23:05,2029277,Q4 2017 Earnings Call - 2029277 : 92536C103,"ALL TEXT IS RELEVANT Good morning, ladies gent...",Q4 2017 Earnings Call,"Veru, Inc.",US:VERU,66,70,-0.060606,0.9999,0.085348
3,GB,MCRO,2018-01-08 09:00:00|2018-01-08 10:48:49,2021320,Q2 2018 Earnings Call - 2021320 : G6117L186,ALL TEXT IS RELEVANT Good morning everyone wel...,Q2 2018 Earnings Call,Micro Focus International Plc,GB:MCRO,107,66,0.383178,1.0,0.119395
4,CA,EXF,2018-01-09 22:00:00|2018-01-09 23:42:39,2024627,Q1 2018 Earnings Call - 2024627 : 302046107,ALL TEXT IS RELEVANT Good day welcome EXFO's F...,Q1 2018 Earnings Call,"EXFO, Inc.",CA:EXF,57,25,0.561404,0.9999,0.145621


To go further, I wanted to get the return of each ticker after each transcript (take me too long to run the code so I abandon it)
and check if there is a correlation between the sentiment score, by using supervised method of sentiment analysis where the return is the target.
We can split the return into category to make it easier.
After we use classification method (like Random forest, Neural Network..) to check the accuracy of our model.