# Probability and Naive Bayes
In this tutorial, we cover some basic probability, code up a baseline NLP financial sentiment classifier from scratch, using Bayes Rule, and try to beat some DL models with NB-SVM!
- Companion Slides: https://docs.google.com/presentation/d/1S_hFtayisDzBVVwG-CAa4rRVquIq0Ax3BCU5oUbMrzU/edit?usp=sharing
- Companion Excel: https://docs.google.com/spreadsheets/d/1ha-kv2gV1OwBhVWpdg42Fcu7NMlYdl_rvPAljZqwYeU/edit?usp=sharing

![Bayes](https://i1.wp.com/www.rensvandeschoot.com/wp-content/uploads/2017/09/bayes-theorem.png?ssl=1)

In [None]:
import numpy as np 
import pandas as pd 
pd.set_option('display.max_colwidth', None)

In [None]:
df = pd.read_csv("/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv", encoding='latin-1', header = None)
df.columns = ["Sentiment", "Headline"]
df.sample(5)

## Frequentist Approach To Probability - Probability as the long run frequency of events

In [None]:
df.Sentiment.hist()

In [None]:
sent=pd.DataFrame(df.Sentiment.value_counts()/len(df)).rename({'Sentiment':'Probability'},axis=1)
sent.style.format({'Probability':'{:.0%}'})

The long run frequencies will converge to the probabilities..

In [None]:
d={'neutral':0,'negative':0,'positive':0}
freq=pd.DataFrame(columns=['neutral','negative','positive'])
for _ in range(200):
    tmp=df.sample(10)
    for sent in tmp.Sentiment.unique(): d[sent]+=tmp.Sentiment.value_counts().loc[sent]
    cumFreq=sum(d.values())
    freq=freq.append(pd.DataFrame([[d['neutral']/cumFreq,d['negative']/cumFreq,d['positive']/cumFreq]],columns=['neutral','negative','positive']))
freq.reset_index(drop=True).plot()

## Joint Probability - the intersection of 2 events
1. What is the probability of a sentence containing "profit" and sentiment is positive?
1. What is the probability of a sentence containing "loss" and sentiment is negative?

![intersection](https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Venn0001.svg/2560px-Venn0001.svg.png)

In [None]:
#1
probProfit           =df[(df.Headline.str.contains('profit'))].shape[0]/len(df)
probPositive         =df[((df.Sentiment=='positive'))].shape[0]/len(df)
probProfitANDPositive=df[(df.Headline.str.contains('profit'))&(df.Sentiment=='positive')].shape[0]/len(df)

print(f'Probability of Profits is {probProfit:.0%}')
print(f'Probability of Positive Sentiment is {probPositive:.0%}')
print(f'Probability of Profits and Positive Sentiment is {probProfitANDPositive:.0%}')
df[(df.Headline.str.contains('profit'))&(df.Sentiment=='positive')].sample(2)

In [None]:
#2
probLoss           =df[(df.Headline.str.contains('loss'))].shape[0]/len(df)
probNegative       =df[((df.Sentiment=='negative'))].shape[0]/len(df)
probLossANDNegative=df[(df.Headline.str.contains('loss'))&(df.Sentiment=='negative')].shape[0]/len(df)

print(f'Probability of Loss is {probProfit:.0%}')
print(f'Probability of Negative Sentiment is {probNegative:.0%}')
print(f'Probability of Loss and Negative Sentiment is {probLossANDNegative:.0%}')
df[(df.Headline.str.contains('loss'))&(df.Sentiment=='negative')].sample(2)

## Conditional Probability - Probability of event A happening given event B has happened
What is the probability of a sentence containing "profit" given sentiment is positive?

first way: if we just apply the formula

In [None]:
#first approach - math P(A|B)=P(A,B)/P(B)
f'P(Sentence containing "profit" | Positive Sentiment) is {probProfitANDPositive / probPositive:0.0%}.'

second way: given the sentence is positive, our "world" becomes only positive sentences. so to calculate probability of profits in that world, we can take count of sentences that are both positive and contain profits divided by the count of positive sentences

In [None]:
#2nd approach - number of positive sentences as base/denominator and numerator as the number of sentences that are both positive and has profits
numPositive         =df[((df.Sentiment=='positive'))].shape[0]
numProfitANDPositive=df[(df.Headline.str.contains('profit'))&(df.Sentiment=='positive')].shape[0]

f'P(Sentence containing "profit" | Positive Sentiment) is {numProfitANDPositive / numPositive:0.0%}.'

## A simple walkthrough of Naive Bayes and Count Ratios, refer to Google Sheets link: 
https://docs.google.com/spreadsheets/d/1ha-kv2gV1OwBhVWpdg42Fcu7NMlYdl_rvPAljZqwYeU/edit?usp=sharing 

In [None]:
#Using a Simple Sentence and its variants as a toy example
df.loc[[965]]

## Implementing the excel walkthrough in python

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer()

In [None]:
sentences=['Productional situation has now improved.',
                            'Productional situation has now deproved.',
                            'the situation has improved.']
X=vectorizer.fit_transform(sentences)
X.toarray()

In [None]:
y=np.array(['positive','negative','positive'])

In [None]:
vocab=vectorizer.get_feature_names()
vocab

In [None]:
pd.DataFrame(X.toarray(),columns=vocab,index=sentences)

In [None]:
tst_term_doc=vectorizer.transform(['the situation deproved,deproved!'])
tst_term_doc.toarray()

In [None]:
import numpy as np
b=np.log((2/3)/(1/3)) #we know that 2 out of 3 documents are positive and 1 out of 3 documents is negative

f'Log of prior probability ratio is {b:.1}'

In [None]:
posProb=(X[y=='positive'].sum(0)+1)/(X[y=='positive'].sum(0).sum()+len(vocab))
negProb=(X[y=='negative'].sum(0)+1)/(X[y=='negative'].sum(0).sum()+len(vocab))

#convert matrix to a numpy array
posProb=np.squeeze(np.asarray(posProb)) 
negProb=np.squeeze(np.asarray(negProb))

In [None]:
R = np.log(posProb/negProb)

In [None]:
preds=tst_term_doc@R+b
print(preds)
preds=np.array(['negative' if x <0 else 'positive' for x in preds])

## Now let's do the same thing, but for the entire dataset. 
Let's exclude neutral sentences to keep things simple and binary.

In [None]:
df=df[df.Sentiment!='neutral']

In [None]:
from sklearn.model_selection import train_test_split
trn_x, tst_x, trn_y, tst_y = train_test_split(df.Headline, df.Sentiment, test_size=0.2, random_state=42)

In [None]:
p=(trn_y=='positive').sum()/len(trn_y) #prior for positive
q=(trn_y=='negative').sum()/len(trn_y) #prior for negative

f'Prior probabilities for positive and negative classes are {p:.0%} and {q:.0%}'

In [None]:
import numpy as np
b=np.log(p/q)

f'Log of prior probability ratio is {b:.1}'

In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer()
trn_term_doc= vectorizer.fit_transform(trn_x)
tst_term_doc= vectorizer.transform(tst_x)

What happens if there is a word in `tst_x` that didn't occur in trn_x?

In [None]:
vocab=vectorizer.get_feature_names()
vocab[500:510]

Qn: How many documents and terms are there?

In [None]:
trn_term_doc

In [None]:
len(vectorizer.vocabulary_)

In [None]:
#type code here

Qn: How often does the word "the" appear in the phrase below? Confirm that the correct counts are stored in the term-document matrix.
> - **Tip1:** To find the position of an element(x) in a list: LIST.index(x)
> - **Tip2:** Convert a matrix to a numpy array for easier indexing: ARRAY = np.squeeze(np.asarray(MATRIX)) 

In [None]:
trn_x.sample(1)

In [None]:
type(vectorizer.vocabulary_)

In [None]:
vectorizer.vocabulary_['the']

In [None]:
trn_term_doc[1,4401]

In [None]:
#type code here

In [None]:
X=trn_term_doc
y=np.array(trn_y)

posProb=(X[y=='positive'].sum(axis=0)+1)/(sum(y=='positive')+len(vocab))
negProb=(X[y=='negative'].sum(axis=0)+1)/(sum(y=='negative')+len(vocab))

#convert matrix to a numpy array
posProb=np.squeeze(np.asarray(posProb)) 
negProb=np.squeeze(np.asarray(negProb))

Qn: Why do we +1 in numerator?

In [None]:
#answer

In [None]:
R = np.log(posProb/negProb)

Question: What is the higher probabiltiy for the word 'fell' and 'rose'?
> **Tip:** To find the position of an element(x) in a list, use list.index(x)

In [None]:
vectorizer.vocabulary_['fell']

In [None]:
R[1788]

In [None]:
preds=tst_term_doc @ R+b
preds=np.array(['negative' if x <0 else 'positive' for x in preds])

In [None]:
(preds[:4]==tst_y[:4]).mean()

In [None]:
acc=(preds==tst_y).mean()
f'Accuracy using NB is {acc:.1%}'

## NB-SVM
Can we do better than a standard Naive Bayes?
Yes, by 
1. Binarizing
1. Bi-grams (or maybe even tri-grams!)
1. Use NB features as input to SVM

#### Source: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf

Changing CountVectorizer to be binary and take in bi/tri grams

In [None]:
vectorizer = CountVectorizer(binary=True,ngram_range=(1,4)) #using unigrams, bigrams and trigrams and binarize
X=trn_term_doc_bin_ngram= vectorizer.fit_transform(trn_x)
tst_term_doc_bin_ngram= vectorizer.transform(tst_x)

In [None]:
R = np.log((X[y=='positive'].sum(axis=0)+1)/(X[y=='positive'].sum(0).sum()+len(vocab))/(X[y=='negative'].sum(axis=0)+1)/(X[y=='negative'].sum(0).sum()+len(vocab)))
R = np.squeeze(np.asarray(R))

In [None]:
from sklearn.svm import LinearSVC #for this tutorial, consider this as a blackbox machine learning model

Naive Bayes as input features

In [None]:
x_nb=trn_term_doc_bin_ngram.multiply(R)
m = LinearSVC().fit(x_nb, y)

In [None]:
acc_nbsvm=m.score(tst_term_doc_bin_ngram.multiply(R),tst_y)
f'Accuracy using NBSVM is {acc_nbsvm:.3%}, {acc_nbsvm-acc:.1%} better than NB alone!'

Exercise: Try from scratch, creating NB model using bigrams or 4-grams and see if it's better?
> For `CountVectorizer` parameters and to tweak them, refer to https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

In [None]:
#type code here

# Beating LSTM, LPS and HSC with NBSVM
Without any fancy architecture/pre-processing, it beats LSTMs, a deep learning model!
- https://arxiv.org/pdf/1908.10063.pdf - refer to Table 2 in finBERT paper for comparison

In [None]:
!pip install -qq nbsvm

from nbsvm import NBSVM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

In [None]:
df = pd.read_csv("/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv", encoding='latin-1', header = None)
df.columns = ["Sentiment", "Headline"]

clf = make_pipeline(CountVectorizer(binary=True), NBSVM())
scores = cross_val_score(clf, df.Headline, df.Sentiment, cv=10)
f'Accuracy on all data is slightly higher than LSTM, LPS and HSC at {np.mean(scores):.0%}!'

In [None]:
df=pd.read_csv('../input/sentiment-analysis-for-financial-news/FinancialPhraseBank/Sentences_AllAgree.txt', encoding='latin-1', header = None,sep='.@')
df.columns = ["Headline","Sentiment"]

clf = make_pipeline(CountVectorizer(binary=True), NBSVM())
scores = cross_val_score(clf, df.Headline, df.Sentiment, cv=10)

f'Accuracy on 100% agreement data is much higher than LSTM, LPS and HSC at {np.mean(scores):.0%}!'