# Prospecção de Dados (Data Mining) DI/FCUL - HA1

## First Home Assignement (MC/DI/FCUL - 2024)

### Fill in the section below

### GROUP: `02`

* João Martins, 62532 - Hours worked on the project
* Rúben Torres, 62531 - Hours worked on the project
* Nuno Pereira, 56933 - Hours worked on the project




The purpose of this Home Assignment is
* Read a Data file with a Set of Texts
* Compute similarities between texts
* Perform simple classification of texts using a Naive Bayes classifier

**NOTE 1: Students are not allowed to add more cells to the notebook**

**NOTE 2: The notebook must be submited fully executed**


## 1. Read the Dataset

The dataset is the file `Sentences_75Agree.txt` from the [Financial Sentiment Analysis database on Gugging Face](https://huggingface.co/datasets/financial_phrasebank)

* Read the dataset and separate them by unique documents (one document per line)
* The last word of each document is the class and it **must be removed from the document** but kept separate for use in the classification tasks below
    * classes can be `.@positive`, `.@negative`, `.@neutral`
    


In [97]:
FILE_NAME: str = "Sentences_75Agree.txt"
CLASS_DELIMITER: str = "@"


def classify_document(document: str) -> dict[str, str]:
    s: str = document.strip().split(CLASS_DELIMITER)
    return {"document": s[0], "class": s[-1]}


documents: list[str] = open(FILE_NAME, encoding="ISO-8859-15").readlines()
documents_classified: list[dict[str, str]] = [
    classify_document(document) for document in documents
]

#192 193 338

## 2. Compute similarities between texts

* Compute the TF.IDF of all words in texts
* compute the average similarity beween texts
* Plot the document similarity distribution (suggestion use [boxplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html) or [histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) or  [histograms with density](https://matplotlib.org/stable/gallery/statistics/histogram_features.html))
* Comment your results


In [98]:
import unicodedata
import numpy as np
from math import log2


# ========== CORPUS Start ========== #

def basic_word_tokenizer(text: str) -> list[str]:
    return text.split()

def remove_accents(text: str) -> str:
    nfkd_form: str = unicodedata.normalize("NFKD", text)
    return "".join([c for c in nfkd_form if not unicodedata.combining(c)])

def remove_stuff(text: str) -> str:
    for c in "\\\t0123456789Ææœ—‘’\ufeff{|}“”.,()$£%&[]?@#!=;*+–\"ǁ":
        text = text.replace(c, "")
    text = text.replace("-", " ")
    return text

def get_words_from_text(text: str) -> list[str]:
    text = text.strip().lower()
    text = remove_accents(text)
    text = remove_stuff(text)
    text = text.lower()
    return basic_word_tokenizer(text)

def get_words_from_corpus(corpus: list[str]) -> list[list[str]]:
    return [get_words_from_text(text) for text in corpus]

# ========== CORPUS End ========== #

# ========== TF Start ========== #

def word_counter(words: list[str]) -> dict[str, int]:
    unique_words: set[str] = set(words)
    counter: dict[str, int] = dict(zip(unique_words, [0] * len(unique_words)))
    for word in words:
        counter[word] += 1
    return counter

def TF(word_counts: dict[str, int]) -> dict[str, int]:
    counts: list[int] = list(word_counts.values())
    if len(counts) == 0:
        return {}
    counts_max: int = max(counts)
    return dict(zip(word_counts.keys(), [count / counts_max for count in counts]))

def TF_all(words_texts: list[list[str]]) -> list[dict[str, int]]:
    return [TF(word_counter(words)) for words in words_texts]

# ========== TF End ========== #

# ========== IDF Start ========== #

def calc_all_words(words_text_sets):
    all_words=set()
    for words in words_text_sets: all_words |= words
    return all_words

def IDF(all_words, doc_word_counts):
    #first initialize a new dictionary with one entry for each word
    D=dict(zip(all_words, [0]*len(all_words)))
    N=len(doc_word_counts)
    for doc in doc_word_counts:
        for word in doc: D[word]+=1
    return {w: log2(N/D[w]) for w in D}

# ========== IDF End ========== #

# ========== TF-IDF Start ========== #

def cosine_similarity_tfidf(idx1, idx2, words_text_sets, all_tfs, idfs):
    text1= words_text_sets[idx1]
    text2= words_text_sets[idx2]
    tfs1=all_tfs[idx1]
    tfs2=all_tfs[idx2]

    common_words = text1 & text2
    if len(common_words)==0: return 0.0
    common_tfidfs = [tfs1[w]*tfs2[w]*idfs[w]*idfs[w] for w in common_words]

    #squared tfidfs
    tfidfs2_1=np.array([tfs1[w]*idfs[w] for w in text1])**2
    tfidfs2_2=np.array([tfs2[w]*idfs[w] for w in text2])**2

    return sum(common_tfidfs)/(np.sqrt(tfidfs2_1.sum())*np.sqrt(tfidfs2_2.sum()))

def text_similarities2(words_text_sets, all_tfs, idfs):
    N=len(words_text_sets)
    sims=[]
    for i in range(N-1):
        for j in range(i+1, N):
            sim = cosine_similarity_tfidf(i,j, words_text_sets, all_tfs, idfs)
            sims.append((sim, (i,j)))
    return sims

# ========== TF-IDF End ========== #

In [99]:
corpus: list[str] = [
    document_classified["document"] for document_classified in documents_classified
]
words_texts: list[list[str]] = get_words_from_corpus(corpus)
all_tfs: list[dict[str, int]] = TF_all(words_texts)

words_texts_sets: list[set[str]] = [set(words) for words in words_texts]

all_words = calc_all_words(words_texts_sets)
idfs = IDF(all_words, words_texts_sets)

# ========= Problems with dataset ========== #

#print(corpus)
#print()
#print(words_text_sets)
#print()
class_encoding = {'negative': 0, 'neutral': 1, 'positive': 2}

y_corpus: list[str] = [documents_classified["class"] for documents_classified in documents_classified]
y_corpus = [class_encoding[label] for label in y_corpus]
x_corpus = words_texts_sets
#print(y_corpus)
#print()

# sims = text_similarities2(words_text_sets, all_tfs, idfs)
# sims = sorted(sims, reverse=True)
# sims



### Your short analysis here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum


## 3. Make a Naive Bayes Classifier 

* Split dataset randomly into training and testing (20% for testing)
* Train a Naive Bayes Model and do some sensitivity analyis on the hyperparameters 
* Evaluate your results with the testing set


In [100]:
### Add supporting functions here
from sklearn.metrics import f1_score, matthews_corrcoef, confusion_matrix, precision_score, recall_score
from random import sample, shuffle

def PD_NBa(X, y, alpha=0):
    #X =df_covid.drop(columns=["Covid"]).values
    N, M=X.shape

    #1. first compute the priors
    yv, yc=np.unique(y, return_counts=True)
    priors={yv[i]: yc[i]/sum(yc) for i in range(len(yv)) }
    #2. now the likelyhoods
    #       the L_hoods data structure can be read as 
    #           L[i, "B" "A"]       =>       P(Xi =B | Y = A) 
    #first we initialize it to Zero
    #fill the rest from here<---------------------------------------
    L_hoods={}
    for j in range(M):
        xs=np.unique(X[:,j])
        for v in xs:
            for yi in yv: L_hoods[(j, v,  yi)]=alpha
    #now we fill in the values for each row in our dataset, updating the corresponding counts for each class
    #for each output class
    for yi in yv:
        X_c = X[y==yi]
        #...now search each individual X column
        for j in range(M):
            col = X_c[:,j]
            vs, cs = np.unique(col, return_counts=True) # The Evil part of this implementation xD
            # ...for each possible value now we divide it all
            for i, v in enumerate(vs):
                L_hoods[(j, v,  yi)] += cs[i]/np.sum(cs)
    

    return priors, L_hoods


def make_train_test(X, y, r=0.2):
    N=len(X)
    test_idx  = set(sample(range(N), int(N*r)))
    train_idx = list(set(range(N)) - test_idx)
    test_idx =list(test_idx)
    # shuffle(train_idx)
    # shuffle(test_idx)
    train_set_X = [X[i] for i in train_idx]
    train_set_y = [y[i] for i in train_idx]
    test_set_X = [X[i] for i in test_idx]
    test_set_y = [y[i] for i in test_idx]
    return train_set_X, train_set_y, test_set_X, test_set_y

def calc_prior_counts(labels):
    yv, yc=np.unique(labels, return_counts=True)
    priors=np.ones(len(yv))
    priors[yv]=yc
    return priors

def calc_all_words(words_text_sets):
    #joins all the words from lists of words in documents to get a unique set of all the words
    all_words=set()
    for words in words_text_sets: 
        all_words |= words
    return all_words


def init_likelihood_counts(docs_words, n_labels):
    all_words=calc_all_words(docs_words)
    L_hoods={}
    for w in all_words:
        L_hoods[w]=np.zeros(n_labels)
    return L_hoods

def update_likelihood_counts(L_hoods, words, label):
    for word in words: 
        L_hoods[word][label]+=1

def calc_likelihood_counts(docs_words, labels):
    n_labels=len(set(labels)) #!!! <- not very bright!
    L_hoods = init_likelihood_counts(docs_words, n_labels)
    for i, words in enumerate(docs_words): 
        update_likelihood_counts(L_hoods, words, labels[i])
    return L_hoods

def classify_new_document(words, priors, L_hoods, alpha=0):
    res=priors/priors.sum()
    alpha_vec=np.ones(len(res))*alpha
    for word in words:
        if word in L_hoods: 
            res*=(L_hoods[word]/L_hoods[word].sum() + alpha_vec)
    return res/res.sum()

def classify_documents(docs, priors, L_hoods, alpha=0):
    return [classify_new_document(words, priors, L_hoods, alpha).argmax(axis=0) for words in docs]


In [101]:
### Add processing code here
#sims
x_train, y_train, x_test, y_test = make_train_test(x_corpus, y_corpus, r=0.2)
# print((x_train))
# print(len(y_train))

PC = calc_prior_counts(y_train)
LHC= calc_likelihood_counts(x_train, y_train)

preds=classify_documents(x_test, PC, LHC, alpha=0.0001)

for i, words in enumerate(x_test[:10]):
    print(i, "--", preds[i], "<--", words)

# print((y_test))
# print(preds)

print("The F1 score is: %7.4f" % f1_score(y_test, preds, average='micro'))
print("The MCC score is: %7.4f" % matthews_corrcoef(y_test, preds))
print("The precision score is: %7.4f" %  precision_score(y_test, preds, average='micro'))
print("The recall score is: %7.4f" %  recall_score(y_test, preds, average='micro'))

print(confusion_matrix(y_test, preds))

0 -- 2 <-- {'rose', 'to', 'from', 'of', 'net', 'eur', 'corresponding', 'mn', 'sales', 'operating', 'the', 'representing', 'profit', 'period', 'in'}
1 -- 1 <-- {'of', 'asphalt', 'mix', 'will', 'in', 'the', 'contract', 'be', 'than', 'more', 'tonnes', 'used'}
2 -- 1 <-- {'payment', 'to', 'divested', 'received', 'limited', 'its', 'part', 'of', 'from', 'sappi', 'graphic', 'metsnliitto', 'a', 'at', 'papers', 'million', 'stock', 'has', 'cash', 'real', 'august', 'end', 'release', 'm', 'exchange', 'for', 'eur', 'group', 'corporation', 'pm', 'business', 'the'}
3 -- 1 <-- {'to', 'quarter', 'helsinki', 'said', 'plans', 'and', "'", 'g', 'its', 'led', 'with', 'afx', 'earnings', 'announced', 'fourth', 'team', 'nokian', 'expectations', 'closed', 'beat', 'manufacture', 'nokia', 'shares', 'up', 'it', 'higher', 'tyres', 'sanyo', 'report', 'by', 'after', 'handsets', 'analysts', 'dealers'}
4 -- 1 <-- {'third', 'results', 'october', 'quarter', 'approximately', 'neste', 'oil', 'friday', 'publish', 'am', 'wil

## 4. Discuss your findings [to fill on your own]

* Comment your results above
* Discuss how could they be used in a Big Data environment


Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
