# Seedtag codetest: NLP Researcher

## Part 3. Message-matcher baseline model
This communication contains a message matcher baseline model. Given a query text message and a corpus of historical messages, this matcher model retrieves all historical messages that are similar to the queried one. Your goal is to improve this model.

In [10]:
import os
from hashlib import md5
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Initialize resources
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

### 0. Auxiliary Functions

In [2]:
def create_df(path, tag):
    '''
    Creates a data frame for a given class
    --------------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tag (str): name of the folder containing class "tag".
    Output:
        df (pd.DataFrame): dataframe with file as index and columns=[text, tag]
    '''
    list_of_text = []
    tag_dir = os.path.join(path, tag)
    for file in os.listdir(tag_dir):

        with open(os.path.join(tag_dir, file), encoding="utf-8", errors="ignore") as f:
            text = f.read()
            list_of_text.append((text, file))
            df = pd.DataFrame(list_of_text, columns = ['Text', 'file'])
            df = df.set_index('file')
    df['tag'] = tag
    return df


def get_all_dfs(path, tags):
    '''
    Loops over all classes in path, each in the corresponding folder
    --------------------------------
    Input:
        path (str): path where all classes folders are stored.
        tags (list): list of classes names.
    Output:
        df (pd.DataFrame): pandas dataframe with the dataframes corresponding to all classes concatenated.
    '''
    list_of_dfs = []
    for tag in tags:

        df = create_df(path, tag)
        list_of_dfs.append(df)
    data = pd.concat(list_of_dfs)
    return data


def to_md5(rsc_id: str) -> str:
    """
    Convert rcs_id string into a hexdigest md5.
    :param rcs_id: str.
    :return: hexdigext representation of md5 codification of input string.
    """
    md5_rsc = bytes(rsc_id, 'utf-8')
    result_1 = md5(md5_rsc)
    return result_1.hexdigest()


def get_similarity(resources: pd.DataFrame, space: str = 'tfidf', max_df: float = .75) -> np.array:
    """
    Compute pairwise cosine similarity for resources in a given vector representation (tf or tfidf).
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :param max_df: maximum valur for document frequency just as in sklearn Vectorizers.
    :return: symmetric np.array with cosine similarity score for each resource pair.
    """
    if space == 'tf':
        vec = CountVectorizer(min_df=2, max_df=max_df)
    elif space == 'tfidf':
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    else:
        print('The "space" input must be either "tf" or "tfidf", using the default "tfidf" option...')
        vec = TfidfVectorizer(min_df=2, max_df=max_df)
    vec_res = vec.fit_transform(resources['Text'].fillna(''))
    sims = cosine_similarity(vec_res, vec_res)
    return sims


def find_similar_rsc(similarity_scores: np.array, threshold: float) -> pd.DataFrame:
    """
    Get a dictionary relating resources to a list of [resource, score] pairs per resource.
    :param similarity_scores: matrix of similarity score per pair of resources of shape
    (number of resoures, number of resources).
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :return: a pd.DataFrame with 'resource_idx', 'similar_res_idx' and 'similarity_score' as columns relating resources
    to a given resource.
    """
    similar_rsc_idx = np.where((similarity_scores >= threshold) & (similarity_scores < 0.999))
    similar_scores = np.round(similarity_scores[similar_rsc_idx], 3)
    sim_res = pd.DataFrame({'resource_idx': similar_rsc_idx[0],
                            'similar_res_idx': similar_rsc_idx[1],
                            'similarity_score': similar_scores})
    return sim_res


def get_similar_rsc(resources: pd.DataFrame, threshold: float = 0.75, space: str = 'tfidf') -> dict:
    """
    Get similar resources per resource.
    :param resources: pd.DataFrame with the resources as rows and at least 'Text' as column.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a dictionary with resources as keys and similar resources as values.
    """
    sims = get_similarity(resources, space)
    find_sims = find_similar_rsc(sims, threshold)
    sim_df = find_sims.copy()
    sim_df.reset_index(inplace=True)
    sim_df['resource_id'] = resources['resource_id'].iloc[find_sims.resource_idx].values
    sim_df['similar_res'] = resources['resource_id'].iloc[find_sims.similar_res_idx].values
    sim_df['sim_resources'] = sim_df.apply(lambda x: [[x.similar_res, x.similarity_score]], axis=1)
    grouped_sim_res = sim_df[['resource_id', 'sim_resources']].groupby('resource_id').agg(lambda x: np.sum(x))
    similar_res_dict = grouped_sim_res.T.to_dict('records')[0]
    sim_res = {k: sorted(v, key=lambda x: x[1], reverse=True) for k, v in similar_res_dict.items()}
    return sim_res


def get_similar(input_text: str, corpus: pd.DataFrame, threshold: float=0.75, space: str = 'tfidf') -> list:
    """
    Retrieves a set of messages from a given corpus that are similar enough to an input message.
    :param input_text: query text.
    :param corpus: pd.DataFrame with historical messages as column 'Text'.
    :param threshold: the similarity score threshold for retrieving as similar resource.
    :param space: vector space representation of resources, either 'tf' or 'tfidf'.
    :return: a list with all the similar messages content and corresponding score to the queried one.
    """
    input_id = to_md5(input_text)
    input_df = pd.DataFrame({'Text': [input_text], 'resource_id': [input_id]})
    data = pd.concat([input_df, corpus])
    sim_dict = get_similar_rsc(data, threshold, space)
    result = list()
    if sim_dict.get(input_id):
        for sim_id, sim_score in sim_dict.get(input_id):
            result.append([corpus['Text'][corpus['resource_id'] == sim_id].values[0], sim_score])
    else:
        result = [None, 0]
    return result

### Extra functions

In [13]:
def clean_text(text):
    """
    Clean and preprocess text by normalizing, removing noise, and lemmatizing.

    Parameters:
    ----------
    text : str
        Input text string to be cleaned.

    Returns:
    --------
    cleaned_text : str
        Processed text with noise removed, normalized, and lemmatized.
    """
    text = text.lower()
    text = re.sub(r'article-i\.d\.: [^\s]+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http[s]?://\S+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()

    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text


### 1. Preparing data

From a given set of messages, a historical corpus and a query message are defined. Thus, the query message is fed into the message matcher so that all messages from the corpus similar to the query one are retrieved.

In [3]:
path = '../part1/dataset'
tags = os.listdir(path)
data_full = get_all_dfs(path, tags)[['Text']]
data_full['resource_id'] = data_full['Text'].apply(to_md5)

In [4]:
corpus = data_full.sample(int(data_full.shape[0] * 0.9))
test_data = data_full[~data_full.resource_id.isin(corpus.resource_id)]
print(corpus.shape)
corpus.tail()

(3467, 2)


Unnamed: 0_level_0,Text,resource_id
file,Unnamed: 1_level_1,Unnamed: 2_level_1
54357,\nPosted by Cathy Smith for L. Neil Smith\n\n ...,ac22da42387cfe902642f3776b2d369b
60997,\nIn article <1r46o9INN14j@mojo.eng.umd.edu> s...,ccda3081810de5243b810dbe1be2b0c2
54703,\nIn article <C5D05G.6xw@undergrad.math.uwater...,f532fdae92d458e008e0353714a6a4c7
51278,\nIn <1993Apr4.093904.20517@proxima.alt.za> lu...,e34b0e4b6124c8c58581500def576ea1
101650,Article-I.D.: morrow.1psg9cINNn86\n\nIn articl...,537e88736575edfa1362bb0f12ea8c7f


In [7]:
corpus['Text']

file
102723    Article-I.D.: pollux.1psvouINNa2l\n\n\nThe Ang...
38432     \nspworley@netcom.com (Steve Worley) writes:\n...
61161     Article-I.D.: aurora.1993Apr23.123433.1\n\nIn ...
178533    \n(oh boy. it's the [in]famous Phill Hallam-Ba...
38627     \nMark A. Cartwright (markc@emx.utexas.edu) wr...
                                ...                        
54357     \nPosted by Cathy Smith for L. Neil Smith\n\n ...
60997     \nIn article <1r46o9INN14j@mojo.eng.umd.edu> s...
54703     \nIn article <C5D05G.6xw@undergrad.math.uwater...
51278     \nIn <1993Apr4.093904.20517@proxima.alt.za> lu...
101650    Article-I.D.: morrow.1psg9cINNn86\n\nIn articl...
Name: Text, Length: 3467, dtype: object

In [8]:
# output_file = 'text_column_output.txt'

# # Open a file in write mode and save the text
# with open(output_file, 'w', encoding='utf-8') as f:
#     for index, row in your_dataframe.iterrows():
#         f.write(f"Index: {index}\n")
#         f.write(f"Text: {row['Text']}\n")
#         f.write("-" * 80 + "\n")  # Separator between rows

# print(f"The 'Text' column has been saved to {output_file}.")

The 'Text' column has been saved to text_column_output.txt.


#### 1.1) Deeper analysis in the data

In [15]:
data_full['cleaned_text'] = data_full['Text'].apply(clean_text)

In [16]:
data_full.head()

Unnamed: 0_level_0,Text,resource_id,cleaned_text
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
59848,Article-I.D.: cs.controversy_733694426\n\n\nCO...,32701d55c7514412c8e297dc566bebc6,controversial question issue periodically come...
59849,Article-I.D.: cs.groups_733694492\n\n\nSPACE A...,13890b221fa3da7c82444a9c2f7d6126,space activistinterestresearch group space pub...
59850,Article-I.D.: cs.astronaut_733694515\n\n\nHOW ...,dae3a32be9293511b5a2ad48095f039b,become astronaut first short form authored hen...
59870,\n\nDIFFS SINCE LAST FAQ POSTING (IN POSTING O...,dde5ba372c661832f108fa0693e4a0cc,diffs since last faq posting posting order han...
59873,"\n\nONLINE AND OTHER SOURCES OF IMAGES, DATA, ...",288935f9f99377966abc786c29b0ee79,online source image data etc introduction wide...


### 2. Getting similar messages

In [5]:
query_text = test_data.iloc[42]['Text']
print(query_text)


Hi,
    I was reading through "The Spaceflight Handbook" and somewhere in
there the author discusses solar sails and the forces acting on them
when and if they try to gain an initial acceleration by passing close to
the sun in a hyperbolic orbit. The magnitude of such accelerations he
estimated to be on the order of 700g. He also says that this is may not
be a big problem for manned craft because humans (and this was published
in 1986) have already withstood accelerations of 45g. All this is very
long-winded but here's my question finally - Are 45g accelerations in
fact humanly tolerable? - with the aid of any mechanical devices of
course. If these are possible, what is used to absorb the acceleration?
Can this be extended to larger accelerations?

Thanks is advance...
-Amruth Laxman




In [6]:
similar_results = get_similar(query_text, corpus, 0.2)
if similar_results[0]:
    print("Similar Messages:")
    for result in similar_results:
        print("-"*75)
        print(result[0])
        print(f"Similarity score: {result[1]}")
        print("-"*75)

Similar Messages:
---------------------------------------------------------------------------

Amruth Laxman <al26+@andrew.cmu.edu> writes:
> Hi,
>     I was reading through "The Spaceflight Handbook" and somewhere in
> there the author discusses solar sails and the forces acting on them
> when and if they try to gain an initial acceleration by passing close to
> the sun in a hyperbolic orbit. The magnitude of such accelerations he
> estimated to be on the order of 700g. He also says that this is may not
> be a big problem for manned craft because humans (and this was published
> in 1986) have already withstood accelerations of 45g. All this is very
> long-winded but here's my question finally - Are 45g accelerations in
> fact humanly tolerable? - with the aid of any mechanical devices of
> course. If these are possible, what is used to absorb the acceleration?
> Can this be extended to larger accelerations?

are you sure 45g is the right number? as far as i know, pilots are
blackout i

### 3) Using Embeddings for vectorization