A hands-on Python coding task focused on data preprocessing, aligned with requirements from the take-home task. Make sure the candidate writes clean code with an object oriented approach.


Exercise Example:
Given a dataset with columns [qid, pid, query, title, description, relevance], write a function to preprocess this data, creating input pairs. Implement a basic data split into train, validation, and test sets.

tasks:
1. read file
2. do the statistical analysis on labels
3. create train, validation, test sets
4. make sure all split has the same distributions based labels

## Import libraries

In [33]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer

In [5]:
## Task 1: read file
def create_df_from_csv(file_path):
    df = pd.read_csv(file_path)
    return df
interview_df = create_df_from_csv("esci_sample.csv")
interview_df = interview_df.drop(columns=["Unnamed: 0"])
interview_df.head(5)

Unnamed: 0,qid,pid,query,title,description,relevance
0,13069,B00JPO0KJ8,bar keepers friend,Bar Keepers Friend Powder Cleanser 12 Oz - Mul...,Bar Keepers Friend Powder Cleanser 12 Oz - Mul...,Exact
1,6078,B079H53D2B,abortion pills for pregnancy,Probiotics 60 Billion CFU - Probiotics for Wom...,Probiotics 60 Billion CFU - Probiotics for Wom...,Irrelevant
2,45027,B071JM699P,get set for school flip crayons,"Amazon Basics Woodcased #2 Pencils, Pre-sharpe...","Amazon Basics Woodcased #2 Pencils, Pre-sharpe...",Substitute
3,87357,B00AQIULD2,reusable camping trash bag,Glad ForceFlexPlus Drawstring Large Trash Bags...,Glad ForceFlexPlus Drawstring Large Trash Bags...,Irrelevant
4,19433,B07J4PP6SD,braiding hair 4,"26""-8 Packs Braiding Hair Pre Stretched Hair f...","26""-8 Packs Braiding Hair Pre Stretched Hair f...",Exact


In [6]:
interview_df['relevance'].unique()

array(['Exact', 'Irrelevant', 'Substitute', 'Complement'], dtype=object)

In [None]:
## 2. Do the statistical analysis on labels
def calculate_relevance_porpotion(df):
    relevance_list = list(df['relevance'])
    unique_relevance_list = df['relevance'].unique()
    total_sample_size = len(interview_df["relevance"])
    relevance_dict = {}

    for relevance in relevance_list:
        if (relevance in relevance_dict):
            relevance_dict[relevance]  = relevance_dict[relevance] + 1
        else:
            relevance_dict[relevance] = 1

    for unique_relevance in unique_relevance_list:
        if (unique_relevance in relevance_dict):
            relevance_dict[unique_relevance] = relevance_dict[unique_relevance] / total_sample_size
    return relevance_dict

dict = calculate_relevance_porpotion(interview_df)


{'Exact': 0.678921568627451,
 'Irrelevant': 0.09436274509803921,
 'Substitute': 0.20554812834224598,
 'Complement': 0.021167557932263815}

In [60]:
## 3. split the data into train, validation and test 
def split_data(df,target,test_ratio=0.2):
    X_train, X_test, y_train, y_test= train_test_split(df,target,test_size=0.2,random_state=54,stratify=target)
    return X_train, X_test, y_train, y_test
target = list(interview_df['relevance'])

## create a combined field for description and title
def create_combined_field(df):
    df['combined_text'] = df['description'].fillna("") + " " + df['title'].fillna("")
    return df

interview_df = create_combined_field(interview_df)
X_df = interview_df[['query','combined_text']]

X_train, X_test, y_train, y_test = split_data(X_df,target)
X_train_final, X_validation, y_train_final, y_validation = split_data(X_train,y_train)

In [23]:
len(X_train_final) + len(X_validation) + len(X_test) == len(interview_df)

True

In [None]:
## 2. Do the statistical analysis on labels
def calculate_relevance_porpotion(relevance_list):
    unique_relevance_list = set(relevance_list)
    total_sample_size = len(relevance_list)
    relevance_dict = {}

    for relevance in relevance_list:
        if (relevance in relevance_dict):
            relevance_dict[relevance]  = relevance_dict[relevance] + 1
        else:
            relevance_dict[relevance] = 1

    for unique_relevance in unique_relevance_list:
        if (unique_relevance in relevance_dict):
            relevance_dict[unique_relevance] = relevance_dict[unique_relevance] / total_sample_size
    return relevance_dict

In [32]:
# check split porpotion
def check_split_size(relevance_list):
    relevance_size_dict = calculate_relevance_porpotion(relevance_list)
    print(relevance_size_dict)
    return relevance_size_dict
train_relevance_size_dict = check_split_size(y_train)
test_relevance_size_dict = check_split_size(y_test)
validation_relevance_size_dict = check_split_size(y_validation)

{'Substitute': 0.20557103064066853, 'Exact': 0.6789693593314763, 'Complement': 0.02116991643454039, 'Irrelevant': 0.09428969359331477}
{'Exact': 0.6787305122494433, 'Substitute': 0.205456570155902, 'Irrelevant': 0.09465478841870824, 'Complement': 0.021158129175946547}
{'Substitute': 0.20543175487465182, 'Exact': 0.6789693593314763, 'Irrelevant': 0.09401114206128133, 'Complement': 0.02158774373259053}


In [61]:
X_test

Unnamed: 0,query,combined_text
154,white heavy duty extension cord,IRON FORGE CABLE 25 ft Lighted Outdoor Extensi...
8371,martin acoustic lifespan 92/8,Martin MA540T Authentic Acoustic Lifespan 2.0 ...
2481,chateau self storage lock,"12 Pack Round Padlock with Shielded Shackle, 2..."
6918,g9 odyssey,Sceptre IPS 24-Inch Business Computer Monitor ...
3244,therabands resistance bands set,"A AZURELIFE Resistance Bands Set, Professional..."
...,...,...
8963,yoda do or do not there is no try art,Small Motivational Wall Art Inspirational Quot...
6251,therabands resistance bands set,"TheraBand Resistance Bands, 6 Yard Roll Profes..."
1949,bird feeder for outside,Perky-Pet 336 Squirrel-Be-Gone Wild Bird Feede...
6811,100 oz water bottle bpa free,MYFOREST Half Gallon Tritan Water Bottle with ...


tasks:
1. select a sentence transformer
2. use it to compute query-combined_text cosine similarity (>0.99 exact, >0.8 and <0 .99 substitute, =0 irrelevant, else compliment )
3. use s-learn confusion matrix between predicted and actual

In [62]:
X_test.reset_index(inplace=True)

In [63]:
X_test = X_test.drop(columns=["index"])
X_test

Unnamed: 0,query,combined_text
0,white heavy duty extension cord,IRON FORGE CABLE 25 ft Lighted Outdoor Extensi...
1,martin acoustic lifespan 92/8,Martin MA540T Authentic Acoustic Lifespan 2.0 ...
2,chateau self storage lock,"12 Pack Round Padlock with Shielded Shackle, 2..."
3,g9 odyssey,Sceptre IPS 24-Inch Business Computer Monitor ...
4,therabands resistance bands set,"A AZURELIFE Resistance Bands Set, Professional..."
...,...,...
1791,yoda do or do not there is no try art,Small Motivational Wall Art Inspirational Quot...
1792,therabands resistance bands set,"TheraBand Resistance Bands, 6 Yard Roll Profes..."
1793,bird feeder for outside,Perky-Pet 336 Squirrel-Be-Gone Wild Bird Feede...
1794,100 oz water bottle bpa free,MYFOREST Half Gallon Tritan Water Bottle with ...


In [None]:
from sentence_transformers import util
# Compute Cos similarity 
model = SentenceTransformer("all-MiniLM-L6-v2")

def compute_predict_y(X_test,model):
    predicted_y = []
    # Embedding the query
    for i in range(len(X_test)):
        query_embedding = model.encode(X_test['query'][i], convert_to_tensor=True)
        product_embedding = model.encode(X_test['combined_text'][i], convert_to_tensor=True)
        cos_similarity = util.cos_sim(query_embedding,product_embedding)
        if (cos_similarity>=0.99):
            predicted_y.append("Exact")
        elif (cos_similarity<0.99 and cos_similarity>=0.8):
            predicted_y.append("Substitue")
        elif (cos_similarity == 0):
            predicted_y.append("Irrelevant")
        else:
            predicted_y.append("Compliment")
    return predicted_y


In [64]:
X_test.dropna(subset=["query","combined_text"],inplace=True)

In [65]:
X_test

Unnamed: 0,query,combined_text
0,white heavy duty extension cord,IRON FORGE CABLE 25 ft Lighted Outdoor Extensi...
1,martin acoustic lifespan 92/8,Martin MA540T Authentic Acoustic Lifespan 2.0 ...
2,chateau self storage lock,"12 Pack Round Padlock with Shielded Shackle, 2..."
3,g9 odyssey,Sceptre IPS 24-Inch Business Computer Monitor ...
4,therabands resistance bands set,"A AZURELIFE Resistance Bands Set, Professional..."
...,...,...
1791,yoda do or do not there is no try art,Small Motivational Wall Art Inspirational Quot...
1792,therabands resistance bands set,"TheraBand Resistance Bands, 6 Yard Roll Profes..."
1793,bird feeder for outside,Perky-Pet 336 Squirrel-Be-Gone Wild Bird Feede...
1794,100 oz water bottle bpa free,MYFOREST Half Gallon Tritan Water Bottle with ...


In [66]:
predict_y = compute_predict_y(X_test,model)

In [71]:
from sklearn.metrics import confusion_matrix

In [73]:
y_test

['Exact',
 'Exact',
 'Substitute',
 'Substitute',
 'Substitute',
 'Exact',
 'Exact',
 'Substitute',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Substitute',
 'Substitute',
 'Substitute',
 'Exact',
 'Exact',
 'Substitute',
 'Substitute',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Substitute',
 'Exact',
 'Exact',
 'Irrelevant',
 'Exact',
 'Exact',
 'Substitute',
 'Irrelevant',
 'Exact',
 'Exact',
 'Substitute',
 'Exact',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Substitute',
 'Complement',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Irrelevant',
 'Substitute',
 'Substitute',
 'Irrelevant',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Substitute',
 'Exact',
 'Irrelevant',
 'Substitute',
 'Substitute',
 'Irrelevant',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Irrelevant',
 'Substitute',
 'Irrelevant',
 'Substitute',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Exact',
 'Complement',


In [75]:
y_set = set(y_test)

In [78]:
label_list = list(y_set)

In [79]:
confusion_matrix(y_test, predict_y, labels=label_list)

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])