# LTR Test Task
## Jeff Wagg, March 2023

Classical search algorithms generate candidate matches from an input query in order to generate 'relevant' results. Here, we are given a large training dataset of queries and resulting links to images and metadata such as text and title, some of which are actually relevant to the original search. We want to develop a model which will allow us to make predictions for whether an image is relevant, or not based on a set of associated features (keywords, image characteristics, etc.). To achieve this we will attempt to use a combination of image recognition for classifying the images returned from the query, keyword search, and Natural Language Processing (NLP) to refine and interpret the text query. 

Once a model has been developed, this will be applied to a smaller test dataset in order to classify images as relevant, or not. 

In [None]:
# start by importing some useful packages
import pandas as pd 
import numpy as np
import re
import requests, os, json, lxml
from PIL import Image
from pytesseract import pytesseract
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import classification_report
from unidecode import unidecode

In [None]:
# we will use tesseract to extract words from images 
path_to_tesseract = '/usr/local/bin/tesseract'
pytesseract.tesseract_cmd = path_to_tesseract

In [None]:
# need to set the header for url searches so that our queries are not blocked by the sites as bots
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
    "Accept":"text/html,application/xhtml+xml,application/xml; q=0.9,image/webp,*/*;q=0.8",'Accept-Language': 'en-US,en;q=0.8'}

## Load the Data and Perform some Exploratory Data Analysis and Cleaning 

In [None]:
%%time 

file_train = 'train.feather'
file_test = 'test.feather'

df_train = pd.read_feather(file_train)
df_train = df_train.fillna(0) # fill na entries with '0'
print("Training set loaded. The number of samples is: ",len(df_train.index))

df_test = pd.read_feather(file_test)
df_test = df_test.fillna(0) # fill na entries with '0'
print("Testing set loaded. The number of samples is: ",len(df_test.index))

print(df_train.columns)

We see that there are 23 columns in the training dataset. We want to build a model that can be used to assess which of the features is most relevant for determining whether a particular search outcome is relevant, or not. I initially hypothesize that some combination of 'title', 'src' (url of image), 'text_tag', and 'text' will be the most relevant for predicting the outcome, or target variable: 'is_relevant'. 

Now, let's find the total number of unique searches in the training set and the average number, k, of candidates generated for each. 

In [None]:
print("The number of searches is: ",df_train['query'].nunique())

queries = df_train['query'].unique()
kavg = 0.
for q in queries:
    df_tmp = df_train.loc[df_train['query'] == q]
    k = len(df_tmp.index)
    #print(q," number of k:",k)
    kavg += k
kavg = kavg / len(queries)
print("Each search returns an average of ",kavg," results")

query_uniq = df_train['query'].unique()
print(query_uniq)

We are given a list of features contained in the dataframe that may be useful for this analysis. These include:

“id” : unique identifier for an image. We do not expect this to be useful for the predictions. \
“query” : text query, which is used to determine the relevance of an image. Interesting to note that all of the queries are single element. \
“url_page”: webpage where the image is found. The keywords derived from this feature may be used in the model. \
“src” : the source image url of the image, this is the url for the image itself. The keywords extracted from this link are expected to be important for the modelling. We may also be able to use the image itself for predictions.  \
“title”: title of the “url_page”. There may be words in the title which prove useful for the model. \
“alt”: alternate text for the image. Words extracted from here are likely to be useful for making predictions. \
“is_relevant”: 1 if image is relevant to the query, 0 otherwise. This is the target for the analysis. 

In examining the list of queries, I find that some of these are misspelled. For example, there is a query for 'eldery' which should be written as 'elderly', or a query for 'alltvshows' which should be 'all tv shows'. This will need to be adressed or corrected for the modelling. Some of the queries are in Spanish, so these may also require a translation and accents removed. A quick search of the list of queries reveals a list of more than 100 misspellings for which we propose changes. In order to find the most likely replacement, I run a Bing search with the original query and extract the recommended alternative text. I also tried Google but encountered a security block after too many webscraping calls.  

In [None]:
%%time

for q in query_uniq: 
    # client param could be replaced with firefox or other browser
    params = {
      'q': q,
      'hl': 'en',
      'gl': 'us',
    }
    
    #link = 'https://www.google.com/search?q='
    link = 'https://www.bing.com/search?q='
    html = requests.get(link, headers=headers, params=params).text
    newq = q
    try:
        soup = BeautifulSoup(html, 'html.parser')
        linetags = str(soup.find("li",{"class":"b_algo"}))
        search_word = linetags.split("<strong>")[1].split()[0]
        newq = search_word.split("</strong>")[0].split()[0].lower()
    except:
        print("Keeping original query")
        
    print(q," suggested: ",newq)
    df_train['query'] = df_train['query'].replace([q], newq)


In [None]:
# Some words do not have replacements, or the Bing suggestions do not make sense. I edit these myself
df_train['query'] = df_train['query'].replace(['home'], 'homedepot')
df_train['query'] = df_train['query'].replace(['jorgenssns'], 'jorgensen')
df_train['query'] = df_train['query'].replace(['bennyes'], 'bennys')  
df_train['query'] = df_train['query'].replace(['about'], 'ajw')  
df_train['query'] = df_train['query'].replace(['tin'], 'thethao24h')  
df_train['query'] = df_train['query'].replace(['a'], 'aeiou')  
df_train['query'] = df_train['query'].replace(['cryptocurrenct'], 'cryptocurrency')  
# technology -> mstm
df_train['query'] = df_train['query'].replace(['bloggle'], 'buc')  
df_train['query'] = df_train['query'].replace(['leukorrhea'], 'leucorrhea') 
df_train['query'] = df_train['query'].replace(['sheikh'], 'sheik')
df_train['query'] = df_train['query'].replace(['aliminum'], 'aluminum')
df_train['query'] = df_train['query'].replace(['arcangels'], 'archangel')
df_train['query'] = df_train['query'].replace(['basball'], 'baseball') 
df_train['query'] = df_train['query'].replace(['distillium'], 'distylium')
df_train['query'] = df_train['query'].replace(['ktire'], 'kture')  
df_train['query'] = df_train['query'].replace(['neglegence'], 'negligence')  
df_train['query'] = df_train['query'].replace(['unfazed'], 'unphased') 
df_train['query'] = df_train['query'].replace(['bonbas'], 'bombas')
df_train['query'] = df_train['query'].replace(['lasik'], 'lazik')


I note that some of the words appear in different languages. This is not taken into account for this analysis except to remove any accents. 

Here, I read in some of the relevant images to determine if their characteristics can be used in the model as features. 

In [None]:
%%time

for i in range(0,19):
    isin = 0
    if df_train['is_relevant'][i] == 1:
        try:
            r = requests.get(df_train['src'][i],headers=headers)
            with open('tmpimg', 'wb') as outfile:
                outfile.write(r.content)
            if 'svg' in df_train['src'][i]:
                drawing = svg2rlg("tmpimg")
                renderPM.drawToFile(drawing, "tmpimg", fmt="PNG")
            img =  Image.open('tmpimg')
            text = pytesseract.image_to_string(img)
            text = text.replace('\n','')
            img_low = str(text).lower()
            #img.show()
            if (df_train['query'][i] in img_low):
                print(i,"Yes, query is in image.")
            else:
                print(i,"No, query is not in image")
        except:
            print(i,"Sorry, could not read image: ",df_train['src'][i])        

In the previous cell, we tried reading the images (linked through 'src') to see if any of their characteristics can be used as a model feature. After running Pillow to extract and open only the images associated with '1' for 'is_relevant', I found that many of the images could not be read, in some cases showing 'File not found'. The main issue appears to be due to the sites using security to block potential webscrapers. This was fixed by including the headers defined in an earlier cell if the code. Some of the files are in 'svg' format and could not be read with PIL. These need to be read with 'svg2rlg(IMAGE_NAME)' and then converted to PNG format. 

We then used tesseract to extract any text found in the image and checked whether the 'query' was in these words. We found that the query was infrequently in the image itself. As such, this is unlikely to be a good model feature, and we will exclude it from here on.  

In running these tests while printing out some of the other feature values, I did note that the query text often appears in either the 'alt' text or URL of the relevant image ('src'). I also note that the query phrase appears in the title about 70% of the time, whether the query is relevant, or not. In the next section, I check to see if these matches are a reliable indicator of whether the results are relevant, or not.  

Finally, I calculate the fraction of images which are deemed to be relevant. You can see that these make up only ~3.6% of the training sample, meaning that we have unbalanced data which should be factored into the model. 

In [None]:
print("Only %",100.*df_train['is_relevant'].sum()/len(df_train.index)," of the images are relevant.")

## Feature Engineering 

Given that we have seen that the presence of the search phrase in at least two of the features can give us some insight into whether the resulting image is relevant, or not, we create new features which can be used as input into a machine learning model such as logistic regression. We will create 'isintxt_XXX' columns in the training dataframe, where 'XXX' is the name of the original column, and the values will be '0'- text not present, '>=1' - the number of times the query appears in the text. 

Note - after running a few tests, we found that running tesseract on the images using a CPU is too slow to search through all of the training images in a reasonable amount of time (~110 hours for all 600k images). Given our previous finding that the query is rarely found in the image when the query is relevant, I decide not to use this as a feature.  

In [None]:
%%time

df_train['isintxt_src'] = 0
df_train['isintxt_alt'] = 0
df_train['isintxt_url'] = 0
df_train['isintxt_title'] = 0
df_train['isintxt_text'] = 0

for i in range(0,len(df_train.index) - 1):  
    query_low = str(df_train['query'][i]).lower()
    
    url_low = str(df_train['url_page'][i]).lower().replace(" ", "").replace("_", "")
    if (query_low in url_low):
        df_train['isintxt_url'][i] = url_low.count(query_low)
    
    src_low = str(df_train['src'][i]).lower().replace("_", "").replace("http://","").replace("https://","").replace("/","")
    if (query_low in src_low):
        df_train['isintxt_src'][i] = src_low.count(query_low)
    
    alt_low = str(df_train['alt'][i]).lower().replace(" ", "").replace("'",'').replace("-",'')
    alt_low = unidecode(alt_low)
    if (query_low in alt_low):
        df_train['isintxt_alt'][i] = alt_low.count(query_low)
    
    title_low = str(df_train['title'][i]).lower().replace(" ", "").replace("'",'').replace("-",'')
    title_low = unidecode(title_low)
    if (query_low in title_low):
        df_train['isintxt_title'][i] = title_low.count(query_low)
        
    text_low = str(df_train['text'][i]).lower().replace(" ", "").replace("_", "")
    text_low = unidecode(text_low)
    if (query_low in text_low):
        df_train['isintxt_text'][i] = text_low.count(query_low)
        

In [None]:
# Run a short check to see if there are any obvious patterns in the occurence of the query text 
print("Query Relevant?   # occurences in:  title Alt SRC Text")

for i in range(0,len(df_train.index) - 1):  
    if df_train['is_relevant'][i] == 1:
        print(df_train['query'][i],df_train['is_relevant'][i],"                          ",
              df_train['isintxt_title'][i],df_train['isintxt_alt'][i],df_train['isintxt_src'][i],
                df_train['isintxt_text'][i])

Before doing creating any machine learning models, I want to run a quick check to see whether there are any obvious combinations of features that indicate whether an image is relevant to a search query. 

In [None]:
alt_src_text_rel = 0
alt_src_rel = 0
alt_text_rel = 0
src_text_rel = 0
allzeros_rel = 0
alt_src_text_irrel = 0
alt_src_irrel = 0
alt_text_irrel = 0
src_text_irrel = 0
allzeros_irrel = 0
alt_rel = 0
src_rel = 0
text_rel = 0 
alt_irrel = 0
src_irrel = 0
text_irrel = 0 


for i in range(0,len(df_train.index) - 1):  
    if ((df_train['isintxt_alt'][i] >=1 ) and (df_train['isintxt_src'][i] >= 1) and (df_train['isintxt_text'][i] >= 1)):
        if (df_train['is_relevant'][i] == 1):
            alt_src_text_rel += 1
        else:
            alt_src_text_irrel += 1
        
    if ((df_train['isintxt_alt'][i] >=1) and (df_train['isintxt_src'][i] >= 1) and df_train['isintxt_text'][i] == 0):
        if (df_train['is_relevant'][i] == 1):
            alt_src_rel += 1
        else:
            alt_src_irrel += 1
    
    if ((df_train['isintxt_alt'][i] >= 1) and (df_train['isintxt_text'][i] >= 1) and df_train['isintxt_src'][i] == 0):
        if (df_train['is_relevant'][i] == 1):
            alt_text_rel += 1
        else:
            alt_text_irrel += 1
    
    if ((df_train['isintxt_text'][i] >= 1) and (df_train['isintxt_src'][i] >= 1) and df_train['isintxt_alt'][i] == 0):
        if (df_train['is_relevant'][i] == 1):
            src_text_rel += 1
        else:
            src_text_irrel += 1
       
    if (df_train['isintxt_alt'][i] == df_train['isintxt_src'][i] == 0 and df_train['isintxt_text'][i] >= 1):
        if (df_train['is_relevant'][i] == 1):
            text_rel += 1
        else:
            text_irrel += 1
       
    if (df_train['isintxt_src'][i] == df_train['isintxt_text'][i] == 0 and df_train['isintxt_alt'][i] >= 1):
        if (df_train['is_relevant'][i] == 1):
            alt_rel += 1
        else:
            alt_irrel += 1
       
    if (df_train['isintxt_alt'][i] == df_train['isintxt_text'][i] == 0 and df_train['isintxt_src'][i] >= 1):
        if (df_train['is_relevant'][i] == 1):
            src_rel += 1
        else:
            src_irrel += 1
       
    if (df_train['isintxt_alt'][i] == df_train['isintxt_src'][i] == df_train['isintxt_text'][i] == 0):
        if (df_train['is_relevant'][i] == 1):
            allzeros_rel += 1
        else:
            allzeros_irrel += 1

In [None]:
alt_text_rel / (alt_text_rel + alt_text_irrel)

In [None]:
num_isrelev = 0
num_notrelev = 0
num_insrcalt_rel = 0
num_insrcalt_notrel = 0

for i in range(0, len(df_train.index) - 1):
    if df_train['is_relevant'][i] == 1:
        num_isrelev += 1
        if ((df_train['isintxt_text'][i] >= 1) and (df_train['isintxt_src'][i] >= 1) and (df_train['isintxt_alt'][i] >= 1)):
            num_insrcalt_rel += 1
    else:
        num_notrelev += 1
        if ((df_train['isintxt_text'][i] >= 1) and (df_train['isintxt_src'][i] >= 1) and (df_train['isintxt_alt'][i] >= 1)):
            num_insrcalt_notrel += 1

In [None]:
print("Fraction of relevant image where query found in alt, src and text:",num_insrcalt_rel / num_isrelev)
print("Fraction of irrelevant image where query found in alt, arc and text:",num_insrcalt_notrel / num_notrelev)

These results are interesting as they suggest that finding the 'query' in some of the other fields can help to distinguish between whether the image is relevant, or not. For example, finding the query in either 'title' or 'url' does not help, as the probability is the same, irrespective of the relevance (appearing in 70% of the titles, and 60% of the URLs). However, in the case of 'alt', 'src' and 'text', the fraction of times that the query appears is about 5.5x higher when the image is relevant. This suggests that these are the most important features for our models. When the query is relevant, it appears in either 'alt' or 'src' or 'text' about 68% of the time, while this is true for only 22% of the cases where the image is irrelevant. Changing the OR condition to an AND, we find that there is only a 2% chance that the image is irrelevant, while this happens in 12% of the relevant images.  

## Machine Learning Model 

Now that we have defined some plausible model features, we will attempt to develop machine learning models that can use these features to predict the target variable, 'is_relevant'. I attempted to implement three models, 1) Logistic Regression, 2) Poisson Regression, and 3) k-Nearest Neighbour (kNN). 

I first split the training data into testing (20%) and training (80%) data sets. The models were fit to the training set and then applied to the test data to verify performance. The first two models proved to be unsuccessful, and I was unable to find model parameters that led to any success in predicting when an image would be relevant based on the presence of the query text in any of the features. I have therefore excluded the Logistic Regression and the Poisson Regresson models from the code below. However, the kNN model did show some promise as it is able to associate a probability rather than a binary predictions that the image might be relevant based on the feature variables. 

In [None]:
feature_cols = ['isintxt_alt','isintxt_src','isintxt_text','isintxt_title','isintxt_url']
X = df_train[feature_cols] # Features
y = df_train['is_relevant']

# split the training data into test and training sub-samples. Standard 80/20 rule of thumb
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

##### fit the model #####
knn_model = KNeighborsRegressor(n_neighbors=44)
knn_model.fit(X_train, y_train)

In [None]:
%%time

# try a grid search to find the best value of 'k'- number of neighbours to use for classification
parameters = {"n_neighbors": range(33, 45)}
gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
gridsearch.fit(X_train, y_train)
gridsearch.best_params_

# We find that 19 neighbours gives the lowest root mean squared error in the grid search when using binary features
# when we let the features indicate the number of matches, the number increases to 44 neighbours

In [None]:
%%time
# make predictions using the training subsample to see how well the model fits 
#y_pred = logreg.predict(X_test)
train_preds = knn_model.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
rmse = sqrt(mse)
print(rmse)

In [None]:
%%time
# make predictions using the testing subsample to see how well the model fits 
test_preds = knn_model.predict(X_test)
mse = mean_squared_error(y_test, test_preds)
rmse = sqrt(mse)
print(rmse)

In [None]:
ymean_rel = 0.
ymean_irr = 0.

for idx, yi in enumerate(y_test):
    print(yi,test_preds[idx])
    if yi == 1:
        ymean_rel += test_preds[idx]
    if yi == 0:
        
        ymean_irr += test_preds[idx]


print("Avg of relevant: ",ymean_rel/len(y_test[y_test == 1]),
      "Avg of irrelevant:",ymean_irr/len(y_test[y_test == 0]))

## Apply the Model to the Test Data

At this stage we want to apply our kNN machine learning model to the test data set. I first have to preprocess the test data provided, including some feature engineering. 

In [None]:
print("The number of searches in the test data is: ",df_test['query'].nunique())

queries_test = df_test['query'].unique()
kavg = 0.
for q in queries_test:
    df_tmp = df_test.loc[df_test['query'] == q]
    k = len(df_tmp.index)
    kavg += k
kavg = kavg / len(queries_test)
print("Each search returns an average of ",kavg," results")

query_uniq_test = df_test['query'].unique()
print(query_uniq_test)

In [None]:
%%time

for q in query_uniq_test: 
    # client param could be replaced with firefox or other browser
    params = {
      'q': q,
      'hl': 'en',
      'gl': 'us',
    }
    
    link = 'https://www.bing.com/search?q='
    html = requests.get(link, headers=headers, params=params).text
    newq = q
    try:
        soup = BeautifulSoup(html,'html.parser')
        linetags = str(soup.find("li",{"class":"b_algo"}))
        search_word = linetags.split("<strong>")[1].split()[0]
        newq = search_word.split("</strong>")[0].split()[0].lower()
    except:
        print("Keeping original query")
        
    print(q," suggested: ",newq)
    df_test['query'] = df_test['query'].replace([q], newq)


In [None]:
%%time

df_test['isintxt_src'] = 0
df_test['isintxt_alt'] = 0
df_test['isintxt_url'] = 0
df_test['isintxt_title'] = 0
df_test['isintxt_text'] = 0

for i in range(0,len(df_test.index) - 1):  
    query_low = str(df_test['query'][i]).lower()

    url_low = str(df_test['url_page'][i]).lower().replace(" ", "").replace("_", "")
    if (query_low in url_low):
        df_test['isintxt_url'][i] = url_low.count(query_low)
    
    src_low = str(df_test['src'][i]).lower().replace("_", "").replace("http://","").replace("https://","").replace("/","")
    if (query_low in src_low):
        df_test['isintxt_src'][i] = src_low.count(query_low)
    
    alt_low = str(df_test['alt'][i]).lower().replace(" ", "").replace("'",'').replace("-",'')
    alt_low = unidecode(alt_low)
    if (query_low in alt_low):
        df_test['isintxt_alt'][i] = alt_low.count(query_low)
        
    title_low = str(df_test['title'][i]).lower().replace(" ", "").replace("'",'').replace("-",'')
    title_low = unidecode(title_low)
    if (query_low in title_low):
        df_test['isintxt_title'][i] = title_low.count(query_low)
        
    text_low = str(df_test['text'][i]).lower().replace(" ", "").replace("_", "")
    text_low = unidecode(text_low)
    if (query_low in text_low):
        df_test['isintxt_text'][i] = text_low.count(query_low)
        

In [None]:
# This block of code is for checking the number of times the query appears in some of the most important features
print("query    is in: title Alt SRC Text    src")
print()

numrel = 0
for i in range(0,len(df_test.index) - 1):  
        print(df_test['query'][i],"            ",df_test['isintxt_title'][i],df_test['isintxt_alt'][i],df_test['isintxt_src'][i],
                df_test['isintxt_text'][i],str(df_test['src'][i]))
        numrel = numrel + 1
    
print("Number of relevant images: ",numrel)

In [None]:
# Now apply the model to the test data 
X_test_final = df_test[feature_cols] # Features
df_test['is_relevant'] = 0
y_test_final = df_test['is_relevant']

test_preds_final = knn_model.predict(X_test_final)
df_test['is_relevant'] = test_preds_final

In [None]:
# check the model output and write to csv file
for i in range(0, len(df_test.index)):
    print(df_test['id'][i],df_test['is_relevant'][i])
    
df_test.to_csv('submission.csv', index=False, columns=['id','is_relevant'])

## Summary

After performing some data cleaning and using natural language processing to refine the queries, I have generated new features using the metadata associated with the training dataset. Following this, I attempted to develop three different machine learning models, finding that a simple kNN worked best with 44 neighbours giving the optimal solution. This kNN model provides some insight into whether the images returned by a query are relevant, or not, assigning probabilities based on the number of times the query appears in the various features. 

With more computing resources, it would be interesting to also try using classification techniques on the images themselves. This is likely to lead to an improvement in the model performance. Once might also consider using word vectors instead of searching for the query words directly in the text features, as this should prove to be computationally faster. 