## Cosine Similarity features (Version 1)
This program implements a Linear Regression model with 4 features. The 4 features are the Cosine Similarity values of the query with each of product description, product title, product attribute name, product attribute value.

### Overview of the methodology:

It first builds the vocab of words under a column and then finds the representations of the words in each training example. After that, it computes the tfidf vectors of the columns in question. The key part of the implementation is that it uses both unigrams and bigrams of words to find the TfIdf vector representation. With these tfidf values in hand, the tfidf values of the search query are computed. There are ultimately 2 vectors - document and search vectors. Each feature is the cosine similarity of a document vector with the corresponding search vector.

### Bigram feature engineering:
It makes more sense to check for a word in a context. If a pair of words appears consecutively in a document in a certain order and also the search query, it's only fair that there be an added similarity between the vectors.

In [1]:
import pandas as pd
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import linear_model
import re

The following method takes a couple of vectors - the document and search vectors and returns the cosine similarity between them. The first step is to fill in the values of the search vector. The vectors are both sparse representations of the tfidf values of the words in the vocab.

In [2]:
# Takes the document and search vectors and returns the cosine similarity of the 2 vectors.
def computeCosineSim(words, doc_vector, search_vector, vectorizer):
    # First store the term frequencies of the words
    dict = {}
    for word in words:
        if word in dict:
            count = dict.get(word)
            count = count + 1
            dict[word] = count
        else:
            dict[word] = 1

    doc_array = doc_vector.toarray()
    search_array = search_vector.toarray()

    # Get the idf of the vocab.
    idf = vectorizer.idf_

    # Now compute the tfidf of the words wrt to the document to get the search vector.
    # Tfidf of a word in the search query = (tf of the word) * (idf of the word in the document corpus)
    for word in dict:
        index = vectorizer.vocabulary_.get(word, -1)
        if index != -1:
            search_array[0][index] = (idf[index] * dict[word]) / len(words)
        else:
            search_array[0][index] = 0

        # Computing the tfidf of the bigrams with this word as the first word.
        for word2 in dict:
            index1 = vectorizer.vocabulary_.get(word + " " + word2, -1)
            if index1 != -1:
                search_array[0][index1] = (idf[index1] * (dict[word] + dict[word2])) / len(words)
            else:
                search_array[0][index1] = 0
    return cosine_similarity(search_array, doc_array)


Given a dataframe, features matrix (which is a set of cosine similarities for each training/test example) and a file, this method writes the feature values in the given csv file.

In [None]:

def writeFeaturesToFile(df, features, myfile):
    myfile.write("\"product_uid\"" + ",\"cosine_desc\"" + ",\"cosine_title\"" + ",\"cosine_attName\"" + ",\"cosine_attValue\"\n")
    for i in range(0, len(features)):
        myfile.write(str(df['product_uid'][i]) + "," + str(features[i][0]) + "," + str(features[i][1]) + "," + str(features[i][2]) + "," + str(features[i][3]) + "\n")


The first step in extracting the features is merging the files. There are 3 files to work on in the preprocessing step - attributes.csv, train.csv, product_descriptions.csv. Motivation - We need a combined vocab of descriptions, titles and attributes in one place.

In [None]:
# This method takes the different product files and combines them into one on product_id.
def preprocess(isTest):
    # Combining the rows of the attributes file on product_uid.
    if not isTest:
        myfile_att = open("att_mod.csv", "w")
        df_attributes = pd.read_csv("attributes.csv", encoding='latin-1')
        df_attributes['name'] = df_attributes['name'].astype(str) + " "
        df_attributes['value'] = df_attributes['value'].astype(str) + " "
        df_attributes = df_attributes.groupby('product_uid').apply(lambda x: x.sum())
        df_attributes = df_attributes.drop(df_attributes.columns[[0]], axis=1)
        df_attributes.to_csv(myfile_att, sep=',', quoting=csv.QUOTE_NONNUMERIC)
        myfile_att.close()

        df_attributes = pd.read_csv("att_mod.csv", encoding='latin-1')
        df_prodDesc = pd.read_csv("product_descriptions.csv", encoding='latin-1')
        result = pd.merge(pd.DataFrame(df_prodDesc), pd.DataFrame(df_attributes), on='product_uid', how='left')
        merged_file = open("merged.csv", "w")
        result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
        merged_file.close()

        # Merge the prod_titles in the train file with the product details.
        df_train = pd.read_csv("train.csv", encoding='latin-1')
        df_merged = pd.read_csv("merged.csv", encoding='latin-1')
        result = pd.merge(pd.DataFrame(df_merged),
            pd.DataFrame(df_train)[['product_uid', 'product_title','search_term','relevance']], on='product_uid', how='inner')
        merged_file = open("training.csv", "w")
        result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
        merged_file.close()
    else:
        df_attributes = pd.read_csv("att_mod.csv", encoding='latin-1')
        df_prodDesc = pd.read_csv("product_descriptions.csv", encoding='latin-1')
        result = pd.merge(pd.DataFrame(df_prodDesc), pd.DataFrame(df_attributes), on='product_uid', how='left')
        merged_file = open("merged_test.csv", "w")
        result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
        merged_file.close()

        # Merge the prod_titles in the train file with the product details.
        df_test = pd.read_csv("test.csv", encoding='latin-1')
        df_merged = pd.read_csv("merged_test.csv", encoding='latin-1')
        result = pd.merge(pd.DataFrame(df_merged),
            pd.DataFrame(df_test)[['product_uid', 'product_title','search_term']], on='product_uid', how='inner')
        merged_file = open("testing.csv", "w")
        result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
        merged_file.close()


### Vectorize the corpus:
This method takes care of vectorizing the text under a certain column in a csv file. The tfidfVectorizer() builds and fits the vocab. Each training example can then be transformed before finding the cosine similarity value.

The vector of each column (ex: Product description) consists of all the unigrams (individual words) and bigrams (pairs of adjacent words) that appear in all rows of the files.

In [None]:
# Given the column to vectorize, this function build the vocab from words under
# the column, gets the tfidf vector representation, computes and returns the cosine similarity matrix.
def vectorize(column):
    df_merged.fillna(' ', inplace=True)
    text = df_merged[column].values.astype('U')

    # create the transform - Using both unigrams and bigrams
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    # tokenize and build vocab
    vectorizer.fit(text)

    # X = vectorizer.fit_transform(text)
    # print(X.toarray())

    cosine = []
    # Encode each document based on the transform.
    for t in range(0, len(df_merged)):
        # Encode document
        doc_vector = vectorizer.transform([df_merged[column][t]])
        # print(doc_vector.toarray())

        # Now get the TFIDF of the search query.
        search_vector = vectorizer.transform([df_merged['search_term'][t]])
        words = re.findall(r"\w+", str(df_merged['search_term'][t]))
        cos = computeCosineSim(words, doc_vector, search_vector, vectorizer)

        cosine.append(cos)
    return cosine


# Returns the cosine Similarities of the specified column in the test file with the search query.
def vectorizeTest(column):
    df_merged.fillna(' ', inplace=True)
    df_test_merged.fillna(' ', inplace=True)
    text = df_merged[column].values.astype('U')

    # create the transform - Using both unigrams and bigrams
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    # tokenize and build vocab
    vectorizer.fit(text)

    cosine_test = []
    # Encode each document based on the transform.
    for t in range(0, len(df_test_merged)):
        doc_vector = vectorizer.transform([df_test_merged[column][t]])

        # Now get the TFIDF of the search query.
        search_vector = vectorizer.transform([df_test_merged['search_term'][t]])
        words = re.findall(r"\w+", str(df_test_merged['search_term'][t]))
        # words = str(df_test_merged['search_term'][t]).split(" ")
        cos = computeCosineSim(words, doc_vector, search_vector, vectorizer)
        cosine_test.append(cos)
    return cosine_test


### Flow of the program:

The following section is the driver of the program.

Steps:

1) Preprocess the train and test files to get a merged file on the train and test data.

2) Compute the cosine similarities of the queries (search terms) with product description, product title, product attribute name, product attribute value.

3) Receive 4 values for each training example from the above step.

4) Learn a Linear Regression model based on the input and output.
    Input = 4 Cosine Similarity values
    Output = Relevance values for each training row
    
5) Get the test data.

6) Predict using Linear Regression.

7) Print the predictions to submission.csv.


## Result:
### RMSE = 0.52134
#### Coefficients of Linear Regression Model: [ 0.86960024,  0.70365176,  0.31511568, -0.42450679]

In [None]:
# Preprocess the training file
preprocess(False)
# Preprocess the test file
preprocess(True)

df_merged = pd.read_csv("training.csv", encoding='latin-1')
df_test_merged = pd.read_csv("testing.csv", encoding='latin-1')

features = []
features_test = []

cosine = vectorize('product_description')
cosine_test = vectorizeTest('product_description')

for c in cosine:
    temp = []
    temp.append(c[0][0])
    features.append(temp)

for c in cosine_test:
    temp = []
    temp.append(c[0][0])
    features_test.append(temp)


cosine = vectorize('product_title')
cosine_test = vectorizeTest('product_title')

for i in range(0, len(cosine)):
    features[i].append(cosine[i][0][0])

for i in range(0, len(cosine_test)):
    features_test[i].append(cosine_test[i][0][0])


cosine = vectorize('name')
for i in range(0, len(cosine)):
    features[i].append(cosine[i][0][0])


cosine_test = vectorizeTest('name')
for i in range(0, len(cosine_test)):
    features_test[i].append(cosine_test[i][0][0])


cosine = vectorize('value')
for i in range(0, len(cosine)):
    features[i].append(cosine[i][0][0])


cosine_test = vectorizeTest('value')
for i in range(0, len(cosine_test)):
    features_test[i].append(cosine_test[i][0][0])


train_features = open("train_features.csv", "w")
writeFeaturesToFile(df_merged, features, train_features)
train_features.close()

test_features = open("test_features.csv", "w")
writeFeaturesToFile(df_test_merged, features_test, test_features)
test_features.close()

df_relevance = df_merged['relevance']

regr = linear_model.LinearRegression()

regr.fit(features, df_relevance)
print("Coefficients: " + str(regr.coef_))

test_predictions = regr.predict(features_test)

df_test_merged = pd.read_csv("test.csv", encoding='latin-1')
df_test_merged = df_test_merged['id']

myfile = open("submission.csv", "w")
myfile.write("\"id\"" + ",\"relevance\"\n")
for id in range(0, len(df_test_merged)):
    myfile.write(str(df_test_merged[id]) + "," + str(test_predictions[id]) + "\n")
myfile.close()