# Final Project - Advanced Topics in Computer Science 8
Yehonatan Peisakhovsky yehonatan-pe@campus.technion.ac.il ?????????
Ido Zuckerman ido.z@campus.technion.ac.il 323102830

## Topic
Strategic classification, specifically the behavior of strategic suppliers in information retrieval.

## Introduction
In the domain of information retrieval, which is closely related to recommendation systems, the objective of the information retrieval system is to retrieve relevant information from a large collection of data in response to a user's query. It does so by finding the documents with the highest probability of being relevant to the user’s query.
But the objective of the content creators, the ones who are creating the data, is to get as many people as possible to read their content (like news sites), buy products from them (like amazon, eBay and etc.) or to expose visitors to advertisement on the site. For all those reasons, content creators objective is to get people to chose their document to consume.
Because content creators want the highest exposure (as it usually translates to more users consuming their data), they would want to be ranked as high as possible to as many queries as possible. So, content creators might change their content, to be ranked higher and receive higher exposure, even if the changes are superficial and meant only for the benefit of higher ratings.
And because not all queries carry the same value (for example, a user searching for a new phone carries much more value to an Ecommerce site than a user searching for bottle water), then content creators might want to tailor their data to fit best to the higher value queries.
In our work, we chose to explore how such strategic behavior of the content creators affects the overall performance of the information retrieval system, and the affect of the strategic behavior of the content creators on their revenue.


## Research Question
How does strategic behavior of content creators affect system integrity?
How does strategic behavior of content creators affect their revenue?


## Assumptions
Our assumptions are:
1.	Different information retrieval systems will show different behavior exposed to
2.	Using strategic behavior will lead to decreased performance of the information retrieval system.
3.	Strategic behavior of content creators will lead to an increase in their revenue only to a certain point, meaning if there are too many strategic agents, strategic behavior might not add additional revenue.



## Methods and Experiments
????? (expiremnts and metrics)

## Results

In [None]:
from enviorment import Env
from utils import *
from sklearn.model_selection import KFold
from sklearn.metrics import ndcg_score
from tqdm import tqdm
import random
from collections import defaultdict
import matplotlib.pyplot as plt

nltk.download('stopwords')

#### Hyperparameters

In [None]:
amount_of_queries = 50
k = 5
Epochs = 3

#### Preprocess

In [None]:
docs, queries, rel = create_documents_queries_and_relevence()
queries_with_relevant_docs = list(rel.keys())
queries = {k: queries[k] for k in queries_with_relevant_docs}
docs, queries = preprocess_data(docs, queries)

# Temp Experiments

In [None]:
def run_exp(strategic_per_nums, number_of_word_to_add_nums, model, k=4, strategy="top_k_softmax"):
    agents_params = {"strategy": strategy, "k": k, "t": 100}
    print(f"\nModel: {model}\tstrategic_per: {strategic_per_nums}\tnumber_of_word_to_add: {number_of_word_to_add_nums}"
          f"\tstrategy: {strategy}\tk: {k}")
    # Preprocess
    docs, queries, rel = create_documents_queries_and_relevence()
    queries_with_relevant_docs = list(rel.keys())
    queries = {k: queries[k] for k in queries_with_relevant_docs}
    docs, queries = preprocess_data(docs, queries)


    # Chosen queries to work with and do cross validation
    chosen_queries = [x for x in queries.keys()][:amount_of_queries]
    cv = KFold(5)
    undisturbed_ndcgs = defaultdict(float)
    disturbed_ndcgs = defaultdict(float)
    agents_values = defaultdict(float)
    for strategic_per in tqdm(strategic_per_nums):
        for number_of_word_to_add in number_of_word_to_add_nums:
            # original ndcg without changing the documents
            undisturbed_ndcg = []
            # Ndcg after the change
            disturbed_ndcg = []
            agents_value = []
            for train_indices, test_indices in cv.split(chosen_queries):
                # The ids of the queries used to train,
                # they actually matter just when using rank svm
                train_queries_ids = [chosen_queries[x] for x in train_indices]
                # The ids of the queries used to test the model
                test_queries_ids = [chosen_queries[x] for x in test_indices]
                train_queries = {k: queries[k] for k in train_queries_ids}
                train_rel = {k: rel[k] for k in train_queries_ids}
                test_queries = {k: queries[k] for k in test_queries_ids}
                test_rel = {k: rel[k] for k in test_queries_ids}
                # Assigns random values to the queries, we can change seed
                queries_worth = create_queries_worth(test_queries)
                # Environment initialization
                environment = Env(docs=docs, strategic_percentage=strategic_per, model_type=model,
                                  number_of_word_to_add=number_of_word_to_add, train_queries=train_queries,
                                  train_relevance_ranking=train_rel, seed=10, agents_params=agents_params)
                # Epochs represents the amount of time we change which agents are chosen to
                # be strategic. We need to talk about how they are chosen, because now they are random
                # meaning if we run too many epochs then the ndcg with or without corruption will
                # be the same (because sometimes strategic agents are relevant and pushed up, sometimes
                # not relevant are pushed up.
                for i in range(Epochs):
                    environment.change_config()
                    # Scores of unchanged documents and real ndcg
                    queries_scores = environment.run(test_queries)
                    strategic_agents_value_pre_change = environment.calculate_strategic_revenue(test_queries,queries_worth)
                    for query_id, scores in queries_scores.items():
                        gt = np.array([[1 if i in test_rel[query_id] else 0 for i in docs.keys()]])
                        undisturbed_ndcg.append(ndcg_score(gt, scores))
                    # Makes strategic agents change their documents
                    environment.corrupt(test_queries, queries_worth)
                    # Scores after documents change
                    queries_scores = environment.run(test_queries)
                    strategic_agents_value_post_change = environment.calculate_strategic_revenue(test_queries,queries_worth)
                    agents_value.append(strategic_agents_value_post_change/strategic_agents_value_pre_change)
                    for query_id, scores in queries_scores.items():
                        gt = np.array([[1 if i in test_rel[query_id] else 0 for i in docs.keys()]])
                        disturbed_ndcg.append(ndcg_score(gt, scores))
            # print(f"Undisturbed ndcg {np.mean(undisturbed_ndcg):.4f}")
            # print(f"Disturbed ndcg {np.mean(disturbed_ndcg):.4f}")
            undisturbed_ndcgs[(strategic_per, number_of_word_to_add)] = np.mean(undisturbed_ndcg)
            disturbed_ndcgs[(strategic_per, number_of_word_to_add)] = np.mean(disturbed_ndcg)
            agents_values[(strategic_per, number_of_word_to_add)] = np.mean(agents_value)
    ax = plt.axes(projection='3d')

    # Data for three-dimensional scattered points
    keys = np.array([[x, y, value] for (x, y), value in disturbed_ndcgs.items()])
    zdata = keys[:, 2]
    xdata = keys[:, 0]
    ydata = keys[:, 1]
    ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')
    ax.dist = 13

    ax.set_xlabel("Strategic Percentage")
    ax.set_ylabel("Number of Word Added")
    ax.set_zlabel("Value")
    ax.set_title(f"Model: {model}")
    plt.show()

    keys = np.array([[x, y, value] for (x, y), value in agents_values.items()])
    zdata = keys[:, 2]
    xdata = keys[:, 0]
    ydata = keys[:, 1]
    ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')
    ax.dist = 13

    ax.set_xlabel("Strategic Percentage")
    ax.set_ylabel("Number of Word Added")
    ax.set_zlabel("Agent Value")
    ax.set_title(f"Model: {model}")
    plt.show()

In [None]:
run_exp(np.arange(0, 1, 0.2), range(1, 10000, 2000), "tf_idf", strategy="dominant_word_not_present")

In [None]:
run_exp(np.arange(0, 1, 0.2), range(1, 10000, 2000), "okapi_bm25", strategy="dominant_word_not_present")

In [None]:
for k in range(1, 5):
    run_exp(np.arange(0, 1, 0.2), range(1, 10000, 2000), "tf_idf", k=k, strategy="top_k_softmax")

In [None]:
run_exp(np.arange(0, 1, 0.2), range(1, 10000, 2000), "tf_idf")

In [None]:
run_exp(np.arange(0, 1, 0.2), range(1, 10000, 2000), "rank_svm")

In [None]:
run_exp(np.arange(0, 1, 0.1), range(1, 10000, 1000), "okapi_bm25")
run_exp(np.arange(0, 1, 0.1), range(1, 10000, 1000), "tf_idf")
run_exp(np.arange(0, 1, 0.1), range(1, 10000, 1000), "rank_svm")

## Discussion
?????

## Limitations
?????