# COMP-9318 Final Project

## Instructions:
1. This note book contains instructions for **COMP9318 Final-Project**.

* You are required to complete your implementation in a file `submission.py` provided along with this notebook.

* You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures returned by corresponding functions.

* This notebook encompasses all the requisite details regarding the project. Detailed instructions including **CONSTRAINTS**, **FEEDBACK** and **EVALUATION** are provided in respective sections. In case of additional problem, you can post your query @ Piazza.

* This project is **time-consuming**, so it is highly advised that you start working on this as early as possible.

* You are allowed to use only the permitted libraries and modules (as mentioned in the **CONSTRAINTS** section). You should not import unnecessary modules/libraries, failing to import such modules at test time will lead to errors.

* You are **NOT ALLOWED** to use dictionaries and/or external data resources for this project.

* We will provide you **LIMITED FEEDBACK** for your submission (only **15** attempts allowed to each group). Instructions for the **FEEDBACK** and final submission are given in the **SUBMISSION** section.

* For **Final Evaluation** we will be using a different dataset, so your final scores may vary.  

* Submission deadline for this assignment is **23:59:59 on 27-May, 2018**.
* **Late Penalty: 10-% on day-1 and 20% on each subsequent day.**

## Introduction:

In this Project, you are required to devise an algorithm/technique to fool a binary classifier named `target-classifier`. In this regard, you only have access to following information:

<br>
1. The `target-classifier` is a binary classifier classifying data to two categories, $\textit{i.e.}$, **class-1** and **class-0**.

2. You have access to part of classifiers' training data, $\textit{i.e.}$, a sample of 540 paragraphs. 180 for **class-1**, and 360 for **class-0**, provided in the files: `class-1.txt` and `class-0.txt` respectively.

3. The `target-classifier` belong to the SVM family.

4. The `target-classifier` allows **EXACTLY 20 DISTINCT** modifications in each test sample.
5. You are provided with a test sample of **200** paragraphs from **class-1** (in the file: `test_data.txt`). You can use these test samples to get feedback from the target classifier (**only 15 attempts** allowed to each group.).
6. **NOTE: You are not allowed to use the data `test_data.txt` for your model training (if any). VIOLATIONS in this regard will get ZERO score**.

<br>
### -to-do:
* You are required to come up with an algorithm named `fool_classifier()` that makes best use of the above-mentioned information (**point 1-4**) to fool the `target-classifier`. By fooling the classifier we mean that your algorithm can help mis-classify a bunch of test instances (**point-5**) with minimal possible modifications (**EXACTLY 20 DISTINCT** modifications allowed to each test sample). 

* **NOTE::** We put a **harsh limit** on the number of modifications allowed for each test instance. You are only allowed to modify each test sample by **EXACTLY 20 DISTINCT tokens (NO MORE NO LESS)**.

* **NOTE::** **ADDING** or **DELETING** one word at a time is **ONE** modification. Replacement will be considered as **TWO** modifications $(\textit{i.e.,}$ **Deletion** followed by **Insertion**).

## Constraints

Your implementation `submission.py` should comply with following constraints.

1. You should implement your methodology using `Python3`.
* You should implement your code in the function `fool_classifier()` in the file `submission.py`. 
* You are only allowed to use pre-defined class `strategy()` defined in the file: `helper.py` in order to train your models (if any). 
* You **should not** do any pre-processing on the data. We have already pre-processed the data for you.  
* You are supposed to implement your algorithm using **scikit-learn (version=0.19.1)**. We will **NOT** accept implementations using other Libraries.

* You are **not supposed to augment** the data using external/additional resources.  You are only allowed to use the partial training data provided to you ($\textit{i.e.,} $ `class-1.txt` and `class-0.txt`).

* You are **not** allowed to use the test samples ($\textit{i.e.,}$ `test_data.txt`) for model training and/or inference building. You can only use this data for testing, $\textit{i.e.,}$ calculating success %-age (as described in the **EVALUATION** section.). **VIOLATIONS IN THIS REGARD WILL GET ZERO SCORE**.

* You are **not** allowed to hard code the ground truth and any other information into your implementation `submission.py`. 

* Considering the **RUNNING TIME**, your implementation is supposed to read the test data file ($\textit{i.e.,}$ `test_data.txt` with 200 test samples), process it and write the modified file (`modified_data.txt`) within **12 Minutes**.

* Each modified test sample in the modified file (`modified_data.txt`) should not differ from the original test sample corresponding to the file (`test_data.txt`) by more than 20 tokens.

* **NOTE::** Inserting or Deleting a word is **ONE** modification. Replacement will be considered as **TWO** modifications $(\textit{i.e.,}$ deletion followed by insertion).

## Submission Instructions:

* Please read these instructions **VERY CAREFULLY**.

### FEEDBACK:
* For this project, we will provide real-time feed-back on a test data ($\textit{i.e.,}$ the file `test_data.txt` containing **200** test cases).
* Each group is allowed to avail only **15 attempts in TOTAL**, so use your attempts **WISELY**.
* We will only provide **ACCUMULATIVE FEEDBACK** ($\textit{i.e.,}$ how many modified test samples out of **200** were classified as Class-0). We **WILL NOT** provide detailed feedback for individual test cases.
* For the feedback, you are required to submit the modified text file ($\textit{i.e.,}$ `modified_data.txt`) via the submission portal: http://kg.cse.unsw.edu.au:8318/project/ (using Group name and Group password).
* **NOTE::** Please make sure that the modified text file is generated by your program `fool_classifier()`, and it obeys the modification constraints. We have provided a function named: `check_data()` in the class: `strategy()`to check whether the modified file: `modified_data.txt` obeys the constraints.

3. Your algorithm should modify each test sample in `test_data.txt` by **EXACTLY 20 DISTINCT TOKENS**.

### Final Submission:
1. For final submission, you need to submit:
    * Your code in the file `submission.py`
    * A report (`report.pdf`) outlining your approach for this project.
2. We will release the detailed instructions for the final submission submission via Piazza.

## Implementation Details

1. In the file `submission.py`, you are required to implement a function named: `fool_classifier()` that reads a text file named: `test_data.txt` from Present Working Directory(PWD), and writes out the modified text file: `modified_data.txt` in the same directory.
* We have provided the implementation of **strategy** class in a seperate file `helper.py`. You are supposed to use this class for your model training (if any) and inference building.

* **Detailed description of input and/or output parts is given below:**

### Input: 
* The function `fool_classifier()` reads a text files named `test_data.txt` having almost (500-1500) test samples. Each line in the input file corresponds to a single test sample.

* **Note:** We will also provide the partial training data ($\textit{(i)}$ `class-0.txt` and $\textit{(ii)}$ `class-1.txt`) in the test environment. You can  access this data using the class: `strategy()`. 

### Output:
* You are supposed to write down the modified file named `modified_data.txt` in the same directory, and in the same format as that of the `test_data.txt`. In addition, your program is supposed to return the instance of the `strategy` class defined in `helper.py`.


* **Note:** Please make sure that the file: `modified_data.txt` is generated by your code, and it follows the **MODIFICATION RESTRICTIONS (ADD** and/or **DELETE EXACTLY 20 DISTINCT TOKENS)**. In case of **ERRORS**, we will **NOT** allow more feedback attempts. 

In [4]:
from sklearn import svm

# We have provided these implementations in the file helper.py, provided along with this project.
## Please do not change these functions.
###################
class countcalls(object):
    __instances = {}
    def __init__(self, f):
        self.__f = f
        self.__numcalls = 0
        countcalls.__instances[f] = self
    def __call__(self, *args, **kwargs):
        self.__numcalls += 1
        return self.__f(*args, **kwargs)
    @staticmethod
    def count(f):
        return countcalls.__instances[f].__numcalls
    @staticmethod
    def counts():
        res = sum(countcalls.count(f) for f in countcalls.__instances)
        for f in countcalls.__instances:
            countcalls.__instances[f].__numcalls = 0
        return res
    
## Strategy() class provided in helper.py to facilitate the implementation.
class strategy:
    ## Read in the required training data...
    def __init__(self):
        with open('class-0.txt','r') as class0:
            class_0=[line.strip().split(' ') for line in class0]
        with open('class-1.txt','r') as class1:
            class_1=[line.strip().split(' ') for line in class1]
        self.class0=class_0
        self.class1=class_1
    
    @countcalls
    def train_svm(parameters, x_train, y_train):
        ## Populate the parameters...
        gamma=parameters['gamma']
        C=parameters['C']
        kernel=parameters['kernel']
        degree=parameters['degree']
        coef0=parameters['coef0']
        
        ## Train the classifier...
        clf = svm.SVC(kernel=kernel, C=C, gamma=gamma, degree=degree, coef0=coef0)
        assert x_train.shape[0] <=541 and x_train.shape[1] <= 5720
        clf.fit(x_train, y_train)
        return clf
    
    ## Function to check the Modification Limits...(You can modify EXACTLY 20-DISTINCT TOKENS)
    def check_data(self, original_file, modified_file):
        with open(original_file, 'r') as infile:
            data=[line.strip().split(' ') for line in infile]
        Original={}
        for idx in range(len(data)):
            Original[idx] = data[idx]

        with open(modified_file, 'r') as infile:
            data=[line.strip().split(' ') for line in infile]
        Modified={}
        for idx in range(len(data)):
            Modified[idx] = data[idx]

        for k in sorted(Original.keys()):
            record=set(Original[k])
            sample=set(Modified[k])
            assert len((set(record)-set(sample)) | (set(sample)-set(record)))==20
        return True

In [None]:
constants = ['#' * i for i in range(1,100)]

In [None]:
test_dt = None
with open('test_data.txt', 'r') as infile:
    test_dt = [line.strip().split(' ') for line in infile]
' '.join(test_dt[0])

In [None]:
import numpy as np
import sklearn
from datetime import datetime

start_time = datetime.now()

st = strategy()
parameters = {
    'C': 1,
    'gamma': 'auto',
    'kernel': 'linear',
    'coef0': 0.0,
    'degree': 3
}

lines = [' '.join(line) for line in st.class0] + [' '.join(line) for line in st.class1]

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

# cv = TfidfVectorizer()
cv = CountVectorizer()
# print(X_train)
X_train_tfidf = cv.fit_transform(lines)
target = np.array([0] * 360 + [1] * 180)

model = svm.SVC(kernel=parameters['kernel'], C=parameters['C'], gamma=parameters['gamma'], 
                degree=parameters['degree'], coef0=parameters['coef0'])
model.fit(X_train_tfidf, target)

test_lines = [' '.join(i) for i in test_dt]
# test_dataset = cv.transform(test_lines)
test_dataset = cv.transform(test_lines)
predicted = model.predict(test_dataset)
print('predicted1', np.mean(predicted == 1))

top_coef_sorted = np.argsort(model.coef_.toarray()[0])[::-1]
top_features = np.array(cv.get_feature_names())

modified_list = []

for record in test_dt:
    record_new = record
    for coef_index in top_coef_sorted:
        feature = top_features[coef_index]

        record_new = [word for word in record_new if word != feature]

        if len((set(record) - set(record_new)) | \
               (set(record_new) - set(record))) == 20: # no more modifications
            break       

    if len((set(record) - set(record_new)) | \
               (set(record_new) - set(record))) != 20: 
        for const in constants:
            if const not in record_new:
                record_new += [const]
            if len((set(record) - set(record_new)) | \
               (set(record_new) - set(record))) == 20: 
                break

modified_list.append(record_new)

# new_file = open("modified_data.txt", "w")

# for i in modified_list:
#     new_file.write(' '.join(i))
#     new_file.write('\n')
# new_file.close()

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))


# modified_dt = None
# with open('modified_data.txt', 'r') as infile:
#     modified_dt = [line.strip().split(' ') for line in infile]
# modified_dtset = cv.transform([' '.join(i) for i in modified_dt])
# predicted2 = model.predict(modified_dtset)
# np.mean(predicted2 == 1)

In [None]:
# from collections import defaultdict

def change_set(line):
    return set(line.split(' '))

a = 'michael owen'
change_set(a)

In [5]:
import numpy as np
import sklearn
from datetime import datetime
from sklearn.feature_extraction.text import CountVectorizer

st = strategy()
parameters = {
    'C': 1,
    'gamma': 'auto',
    'kernel': 'linear',
    'coef0': 0.0,
    'degree': 3
}
lines = [' '.join(line) for line in st.class0] \
            + [' '.join(line) for line in st.class1]
    
mod_lines = None
with open('modified_data.txt', 'r') as infile:
    mod_lines = [line.strip() for line in infile]
    
test_lines = None
with open('test_data.txt', 'r') as infile:
    test_lines = [line.strip() for line in infile]

target = np.array([0] * 360 + [1] * 180)
cv = CountVectorizer()
cv.fit(lines)
X_train = cv.transform(lines)
model = st.train_svm(parameters, X_train, np.array([0] * 360 + [1] * 180))
features = np.array(cv.get_feature_names())

# line0_sorted = np.argsort(X_train[0].toarray()[0])[::-1]
# print(featues[line0_sorted[:96]])
# print(X_train[0].toarray()[0][line0_sorted[:96]])

print(model.predict(cv.transform(test_lines)))
print(model.predict(cv.transform(mod_lines)))

# np.where(features == 'hello')




[1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1
 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 1 0
 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0
 1 0 1 0 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1
 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
l1 = test_lines[0]
l2 = mod_lines[0]

diff1 = change_set(l1) - change_set(l2)
diff2 = change_set(l2) - change_set(l1)
removed = diff1 | diff2
# print(removed, len(removed))
# print(model.predict(cv.transform(test_lines[:1])))
# print(model.predict(cv.transform(mod_lines[:1])))

# temp = l1.split(" ")
# for word in list(removed):
#     x, = np.where(featues == word)
#     print(f'{word}\t{model.coef_.data[x[0]]}')
#     temp = [i for i in temp if i != word]
#     print(model.predict(cv.transform([' '.join(temp)])))

# coe_arr = []
# for word in l1.split(" "):
#     x, = np.where(features == word)
#     if len(x) > 0:
# #         print(f'{word}\t{model.coef_.toarray()[0][x[0]]}')
#         coe_arr.append((word, model.coef_.toarray()[0][x[0]]))

# top_20 = sorted(coe_arr, key=lambda x: x[1], reverse=True)[:20]
# temp = l1.split(" ")
# for word, _ in top_20:
#     x, = np.where(featues == word)
#     print(f'{word}\t{model.coef_.data[x[0]]}')
#     temp = [i for i in temp if i != word]
#     print(model.predict(cv.transform([' '.join(temp)])))


np.argsort(model.coef_.toarray()[0])

In [None]:
a = [-0.033706742840289503, -0.025667649604084575, -0.048741199668066004, -0.060958130817115583, 0.037578565133285953, 0.03045565583049753, 0.027046251179203847, 0.046543105543863844, -0.014483809105959157, -0.069304461977054865, 0.044895435233572173, -0.013897922431222242, -0.029180422206721485, 0.16730445540091685, -0.025375080556183528, -0.058229462455121353, 0.024256194988257512, -0.055916084763357049, 0.058073041841478844, -0.011532623928430056, -0.085578750026615344, 0.0038493433344678994, 0.046351509359357349, 0.03045565583049753, 0.052326105801658709, -0.12071944768867805, -0.10445832116566628, 0.063598764665613011, 0.1336985316446371, -0.0058358670369515406, -0.014800964793316218, -0.046132520041169828, -0.014800964793316218, -0.072807842031844094, -0.029539430348497257, -0.016529789157141955, 0.086581460912192432, -0.012059750636169715, -0.048741199668066004, -0.059698820187955595, 0.03045565583049753, 0.045594433550108109, 0.0062321636431825773, 0.06841788413420935, 0.054519806692008885, 0.079276622410511879, 0.046351509359357349, -0.014800964793316218, 0.065732551099928288, 0.012201600663164225, 0.082094578213351005, -0.15341143957055678, -0.016529789157141955, 0.054486108417417078, -0.014800964793316218, -0.079034888264888706, 0.07122728309724341, -0.0015632816322620896, -0.045086556725529213, 0.020306695753977116, -0.014800964793316218, -0.073693645185750972, -0.12057433233880716, 0.022880394970827195, 0.0027044140313002736, -0.024503129624850885, -0.014800964793316218, 0.14491958631214708, 0.10405286465262994, 0.13702530254640488, 0.0061051045063516343, 0.047146024505769442, -0.0083612863355028016, 0.0098955076628064673, -0.011643233593216529, -0.0023890325660017482, -0.010864281778394429, 0.052030638866491043, 0.03045565583049753, 0.060341955315721965, 0.030144900320601055, 0.035638238342581299, 0.036782081615101658, -0.016529789157141955, -0.071512439951866791, 0.081979427535624122, 0.097543605029522584, -0.0015632816322620896, -0.033706742840289503, 0.063176986315459163, 0.0098955076628064673, 0.010316164543263867, -0.051924640219383096, 0.03045565583049753, -0.014800964793316218, -0.014675483457957925]

sorted(a)[::-1][:20]

In [None]:
from sklearn.model_selection import GridSearchCV

X = X_train_tfidf
y = np.array([0] * 360 + [1] * 180)
clf_start = st.train_svm(parameters, X, y)
param_range = np.arange(0.001,1,0.01) 
param_grid = [{'C': param_range, 'kernel': ['linear']}]
grid = GridSearchCV(clf_start, param_grid)
grid.fit(X,y)
clf = grid.best_estimator_
print(clf)
print('hello')

In [None]:
import matplotlib.pyplot as plt
t = model.coef_.data[top_coef_sorted]
plt.plot(t)
plt.show()

model.coef_.data[top_coef_sorted][100], model.coef_.data[top_coef_sorted][-100]

In [None]:
modified_dt = None
with open('modified_data.txt', 'r') as infile:
    modified_dt = [line.strip().split(' ') for line in infile]
modified_dtset = cv.transform([' '.join(i) for i in modified_dt])
predicted2 = model.predict(modified_dtset)
print(np.mean(predicted2 == 1))
print(predicted2)

In [None]:
# cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=top_features[top_position_coef])
# count_array = cv.transform([' '.join(i) for i in test_dt]).toarray()
# count_array.sum(axis=1)

# print(np.sort(model.coef_.data))

# print(model.coef_.data[np.argsort(model.coef_.data)[::-1][:10]])
# print(model.coef_.data[np.argsort(model.coef_.data)[:10]])

# strategy().check_data('./test_data.txt','./modified_data.txt')
original_file = './test_data.txt'
modified_file = './modified_data.txt'

with open(original_file, 'r') as infile:
    data=[line.strip().split(' ') for line in infile]
Original={}
for idx in range(len(data)):
    Original[idx] = data[idx]

with open(modified_file, 'r') as infile:
    data=[line.strip().split(' ') for line in infile]
Modified={}
for idx in range(len(data)):
    Modified[idx] = data[idx]
    
for k in sorted(Original.keys()):
    record=set(Original[k])
    sample=set(Modified[k])
#     print(len(record), len(sample))
#     print(set(record)-set(sample))
#     print(set(sample)-set(record))
    if len((set(record)-set(sample)) | (set(sample)-set(record))) != 20:
        print(k, len(set(record)-set(sample)), len(set(sample)-set(record)))
        print(set(record)-set(sample))
        print(set(sample)-set(record))

# len(Original[1]), len(Modified[1])

In [None]:
# print(len(cv.vocabulary_))
# print(cv.get_feature_names()[-30:])
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import svm
import numpy as np

my_test = ['I am psa. michael', 'got damn it michael michael']
cv2 = CountVectorizer()
cv2.fit(my_test)
# cv2.vocabulary_
print(cv2.get_feature_names())
X_train2 = cv2.transform(my_test)
print(X_train2.toarray())
model2 = svm.SVC(kernel='linear')
# X_train2.shape, np.array([0, 0]).shape
model2.fit(X_train2, np.array([0, 1]))
# print(cv2.get_feature_names())
print(model2.coef_.data)
# print(model2.coef_.data, model2.coef_.shape)

# print(np.argsort(model2.coef_.data)[::-1])
# for i in cv2.transform(['I got got you','i miss you']).toarray():
#     print(i)

sorted_coef = np.argsort(model2.coef_.data)[::-1]
print(model2.coef_.data[sorted_coef])
print(np.array(cv2.get_feature_names())[sorted_coef])

In [1]:
import helper
from sklearn.feature_extraction.text import CountVectorizer

def fool_classifier(test_data): ## Please do not change the function defination...
    ## Read the test data file, i.e., 'test_data.txt' from Present Working Directory...
    test_dt = None
    with open('test_data.txt', 'r') as infile:
        test_dt = [line.strip().split(' ') for line in infile]
    
    ## You are supposed to use pre-defined class: 'strategy()' in the file `helper.py` for model training (if any),
    #  and modifications limit checking
#     constants = ['#' * i for i in range(100)]
    strategy_instance = helper.strategy() 
    parameters = {
        'C': 1,
        'gamma': 'auto',
        'kernel': 'linear',
        'coef0': 0.0,
        'degree': 3
    }
    lines = [' '.join(line) for line in strategy_instance.class0] \
            + [' '.join(line) for line in strategy_instance.class1]
    
    cv = CountVectorizer()
    cv.fit(lines)
    X_train = cv.transform(lines)
    model = strategy_instance.train_svm(parameters, X_train, np.array([0] * 360 + [1] * 180))
    top_positive_coef = np.argsort(model.coef_.data)[::-1]
    top_features = np.array(cv.get_feature_names())
    ##..................................#
    modified_list = []

    for record in test_dt:
        record_new = record
        for coef_index in top_coef_sorted:
            feature = top_features[coef_index]
            feature_coef = model.coef_.data[coef_index]
            record_new = [word for word in record_new if word != feature]

            if len((set(record) - set(record_new)) | \
                   (set(record_new) - set(record))) == 20: # no more modifications
                break

        modified_list.append(record_new)
    
    ## Write out the modified file, i.e., 'modified_data.txt' in Present Working Directory...
    new_file = open("modified_data.txt", "w")

    for i in modified_list:
        new_file.write(' '.join(i))
        new_file.write('\n')
    new_file.close()
    
    ## You can check that the modified text is within the modification limits.
    modified_data='./modified_data.txt'
    assert strategy_instance.check_data(test_data, modified_data)
    return strategy_instance ## NOTE: You are required to return the instance of this class.


 **NOTE:** 
 1. **You are required to return the instance of the class: `strategy()`, $\textit{e.g.}$, `strategy_instance` in the above cell.**
 2. **You are supposed to write out the file `modified_data.txt` in the same directory, and in the same format as that of `test_data.txt`**

## How we test your code

In [None]:
import helper
import submission as submission
test_data='./test_data.txt'
strategy_instance = submission.fool_classifier(test_data)

########
#
# Testing Script.......
#
#
########

print('Success %-age = {}-%'.format(result))

In [1]:
import helper
import submission as submission
test_data='./test_data.txt'
strategy_instance = submission.fool_classifier(test_data)
print('hello')

hello


## EVALUATION:

1. For evaluation, we will consider a bunch of test paragraphs having:
    * Approximately 500-1500 test samples for class-1, with each line corresponding to a distinct test sample.The input test file will follow the same format as that of `test_data.txt`.
    * We will consider the success rate of your algorithm for final evaluation. By success rate we mean %-age of samples miss-classified by the `target-classifier` ($\textit{i.e.,}$  instances of `class-1`, classified as `class-0` after `20` distinct modifications). 

### Example:

1. Consider 200 test-samples (classified as **class-1** by the `target-classifier`). 
2. For-Example, after modifying each test sample by (**20 DISTINCT TOKENS**) the `target-classifier` mis-classifies **100** test samples ($\textit{i.e.,}$ 100 test samples are classified as **class-0** then your **success %-age** is:

3. **success %-age** = (100) x 100/200 = **50%**

In [None]:
# make own model
# check it against test data
# find out top coefficient (positive), sort it by value
# modify test data according to the top coefficient
# when I put in the parameter, the kernel should be linear
# if feature is positive and it's in the record remove that
            # if negatvie and it;s in thre record, add 20 distinct words
            #if negative and it's not in the record, add it

'''
>>> x = np.arange(9.).reshape(3, 3)
>>> np.where( x > 5 )
(array([2, 2, 2]), array([0, 1, 2]))
>>> x[np.where( x > 3.0 )]               # Note: result is 1D.
array([ 4.,  5.,  6.,  7.,  8.])
>>> np.where(x < 5, x, -1)               # Note: broadcasting.
array([[ 0.,  1.,  2.],
       [ 3.,  4., -1.],
       [-1., -1., -1.]])
'''

x = np.arange(9.).reshape(3, 3)
print(x)
# print(np.where( x > 5 ))
# print(x[np.where( x > 3.0 )] )
print(np.where(x < 5, x, -1))

In [None]:
a = ['a', 'b', 'c', 'c', 'd', 'e']
# [i for i in a if i != 'c']

np.array(a)

test_file = open("test_write.txt", "w")

# file.write(“This is a test”) 
# file.write(“To add more lines.”)
for i in a:
    test_file.write(i + "\n")

test_file.close()

In [None]:
a = [1, 2, 3, 4, 4]
# len(set(a))

# ['hello', 'world'] + ['michael']

# for i in ab:
#     cd = i.split(" ")
#     print(cd[1], cd[2], int(cd[1]) | int(cd[2]))

# np.where(np.array(a) > 3)[0]

ab = [('mi', 2), ('sad', 5), ('owen', 1), ]
sorted(ab, key=lambda x: x[1])

In [None]:
for record in test_dt[:1]:
    del_count = 0
    record_new = record
    for feature in top_features[top_position_coef][:20]:
        feat_count = record_new.count(feature)
        if del_count + feat_count <= 20:
            del_count += feat_count
            record_new = [word for word in record_new if word != feature]
        else:
            while del_count != 20:
                if feature in record_new:
                    record_new.remove(feature)
                    del_count += 1
                else:
                    break
            if del_count == 20:
                break
    modified_list.append(record_new)
    print(f'original count: {len(record)}, modified count: {len(record_new)}')
                