# TF IDF

Adding the term frequency inverse dictionary frequency measure (tf-idf). The term
frequency is the count of a term in a specific question, the inverse document frequency is the log of the total number
of questions divided by the number of questions containing the term.

### Steps

1. Load cleaned data
2. Count TF-IDF

'question1_lowercase' - low case questions with stop words

'concatenated_questions'='question1_lowercase'+
                                    'question2_lowercase'
                                    
CountVectorizer('concatenated_questions')

Credit: Some of the code was inspired by this awesome [NLP repo][1]. 




  [1]: https://github.com/rouseguy/DeepLearningNLP_Py

In [225]:
import pandas as pd
pd.options.mode.chained_assignment = None
pd.set_option('max_colwidth', 250)
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
#from nltk import ngrams

#from sklearn import metrics
import xgboost as xgb
#from sklearn.linear_model import LogisticRegression

#from gensim.models import word2vec

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (16,6)

import multiprocessing as mp

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from difflib import SequenceMatcher

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss


%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 16.3 ms


In [204]:
#data = pd.read_csv('data/train.csv').sample(50000, random_state=23)
train = pd.read_csv('../data/train.csv').sample(10000, random_state=23)

time: 1.1 s


In [205]:
for data in [train]:
    for col in ['question1', 'question2']:
        data[col][pd.isnull(data[col])] = ''
del data

time: 13.5 ms


In [188]:
assert 404290 == train.shape[0]

AssertionError: 

time: 8.69 ms


In [189]:
print(train.shape)
train.head(3)

(10000, 5)


Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
237921,9732,79801,Is sex necessary in a relationship?,Why is sex important in a good relationship?,1
181001,277377,277378,What are the most inspiring start up stories?,What are the most inspirational stories ever?,0
294691,150129,93109,What is your best way to do digital marketing?,What are the best unique ways to do Digital Marketing?,1


time: 9.67 ms


In [190]:
train.tail(3)

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
165245,256635,256636,How much money does the author of an academic textbook earn?,How much does it cost to publish a textbook?,0
251754,69108,11477,How can I improve my speaking?,I want to improve my English?,1
130958,37405,210066,What are some tips for having sex for the first time?,What are some tips for first time sex?,1


time: 9.33 ms


In [30]:
for col in ['question1', 'question2']:
    train[col] = train[col].apply(lambda x: nltk.word_tokenize( x.lower() ))

time: 2min 4s


In [206]:
STOP_WORDS = stopwords.words('english')

def remove_stopwords(tokenized_sent):
    unique_stopwords = set(STOP_WORDS)
    return [word for word in tokenized_sent if word.lower() not in unique_stopwords]


def concatenate_tokens(token_list):
    return str(' '.join(token_list))


def find_similarity(sent1, sent2):
    return SequenceMatcher(lambda x: x in (' ', '?', '.', '""', '!'), sent1, sent2).ratio()


def return_common_tokens(sent1, sent2):
    return " ".join([word.lower() for word in sent1 if word in sent2])


def convert_tokens_lower(tokens):
    return [token.lower() for token in tokens]

time: 10.8 ms


In [207]:
train_transformed = pd.DataFrame(index = train.index)
temp_features = pd.DataFrame()

for i in (1, 2):
    # question tokens
    train_transformed['question%s_tokens' % i] = train['question%s' % i].apply(nltk.word_tokenize)
    # question lowercase tokens
    train_transformed['question%s_lowercase_tokens' % i] = train_transformed['question%s_tokens' % i].apply(convert_tokens_lower)
    # question lowercase tokens join with ' '
    train_transformed['question%s_lowercase' % i] = train_transformed['question%s_lowercase_tokens' % i].apply(concatenate_tokens)
    # remove stop words from question tokens
    train_transformed['question%s_words' % i] = train_transformed['question%s_tokens' % i].apply(remove_stopwords)
    # w\o stop words join ' '
    train_transformed['question%s_pruned' % i] = train_transformed['question%s_words' % i].apply(concatenate_tokens)

time: 3.34 s


In [241]:
temp_features['common_tokens'] = np.vectorize(return_common_tokens)(
    train_transformed['question1_tokens'],
    train_transformed['question2_tokens'])

naive_similarity = pd.DataFrame()
naive_similarity['similarity'] = np.vectorize(find_similarity)(
    train['question1'], train['question2'])
naive_similarity['pruned_similarity'] = np.vectorize(find_similarity)(
    train_transformed['question1_pruned'], train_transformed['question2_pruned'])

time: 4.73 s


In [209]:
train.tail(3)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
165245,165245,256635,256636,How much money does the author of an academic textbook earn?,How much does it cost to publish a textbook?,0
251754,251754,69108,11477,How can I improve my speaking?,I want to improve my English?,1
130958,130958,37405,210066,What are some tips for having sex for the first time?,What are some tips for first time sex?,1


time: 9.85 ms


In [210]:
dictionary = pd.DataFrame()

#Deriving the TF-IDF
dictionary['concatenated_questions'] = train['question1'] +\
                                       train['question2']

vectorizer = CountVectorizer()
terms_matrix = vectorizer.fit_transform(dictionary['concatenated_questions'])
terms_matrix_1 = vectorizer.transform(train['question1'])
terms_matrix_2 = vectorizer.transform(train['question2'])
common_terms_matrx = vectorizer.transform(temp_features['common_tokens'])

transformer = TfidfTransformer(smooth_idf = False)
weights_matrix = transformer.fit_transform(terms_matrix)
weights_matrix_1 = transformer.transform(terms_matrix_1)
weights_matrix_2 = transformer.transform(terms_matrix_2)
common_weights_matrix = transformer.transform(common_terms_matrx)

len(transformer.idf_), terms_matrix.shape

(15187, (10000, 15187))

time: 765 ms


In [211]:
#Converting the sparse matrices into dataframes
transformed_matrix_1 = weights_matrix_1.tocoo(copy = False)
transformed_matrix_2 = weights_matrix_2.tocoo(copy = False)
transformed_common_weights_matrix = common_weights_matrix.tocoo(copy = False)

weights_dataframe_1 = pd.DataFrame({'index_': transformed_matrix_1.row,
                                    'term_id': transformed_matrix_1.col,
                                    'weight_q1': transformed_matrix_1.data}
                                   )[['index_', 'term_id', 'weight_q1']].sort_values(['index_', 'term_id']).reset_index(drop = True)
weights_dataframe_2 = pd.DataFrame({'index_': transformed_matrix_2.row,
                                    'term_id': transformed_matrix_2.col,
                                    'weight_q2': transformed_matrix_2.data}
                                   )[['index_', 'term_id', 'weight_q2']].sort_values(['index_', 'term_id']).reset_index(drop = True)
weights_dataframe_3 = pd.DataFrame({'index_': transformed_common_weights_matrix.row,
                                    'term_id': transformed_common_weights_matrix.col,
                                    'common_weight': transformed_common_weights_matrix.data}
                                   )[['index_', 'term_id', 'common_weight']].sort_values(['index_', 'term_id']).reset_index(drop = True)

time: 68.8 ms


In [212]:
weights_dataframe_1[weights_dataframe_1.index_ == 0]

Unnamed: 0,index_,term_id,weight_q1
0,0,6899,0.189101
1,0,7313,0.158552
2,0,9198,0.618865
3,0,11314,0.538469
4,0,12173,0.515906


time: 9.63 ms


In [213]:
#Summing the weights of each token in each question to get the summed weight of the question
sum_weights_1 = weights_dataframe_1.groupby('index_').sum()
sum_weights_2 = weights_dataframe_2.groupby('index_').sum()
sum_weights_3 = weights_dataframe_3.groupby('index_').sum()

# Join by term id
weights = sum_weights_1.join(sum_weights_2, how = 'outer', lsuffix = '_q1',
                             rsuffix = '_q2').\
    join(sum_weights_3, how = 'outer', lsuffix = '_cw', rsuffix = '_cw')

weights = weights.fillna(0)
del weights['term_id_q1'], weights['term_id_q2'], weights['term_id']

print (weights[:20])

        weight_q1  weight_q2  common_weight
index_                                     
0        2.020893   2.438344       1.616312
1        2.485519   2.303256       1.917377
2        2.658949   2.735171       1.946522
3        2.737284   2.750768       1.406763
4        4.007998   4.007998       4.007998
5        3.447859   3.507630       3.252584
6        2.799849   2.004068       1.000000
7        3.085802   2.422342       0.000000
8        2.503914   2.142919       1.000000
9        1.993572   1.864838       1.599615
10       2.517155   2.439050       1.884600
11       2.539523   2.502484       1.763797
12       2.553175   2.678135       2.168378
13       2.808284   2.949881       1.569570
14       3.537830   3.624433       3.537830
15       3.434058   3.527817       2.012766
16       4.033230   3.174608       2.696729
17       3.523631   4.568617       2.141541
18       2.834307   3.805428       1.268429
19       2.618584   2.267223       2.169166
time: 32.8 ms


In [258]:
#Creating a random train-test split
X = naive_similarity.join(weights, how = 'inner')
y = train['is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 42)

#Scaling the features
#sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)


print (X_train[:20])

"""

We train our algorithm (gradient boosting classifier) and print the logarithmic loss:

#Training the algorithm and making a prediction"""

params = {'alpha': 0.1,
 'colsample_bytree': 0.7,
 'eta': 0.01,
 'eval_metric': 'logloss',
 'max_depth': 6,
 'min_child_weight': 25,
 'objective': 'binary:logistic',
 'seed': 42,
 'silent': 1,
 'subsample': 0.7}

num_rounds = 800 
plst = list(params.items())
sc = StandardScaler()
xgtrain = xgb.DMatrix(X_train, label=y_train)
xgtest = xgb.DMatrix(X_test)

model = xgb.train(plst, xgtrain, num_rounds, verbose_eval=50)
y_pred = model.predict(xgtest)
print ('The log loss is %s' % log_loss(y_test, y_pred))

prediction = pd.DataFrame(y_pred, columns = ['is_duplicate'], index = X_test.index)

"""
#Training the algorithm and making a prediction
gbc = GradientBoostingClassifier(n_estimators = 8000, learning_rate = 0.3, max_depth = 3)
gbc.fit(X_train, y_train.values.ravel())
prediction = pd.DataFrame(gbc.predict(X_test), columns = ['is_duplicate'], index = X_test.index)

#Inspecting our mistakes
prediction_actual = prediction.join(y_test, how = 'inner', lsuffix = '_predicted', rsuffix = '_actual').join(train[['question1', 'question2']], how = 'inner').join(X_test, how = 'inner')
print ('The log loss is SCKL %s' % log_loss(y_test, prediction))
"""
0

      similarity  pruned_similarity  weight_q1  weight_q2  common_weight
7660    0.727273           0.682927   2.096303   2.017127       1.383557
275     0.680000           0.730159   2.433464   2.867656       1.757610
2985    0.875000           0.804348   2.722076   2.776350       2.590012
5646    0.388060           0.356436   3.061119   3.254577       1.587374
5849    0.377358           0.363636   2.937788   3.559681       1.655746
3477    0.327684           0.292308   3.909865   3.748374       2.143320
7558    0.722892           0.707692   2.252590   2.348782       2.022280
3558    0.682927           0.727273   2.710025   2.306698       1.866726
2809    0.625000           0.666667   2.717055   2.665511       2.100895
9113    0.540000           0.459016   2.584222   2.298426       1.572971
4861    0.240310           0.372881   3.787417   4.400992       1.942285
6586    0.494118           0.654545   3.005096   4.065321       2.197647
3416    0.250000           0.267717   2.659246   3.

0

time: 1.51 s


In [264]:
prediction_actual = prediction.join(y_test, how = 'inner',
                                    lsuffix = '_predicted',
                                    rsuffix = '_actual').\
    join(train[['question1', 'question2']], how = 'inner').\
    join(train_transformed, how = 'inner').\
    join(X_test, how = 'inner')
    
bind = (prediction_actual.is_duplicate_actual == 0) & (prediction_actual.is_duplicate_predicted>0.6)

bind = (prediction_actual.is_duplicate_actual == 1) & (prediction_actual.is_duplicate_predicted < 0.6)
print(np.sum(bind), X_test.shape)
prediction_actual[bind]

17 (5000, 5)


Unnamed: 0,is_duplicate_predicted,is_duplicate_actual,question1,question2,question1_tokens,question1_lowercase_tokens,question1_lowercase,question1_words,question1_pruned,question2_tokens,question2_lowercase_tokens,question2_lowercase,question2_words,question2_pruned,similarity,pruned_similarity,weight_q1,weight_q2,common_weight
2304,0.459989,1,What might cause a brown discharge during the menstrual cycle (before/during/after menstruation)?,Why do I get a brown discharge before my menstrual cycle?,"[What, might, cause, a, brown, discharge, during, the, menstrual, cycle, (, before/during/after, menstruation, ), ?]","[what, might, cause, a, brown, discharge, during, the, menstrual, cycle, (, before/during/after, menstruation, ), ?]",what might cause a brown discharge during the menstrual cycle ( before/during/after menstruation ) ?,"[might, cause, brown, discharge, menstrual, cycle, (, before/during/after, menstruation, ), ?]",might cause brown discharge menstrual cycle ( before/during/after menstruation ) ?,"[Why, do, I, get, a, brown, discharge, before, my, menstrual, cycle, ?]","[why, do, i, get, a, brown, discharge, before, my, menstrual, cycle, ?]",why do i get a brown discharge before my menstrual cycle ?,"[get, brown, discharge, menstrual, cycle, ?]",get brown discharge menstrual cycle ?,0.850467,0.864516,4.137393,3.693397,3.582522
6603,0.34649,1,Why are 500 and 1000 notes being banned in India?,What are the reasons why eradication of 1000 rs and 500 rs notes?,"[Why, are, 500, and, 1000, notes, being, banned, in, India, ?]","[why, are, 500, and, 1000, notes, being, banned, in, india, ?]",why are 500 and 1000 notes being banned in india ?,"[500, 1000, notes, banned, India, ?]",500 1000 notes banned India ?,"[What, are, the, reasons, why, eradication, of, 1000, rs, and, 500, rs, notes, ?]","[what, are, the, reasons, why, eradication, of, 1000, rs, and, 500, rs, notes, ?]",what are the reasons why eradication of 1000 rs and 500 rs notes ?,"[reasons, eradication, 1000, rs, 500, rs, notes, ?]",reasons eradication 1000 rs 500 rs notes ?,0.507937,0.4375,2.704007,2.240851,1.826248
532,0.208236,1,How imminent is world war III?,Are we heading toward World War 3?,"[How, imminent, is, world, war, III, ?]","[how, imminent, is, world, war, iii, ?]",how imminent is world war iii ?,"[imminent, world, war, III, ?]",imminent world war III ?,"[Are, we, heading, toward, World, War, 3, ?]","[are, we, heading, toward, world, war, 3, ?]",are we heading toward world war 3 ?,"[heading, toward, World, War, 3, ?]",heading toward World War 3 ?,0.451613,0.540541,2.298379,1.739305,1.0
1683,0.020531,1,What is the best app for Berlin public transportation?,What are the best public transportation apps to help me in Berlin?,"[What, is, the, best, app, for, Berlin, public, transportation, ?]","[what, is, the, best, app, for, berlin, public, transportation, ?]",what is the best app for berlin public transportation ?,"[best, app, Berlin, public, transportation, ?]",best app Berlin public transportation ?,"[What, are, the, best, public, transportation, apps, to, help, me, in, Berlin, ?]","[what, are, the, best, public, transportation, apps, to, help, me, in, berlin, ?]",what are the best public transportation apps to help me in berlin ?,"[best, public, transportation, apps, help, Berlin, ?]",best public transportation apps help Berlin ?,0.237288,0.235294,3.177638,2.750046,1.357952
7444,0.230187,1,Can we kill herpes virus once it is out of nerve cell I attached a pic of the description of the question please read and answer?,Can we kill herpes virus once it is out of nerves cell?,"[Can, we, kill, herpes, virus, once, it, is, out, of, nerve, cell, I, attached, a, pic, of, the, description, of, the, question, please, read, and, answer, ?]","[can, we, kill, herpes, virus, once, it, is, out, of, nerve, cell, i, attached, a, pic, of, the, description, of, the, question, please, read, and, answer, ?]",can we kill herpes virus once it is out of nerve cell i attached a pic of the description of the question please read and answer ?,"[kill, herpes, virus, nerve, cell, attached, pic, description, question, please, read, answer, ?]",kill herpes virus nerve cell attached pic description question please read answer ?,"[Can, we, kill, herpes, virus, once, it, is, out, of, nerves, cell, ?]","[can, we, kill, herpes, virus, once, it, is, out, of, nerves, cell, ?]",can we kill herpes virus once it is out of nerves cell ?,"[kill, herpes, virus, nerves, cell, ?]",kill herpes virus nerves cell ?,0.361809,0.373134,3.114752,3.91918,2.24064
7578,0.55846,1,What should I know to get into GSoC?,What do I need to learn to get a fair chance of getting selected for GSoC?,"[What, should, I, know, to, get, into, GSoC, ?]","[what, should, i, know, to, get, into, gsoc, ?]",what should i know to get into gsoc ?,"[know, get, GSoC, ?]",know get GSoC ?,"[What, do, I, need, to, learn, to, get, a, fair, chance, of, getting, selected, for, GSoC, ?]","[what, do, i, need, to, learn, to, get, a, fair, chance, of, getting, selected, for, gsoc, ?]",what do i need to learn to get a fair chance of getting selected for gsoc ?,"[need, learn, get, fair, chance, getting, selected, GSoC, ?]",need learn get fair chance getting selected GSoC ?,0.842105,0.871795,1.849544,1.67263,1.192016
8204,0.56586,1,Why should I learn C language?,Should I learn C?,"[Why, should, I, learn, C, language, ?]","[why, should, i, learn, c, language, ?]",why should i learn c language ?,"[learn, C, language, ?]",learn C language ?,"[Should, I, learn, C, ?]","[should, i, learn, c, ?]",should i learn c ?,"[learn, C, ?]",learn C ?,0.46875,0.564103,3.253949,2.968196,2.021113
8383,0.538299,1,How do I lose weight?,What can I do to lose 20 pounds?,"[How, do, I, lose, weight, ?]","[how, do, i, lose, weight, ?]",how do i lose weight ?,"[lose, weight, ?]",lose weight ?,"[What, can, I, do, to, lose, 20, pounds, ?]","[what, can, i, do, to, lose, 20, pounds, ?]",what can i do to lose 20 pounds ?,"[lose, 20, pounds, ?]",lose 20 pounds ?,0.876712,0.727273,2.172359,2.200785,2.065846
1272,0.484915,1,Why does magnetic field produce when current flows in a conductor?,Why is a magnetic field produced when current flows through a conductor?,"[Why, does, magnetic, field, produce, when, current, flows, in, a, conductor, ?]","[why, does, magnetic, field, produce, when, current, flows, in, a, conductor, ?]",why does magnetic field produce when current flows in a conductor ?,"[magnetic, field, produce, current, flows, conductor, ?]",magnetic field produce current flows conductor ?,"[Why, is, a, magnetic, field, produced, when, current, flows, through, a, conductor, ?]","[why, is, a, magnetic, field, produced, when, current, flows, through, a, conductor, ?]",why is a magnetic field produced when current flows through a conductor ?,"[magnetic, field, produced, current, flows, conductor, ?]",magnetic field produced current flows conductor ?,0.507692,0.474227,3.319158,2.74789,2.220099
8641,0.33797,1,What do you want more of in your life?,What do you want in life?,"[What, do, you, want, more, of, in, your, life, ?]","[what, do, you, want, more, of, in, your, life, ?]",what do you want more of in your life ?,"[want, life, ?]",want life ?,"[What, do, you, want, in, life, ?]","[what, do, you, want, in, life, ?]",what do you want in life ?,"[want, life, ?]",want life ?,0.62069,0.684211,1.659821,2.177453,1.519794


time: 150 ms
