## Feature Engineering

The functions to follow will take in the data as a pandas dataframe, and return a pandas dataframe with additional columns for the features added. 

In general, each function takes in an argument _stopwords_, which if initialized as a list of strings will be removed from the text. 


### An outline of the features included:

#### 1. Basics

#### 2. String Distance Features and Fuzzy Features

#### 3. Tf-Idf Features

#### 4. Word Embedding Features

#### 5. Linguistic Features


### 0.1 import packages and data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("../data/processed/train.csv")

In [66]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,1,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,2,1,3,4,What is the story of Kohinoor Koh - i - Noor D...,What would happen if the Indian government sto...,0
2,3,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,4,3,7,8,Why am I mentally very lonely ? How can I solv...,Find the remainder when math 23 24 math is div...,0
4,5,4,9,10,"Which one dissolve in water quickly sugar , sa...",Which fish would survive in salt water ?,0


In [6]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id,qid1,qid2,is_duplicate
count,404290.0,404290.0,404290.0,404290.0,404290.0
mean,202145.5,202144.5,217243.942418,220955.655337,0.369198
std,116708.614503,116708.614503,157751.700002,159903.182629,0.482588
min,1.0,0.0,1.0,2.0,0.0
25%,101073.25,101072.25,74437.5,74727.0,0.0
50%,202145.5,202144.5,192182.0,197052.0,0.0
75%,303217.75,303216.75,346573.5,354692.5,1.0
max,404290.0,404289.0,537932.0,537933.0,1.0


In [136]:
# small sample for testing
sample = df.sample(100)

### 0.2 Objects I will use repeatedly in this notebook

In [36]:
punctuation = ["?", "!", ",", ".", '"', "-", ' ']

### 1 Basic features

#### 1.1 Number of intersecting words

In [111]:
def len_intersection(df, stopwords = None):
    if stopwords:
        df['len_intersection'] = df.apply(lambda x: len(set([w for w in x["question1"].strip().split(" ") if
                                    w not in punctuation and w not in stopwords]).intersection(
                    set([w for w in x["question2"].strip().lower().split(" ") if
                         w not in punctuation and w not in stopwords]))),
                 axis = 1)
        return(df)

    else:
        df['len_intersection'] = df.apply(lambda x: len(set([w for w in x["question1"].strip().split(" ") if
                                    w not in punctuation]).intersection(
                    set([w for w in x["question2"].strip().lower().split(" ") if
                         w not in punctuation]))),
                 axis = 1)
        return(df)

#### 1.2 Number of words in each sentence

In [112]:
def num_words_q1(df, stopwords = None):
    if stopwords:
        df['num_words_q1'] = df['question1'].apply(lambda x: len([w for w in x.strip().lower().split(" ")
                                                          if w not in punctuation and w not in stopwords]))
        return(df)
    
    else:
        df['num_words_q1'] = df['question1'].apply(lambda x: len([w for w in x.strip().lower().split(" ") if
                                                          w not in punctuation]))
        return(df)

In [113]:
def num_words_q2(df, stopwords = None):
    if stopwords:
        df['num_words_q2'] = df['question2'].apply(lambda x: len([w for w in x.strip().lower().split(" ")
                                                          if w not in punctuation and w not in stopwords]))
        return(df)
    
    else:
        df['num_words_q2'] = df['question2'].apply(lambda x: len([w for w in x.strip().lower().split(" ")
                                                          if w not in punctuation]))
        return(df)

#### 1.3 Difference in number of words 

Assumes functions *num_words_q1* and *num_words_q2* were already applied

In [129]:
def num_words_diff(df):
    df['num_words_diff'] = abs(df['num_words_q1'] - df['num_words_q2'])
    return(df)

#### 1.4 Character length of each question

In [122]:
def num_chars_q1(df, stopwords = None):
    if stopwords:
        df['num_chars_q1'] = df['question1'].apply(lambda x: sum([len([c for c in w if c not in punctuation]) for 
                                                                 w in x.strip().split() if
                                                                 w not in stopwords]))
        return(df)
    else:
        df['num_chars_q1'] = df['question1'].apply(lambda x: len(list([c for c in x.strip() if
                                                                       c not in punctuation])))
        return(df)

In [127]:
def num_chars_q2(df, stopwords = None):
    if stopwords:
        df['num_chars_q2'] = df['question2'].apply(lambda x: sum([len([c for c in w if c not in punctuation]) for 
                                                                 w in x.strip().split() if
                                                                 w not in stopwords]))
        return(df)
    else:
        df['num_chars_q2'] = df['question2'].apply(lambda x: len(list([c for c in x.strip() if
                                                                       c not in punctuation])))
        return(df)

#### 1.5 Difference in number of characters between q1 and q2

assumes two functions above were already called

In [137]:
def num_chars_diff (df):
    df['num_chars_diff'] = abs(df['num_chars_q1'] - df['num_chars_q2'])
    return(df)

---

### 2. String distance features

Features inspired and some code used from Abhishek Thakur's excellent [blogpost](https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/) and Marco Bonzanini's [FuzzyWuzzy turorial](https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/)