# Feature Engineering and Syntactic Similarity

- Computation of a similiraty measure between texts
- Impact of preprocessing on 
    - similarity measure
    - speed of execution 

1 A toy dataset example

2 A real dataset 

### A Toy dataset experiment

We consider a text made of four sentences:

In [2]:
import pandas as pd
import numpy as np

In [2]:
sentences = ["It was the best of times",
            "it was the worst of times",
            "it was the age of wisdom",
            "it was the age of foolishness"]

In a first step, we ignore the preprocessing step, as we have only a few sentences with not a lot of words. 

We want to illustrate two steps:
    
    1. Transforming the text into a numerical array. The dimensions of these numerical array are (number of documents × number of words) We will consider three different options:
        1. Onehot_encoder
        2. Word count in each document
        3. td-idf for each document
        
    2. We compute a similarity indicator between each document. We will compute:
        1. The Similarity Matrix
        2. The Cosine similarity Matrix 
        

We will use scikit-learn functions

##### One-Hot Encoding with scikit-learn
 
In that case, each text is represented by a vector (1 × number of words in the set of **all** documents) whose elements are equal to 1 if the word is in the text and 0 if not. 

In that case, we do not take into account the number of occurrences of the word. 

In [3]:
tokenized_sentences = [[t for t in sentence.split()] for sentence in sentences]

In [4]:
print(tokenized_sentences)

[['It', 'was', 'the', 'best', 'of', 'times'], ['it', 'was', 'the', 'worst', 'of', 'times'], ['it', 'was', 'the', 'age', 'of', 'wisdom'], ['it', 'was', 'the', 'age', 'of', 'foolishness']]


In [20]:
one_list=set(sum(tokenized_sentences,[]))# We make a single list from all sublist and cancel multiple occurrence
print(set(one_list))# set() creates a set object. Elements of a set are unordered and unique

{'of', 'age', 'the', 'times', 'It', 'it', 'worst', 'best', 'wisdom', 'was', 'foolishness'}


In [10]:
vocabulary = set([w for s in tokenized_sentences for w in s])
print(vocabulary) # 

{'of', 'age', 'the', 'times', 'It', 'it', 'worst', 'best', 'wisdom', 'was', 'foolishness'}


In [27]:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
lb.fit([vocabulary])
A=lb.transform(tokenized_sentences)
print(A)

[[1 0 1 0 0 1 1 1 1 0 0]
 [0 0 0 0 1 1 1 1 1 0 1]
 [0 1 0 0 1 1 1 0 1 1 0]
 [0 1 0 1 1 1 1 0 1 0 0]]


### The Similarity Matrix

A simple measure of the similarity between two documents is based on the product of their respective vectors:

$$ S_{ij}=\sum_{k} D_{ik}D_{jk} = D_{i}D_{j}^{T}=D_{j}D_{i}^{T} $$ or $$S=DD^{T}$$

In [30]:
np.dot(A,np.transpose(A))

array([[6, 4, 3, 3],
       [4, 6, 4, 4],
       [3, 4, 6, 5],
       [3, 4, 5, 6]])

- The highest numbers are on the diagonal: each text has maximum similarity with itself

- text 1 is close to text 2
- text 2 is close to text 3
- text 3 is close to text 4

#### Bag-of-Words Models

- We count the frequency of words in each document => *bag-of-words* representation
- *bag-of-words* representation are frequently used for classification or sentiment detection
- *bag-of-words* are required for some methods of topic modeling, such as Latent Dirichlet Allocation (LDA)

Note that words appear only once in sentences. We add an extra sentences with words occuring more than once.

In [6]:
more_sentences= sentences+["John likes to watch movies. Mary likes to watch movies too", "Mary also likes to watch football games"]

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() # specification by default

In [8]:
cv.fit(more_sentences)

In [10]:
cv.get_feature_names_out()

array(['age', 'also', 'best', 'foolishness', 'football', 'games', 'it',
       'john', 'likes', 'mary', 'movies', 'of', 'the', 'times', 'to',
       'too', 'was', 'watch', 'wisdom', 'worst'], dtype=object)

In [12]:
dt =cv.transform(more_sentences)

In [13]:
dt

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [16]:
dt.todense()

matrix([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
        [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0],
        [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]],
       dtype=int64)

In [18]:
pd.DataFrame(dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1
2,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0
3,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,2,1,2,0,0,0,2,1,0,2,0,0
5,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0


Some words appears more than once in some documents

#### Computation of the similarity matrix

In [20]:
(np.dot(dt,np.transpose(dt))).todense()

matrix([[ 6,  5,  4,  4,  0,  0],
        [ 5,  6,  4,  4,  0,  0],
        [ 4,  4,  6,  5,  0,  0],
        [ 4,  4,  5,  6,  0,  0],
        [ 0,  0,  0,  0, 19,  7],
        [ 0,  0,  0,  0,  7,  7]], dtype=int64)

In [None]:
- The product in the cells of the similarity matrix increases with the lenght of each document.
- They also depend on the number of occurences each words => we need a measure to take out this effect

### Computation of similarity

- A measure of similarity between documents can be based on the angle between documents.
- The cosinus of the angle between two vectors is defined as:

$$cos(a,b)=\frac{a.b}{\vert \vert a \vert \vert.\vert \vert b \vert \vert}=\frac{\sum a_{i}b_{i}}{\sqrt{\sum a_{i}a_{i}}\sqrt{\sum b_{i}b_{i}}}$$

scikit-learn as a function which directly compute the cosine similarity matrix

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
S= cosine_similarity(dt,dt)

In [25]:
print(pd.DataFrame(S))

          0         1         2         3         4         5
0  1.000000  0.833333  0.666667  0.666667  0.000000  0.000000
1  0.833333  1.000000  0.666667  0.666667  0.000000  0.000000
2  0.666667  0.666667  1.000000  0.833333  0.000000  0.000000
3  0.666667  0.666667  0.833333  1.000000  0.000000  0.000000
4  0.000000  0.000000  0.000000  0.000000  1.000000  0.606977
5  0.000000  0.000000  0.000000  0.000000  0.606977  1.000000


- A symmetric matrix
- Documents 0/1 and 2/3 are most similar
- Documents 4 and 5 have no similarity with the first three one 

### With TD-IDF

#### Option 1

- We can use TfidfTransformer with count matrix as input


In [29]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf_dt=tfidf.fit_transform(dt)
pd.DataFrame(tfidf_dt.toarray(),columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.56978,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.56978
2,0.467228,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.56978,0.0
3,0.467228,0.0,0.0,0.56978,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.260453,0.42715,0.213575,0.520906,0.0,0.0,0.0,0.42715,0.260453,0.0,0.42715,0.0,0.0
5,0.0,0.419233,0.0,0.0,0.419233,0.419233,0.0,0.0,0.343777,0.343777,0.0,0.0,0.0,0.0,0.343777,0.0,0.0,0.343777,0.0,0.0


In [31]:
pd.DataFrame(cosine_similarity(tfidf_dt,tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.675351,0.457049,0.457049,0.0,0.0
1,0.675351,1.0,0.457049,0.457049,0.0,0.0
2,0.457049,0.457049,1.0,0.675351,0.0,0.0
3,0.457049,0.457049,0.675351,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.513956
5,0.0,0.0,0.0,0.0,0.513956,1.0


#### Option 2
We can use TfidfVectorizer with the documents as input

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
V_tfidf = TfidfVectorizer()
tfidf_dt=V_tfidf.fit_transform(more_sentences)

In [42]:
V_tfidf.get_feature_names_out()

array(['age', 'also', 'best', 'foolishness', 'football', 'games', 'it',
       'john', 'likes', 'mary', 'movies', 'of', 'the', 'times', 'to',
       'too', 'was', 'watch', 'wisdom', 'worst'], dtype=object)

In [47]:
tfidf_dt.todense()

matrix([[0.        , 0.        , 0.56977984, 0.        , 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 , 0.3380271 , 0.46722762, 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 , 0.3380271 , 0.46722762, 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.56977984],
        [0.46722762, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 , 0.3380271 , 0.        , 0.        ,
         0.        , 0.3380271 , 0.        , 0.56977984, 0.        ],
        [0.46722762, 0.        , 0.        , 0.56977984, 0.        ,
         0.        , 0.3380271 , 0.        , 0.        , 0.        ,
         0.        , 0.3380271 

In [48]:
pd.DataFrame(tfidf_dt.todense(),columns=V_tfidf.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.56978,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.56978
2,0.467228,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.56978,0.0
3,0.467228,0.0,0.0,0.56978,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.260453,0.42715,0.213575,0.520906,0.0,0.0,0.0,0.42715,0.260453,0.0,0.42715,0.0,0.0
5,0.0,0.419233,0.0,0.0,0.419233,0.419233,0.0,0.0,0.343777,0.343777,0.0,0.0,0.0,0.0,0.343777,0.0,0.0,0.343777,0.0,0.0


In [52]:
pd.DataFrame(cosine_similarity(tfidf_dt,tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.675351,0.457049,0.457049,0.0,0.0
1,0.675351,1.0,0.457049,0.457049,0.0,0.0
2,0.457049,0.457049,1.0,0.675351,0.0,0.0
3,0.457049,0.457049,0.675351,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.513956
5,0.0,0.0,0.0,0.0,0.513956,1.0


- The values of the measures of similarity are lower compared to previous cases
- Documents 0/1 and 2/3 are the most proximate
- Documents 4/5 are rather similar but have no similarity with the first three ones. 


## Same Exercice with a large dataset

In [7]:
df = pd.read_csv("abcnews-date-text.csv",parse_dates=["publish_date"])

In [8]:
print(len(df))

1244184


In [10]:
df.head()

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides against community broadcasting lic...
1,2003-02-19,act fire witnesses must be aware of defamation
2,2003-02-19,a g calls for infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise
4,2003-02-19,air nz strike to affect australian travellers


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
dt=tfidf.fit_transform(df["headline_text"])

In [12]:
dt

<1244184x105966 sparse matrix of type '<class 'numpy.float64'>'
	with 8072405 stored elements in Compressed Sparse Row format>

- 1 244 184 documents
- 105 966 words

In [16]:
%%time
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(dt[1:10000],dt[0:10000])

CPU times: total: 219 ms
Wall time: 266 ms


array([[0.        , 1.        , 0.        , ..., 0.        , 0.03053607,
        0.06614406],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.03618138, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.03053607, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.06614406, 0.        , ..., 0.        , 0.        ,
        1.        ]])

### Reducing features dimensions (reducing the number of words)
1. Removing stop words
2. Minimum frequency
3. Maximun frequency
4. Lemmatization
5. Limit word types
6. Remove Most Common Words
7. Adding Context Via n-grams


In [13]:
import textacy
import textacy.preprocessing as tprep

preproc = tprep.make_pipeline(
    tprep.replace.urls,
    tprep.remove.html_tags,
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.normalize.whitespace,
    tprep.replace.numbers
)

In [15]:
df['clean_text']=df['headline_text'].apply(preproc)

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_tfidf = TfidfVectorizer(min_df=5,stop_words='english')
text_tfidf=tfidf.fit_transform(df['clean_text'])

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(text_tfidf[1:10000],text_tfidf[0:10000])

array([[0.        , 1.        , 0.        , ..., 0.        , 0.03053607,
        0.06614406],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.03618138, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.03053607, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.06614406, 0.        , ..., 0.        , 0.        ,
        1.        ]])