<h2> 3.6 Featurizing text data with tfidf weighted word-vectors </h2>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
warnings.filterwarnings("ignore")
import sys
import os 
import pandas as pd
import numpy as np
from tqdm import tqdm

# exctract word2vec vectors
# https://github.com/explosion/spaCy/issues/1721
# http://landinghub.visualstudio.com/visual-cpp-build-tools
import spacy

This code is importing several libraries that are commonly used in data analysis and natural language processing. Here is an explanation of each line:

1. `import pandas as pd`: This line imports the pandas library and gives it the alias `pd`. Pandas is a library for data manipulation and analysis.
2. `import matplotlib.pyplot as plt`: This line imports the `pyplot` module from the `matplotlib` library and gives it the alias `plt`. Matplotlib is a library for creating visualizations.
3. `import re`: This line imports the `re` module, which provides regular expression matching operations.
4. `import time`: This line imports the `time` module, which provides various time-related functions.
5. `import warnings`: This line imports the `warnings` module, which allows you to issue warning messages.
6. `import numpy as np`: This line imports the numpy library and gives it the alias `np`. Numpy is a library for working with arrays of data.
7. `from nltk.corpus import stopwords`: This line imports the `stopwords` corpus from the Natural Language Toolkit (nltk) library. Stopwords are common words that are often removed from text data when processing natural language.
8. `from sklearn.preprocessing import normalize`: This line imports the `normalize` function from the scikit-learn library's preprocessing module. This function is used to scale input vectors individually to unit norm.
9. `from sklearn.feature_extraction.text import CountVectorizer`: This line imports the `CountVectorizer` class from scikit-learn's feature_extraction.text module. This class can be used to convert a collection of text documents into a matrix of token counts.
10. `from sklearn.feature_extraction.text import TfidfVectorizer`: This line imports the `TfidfVectorizer` class from scikit-learn's feature_extraction.text module. This class can be used to convert a collection of text documents into a matrix of TF-IDF features.
11. `warnings.filterwarnings("ignore")`: This line sets the warnings filter to ignore all warnings.
12-14. These lines import the sys, os, and pandas libraries again, but without giving them aliases.
15. `import numpy as np`: This line imports the numpy library again, but this time with an alias of `np`.
16. `from tqdm import tqdm`: This line imports the tqdm function from the tqdm library. Tqdm is a library for creating progress bars in Python loops and iterators.
17-19: These lines are comments that provide links to resources related to spaCy and Visual C++ Build Tools.
20. `import spacy`: This line imports the spaCy library, which is used for natural language processing.

In [2]:
# avoid decoding problems
df = pd.read_csv("train.csv")
 
# encode questions to unicode
# https://stackoverflow.com/a/6812069
# ----------------- python 2 ---------------------
# df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))
# df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))
# ----------------- python 3 ---------------------
df['question1'] = df['question1'].apply(lambda x: str(x))
df['question2'] = df['question2'].apply(lambda x: str(x))

Here is an explanation of each line of the code you provided:

1. `# avoid decoding problems`: This is a comment that explains the purpose of the following lines of code.
2. `df = pd.read_csv("train.csv")`: This line reads a CSV file named "train.csv" into a pandas DataFrame called `df`.
3. `# encode questions to unicode`: This is a comment that explains the purpose of the following lines of code.
4. `# https://stackoverflow.com/a/6812069`: This is a comment that provides a link to a Stack Overflow post related to encoding text in Python.
5. `# ----------------- python 2 ---------------------`: This is a comment that indicates that the following lines of code are specific to Python 2.
6. `# df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))`: This line, which is commented out, shows how the values in the `question1` column of the DataFrame would be encoded as unicode strings in Python 2 using the `unicode` function.
7. `# df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))`: This line, which is commented out, shows how the values in the `question2` column of the DataFrame would be encoded as unicode strings in Python 2 using the `unicode` function.
8. `# ----------------- python 3 ---------------------`: This is a comment that indicates that the following lines of code are specific to Python 3.
9. `df['question1'] = df['question1'].apply(lambda x: str(x))`: This line encodes the values in the `question1` column of the DataFrame as strings using the `str` function. The `apply` method is used to apply the lambda function to each element in the specified column of the DataFrame.
10. `df['question2'] = df['question2'].apply(lambda x: str(x))`: This line encodes the values in the `question2` column of the DataFrame as strings using the `str` function. The `apply` method is used to apply the lambda function to each element in the specified column of the DataFrame.

In [3]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(df['question1']) + list(df['question2'])

tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(questions)

# dict key:word and value:tf-idf score
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))

This code is using the TfidfVectorizer class from scikit-learn's feature_extraction.text module to compute the TF-IDF scores for the words in two columns of a DataFrame. Here is an explanation of each line:

1. `from sklearn.feature_extraction.text import TfidfVectorizer`: This line imports the `TfidfVectorizer` class from scikit-learn's feature_extraction.text module. This class can be used to convert a collection of text documents into a matrix of TF-IDF features.
2. `from sklearn.feature_extraction.text import CountVectorizer`: This line imports the `CountVectorizer` class from scikit-learn's feature_extraction.text module. This class can be used to convert a collection of text documents into a matrix of token counts.
3. `# merge texts`: This is a comment that explains the purpose of the following line of code.
4. `questions = list(df['question1']) + list(df['question2'])`: This line creates a new list called `questions` that contains the values from the `question1` and `question2` columns of the DataFrame `df`.
5. `tfidf = TfidfVectorizer(lowercase=False, )`: This line creates an instance of the TfidfVectorizer class with the `lowercase` parameter set to `False`.
6. `tfidf.fit_transform(questions)`: This line fits the TfidfVectorizer to the data in the `questions` list and transforms it into a matrix of TF-IDF features.
7. `# dict key:word and value:tf-idf score`: This is a comment that explains the purpose of the following line of code.
8. `word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))`: This line creates a dictionary called `word2tfidf` that maps words to their corresponding TF-IDF scores. The keys of the dictionary are obtained by calling the `get_feature_names` method on the TfidfVectorizer instance, and the values are obtained from its `idf_` attribute.

- After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
- here we use a pre-trained GLOVE model which comes free with "Spacy".  https://spacy.io/usage/vectors-similarity
- It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. 

In [5]:
!python -m spacy download en_core_web_lg
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md
!python -m spacy download en

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 587.7/587.7 MB ? eta 0:00:00

2023-04-04 17:13:48.193148: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2023-04-04 17:13:48.193197: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[notice] A new release of pip available: 22.3 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip



Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 1.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;3m[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use
the full pipeline package name 'en_core_web_sm' instead.[0m
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


2023-04-04 17:26:18.044604: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2023-04-04 17:26:18.044640: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[notice] A new release of pip available: 22.3 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
nlp = spacy.load('en_core_web_lg')
vecs1 = []

for qu1 in tqdm(list(df['question1'])):
    doc1 = nlp(qu1) 
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        vec1 = word1.vector
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)

df['q1_feats_m'] = list(vecs1)

100%|█████████████████████████████████████████████████████████████████████████| 404290/404290 [40:05<00:00, 168.10it/s]


This code is using the spaCy library to compute the mean word2vec vector for each question in the `question1` column of a DataFrame called `df`. Here is an explanation of each line:

1. `# en_vectors_web_lg, which includes over 1 million unique vectors.`: This is a comment that provides some information about the `en_vectors_web_lg` model.
2. `nlp = spacy.load('en_core_web_lg')`: This line loads the `en_core_web_lg` model from spaCy and assigns it to the variable `nlp`.
3. `vecs1 = []`: This line creates an empty list called `vecs1`.
4. `# # https://github.com/noamraph/tqdm`: This is a comment that provides a link to the GitHub repository for the tqdm library.
5. `## tqdm is used to print the progrss bar`: This is a comment that explains the purpose of using the tqdm library in the following lines of code.
6. `for qu1 in tqdm(list(df['question1'])):`: This line starts a for loop that iterates over the values in the `question1` column of the DataFrame `df`. The `tqdm` function is used to create a progress bar for the loop.
7. `doc1 = nlp(qu1)`: This line processes each question using the spaCy model and assigns the resulting Doc object to the variable `doc1`.
8. `# 384 is the number of dimensions of vectors`: This is a comment that provides some information about the size of the word vectors.
9. `mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])`: This line creates an array of zeros with shape `(len(doc1), len(doc1[0].vector))` and assigns it to the variable `mean_vec1`.
10. `for word1 in doc1:`: This line starts a nested for loop that iterates over each token in the Doc object.
11. `# word2vec`: This is a comment that indicates that the following lines of code are related to computing word2vec vectors.
12. `vec1 = word1.vector`: This line extracts the word vector for each token and assigns it to the variable `vec1`.
13. `# fetch df score`: This is a comment that explains the purpose of the following lines of code.
14. `try:`: This line starts a try block that attempts to retrieve the IDF score for each token from a dictionary called `word2tfidf`.
15. `idf = word2tfidf[str(word1)]`: This line retrieves the IDF score for each token from the dictionary and assigns it to the variable `idf`.
16. `except:`: This line starts an except block that handles cases where a token is not found in the dictionary.
17. `idf = 0`: This line sets the value of `idf` to 0 if a token is not found in the dictionary.
18. `# compute final vec`: This is a comment that explains the purpose of the following line of code.
19. `mean_vec1 += vec1 * idf`: This line updates the value of `mean_vec1` by adding the product of each token's word vector and its IDF score.
20. `mean_vec1 = mean_vec1.mean(axis=0)`: After all tokens have been processed, this line computes the mean of all word vectors along axis 0 and assigns it to `mean_vec1`.
21. `vecs1.append(mean_vec1)`: This line appends each mean vector to the list called `vecs1`.
22. `df['q1_feats_m'] = list(vecs1)`: After all questions have been processed, this line adds a new column called `'q1_feats_m'` to the DataFrame and sets its values to be equal to those in the list called `vecs1`.

In [23]:
vecs2 = []
for qu2 in tqdm(list(df['question2'])):
    doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc1), len(doc2[0].vector)])
    for word2 in doc2:
        # word2vec
        vec2 = word2.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word2)]
        except:
            #print word
            idf = 0
        # compute final vec
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
df['q2_feats_m'] = list(vecs2)

100%|█████████████████████████████████████████████████████████████████████████| 404290/404290 [37:37<00:00, 179.05it/s]


This code processes a column of text data named `question2` in a DataFrame `df`. Here's a line-by-line explanation:

1. `vecs2 = []`: Initializes an empty list `vecs2` to store the results.
2. `for qu2 in tqdm(list(df['question2'])):`: Iterates over each element in the `question2` column of the DataFrame `df`. The `tqdm` function is used to display a progress bar.
3. `doc2 = nlp(qu2)`: Processes the text data using the `nlp` function from the spaCy library.
4. `mean_vec1 = np.zeros([len(doc1), len(doc2[0].vector)])`: Initializes an array of zeros with shape `(len(doc1), len(doc2[0].vector))` using the NumPy library. The purpose of this array is not clear from the provided code.
5. `for word2 in doc2:`: Iterates over each token in the processed text data.
6. `vec2 = word2.vector`: Retrieves the word vector for the current token.
7. `try: idf = word2tfidf[str(word2)]`: Attempts to retrieve the inverse document frequency (IDF) score for the current token from a dictionary named `word2tfidf`.
8. `except: idf = 0`: If the IDF score is not found in the dictionary, it is set to 0.
9. `mean_vec2 += vec2 * idf`: Multiplies the word vector by its IDF score and adds it to a running total.
10. `mean_vec2 = mean_vec2.mean(axis=0)`: Computes the mean of the accumulated word vectors along axis 0.
11. `vecs2.append(mean_vec2)`: Appends the computed mean vector to the list `vecs2`.
12. `df['q2_feats_m'] = list(vecs2)`: Adds a new column named `q2_feats_m` to the DataFrame and assigns it the values in the list `vecs2`.

This code appears to compute a weighted average of word vectors for each element in the `question2` column of a DataFrame, where the weights are determined by the IDF scores of the words.


In [None]:
#prepro_features_train.csv (Simple Preprocessing Feartures)
#nlp_features_train.csv (NLP Features)
if os.path.isfile('nlp_features_train.csv'):
    dfnlp = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
else:
    print("download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('df_fe_without_preprocessing_train.csv'):
    dfppro = pd.read_csv("df_fe_without_preprocessing_train.csv",encoding='latin-1')
else:
    print("download df_fe_without_preprocessing_train.csv from drive or run previous notebook")

This code checks if two files, `nlp_features_train.csv` and `df_fe_without_preprocessing_train.csv`, exist in the current working directory. If they do, the code reads them into two DataFrames `dfnlp` and `dfppro` using the `pandas` library. If either file is not found, a message is printed instructing the user to download it or run a previous notebook. Here's a line-by-line explanation:

1. `#prepro_features_train.csv (Simple Preprocessing Feartures)`: A comment indicating that the file `prepro_features_train.csv` contains simple preprocessing features.
2. `#nlp_features_train.csv (NLP Features)`: A comment indicating that the file `nlp_features_train.csv` contains NLP features.
3. `if os.path.isfile('nlp_features_train.csv'):`: Checks if the file `nlp_features_train.csv` exists in the current working directory using the `isfile` function from the `os.path` module.
4. `dfnlp = pd.read_csv("nlp_features_train.csv",encoding='latin-1')`: If the file exists, it is read into a DataFrame named `dfnlp` using the `read_csv` function from the `pandas` library. The file is assumed to be encoded in `'latin-1'`.
5. `else: print("download nlp_features_train.csv from drive or run previous notebook")`: If the file does not exist, a message is printed instructing the user to download it or run a previous notebook.
6. `if os.path.isfile('df_fe_without_preprocessing_train.csv'):`: Checks if the file `df_fe_without_preprocessing_train.csv` exists in the current working directory using the same method as before.
7. `dfppro = pd.read_csv("df_fe_without_preprocessing_train.csv",encoding='latin-1')`: If the file exists, it is read into a DataFrame named `dfppro` using the same method as before.
8. `else: print("download df_fe_without_preprocessing_train.csv from drive or run previous notebook")`: If the file does not exist, a message is printed instructing the user to download it or run a previous notebook.


In [None]:
df1 = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)
df2 = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3 = df.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3_q1 = pd.DataFrame(df3.q1_feats_m.values.tolist(), index= df3.index)
df3_q2 = pd.DataFrame(df3.q2_feats_m.values.tolist(), index= df3.index)

This code performs several operations on three DataFrames `dfnlp`, `dfppro`, and `df`. Here's a line-by-line explanation:

1. `df1 = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)`: Creates a new DataFrame `df1` by dropping the columns `['qid1','qid2','question1','question2']` from the DataFrame `dfnlp` using the `drop` method. The `axis=1` parameter indicates that columns are being dropped.
2. `df2 = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)`: Creates a new DataFrame `df2` by dropping the columns `['qid1','qid2','question1','question2','is_duplicate']` from the DataFrame `dfppro` using the same method as before.
3. `df3 = df.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)`: Creates a new DataFrame `df3` by dropping the columns `['qid1','qid2','question1','question2','is_duplicate']` from the DataFrame `df` using the same method as before.
4. `df3_q1 = pd.DataFrame(df3.q1_feats_m.values.tolist(), index= df3.index)`: Creates a new DataFrame `df3_q1` by converting the values in the `q1_feats_m` column of the DataFrame `df3` to a list of lists using the `tolist` method and passing it to the `DataFrame` constructor from the `pandas` library. The index of the new DataFrame is set to be the same as that of the DataFrame `df3`.
5. `df3_q2 = pd.DataFrame(df3.q2_feats_m.values.tolist(), index= df3.index)`: Creates a new DataFrame `df3_q2` using the same method as before, but with the values in the `q2_feats_m` column of the DataFrame `df3`.



In [None]:
# dataframe of nlp features
df1.head()

In [None]:
# data before preprocessing 
df2.head()

In [None]:
# Questions 1 tfidf weighted word2vec
df3_q1.head()

In [None]:
# Questions 2 tfidf weighted word2vec
df3_q2.head()

In [None]:
print("Number of features in nlp dataframe :", df1.shape[1])
print("Number of features in preprocessed dataframe :", df2.shape[1])
print("Number of features in question1 w2v  dataframe :", df3_q1.shape[1])
print("Number of features in question2 w2v  dataframe :", df3_q2.shape[1])
print("Number of features in final dataframe  :", df1.shape[1]+df2.shape[1]+df3_q1.shape[1]+df3_q2.shape[1])

In [None]:
# storing the final features to csv file
if not os.path.isfile('final_features.csv'):
    df3_q1['id']=df1['id']
    df3_q2['id']=df1['id']
    df1  = df1.merge(df2, on='id',how='left')
    df2  = df3_q1.merge(df3_q2, on='id',how='left')
    result  = df1.merge(df2, on='id',how='left')
    result.to_csv('final_features.csv')

This code checks if a file named `final_features.csv` exists in the current working directory. If it does not, the code performs several operations to merge the DataFrames `df1`, `df2`, `df3_q1`, and `df3_q2` into a single DataFrame `result` and writes it to a CSV file named `final_features.csv`. Here's a line-by-line explanation:

1. `# storing the final features to csv file`: A comment indicating that the purpose of the code is to store the final features to a CSV file.
2. `if not os.path.isfile('final_features.csv'):`: Checks if the file `final_features.csv` does not exist in the current working directory using the `isfile` function from the `os.path` module.
3. `df3_q1['id']=df1['id']`: Adds a new column named `id` to the DataFrame `df3_q1` and assigns it the values from the `id` column of the DataFrame `df1`.
4. `df3_q2['id']=df1['id']`: Adds a new column named `id` to the DataFrame `df3_q2` and assigns it the values from the `id` column of the DataFrame `df1`.
5. `df1  = df1.merge(df2, on='id',how='left')`: Merges the DataFrames `df1` and `df2` into a single DataFrame using the `merge` method from the `pandas` library. The merge is performed on the `id` column and uses a left join.
6. `df2  = df3_q1.merge(df3_q2, on='id',how='left')`: Merges the DataFrames `df3_q1` and `df3_q2` into a single DataFrame using the same method as before.
7. `result  = df1.merge(df2, on='id',how='left')`: Merges the resulting DataFrames from steps 5 and 6 into a single DataFrame named `result` using the same method as before.
8. `result.to_csv('final_features.csv')`: Writes the resulting DataFrame to a CSV file named `final_features.csv` using the `to_csv` method.

