# Exploring Preprocessing Text Data


In [None]:
# Setup
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Preprocessing text data means that you are transforming text so it can be in a useable format for training. In this case we are using the `volunteer` data set which includes ads for volunteer opportunities in NYC. In this exercise we will prepare the `title` column for processing by filtering a sparse matrix to its most important columns.

First I will show the dataset to gain familiarity.

In [None]:
volunteer = pd.read_csv('datasets/volunteer_opportunities.csv')
volunteer

Then, I'll select the `title` column, and instantiate the Tfidf Vectorizer which is used to identify the weight of each word.

In [None]:
title_text = volunteer["title"]

tfidf = TfidfVectorizer()
text_tfidf = tfidf.fit_transform(title_text)

To get a better understanding of what the Tfidf matrix looks like, we convert `text_tfidf` into a data frame utilizing the method `.get_feature_names()` which outputs a list of words in the order they appear by column in the vectorizer. In turn, the df outputs a row for each row in our data set. Each column holds the weight a word has in relation to the other words in that row and all the other words from our dataset.

In [None]:
feature_names = tfidf.get_feature_names()
df = pd.DataFrame(text_tfidf.toarray(), columns=feature_names)
print(df.head())

Next, we want to filter the columns in `text_tfidf` to only include those that are the most relevant for processing later in the model. The function `return_weights` returns the top n words that weighed the most heavily in a particular row. We will use this function iteratively later on so we can get the most weighted words for the whole data set.

In [None]:
vocab = {v:k for k,v in tfidf.vocabulary_.items()}

def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series which makes the data easier to operate on
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf.vocabulary_, text_tfidf, 1, 3))

To break down the function above, there are some key things to know. First, the `.vocabulary_` method outputs a dictionary in which all words from the dataset are the dictionary keys and the respective values are the column positions of each word in the tfidf matrix.

In [None]:
vocab_dict = tfidf.vocabulary_
vocab_dict

Therefore, the `vocab` variable just flips these values in the dict. Both of these are used to return what is needed.

To understand `zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))`, I should break down `.indices` and `.data`. `.indices` holds all the column positions for each word in `text_tfidf`. `.data` holds the value for the relevance of the word. If you cross reference the output shown below with what was output from `.vocabulary_`, you'll see that for the second row, the column position of "designer" is 297 and the column position for "web" is 1095 which is aligned with what we see in row 2 of the data set.

In [None]:
print(text_tfidf[1].indices)
print(text_tfidf[1].data)


When you zip these together for a particular row, `zip` aggregates them into a tuple. Adding the function `dict` makes the column position of the word the key and the weight of that word the value.

In [None]:
zipped = zip(text_tfidf[1].indices, text_tfidf[1].data)
print(list(zipped))

Finally, we will filter the matrix with `words_to_filter` which now only has 1061 columns

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

In [None]:
filtered_text.shape

Of course, preprocessing will look different based on the model you end up using as well as the other data available. This example was to explain the steps that could be involved with preprocessing text data.