# Exploring Preprocessing Text Data


In [1]:
# Setup
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Preprocessing text data means that you are transforming text so it can be in a useable format for training. In this case we are using the `volunteer` data set which includes ads for volunteer opportunities in NYC. In this exercise we will prepare the `title` column for processing by filtering a sparse matrix to its most important columns.

First I will show the dataset to gain familiarity.

In [2]:
volunteer = pd.read_csv('datasets/volunteer_opportunities.csv')
volunteer

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,amsl,amsl_unit,org_title,org_content_id,addresses_count,locality,region,postalcode,primary_loc,display_url,recurrence_type,hours,created_date,last_modified_date,start_date_date,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,,,Center For NYC Neighborhoods,4426,1,,NY,,,/opportunities/4996,onetime,0,January 13 2011,June 23 2011,July 30 2011,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,,,Bpeace,37026,1,"5 22nd St\nNew York, NY 10010\n(40.74053152272...",NY,10010.0,,/opportunities/5008,onetime,0,January 14 2011,January 25 2011,February 01 2011,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,,,Street Project,3001,1,,NY,10026.0,,/opportunities/5016,onetime,0,January 19 2011,January 21 2011,January 29 2011,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,,,Oxfam America,2170,1,,NY,2114.0,,/opportunities/5022,ongoing,0,January 21 2011,January 25 2011,February 14 2011,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,,,Office of Recycling Outreach and Education,36773,1,,NY,10455.0,,/opportunities/5055,onetime,0,January 28 2011,February 01 2011,February 05 2011,February 05 2011,approved,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5640,50193,3,0,Volunteer for NYLAG's Food Stamps Project,197,"Volunteers needed to file for fair hearings, d...",,2.0,Helping Neighbors in Need,,,New York Legal Assistance Group,104,1,"7 Hanover Square\nNew York, NY 10004\n(40.7043...",NY,10004.0,,/opportunities/5640,ongoing,0,August 16 2011,August 17 2011,August 16 2011,November 15 2012,approved,,,,,,,,
661,5218,38711,10,0,Iridescent Science Studio Open House Volunteers,113,Come out to the South Bronx to help us hold ou...,,1.0,Strengthening Communities,,,Iridescent,38544,1,"890 Garrison Ave\nBronx, NY 10474\n(40.8171141...",NY,10474.0,,/opportunities/5218,onetime,0,March 21 2011,March 21 2011,April 13 2011,April 13 2011,approved,,,,,,,,
662,5541,47820,1,0,French Translator,145,Volunteer needed to translate written material...,,2.0,Helping Neighbors in Need,,,"Services for the UnderServed, Inc.",38951,1,"305 Seventh Avenue\nNew York, NY 10001\n(40.74...",NY,10001.0,,/opportunities/5541,ongoing,0,July 20 2011,August 23 2011,July 20 2011,September 01 2011,approved,,,,,,,,
663,5398,40722,2,0,Marketing & Advertising Volunteer,330,World Cares Center is looking for individuals ...,,1.0,Strengthening Communities,,,World Cares Center,36979,1,"520 8th Ave\nNY, NY 10018\n(40.75376054978079,...",NY,10018.0,,/opportunities/5398,ongoing,0,June 01 2011,August 09 2011,June 01 2011,May 31 2012,approved,,,,,,,,


Then, I'll select the `title` column, and instantiate the Tfidf Vectorizer which is used to identify the weight of each word.

In [3]:
title_text = volunteer["title"]

tfidf = TfidfVectorizer()
text_tfidf = tfidf.fit_transform(title_text)

To get a better understanding of what the Tfidf matrix looks like, we convert `text_tfidf` into a data frame utilizing the method `.get_feature_names()` which outputs a list of words in the order they appear by column in the vectorizer. In turn, the df outputs a row for each row in our data set. Each column holds the weight a word has in relation to the other words in that row and all the other words from our dataset.

In [4]:
feature_names = tfidf.get_feature_names()
df = pd.DataFrame(text_tfidf.toarray(), columns=feature_names)
print(df.head())

    11  125th  14th   17  175th   20  ...  young  your  youth  zion  zoo  zumba
0  0.0    0.0   0.0  0.0    0.0  0.0  ...    0.0   0.0    0.0   0.0  0.0    0.0
1  0.0    0.0   0.0  0.0    0.0  0.0  ...    0.0   0.0    0.0   0.0  0.0    0.0
2  0.0    0.0   0.0  0.0    0.0  0.0  ...    0.0   0.0    0.0   0.0  0.0    0.0
3  0.0    0.0   0.0  0.0    0.0  0.0  ...    0.0   0.0    0.0   0.0  0.0    0.0
4  0.0    0.0   0.0  0.0    0.0  0.0  ...    0.0   0.0    0.0   0.0  0.0    0.0

[5 rows x 1136 columns]


Next, we want to filter the columns in `text_tfidf` to only include those that are the most relevant for processing later in the model. The function `return_weights` returns the top n words that weighed the most heavily in a particular row. We will use this function iteratively later on so we can get the most weighted words for the whole data set.

In [7]:
vocab = {v:k for k,v in tfidf.vocabulary_.items()}

def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series which makes the data easier to operate on
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf.vocabulary_, text_tfidf, 1, 3))

[1095, 297]


To break down the function above, there are some key things to know. First, the `.vocabulary_` method outputs a dictionary in which all words from the dataset are the dictionary keys and the respective values are the column positions of each word in the tfidf matrix.

In [20]:
vocab_dict = tfidf.vocabulary_
vocab_dict

{'volunteers': 1086,
 'needed': 690,
 'for': 404,
 'rise': 869,
 'up': 1061,
 'stay': 959,
 'put': 822,
 'home': 493,
 'rescue': 855,
 'fair': 375,
 'web': 1095,
 'designer': 297,
 'urban': 1063,
 'adventures': 43,
 'ice': 515,
 'skating': 930,
 'at': 98,
 'lasker': 587,
 'rink': 868,
 'fight': 392,
 'global': 447,
 'hunger': 512,
 'and': 75,
 'support': 986,
 'women': 1108,
 'farmers': 380,
 'join': 562,
 'the': 1012,
 'oxfam': 739,
 'action': 31,
 'corps': 255,
 'in': 523,
 'nyc': 710,
 'stop': 962,
 'swap': 989,
 'queens': 825,
 'staff': 951,
 'development': 300,
 'trainer': 1037,
 'claro': 213,
 'brooklyn': 155,
 'volunteer': 1084,
 'attorney': 101,
 'cents': 188,
 'ability': 23,
 'community': 235,
 'health': 480,
 'advocates': 48,
 'supervise': 984,
 'children': 202,
 'highland': 491,
 'park': 748,
 'garden': 433,
 'worldofmoney': 1118,
 'org': 727,
 'youth': 1132,
 'amazing': 67,
 'race': 826,
 'qualified': 824,
 'board': 142,
 'member': 649,
 'seats': 899,
 'available': 106,
 'y

Therefore, the `vocab` variable just flips these values in the dict. Both of these are used to return what is needed.

To understand `zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))`, I should break down `.indices` and `.data`. `.indices` holds all the column positions for each word in `text_tfidf`. `.data` holds the value for the relevance of the word. If you cross reference the output shown below with what was output from `.vocabulary_`, you'll see that for the second row, the column position of "designer" is 297 and the column position for "web" is 1095 which is aligned with what we see in row 2 of the data set.

In [21]:
print(text_tfidf[1].indices)
print(text_tfidf[1].data)


[ 297 1095]
[0.68245886 0.73092401]


When you zip these together for a particular row, `zip` aggregates them into a tuple. Adding the function `dict` makes the column position of the word the key and the weight of that word the value.

In [11]:
zipped = zip(text_tfidf[1].indices, text_tfidf[1].data)
print(list(zipped))

[(297, 0.6824588570832413), (1095, 0.7309240099960025)]


Finally, we will filter the matrix with `words_to_filter` which now only has 1061 columns

In [25]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

In [27]:
filtered_text.shape

(665, 1061)

Of course, preprocessing will look different based on the model you end up using as well as the other data available. This example was to explain the steps that could be involved with preprocessing text data.