Sage Hahn

**Data Science Final Project**

Testing different pre-processing techniques

In [1]:
import nltk
import pickle

Right now I have a few main sources of text, reddit posts and a pubically avaliable sample of tweets, clips from news articles and blog posts. These sources are in a few different formats, and contain text of varying length and usefulness. I will begin by only loading one dataset, lets say the news clips from coursera, and will use it to explore which pre-processing steps I would like to apply universally to each dataset.

In [2]:
read_path = 'data/courseraNews.txt'

news_posts = []

with open(read_path, 'r') as f:
    news_posts = f.readlines()

In [3]:
len(news_posts)

1010242

Alright there are a little over a million samples of writing here. The first thing I'd like to try is using the sentence tokenizer via the NLTK library, in order to split the dataset up into sentences.

In [4]:
sentences = []

for line in news_posts:
    sent_text = nltk.sent_tokenize(line)
    
    for sent in sent_text:
        sentences.append(sent.lower())  #Also change to lowercase

In [5]:
len(sentences)

1950024

It appears this set of news data was already broken down a little bit, as the size once turned into sentences only roughly doubled. Lets not try with the much rougher reddit data,

In [6]:
read_path = 'data/redditTextBlock1.pkl'

#Read in the first post chunk
posts = []

with open(read_path, 'rb') as f:
    posts = pickle.load(f)

In [7]:
reddit_sentences = []

for line in posts[:1000000]: #Only investigating w/ one million vs all 10 million
    sent_text = nltk.sent_tokenize(line)
    
    for sent in sent_text:
        reddit_sentences.append(sent.lower()) #Also change to lowercase

In [8]:
len(reddit_sentences)

2309955

Okay, I was right, but only slightly, w/ 2.4 mil after turning into sentences. Regardless, to continue exploring pre-processing steps I'll combine the two sets.

In [9]:
for line in reddit_sentences:
    sentences.append(line)

In [10]:
len(sentences)

4259979

In [11]:
#Lets see a few random samples
print(sentences[50])
print(sentences[50000])
print(sentences[2500003])

but suggesting we raise the taxes on the wealthy will lead to more problems than it solves.
borders said it is losing about $2 million a day at the stores it plans to close.
the fact you can search for yours using fetchlands is really cool (and powerful).


In [12]:
from nltk.stem import WordNetLemmatizer

In [13]:
w_lem = WordNetLemmatizer()

In [14]:
lemma_test = []

for sent in sentences:
    words = nltk.word_tokenize(sent)

    sentence = ""
    
    for word in words:
        word = w_lem.lemmatize(word)
        
        sentence = sentence + " " + word
        
    lemma_test.append(sentence)

In [15]:
len(lemma_test)

4259979

The other option is to use stemming instead, so lets try that as well. 

In [16]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)  #Since well... english is the scope here, and lets ignore stopwords

In [17]:
stem_test = []

for sent in sentences:
    words = nltk.word_tokenize(sent)

    sentence = ""
    
    for word in words:
        word = stemmer.stem(word)
        
        sentence = sentence + " " + word
        
    stem_test.append(sentence)

In [18]:
len(stem_test)

4259979

In [19]:
#As a sanity check lets compare a few line

print(lemma_test[400])
print(stem_test[400])
print(" ")
print(lemma_test[405])
print(stem_test[405])
print(" ")
print(lemma_test[1400000])
print(stem_test[1400000])

 this year 's `` what the bos make '' survey cover the period for fiscal year ending from march 31 to dec. 31 , 2008 .
 this year 's `` what the boss make '' survey cover the period for fiscal year end from march 31 to dec. 31 , 2008 .
 
 the rest of his family is spread out geographically -- his parent live in portland , along with one brother .
 the rest of his famili is spread out geograph -- his parent live in portland , along with one brother .
 
 in early november , filandrinos encouraged her to apply for another temporary position with bjc healthcare .
 in earli novemb , filandrino encourag her to appli for anoth temporari posit with bjc healthcar .


The results look pretty muchas expected, the two definitely produce different sentences, though the variance changes.

Alright now that I have 2 test sets pre-proccessed a little differently it is time to decide which one will work better. Rather then run the whole rest of the expiriment on both sets, and have to create two full sets, which would be too intensive/take too long, I am simply going to choose the option which gives me more results from a keyword search. I'll have to be careful here though to pre-processs the key words.

In [30]:
#def getLemmaCount(key_words):
    
    count_list = []
    
    for line in lemma_test:
        for key in key_words:
            if ((w_lem.lemmatize(key) in line) and (line not in count_list)):
                count_list.append(line)
    
    return len(count_list)

def getStemCount(key_words):
    
    count_list = []
    
    for line in stem_test:
        for key in key_words:
            if ((stemmer.stem(key) in line) and (line not in count_list)):
                count_list.append(line)
    
    return len(count_list)


#Define some random key words
key_words_list = [["monkey","gorilla","chimp"],['sailor', 'seamen', 'seaman'], ['boat','vessel','ship']]


#Don't actually run
#for k_w in key_words_list:
#    print("Lemma Count for", k_w, " : ", getLemmaCount(k_w))
#    print("Stem Count for", k_w, " : ", getStemCount(k_w))

KeyboardInterrupt: 

So before I actually run that chunk...

In [31]:
key_words_list = [["monkey","gorilla","chimp"],['sailor', 'seamen', 'seaman'], ['boat','vessel','ship']]
for k_w in key_words_list:
    for key in k_w:
        print(w_lem.lemmatize(key))
        print(stemmer.stem(key))

monkey
monkey
gorilla
gorilla
chimp
chimp
sailor
sailor
seaman
seamen
seaman
seaman
boat
boat
vessel
vessel
ship
ship


Yeah so... for most nouns or search query-esque words they will most likely be the same... so the verdict is using stemming, since it runs slightly faster for me.