## DS-UA 301 Assignment 1

Welcome! In this notebook, we will explore and practice several foundational concepts and techniques in Natural Language Processing, specifically in the "pre-processing" stage: 
+ Finding and getting a corpus up and running in your Jupyter notebook
+ Cleaning & tokenizing text data
+ POS tagging 
+ Stemming & lemmatization

We'll also *start to* think about approaching our NLP work from a hypothesis-driven approach. This will help for the project, as well as life in general, as we've elaborated ad nauseum in lecture already.

Some tips: 
+ You may use any data you like (built-in NLTK or something else!) as long as you are able to complete all the prompts below.
+ You're also welcome to try the assignment with both an NLTK corpus *and* something you collect yourself! No extra credit, just extra learning :).
+ While we will be grading you for correctness and completeness; i.e., you must correctly complete the specific tasks we ask of you, there is rarely a single "right answer" in much of data science, NLP very much included, when it comes to *which* strategy to use.
+What this means for you is: for example, while we do need you to, e.g., remove all capital letters if we ask you to, for questions about your decision-making about whether or not removing capital letters is appropriate, we care more that you thoughtfully and transparently explain your reasoning, rather than worrying whether it's "right" to keep them or not. 
+Of course, we want you to use your best judgement, but usually there isn't a single correct answer out there floating around for questions about what method you use. It's about balancing tradeoffs, and we want to see that you understand the tradeoffs.
+ In other words, to loosely quote a wise data scientist (not me, ha ha ha (but really it isn't)): **Keep the precision but let go of perfectionism!**
+ Also, 2-3 sentences (at most!) for the non-code questions should generally be enough. No need for essays!

Finally, each question is worth 1 whole point, for 45 points total.

## 1. Finding some text you'd like to represent as data

This step may be less trivial an exercise than it seems at first glance. As with structured, numeric data, finding data that's both usable *and* addresses the question you're interested in is often no small feat.

For the project, you'll eventually need to use your own data (alas, NLTK library data is not allowed for that). But for this assignment, especially if it's your first go at NLP, you're more than welcome to use something built in so you can focus on the techniques.

That said, it may make your life (eventually) easier if you start exploring for and with your own data now!

(a) What text will you be investigating in this assignment?

**The King James Version (KJV) of the Bible**


(b) Why did you choose this text? What do you hope to learn from it? (Even if you're working with an NLTK corpus, surely you have a reason you were drawn to one!)

**I grew up in a Catholic household, yet attended an Episcopal school, and this was the version of the Bible used in that school. I never actually read this version of the bible, or any other version for that matter, but I have always understood this version to be the most lexically complex, yet beautifully translated, of them all.**

(c) Imagine, hypothetically, that you needed to form a hypothesis about this data (you won't have to test it in this assignment). What are three hypotheses you could test using this data? (They can be three related hypotheses, or not! You can also imagine that you would eventaully add other data if needed, or not.)

**1. Some of the most used words in the KJV Bible are not used anymore in the English Lexicon**

**2. The KJV Bible is the source of many words in popular use today.**

**3. Bible versions today are less faithful to their original transaltion to English (KJV) than the KJV was to its Greek and Latin Sources**

(d) For *one* of the hypotheses above, how would you test it, and how would you know if your hypothesis is wrong? In other words, what sort of result would disprove it in the context of your study? (Again, you don't have to test anything in this assignment, but (obviously) this is practice for the project!)

**A good source of comparison for popularity of words, I believe, would be Shakespeare's works from before 1611, when the KJV was written, and some more popular texts that came after 1611. You could compare the KJV to both Shakespeare's works and use of words to compare the language to that before it was published, as well as to works published a few decades or even centuries after the KJV was published.**

(e) Ok, it's time to get it ready to go in Python. Import, load, or do whatever's necessary so your text is workable in one document here (you can have more than one document, but it's not needed for this assignment).

In [1]:
import nltk
import statistics
import string
from nltk import *
kjv_words = nltk.corpus.gutenberg.words('bible-kjv.txt')
kjvsentences = nltk.corpus.gutenberg.sents('bible-kjv.txt')
text=''
for sent in kjvsentences:
    sentence=''
    for word in sent:
        sentence+=word+' '
    text+= sentence + '\n'

(f) How many words are we working with here? (I.e., how many are in the corpus you'll be using in this assignment?)

In [2]:
len(kjv_words)

1010654

## 2. Capitalization

(a) Hooray! You're ready to start pre-processing. First, consider capitalizations. Before we do anything with them, think about what you'd like to do with this data, and what sort of data it is. Do you think you'll want to keep the capitalizations, or change everything to lowercase? Or something else?

I think changing everything to lowercase will make this text easier to work with.

(b) Briefly explain your answer to 2a above.

I don't think this Bible version has many mentions of names that may be confused for other words (ie. Peter, Isiah, David), and the mentions of God, mostly, all relate to the one Christian God. I believe there's little to no reason to keep the capitalization of words in the text since this will only cause more work for the program to run.

(c) For fun(!), regardless of your answer to 2a, write some code to remove all capitalizations. (If you want to keep the capitalization in later stages, just comment it out after you confirm it works, but leave the code.)

In [3]:
lower_kjvsentences = []
for i in range(len(kjvsentences)):
    sentence_list = [kjvsentences[i][j].lower() for j in range(len(kjvsentences[i]))]
    lower_kjvsentences.append(sentence_list)

(d) For even more(!) fun, regardless of your answer to 2a, write some code to instead remove only capitalizations that appear at the beginning of each sentence. (Again, if this doesn't suit your overall analysis, just comment it out after you're done.)

In [4]:
# I found that most, if not all sentences in kjv_sentences have the first three words in those sentences be either a number or a colon.
# I'll be, therefore, making the 4th word of each sentence lower case
lower_first_kjvsentences = []
for i in range(len(kjvsentences)):
    sentence_list = []
    for j in range(len(kjvsentences[i])):
        if j == 3:
            sentence_list.append(kjvsentences[i][j].lower())
        else:
            sentence_list.append(kjvsentences[i][j])
    lower_first_kjvsentences.append(sentence_list)

(e) So far we've tried the capitalization options we discussed in lecture. Now, think up your own capitalization rule (anything at all is fine, though try to think of what might be most useful for your text and goals). What rule will you implement?

Let's lowercase ever word in the Bible BUT God, let's retain that word as either God or god.

(f) Go on, implement it!

In [5]:
God_kjvsentences = []
for i in range(len(kjvsentences)):
    sentence_list = []
    for j in range(len(kjvsentences[i])):
        if kjvsentences[i][j]=='God' or kjvsentences[i][j] == 'Jesus':
            sentence_list.append(kjvsentences[i][j])
        else:
            sentence_list.append(kjvsentences[i][j].lower())
    God_kjvsentences.append(sentence_list)

(g) Now that you've explored some capitalization options, which one do you think is the "best" fit for your text and goals? (And here's where you can comment out anything you won't use going forward and then re-run what's left so it's in for the shape you want for the next questions.)

I feel the rule from e) will be most useful here because there are several instances where god is used instead of God, the first one alluding to a "false" god and the 2nd one relates to the Christian God. 

(h) Why do you think the rule you chose in g is the most appropriate for your work? Give one strength and one weakness of the rule you've chosen.

One strength is that we'll be able to tell how the Bible speaks of god compared to God, either by the sentiment of the sentence, or how to comparative these are, ie. my God is the most...

I don't have much experience with NLP yet to know a weakness to lowercasing every word, but we'll see if something goes wrong later!

## 3. Punctuation

(a) We're going to go through a similar process for punctuation as we did with capitalization, but don't worry, we'll pick up the pace a bit. First, do you think you'll want punctuation, or not, or something in between? Briefly explain your answer.

I don't know if punctuation is necessary because most bible verses anyways are either split by commas and/or spaces. It might be useful to get rid of these then. 

(b) Remove whatever punctuation you think is appropriate for your work. You may need to try a few versions. There's no need to show us anything but the final one in this case.

In [6]:
non_punctuation = [' '.join(c for c in i if c not in ',.') for i in God_kjvsentences]
for i in range(len(non_punctuation)):
    non_punctuation[i] = non_punctuation[i].split()

**YOUR ANSWER HERE**

(c) What punctuation rule did you land on? Why did you decide on this one? Briefly walk us through the options you considered, if any. (If it was crystal clear to you what you wanted to do with the punctuation, just explain why.)

I found that there are many commas in several verses, so taking these out would be useful as well. There are also the verse colons (3:2) so I'll take those out in the next line of code.
There aren't many uses of any other punctuation marks, at least that I know of so these three things (Periods, commas, colons) should be good to go. 

I also find it useful to remove the first numbers (Chapter/verse) of any sentence, because those aren't useful at all right now. So I'll take those out as well in the next problem (d)

(d) This isn't *exactly* punctuation, but let's do it here: Go ahead and remove any other miscellaneous text, symbols, or tags, if there are any, that you don't need. If there aren't any, just say so! (You don't even have to explain why you're doing it (for once!). We *get* it.)

In [7]:
sentences = [' '.join(c for c in i if c not in string.punctuation) for i in God_kjvsentences]
for i in range(len(sentences)):
    sentences[i] = sentences[i].split()
    sentences[i] = sentences[i][2:]

## 4. Stop words

(a) It's time to talk stop words! Before we get into which stop word list to use or what other stop words there might be for your text, share what you think will be right for you: drop most "standard" stop words, keep them, add some stop words of your own, or some combination? Briefly explain your reasoning.

I think the bible is mostly stop words so I'll be taking out most of them, and most are really just redundant (And jesus said, and he spoke too, and he went to).

(b) It's sometimes a little tricky to tell where the most useful line is between the benefit of simplification and the loss of substance when it comes to stop words in a particular text. Explore their usefulness for your text by first dropping all stop words from the NLTK stop word list.

In [8]:
stop_words = set(nltk.corpus.stopwords.words('english'))
filtered_text = []
for line in sentences:
    filtered_text.append([word for word in line if not word in stop_words]) 

(c) How do you think the NLTK stop words did? Do you think removing them is generally helpful for your analysis? Why or why not? 

(By the way, I know you're not doing a *full* analysis, so it's ok if at this point you're thinking -- hang on, well, if eventually I want to do X, I wouldn't need stop words, but if I wanted to do Y, I probably would -- feel free to tell us that, or just pick a general direction from your hypotheses above and steer towards that. 

I *also* know we haven't gotten far in terms of techniques so it may not be obvious what you even *could* eventually do. But trust your curiosity and instincts! It's ok if it's not something you ultimately end up doing, or even is feasible long term. You can also be quite general, like "understand trends in X over time" and leave it more or less as that.)

In [9]:
mean_kjvsentence_length = statistics.mean([len(kjvsentences[i]) for i in range(len(kjvsentences))])
mean_filteredsentence_length = statistics.mean([len(filtered_text[i]) for i in range(len(filtered_text))])
mean_sentence_length = statistics.mean([len(sentences[i]) for i in range(len(sentences))])

In [10]:
print("Unfiltered Text Mean sentence length: "+str(mean_kjvsentence_length),
      "\nMean sentence length with punctuation taken out: " +str(mean_sentence_length), 
     '\nMean sentence length with punctuation AND stopwords taken out ' + str(mean_filteredsentence_length))

Unfiltered Text Mean sentence length: 33.57319868451649 
Mean sentence length with punctuation taken out: 26.38215460253131 
Mean sentence length with punctuation AND stopwords taken out 12.720227219878417


We can compare the lengths of the average sentences here and we see that there is a massive drop in the average words. Let's also take into account that 5 "words" (brackets (counted as two words), and two numbers for each sentences (Verse/chapter), plus the colon in between the verse:chapter) were all dropped, lowering the mean by 5 alone. Also, there are many commas so take these out as well and you get a nearly 7 word mean drop.

Now taking stopwords out halves the mean sentence length, this is what I said about stopwords being the majority of words in the Bible. 

(d) Put the NLTK stop words back in, and try removing stop words from a *different* list. It doesn't matter what list it is as long as it's not from NLTK or your own brain (yet).

In [11]:
non_stopword_sentences = []
for line in sentences:
    non_stopword_sentences.append([word for word in line if not word in ['God', 'Jesus', 'heaven', 'apostle', 'prophet', 'peter', 'andrew', 'james', 'john', 'philip', 'bartholomew', 'matthew', 'thomas', 'james', 'simon', 'judas']]) 
mean_nonstopword_length = statistics.mean([len(non_stopword_sentences[i]) for i in range(len(non_stopword_sentences))])
print(mean_nonstopword_length)

26.169351891838023


(e) What stop word list did you use, and why did you choose it?

I chose to eliminate any mention of the 12 apostles, the words prohet, apostle, or heaven, plus any mention of Jesus and God and this brought the average sentence length down by only .2 from the mean that still had all these words, no punctuation in.
I thought these words would have a greater impact but they don't at all, perhaps for the meaning of the bible they do, but, for example, you can remove 'and', and this will lower the mean by two entire words!

(f) How do you think it did? Do you think it's more useful than the NLTK list, or not? Why or why not?

NLTK Stopword list is still more useful in my opinion. As I said before, almost the entire bible consists of sentences with and, or, if, etc. etc. boring words. So these could easily be taken out without much loss to the meaning.

(g) And now the moment you've all been waiting for -- it's time to think about some potentially useful stop words to explore dropping that might be unique to your text. First, what might some of these stop words be? Feel free to write some code below to work out what might be helpful (no need if that's not necessary). (Note: your final decision might be to remove no stop words, not even your own, but for now you must choose at least one!)

In [12]:
filtered_text = []
for line in sentences:
    filtered_text.append([word for word in line if not word in stop_words]) 

(h) Go ahead and drop your unique stop words! (You may do so while retaining previous dropped stop words, or not, depending on what is more useful for you.)m

I'll keep the above code (filtered_text) as my primary text from now on. 

(i) Having explored a few angles on stop words, what version do you think is best? (You could also choose, e.g., a subset of an existing list, or a subset + a few unique ones -- anything. If you do something outside of what we've already done anywhere in Question 4, just include your code below.) Briefly describe your stop word strategy and why you think it's the most useful. (As before, you may comment out any stop word code that you ultimately won't want to use going forward.)

I feel that removing stopwords from the NLTK list is enough because removing any other words, as in problem d) will result in loss of meaning. I think this is good enough.

## 5. Tokenize this!

(a) Go ahead and tokenize your text!

In [13]:
filtered_raw = ''
for i in range(len(filtered_text)):
    filtered_raw += ' '.join(filtered_text[i])+'\n'
tokens_raw = word_tokenize(text)
tokens_filtered = word_tokenize(filtered_raw)

(b) How many *tokens* are in your corpus?

In [14]:
print("Raw text tokens: "+str(len(tokens_raw)))
print("Filtered text tokens: "+str(len(tokens_filtered)))

Raw text tokens: 1011046
Filtered text tokens: 383304


(c) How many *types* are in your corpus?

In [15]:
print("Raw text types: "+str(len(sorted(set(tokens_raw)))))
print("Filtered text types: "+str(len(sorted(set(tokens_filtered)))))

Raw text types: 13758
Filtered text types: 12527


(d) How many *terms* are in your corpus? How did you come to this number?

In [16]:
fdist1 = nltk.FreqDist(tokens_filtered)
filtered_word_freq = dict((word, freq) for word, freq in fdist1.items() if not word.isdigit())
terms = sum([i for i in filtered_word_freq.values()])
terms

373252

I am really confused over the difference between terms and tokens, so I assumed a token was any word, and a term was also any word, without repeats. So james could be in tokens 100 times, but in terms, it would only be 1. I may be wrong, and I probably am.

(e) Is there anything else you'd like to share about your tokenizing experience? It's ok if the answer is "no", as tokenizing is (probably) the least controversial of all the steps. (Having said that, I'm sure the tokenization wars on Twitter will now blow up in my face!)

This process was easier than expected

## 5. POS tagging

(a) In lecture we talked about several strategies for POS tagging. Two of them were *lexical-based* and *rule-based*. Tag your corpus according to a *lexical-based* strategy.

In [24]:
lexical_based = nltk.pos_tag(tokens_filtered)
print(type(lexical_based[0]))

<class 'tuple'>


(b) Now, remove those tags and instead tag your corpus according to a *rule-based* strategy.

In [46]:
rule_based = []
for i in tokens_filtered:
    if i[-2:] == 'ed':
        POS = (i, 'VBD')
        rule_based.append(POS)
        continue
    if i in ['God', 'Jesus', 'heaven', 'peter', 'andrew', 'james', 'john', 'philip', 'bartholomew', 'matthew', 'thomas', 'james', 'simon', 'judas', 'moses']:
        POS = (i, 'NNP')
        rule_based.append(POS)
        continue
    if i.isnumeric():
        POS = (i, 'CD')
        rule_based.append(POS)
        continue
    if i[-1] == 's':
        POS = (i, 'NNS')
        rule_based.append(POS)
        continue
    else: 
        POS = (i, 'NN')
        rule_based.append(POS)
        continue

(c) Which strategy do you think is more useful for your analysis, and why? (As ever, you may comment out the one you don't choose.)

I was going to list all the books of the bible but it would have taken me ages to type them all out. I would be tagging them as NNP, but I think I'll be staying with the nltk.pos_tag.

There are also a lot of words (thus, henceforth, etc.) that the nltk method will be able to tag, and I won't catch

## 6. Let's stem!!

(a) In lecture we discussed two stemmers: Porter and Lancaster. Apply the Porter stemmer to your text.

In [49]:
from nltk.stem.porter import *
porter = PorterStemmer()
porter_stem = [porter.stem(i) for i in tokens_filtered]

(b) How did it do? Do you think the Porter stemmer is useful for your work? Why or why not? 

I think this stemmer will work just fine, it is not as agressive as lancaster, where most words there were shortened too much (called, cal). 

(c) Undo the Porter stemmer and instead apply the Lancaster stemmer to your text.

In [54]:
lancaster = nltk.LancasterStemmer()
lancaster_stem = [lancaster.stem(i) for i in tokens_filtered]

(d) How did the Lancaster stemmer do? Do you think it's more or less useful for your work than the Porter stemmer? Briefly explain your reasoning.

I found this stemming process to be really agressive and this will lose my text of its meaning. I will stick with porter.

## 7. Last but (well, maybe) not least ... Lemmatization!

(a) Go on, Lemmatize! You may do so in any way you see fit.

In [56]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
lemmatize = [wordnet_lemmatizer.lemmatize(i) for i in tokens_filtered]

(b) Why did you choose the Lemmatizer that you did? (Note: "It seemed like the easiest one" is a fine answer for this assignment!)

It seemed like the easiest one, I didn't even know there were different lemmatizers

(c) What do you think is more useful for your work, the winning stemmer from Question 6, or the Lemmatizer? Briefly explain your reasoning. In doing so, please comment on the tradeoffs between the two choices and why you landed where you did.

I couldn't see a difference between lemmatizing and stemming, maybe because it is more agressive and obvious than lemmatization. Lemmatization would replace entire words instead of just cropping out the last couple of characters from it. I don't know the bible word by word so it'd be hard for me to notice which words were changed. Perhaps lemmatization works better for different tenses, so I'll stay with lemmatization.

## 8. Summing up

(a) Hopefully you've had a chance to experience, without getting *too* irritated at me for a few touches of tedium in this assignment, that even in this *relatively* uncontroversial pre-processing stage, you had to make a number of choices about how to clean and standardize your text. 

Overall, which choice was the most difficult to make, and which was the easiest, or most "obvious" for you? (Even if none were particularly difficult or particularly easy -- all possible depending on your text -- try to pull out at least the extremes!) Briefly explain your answer.

The most obvious was punctuation, but the one I found the most difficult to understand was POS Tagging. This corpus is obviously historical and won't be read in the present tense, so perhaps POS Tagging might be more useful when working with more modern and 1st person texts, like tweets or articles, etc. And I don't know if there's an easier way to do rule-based tagging because it is very cumbersome unless you have a dictionary at hand to compare words with their Parts of speech, which I didn't and simply gave up on. 

(b) How confident are you in the choices you've made? In other words, if you were to proceed with analyzing this text, how likely do you think it is that you'd eventually want to go back and tweak (or totally change) some of the decisions you've made? What do you think you'd be most likely to change? Briefly explain your reasoning. If you expect to be 100 percent confident in all your choices in this assignment until the end of time, explain why.

I feel I am quite confident to be honest. I would probably keep the proper nouns capitalized, if anything, and other than that I feel I did what was best to analyze the text more easily. 

(c) We haven't learned much (yet!!) about what to actually *do* with text once it's pre-processed, but now that you have it, what do *you* imagine the next step would be in your analysis in order to test (or at least get closer to being ready to test) one or all of the hypotheses you identified in Question 1? Go with your intincts -- I bet there's a technique for it (and if there isn't, well, now you're a future NLP methods developer!).

The next step would be to do the same process for this bible as with every other version of the bible throughout history and to compare the different use of verbs, connotations, etc. 

(d) Finally, while, again, we haven't *really* done any analysis yet, what's something you've learned about your text from this pre-processing work? No lesson is too large or too small.

I thought this process would be way harder but I had never worked with NLTK, whose methods make everything 20x faster to do and process. I am really liking this course, however little we've done until now.

# The end! 

Congratulations on finishing your first assignment, and your first NLP work ever (for most of you, at least)! Don't forget to comment out (not delete!) the code that you decided is not appropriate for your work. In other words, leave the final version of this notebook such that we can run *your* analysis top to bottom.