# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [15]:
import os # operating system tools
import re #regex
import string # string processing tootls
from collections import Counter, OrderedDict

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [16]:
filename = os.path.join("..", "data", "Dickens_Expectations_1861.txt")

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".

In [17]:
with open(filename, "r", encoding = "utf-8-sig") as file: #r means "read", which is a mode to read file
    text= file.read() 


#we could do this : text= file.read(), but when working with lots of data, this is better and more specific and concise 
# "utf-8-sig" is the encoding of this specific file, when print the file it will reveal this encoding

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [18]:
#printing and slicing the first 300 words
#text =text[:300] #contains a lot of line breaks


In [19]:

#note there's a difference betwen below and the above 
#print(text[:300])  #creats a full preview, nice neat and stuff

You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [20]:
text = text.replace("\n", " " )

__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [21]:
tokens = text.split() #if we leave it blank it splits by space

In [22]:
tokens

['REAT',
 'EXPECTATIONS',
 '1867',
 'Edition',
 'by',
 'Charles',
 'Dickens',
 'Chapter',
 'I',
 'My',
 "father's",
 'family',
 'name',
 'being',
 'Pirrip,',
 'and',
 'my',
 'Christian',
 'name',
 'Philip,',
 'my',
 'infant',
 'tongue',
 'could',
 'make',
 'of',
 'both',
 'names',
 'nothing',
 'longer',
 'or',
 'more',
 'explicit',
 'than',
 'Pip.',
 'So,',
 'I',
 'called',
 'myself',
 'Pip,',
 'and',
 'came',
 'to',
 'be',
 'called',
 'Pip.',
 'I',
 'give',
 'Pirrip',
 'as',
 'my',
 "father's",
 'family',
 'name,',
 'on',
 'the',
 'authority',
 'of',
 'his',
 'tombstone',
 'and',
 'my',
 'sister,',
 '-',
 'Mrs.',
 'Joe',
 'Gargery,',
 'who',
 'married',
 'the',
 'blacksmith.',
 'As',
 'I',
 'never',
 'saw',
 'my',
 'father',
 'or',
 'my',
 'mother,',
 'and',
 'never',
 'saw',
 'any',
 'likeness',
 'of',
 'either',
 'of',
 'them',
 'for',
 'their',
 'days',
 'were',
 'long',
 'before',
 'the',
 'days',
 'of',
 'photographs',
 ',',
 'my',
 'first',
 'fancies',
 'regarding',
 'what',
 't

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In [34]:
sentences = re.split(r'[\.\?!]\s*' , text) # split at ANY . ? ! followed by a space
#this may not be the best way, as it sees Mr.
#  Johnson as a full stop after Mr. :///

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

In [24]:
counter = 0
keyword = "love"

for token in tokens: #for every token
    stripped = token.strip(string.punctuation) #the token minus all
    lowered = stripped.lower()
    if lowered == keyword: #is the token a keyword
        counter += 1  #if yes count it



In [25]:
print(counter)

60


In [26]:
tokens.count("love") #gør det samme som for loop uden clean 

43

In [55]:
cleaned = []

for token in tokens:
    stripped = token.strip(string.punctuation) #the token minus all
    lowered = stripped.lower()
    cleaned.append(lowered)

In [28]:
Counter(cleaned)

Counter({'reat': 1,
         'expectations': 29,
         '1867': 1,
         'edition': 1,
         'by': 809,
         'charles': 1,
         'dickens': 1,
         'chapter': 63,
         'i': 6484,
         'my': 2070,
         "father's": 14,
         'family': 39,
         'name': 121,
         'being': 265,
         'pirrip': 5,
         'and': 7078,
         'christian': 9,
         'philip': 7,
         'infant': 6,
         'tongue': 8,
         'could': 483,
         'make': 160,
         'of': 4431,
         'both': 133,
         'names': 13,
         'nothing': 176,
         'longer': 19,
         'or': 566,
         'more': 404,
         'explicit': 1,
         'than': 293,
         'pip': 326,
         'so': 794,
         'called': 62,
         'myself': 235,
         'came': 235,
         'to': 5079,
         'be': 1034,
         'give': 87,
         'as': 1773,
         'on': 1419,
         'the': 8143,
         'authority': 8,
         'his': 1858,
         'tombstone

We can use a similar logic to find all sentences where a certain keyword appears.

In [54]:
### find every sentence with love in it:
keyword = "love"


for idx,sentence in enumerate(sentences): #enumerator is cool here to index the sentence
    #make things lowercase
    lowered = sentence.lower()
    #strip punctuation
    stripped = lowered.strip(string.punctuation)

    #add white space in keword (to only get "love")
    modified_keyword = " " + keyword + " "

    if modified_keyword in stripped:
        print(idx, stripped) #idx and the sentence with keyword


918 but i loved joe, - perhaps for no better reason in those early days than because the dear fellow let me love him, - and, as to him, my inner self was not so easily composed
1845  "cousin raymond," observed another lady, "we are to love our neighbor
2034 it's something to have seen the object of one's love and duty for even so short a time
2875 if i could only get myself to fall in love with you, - you don't  mind my speaking so openly to such an old acquaintance
4116 there appeared upon the scene - say at the races, or the public balls, or anywhere else you like - a certain man, who made love to miss havisham
4125 your guardian was not at that time in miss havisham's counsels, and she was too haughty and too much in love to be advised by any one
4345 that did not extend to me, she told me in a gush of love and confidence  at that time, i had known her something less than five minutes ; if they were all like me, it would be quite another thing
4731  not a man of them, sir, would be 

Python also has some built-in tools which we can use to count how many times a token appears in a list.

There are some problems, though! 

## Viewing keywords in context (KWIC, concordancing)

In [63]:
#define keyword
keyword = "love"

#finding the words before and after the keyword
for idx, token in enumerate(cleaned):
    if token == keyword:
        #join: joins the 5 tokens before the keyword by a space "" up until the keyword(not including the keyword)
        before = ' '.join(cleaned[idx-5:idx]) 
        #join: joins the 5 tokens after the keyword by a space "" 
        after = ' '.join(cleaned[idx+1:idx+6])
        #making the order of the output
        full = [before, token, after]
        #regulating the space of the output by character
        print("{:50}, {:20}, {:50}".format(*full)) #adds a tabular structure to it :)

the dear fellow let me                            , love                , him  and as to                                    
another lady we are to                            , love                , our neighbor sarah pocket returned                
with anxiety of those i                           , love                , if i could be less                                
higher than your head my                          , love                , said mr camilla i have                            
expect to thank you my                            , love                , without expecting any thanks or                   
seen the object of one's                          , love                , and duty for even so                              
get myself to fall in                             , love                , with you  you don't                               
a certain man who made                            , love                , to miss havisham i never                          


In [67]:
##the function join:
animals = ["dog","bird", "cat"]

" ".join(animals) #"animals" that follow is a list, make it a string
#"..".join(animals)
#OR
#"<3".join(animals)



'dog..bird..cat'

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.