# Python for Linguists

Notebook 5: Files

Venelin Kovatchev

University of Barcelona 2020

### Strings, Lists, Dictionaries

In this notebook we will continue the last practice practice with strings, lists, and dictionaries and text statistics.

However, this time we will be using files instead of a fixed variable


In [None]:
# In order to read files, we need to open them first

# Let's start by opening the manel file
# The manel file is within the same folder as this program, so we don't need to put the full path, we just put the name
with open('manel.txt', 'r', encoding='utf-8') as f:
    # Let's see reading the file in a string
    manel_str = f.read()

with open('manel.txt', 'r', encoding='utf-8') as f_2:
    manel_list = f_2.readlines()
    
# When we change the indentation, python will automatically "close" the file
# You don't have to worry about closing the file in this example

print("The file read as a single string")
print(manel_str)


In [None]:
# Let's see the first 100 characters:
print(manel_str[:100])

# Observe that the print() correctly puts the newlines
# If we want to see the text in the way it is actually saved in the file, we can use the repr() function
# As you can see, the file is in fact one very long string, a single line. \n marks the ends of each "expected" line
print(repr(manel_str[:100]))


In [None]:
# Now let's see the list version of the file
print(manel_list[:5])

In [None]:
# Finally, let's see how reading file line by line works:

num_lines = 0

with open('manel.txt', 'r', encoding='utf-8') as f_3:
    for cur_line in f_3:
        # In this code we just print the line and some text
        # In real code we can do things with each line - e.g. we can split(), we can count, we can keep it in another variable
        print("Printing line number " + str(num_lines))
        print(cur_line)
        num_lines +=1

In [None]:
# Writing to files works as described in the lectures:
with open('test.txt', 'w', encoding='utf-8') as f_4:
    f_4.write("Some text here")
    # You can use the "\n" symbol to write a new line
    f_4.write("\n")
    # You can also use a variable
    str_to_write = "Some text ending with a newline.\n"
    f_4.write(str_to_write)
    
    # You can go and open the file with notepad to see the content

For today's exercise, we will continue working with text statistics

We will use the exercise from last week to read a corpus, create a vocabulary, and count the frequencies

However, instead of using a predefined corpus in a list, we will read the data from a file

We will also put some additional requirements on counting characters

In [None]:
# The following function will open a file and read it line by line
# There are three possible texts in this lab, the names are "manel", "macbeth" and "quijote"

def readLineFromText(text):
    with open(text+'.txt', 'r', encoding='utf-8') as f:
        for line in f:
            yield line
            
# Observe the behavior of the readLineFromText function:
# It is the same as the example we previously saw

for sentence in readLineFromText("manel"):
    print("This is a sentence: ")
    print(sentence)

In [None]:
# We will use the same functions to print and plot statistics as in the previous task
# Do not change those
import collections
import matplotlib.pyplot as plt
def mostCommonWords(d):
    c = collections.Counter(d)
    most = c.most_common(10)
    print('Rank\t\tWord\t\tFrequency')
    print('----\t\t----\t\t---------')
    for i,(w,f) in enumerate(most):
        print(str(i+1)+ '\t\t' + w + '\t\t' + str(f))

def plot(d):
    c = collections.Counter(d)
    most = c.most_common()
    plt.plot([x[1] for x in most])
    plt.grid()
    plt.xlabel('Word rank')
    plt.ylabel('Frequency')
    
    


In [None]:
# Task 1

# Use the function readLineFromText to process the whole text file
# When reading a sentence, separate all the words using .split() and keep them in a list, similar to Task 1 from last class

In [None]:
# Task 2 
# Create a vocabulary for the text files, similar to Task 2 from last class
# The vocabulary should be case insensitive, it should not make a difference between "SPAIN", "Spain", and "spain"
# Ignore stop words and do not add them to the vocabulary
# For the purpose of this exercise, stop words are words with length 1,2, or 3
# Alternatively, create a list variable with stop words and check if a word is in the list


In [None]:
# Task 3
# Calculate the word frequencies for the text files, similar to Task 3 from last class
# Use the vocabulary from task 2, case insensitive and ignoring stop words
# 


In [None]:
# Task 4
# Experiment using the Counter() dictionary
# Unlike a normal dictionary, a Counter dictionary does not give you error when you try to modify non-existing value
# It defaults it to 0
# You can create a new counter with the following code
import collections
count = counts = collections.Counter()

# Redo task 3 usning a counter dictionary instead of a normal one

In [None]:
# Task 5 (Advanced)
# Normalize the frequencies
# Now we're switching from absolute frequencies to relative frequencies, that is, each value will be the 
# fraction of occurrences of its key, and all values in the dictionary should sum 1. Then show the results again.
# 
# Hints:
# - relative frequency of x = absolute frequency of x / sum of all absolute frequencies
# - You don't need to sum all absolute frequencies for each time you do the previous operation 
#   in your code, because it is constant. You can calculate it once and store it in a variable.