# Homework for the Python Basics Workshop

Ok, so we delivered on our promise of making you capable of writing non-trivial python programs. Now, you can become even more dangerous by learning how to read text files directly, and then how to parse these text files and calculate statistics on the words inside of them. We're going to expose you to files and strings and ask you to do some data analysis using them.

### Files

The built-in `open()` function is a **constructor** that creates a Python file object, which serves as a link to a file residing on your machine. After calling 'open()', data can be transferred to and from the associated file by calling the returned file object's **methods**.

At this point, you can read data from the file as a whole (`.read()`, or n bytes at a time, `.read(n`). You can read a line at a time with `.readline()`, and all the lines into a list of strings  with `.readlines()`. Or simply treat the file like a list and iterate through it: you will get line by line. Similar methods exist for writing.

You must close the file `.close()` after you finish using it.

![](https://github.com/univai-ghf/ghfmedia/raw/main/images/filemethods.png)

The next line is to fetch data into Colab. You can safely ignore it.

In [3]:
mkdir -p data; pushd data; wget https://raw.githubusercontent.com/univai-ghf/ghfmedia/main/data/JuliusCaesar.txt; popd

In [4]:
fd = open("data/JuliusCaesar.txt")
thecontents = fd.read()
fd.close()
type(thecontents)

The text read in from a file is a **string** of characters. Gere are the first 200

In [5]:
thecontents[0:200]

### Strings

Strings are objects which behave like lists, but just like lists, also have methods defined on them.

In [6]:
# a long string in python split over multiple lines
alongstring = """
Hello World
Hello My Friends
"""

In [7]:
# a string is listy
for character in alongstring:
    print(character, end=":")

In [8]:
# how did I find the arguments to print out?
?print

Here's a method that splits the string on whitespace (including newlines, tabs, and spaces)

In [9]:
alongstring.split()

### Files are like lists

In [10]:
fd = open("data/JuliusCaesar.txt")
counter = 0
for line in fd:
    if counter < 10: # print first 10 lines, there are lots!
        print("<<", line, ">>")
    counter = counter + 1 # also writeable as counter += 1
fd.close()

Notice that the newlines remain. You can use the string method `strip` on `line` to remove them. 

In [11]:
fd = open("data/JuliusCaesar.txt")
counter = 0
for line in fd:
    if counter < 10: # print first 10 lines
        print("<<", line.strip(), ">>")
    else:
        break # break out of for loop
    counter = counter + 1 # also writeable as counter += 1
fd.close()
print(counter)

### What about writing?

We may want to modify or process files, and then write them out. For this we must use the `open` constructor with an additional argument, which signifies that we are in a writing mode.

In [12]:
fd2 = open("data/JuliusCaesar2.txt", "w")
fd2.write(thecontents)
fd2.close()

## Finally the homework QUESTION

1. Read Julius Caesar. 
2. Get each line. 
3. Remove newline characters from each line. 
4. Split the line to get the words from the line (use the split method on strings). 
5. Lowercase them (use the `lower` method on strings).
6. Now let us make a histogram that has the counts of all the words in the play except for words which are in `stop_words` list provided below. This list contains a most common words in the english language like 'and', 'the', 'i', 'we'. 
7. Your output is a dictionary `worddict` which will store these counts as values, with the words as keys.

In [None]:
mkdir -p data; pushd data; wget https://raw.githubusercontent.com/univai-ghf/ghfmedia/main/data/stopwords.pkl; popd

In [20]:
# pickle library serializes and de-serializes a 
# Python object structure.
# In simple words it can be used to save python 
# objects(almost everything in python is an object)
import pickle
fd = open('data/stopwords.pkl','rb') # binary read
stop_words=pickle.load(fd)
fd.close()

#display the first 20 elements in the list
stop_words[:20]

In [15]:
# your code here


We sort the worddict, using the function worddict.get to provide the values, which are the counts. We print the top 20 counts

In [16]:
topwords = sorted(worddict, key = worddict.get, reverse=True)
top20 = topwords[:20]
for word in top20:
    print(word, worddict[word])

Now we use matplotlib to plot a horizontal bar chart for these top 20

In [18]:
# various imports of libraries

import numpy as np
import matplotlib.pyplot as plt

# ask matplotlib to plot in notebooks
%matplotlib inline
fig, ax = plt.subplots(figsize=(9, 7))
pos = range(len(top20))
ax.barh(pos, [worddict[word] for word in top20],
                     align='center',
                     tick_label=top20)
ax.set_title('Most frequent words in THE TRAGEDY OF JULIUS CAESAR')
plt.show()