#Python Day 3
##File I/O and String Manipulation

##Modules

* Basically any piece of Python code ending in .py could be considered a module
* ...But that can have some odd effects

#What Are Modules?

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


#Why Use Modules?

* Programmers are lazy, therefore we always assume we may need to reuse our code.
* Modules are Python's basic unit of code reuse.
* Also, there are a bunch of useful ones already written!  (More on this later)

#Using Modules

* Load modules using the import command
* Access functions using module name

In [4]:
#import the module time 
import time 

#Use a function of the time module
print(time.gmtime())

time.struct_time(tm_year=2019, tm_mon=2, tm_mday=12, tm_hour=2, tm_min=48, tm_sec=31, tm_wday=1, tm_yday=43, tm_isdst=0)


#Using Modules

* Alternatively, the `from` command also works
* Then just use the function
  * (I prefer the first for readability)

In [2]:
#import the module time 
from time import gmtime

#Use a function of the time module
print(gmtime())

2019-02-11 21:48:04.328704


#Why Use Modules?

* Programmers are lazy, therefore we always assume we may need to reuse our code.
* Modules are Python's basic unit of code reuse.
* Also, there are a bunch of useful ones already written!  (More on this later)

#Using Modules

* The `dir` command can used to show the available functions
* Though most of the time it's easier to just check the documentation

In [3]:
import time
#Check the available functions
dir(time)

['_STRUCT_TM_ITEMS',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'altzone',
 'asctime',
 'clock',
 'ctime',
 'daylight',
 'get_clock_info',
 'gmtime',
 'localtime',
 'mktime',
 'monotonic',
 'perf_counter',
 'process_time',
 'sleep',
 'strftime',
 'strptime',
 'struct_time',
 'time',
 'timezone',
 'tzname',
 'tzset']

#Using Modules

* IDE Protip:  waiting for a second after typing moduleName. will bring up a list of functions!

#Using Modules
* There are many modules included in the basic install.
* A full index can be found at http://docs.python.org/py-modindex.html
* Also many more available via the python package index https://pypi.python.org/pypi

#Using Modules

* Adding new modules is as simple as placing them wear Python can find them.
* Python always check the folder of the main script
* It also checks install dir/Lib
* Or mess with the pythonpath(sys.path) variable
* Or just use `pip` or `conda` if you're using anaconda


#Writing Modules
* At the basic level any .py file will do
* Assuming python knows where it is, just use import to pull it in and acts just like standard modules
* Gotcha:  don't try to import yourFile.py

#Writing Modules

* Handy trick:  Have a bunch functions in a script you may want to reuse?
* Just add `if __name__ == 'main':`
* Code written under this will only be executed if it is being run directly (vs. imported)

#Exercise

In [5]:
'''
Run the following code.  Use what you know about modules to try to figure
out the final secret message.  Pay attention to the naming convention.
'''


import base64
from itertools import cycle

def thisWillPrintTheKey():
    key = 'c3JlZG5hU3JhZUY='
    print ("The Key is: " + str(base64.b64decode(key)))

def xor_crypt_string(data, key):
    return ''.join(chr(ord(x) ^ ord(y)) for (x,y) in zip(data, cycle(key)))
    
if __name__ == "__main__":
    answer = "0\x1d\x0b\x03\x1c\x00'\x07\r\x042\x1a\x1d\x0b\x17BA*\x1d\x14E \x1a\x15\x10\x16\x0b\x05s\x1d\x14\x11f\x07\x1a\x00D\x0f\x0f \x05\x04\x17hSR#\r\x1c\x12'R\x0e\x0b#S\x06\nD\x1d\x00*R&\x0c(\x14\x17\x173\x01\x0e8\x1b\x04E!\x16\x06\x16D\x0fA1\x17\x04\x17"
    k = input("Enter the Key:")
    print (xor_crypt_string(answer, k))
    thisWillPrintTheKey()

Enter the Key:b'srednaSraeF'
R:xqydIf^vS[,ue2Xxp+AIgqsM":gccta}nEC#uO  First onetd-7oeN<G_ZurQc)Kia!OwUd%ju#d
The Key is: b'srednaSraeF'


#File Input and Output

* So far all your data has been entered by hand.
* This will rarely actually happen
* Python has built-in functions to open files
* To open files use the built-in function `open`
* The first argument is the path to the file
* This can sometimes be tricky with “/” or “\”
* Use raw strings to avoid this. Ex r”/dir/file”
* The second argument is the mode

In [7]:
#Basic file open
file = open('exampleFile.txt')

print(file)

<_io.TextIOWrapper name='exampleFile.txt' mode='r' encoding='UTF-8'>


#Common File Object Modes

* 'r' - opens the file for only reading. (this is default if the argument is omitted)
* 'r+' - opens the file for both reading and writing
* 'w' - opens the file for only writing (an existing file with the same name will be erased)
* 'w+' - opens the file for both reading and writing. (an existing file with the same name will be erased) If the file does not exist, creates a new file for reading and writing
* 'a' - opens the file for appending; any data written to the file is automatically added to the end

#Interacting With Files
* Once you've opened a file you need to actually do something with it.  
* File handle variables are used to access the file in your programs
* These objects have a number have different methods, but we'll cover the most common ones.

#Reading Files
* The most common way to read input from a file is `file.readlines()`
* Returns a list containing each line of the file
* Can be huge if the file is huge
* `file.readline()` reads just a single line
* `file.read()` reads the entire file as a string
* Our Example File

```
This is a File.
There are many like it, 
but this one is mine.
```

In [5]:
#Read and print each line

#open the file
inputFile = open('exampleFile.txt')

#read in lines as a list
lines = inputFile.readlines()

#print the lines
for line in lines:
    print(line)
    
inputFile.close()

This is a File.

There are many like it, 

but this one is mine.


In [4]:
#open the file
inputFile = open('exampleFile.txt')

#read in lines as a list
line  = inputFile.readline()

print(line)

line  = inputFile.readline()
inputFile.close()
print(line)

This is a File.

There are many like it, 



In [6]:
#open the file
inputFile = open('exampleFile.txt')

print(inputFile.read())

inputFile.close()

This is a File.
There are many like it, 
but this one is mine.


#Writing Files

* Writing to a file looks an awful lot like reading
* file.write() writes just a single line
* file.writelines() writes a sequence

In [7]:
#open the file for writing
outputFile = open('outputTest.txt', 'w')
#write something
outputFile.write('This is a test\n')
#close the file
outputFile.close()

#File I/O Tips

* file.readlines() is generally the easiest way to read a file unless you have memory constraints
* Always close files.
* If you hate closing files use `with` statements

In [8]:
#auto closes the file when you finish reading
with open('exampleFile.txt') as inputFile:
    lines = inputFile.readlines()

print(lines)

['This is a File.\n', 'There are many like it, \n', 'but this one is mine.']


#os.path Module

* One of the nice things about Python is it's cross platform, but Windows and *nix platforms represent file paths differently
  * '/' for *nix
  * '\' for Windows
* Using os.path lets you avoid having to worry about that
* Use it anytime you think you may have to use the script on multiple platforms
* Basically os.path.join will let you just join directories without worrying about the os.

In [1]:
windowsStr = 'C:\Folder\File'  # String wouldn't work on *nix

import os.path
betterString = os.path.join('Folder', 'File')
print(betterString)

Folder/File


#os.path Module

* os.path also has a lot of other features
  * Directory walks
  * Absolute paths
  * File existence
* Check the documentation if you need to do any of these things

#Exercise
1. Write a program that asks the user for a file to open and then copies the contents of that file a second user specified file.
2. Write a program that opens a specified  text file counts the number of lines in the file.  Once it has this number it appends this number and brief descriptor to the end of the file.

###Solution 1

In [2]:
#Get files to open
inFile  = input("Enter the input file: ")
outFile = input("Enter the output file: ")

#open files
i = open(inFile)
o = open(outFile, 'w')

#read the lines in the file and output each line
for line in i.readlines():
    o.write(line)

#close files
i.close()
o.close()

Enter the input file: exampleFile.txt
Enter the output file: output.txt


###Solution 2

In [None]:
#get file name from user
inFile  = input("Enter the input file: ")

#open file
with open(inFile) as i:
    i = open(inFile)

#read in file as a list then call len on that list
    lines = len(i.readlines())

#open output file using the append method and write a final line with info
o = open(inFile, 'a')
o.write("There are " + str(lines) + " lines in this file")

#close the file
o.close()

#Strings

* Strings will show up a lot
* One of Python's greatest strengths is a robust set of ways to interact with strings
* Python even views binary data as a byte-string
* Strings are a lot like lists
* If you can do something with a list, you can do it with a string

In [4]:
myString = "My String"

#string concatenation
myString = myString + " with some stuff added"
print(myString)

#in works too
print ("My" in myString)

My String with some stuff added
True


#Strings

* Strings also have a number of special methods to make using them easier
* We'll cover some of the more common ones
* If you ever need to you can normally code these by hand, but it's no fun

In [10]:
myString = "Yet another string example."

#find() locates the start index of the search string
print("Another found at %d" % myString.find('another'))

#count() counts the number of specified values in the string
print("There are %d e's" % myString.count('e'))

#replace() replaces the first pattern with the second
print(myString.replace("Yet", "Yes!"))

#split() splits the string into a list of strings
print(myString.split(' '))

#strip() removes the specified beginning and trailing characters
print(myString.strip('.'))

Another found at 4
There are 4 e's
Yes! another string example.
['Yet', 'another', 'string', 'example.']
Yet another string example


#Hex Strings

* Python also makes hex easy
* Python generally assume that values are ascii, however you tell it a character is actually a raw byte using “\x”

In [13]:
#simple hex string
print('\x00Hi\xFF')

#methods like "find" work
print('\x00Hi\xFF'.find('\xFF'))

 Hiÿ
3


#Unicode

* Unicode is a standard way of dealing with international language characters.
* In Python you can specify a unicode string by adding “u” in front of it or by calling unicode.

In [16]:
u1 = u'Unicode String'
print(u1)

Unicode String


#Exercise

1. Write a program that reads in file and counts the number of words in it.  For our purposes words are considered to be a string of characters seperated by a space.

2. Write a program that Olde Englishifies a document.  Turn "You" into "Thou", "the" into "thy" and "has" into "hast".

###Solution 1

In [17]:
#get file name from user
inFile  = input("Enter the input file: ")

#open file and read in its contents
with open(inFile) as i:
    lines = i.readlines()

#create our counter variable
words = 0

for line in lines:
    words += len(line.split(' ')) #create a list of split words and add its length

print ("There are %d words in this file." % words)

Enter the input file: exampleFile.txt
There are 15 words in this file.


###Solution 2

In [18]:
#get file name from user
inFile  = input("Enter the input file: ")

#open file and read in its contents as a single line
i = open(inFile)
allLines = i.read()

#create a new version of the string for each replacement we make
allLines = allLines.replace('you', 'thou')
allLines = allLines.replace('You', 'Thou')
allLines = allLines.replace('the', 'thy')
allLines = allLines.replace('The', 'Thy')
allLines = allLines.replace('has', 'hast')
allLines = allLines.replace('Has', 'Hast')

#output the file
i.close()
o  = open(inFile, 'w')
o.write(allLines)

o.close()

Enter the input file: exampleFile.txt


#Regular Expressions

* In many cases a “in” will let let you match for parts of a string.  However, sometimes you will need a more robust searching mechanism.
* Regular Expressions are the standard programming tool for pattern matching
* Python fully supports regular expressions but is a bit less intuitive than something like perl

#Regular Expressions

* Regular Expressions provide the ability to match abstract user defined patterns and not just exact matches.
* Examples of this include:
  * Phone Numbers
  * Credit Card Numbers
  * IP address
  * Formatted log entries

#Regular Expressions

* Basic Regular Expressions(REs) are made up of elements, which are intended to match items, and metacharacters, which specify things about the elements.
* The most basics elements are exact matches, for example, the letter “a” or the number “4”
* “abc” is a valid regular expression and would function exactly like searching for“abc” in string
* Wildcard elements are used to make Res more versatile
  * “.” is the element that specifies any value 
  * “a.c” would match “abc” or “a2c” (among others)
* There are number of useful wild cards
  * “\w” is any alphnumeric character
  * “\s” is whitespace
* Check the python documentation for a full list

#Regular Expressions

* REs also support more defined logical Ors by placing all the ORed together values in [].  
* “a[b1]c” would match only “abc” or “a1c”
  * Use “-” to specify a range. For example, [A-Z] is all uppercase letters.
* This syntax is useful, but REs really start to get useful when we don't know how many values there will be
* Metacharacters are used to specify how many instances of any element are allowed
  * “*” specifies 0 or more characters 
* For example, “.*” literally matches anything
* There are number of useful metacharacters
  * “+” specifies 1 or more characters
  * “{min, max}” specifies between min and max characters

#Regular Expressions

* We can group characters together using “()”.  These groups will also become important later for extraction of values 
  * For example “(ab)+” will match “ab” or “abab” or “ababab”, etc
* Finally, “\” is the escape character that will let you use reserved characters in a pattern.
  * “\.” will match an actual period

#Regular Expressions

* Common Regular expression examples
  * Phone number: “\d-\d\d\d-\d\d\d-\d\d\d\d”
  * Social Security number: “\d{3}-\d{2}-\d{4}”
  * IP Address: “(\d{1,3}\.){3}\d{1,3}”

#Regular Expressions in Python

* In Python, regular expressions are accessed using the re module
* There are two main types
  * Match REs will require you to match the entire string 
  * Search REs will let you match a substring
* There is also a findall function that will find all the given matches in a string
* To further complicate thingsyou also can precompile RE objects or just run the RE functions directly
* Precompiling is more efficient if you plan on repeated using the RE
* Also, using these functions requires that you assign the results to a variable

#Basic Pattern Matching

In [21]:
import re

#create a pattern
pattern = r'X[a-z]*X'

#Run against some strings
res = re.match(pattern, "XthismatchesX")
res2 = re.match(pattern, 'XthisDOESNTmatchX')

if res:
    print("The first results includes a match")
    
if res2:
    pass
else:
    print("The second doesn't match")


The first results includes a match
The second Doesn't match


#Basic Pattern Matching

In [23]:
import re

#create a pattern
pattern = r'X[a-z]*X'

#pre-compile our matches
matcher = re.compile(pattern)

#Run against some strings
res = matcher.match( "XthismatchesX")
#you can search with the same pattern as well
res2 = matcher.search('XthisXDOESmatchX')

if res:
    print("The first results includes a match")
    
if res2:
    print("The Search matches")
else:
    print("The second doesn't match")

The first results includes a match
The Search matches


#Extracting Data

* Grouping is a way to extract data from a pattern using groups designated by “()”
* Grouping is why match objects return more than a boolean
* Accessed via calling the group() method on an object
  * group(0) is the entire match
  * group(1) is the first group, etc

In [24]:
import re

#create a pattern
pattern = r'X([a-z]*)X'

#Run against some strings
res = re.match(pattern, "XthismatchesX")

if res:
    print("The first results includes a match of %s" % res.group(1))
    

The first results includes a match of thismatches


#Extracting Data
* groups() (note the plural) is an easy way to match to extract multiple matches for looping
* This method returns a tuple containing all of the various matches

In [25]:
import re

#create a pattern
pattern = r'X([a-z]*)X([a-z]*)X([a-z]*)X'

#Run against some strings
res = re.match(pattern, "XthisXmatchesXthriceX")

if res:
    print(res.groups())

('this', 'matches', 'thrice')


#Exercise

1. Write a program that prompts the user to enter a MAC address and then determines if the address is valid.  Remember mac addresses consist of 6 pairs of hex characters seperated by either a dash or a colon.
2. Write a program that parses a text file and retrieves the last word of each sentence.
3. Bonus!  Modify exercise 1 to recognize only valid ip addresses.

###Solution 1

In [27]:
#import the module
import re

#create a regex pattern
regex = "([0-9A-F]{2}[\-:]){5}[0-9A-F]{2}"

#get input
mac = input("Enter A MAC address: ")

#Check to see if the input matches and store the results
res = re.match(regex, mac)

#check to see if we have a result.
if res:
    print ("Valid MAC Address")
else:
    print ("Not a valid MAC Address")

Enter A MAC address: AA:11:BB:CC:FF:45
Valid MAC Address


###Solution 2

In [4]:
import re
#get file name from user
#inFile  = raw_input("Enter the input file: ")

#open file and read in its contents
i = open('exampleFile.txt')
allLines = i.read()
i.close()

#create our regex
regex = "([^\s]+?)[\.\!\?]"

#use find all to find each instance
res = re.findall(regex, allLines)

#display our results
print ("Found the following last words:")
for word in res:
    print(word)

Found the following last words:
File
mine


#Final Exercise

You have been given a large Apache log for your company webserver.  Write a program that will parse the file and extract useful data.  Read in the file and write out a summary which contains the following data.

1. Unique IPs
2. Most Popular Page
3. Total number of requests

Bonus Problem 1:  Figure out where it might be worthwhile to put a cached copy of the page by identifying possible proxy servers 

In [None]:
import re

#function to find the largest value in dictionary 
def findLargestValue(d):
    largestValue = 0
    returnKey = None

    for key in d:
        if d[key] > largestValue:
            largestValue = d[key]
            returnKey = key

    return returnKey
    


fileName = "http_logs"
#create our regex
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - \[.*?\] "(?:(?:GET)|(?:POST)) (.*?) HTTP/1.1".*'
matcher = re.compile(regex)

#open and read in the file
inputFile = open(fileName)
lines = inputFile.readlines()
inputFile.close()

#create dictionaries for easy deduping
IPs = {}
pages = {}

#parse through log and extract data, storing data in the dictionaries
for line in lines:
    match = matcher.match(line)
    if match:
        if match.group(1) in IPs:
            IPs[match.group(1)] += 1
        else:
            IPs[match.group(1)] = 1

        if match.group(2) in pages:
            pages[match.group(2)] += 1
        else:
            pages[match.group(2)] = 0

#find final results
finalIP = findLargestValue(IPs)
finalPage = findLargestValue(pages)
totalHits = len(lines)

#print our results
print ("Unique IPs: ")
for key in IPs:
    print key

print ("Most Popular Page: " + finalPage)
print ("Total Hits: " + str(totalHits))