# File Manipulation!

Here, we will go over some ways to load in, manipulate, and save txt/data files.

As with all jupyter notebooks, everything we do here can also be done in an individual python script.

# Read-in and parse a simple file

In [1]:
infile = open('blank.txt', 'r') # Creates a text wrapper object that we save to the variable infile
lines = infile.readlines() # the text wrapper object has the functional attribute .readlines() 
                           # which takes each line and assigns it to a list element
print(type(lines)) # This shows us that the variable lines is in fact a list!
print(lines)

<class 'list'>
['0\n', '1\n', '2\n', '3\n']


In [2]:
# Becasue the variable lines is a list, we can parse it like we did in Lesson 2

mean = 0 
for number in lines:
    mean = mean + number
mean = mean/len(lines)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [3]:
# Hmm... that didn't work. 
# Seens like the elements inside of the variable lines are strings so we cant add them to an integer.

mean = 0 
for line in lines:
    number = float(line)
    mean = mean + number
mean = mean/len(lines)

print(mean)

1.5


## Parse a more complicated file

If lines in a file are more complex say, for instance, several words separated by commas, our previous method of reading in the lines wont allow us to access each individual word. We will instead get a list of individual strings that include the several words and commas.

Instead, we can split a string by defining a delimiter.

In [5]:
infile = open('numbers.csv', 'r')

numbers = infile.readlines()
print("first line of the file numbers[0]: ",numbers[0])
print("has the type:", type(numbers[0]))
print("but if I try to print the first number numbers[0][0]:", numbers[0][2])

first line of the file numbers[0]:  34,6,3,65,1234,56,56

has the type: <class 'str'>
but if I try to print the first number numbers[0][0]: ,


In [7]:
# We can take the comma-separated string of numbers in the first element spot 
# in numbers and split the string along the commas.
num_delim = numbers[0].split(',')
print(".split(') generates a new list of strings but without the commas")
print(num_delim)
print(num_delim[0])

.split(') generates a new list of strings but without the commas
['34', '6', '3', '65', '1234', '56', '56\n']
34


If data is stored in multiple columns you can read it in using a numpy function.

In [10]:
import numpy as np
block_data = np.genfromtxt("./block_data.txt",dtype='str')

In [9]:
print(block_data)

[['Month' 'Year' 'Money']
 ['January' '2014' '10']
 ['February' '2014' '15']
 ['March' '2014' '16']
 ['April' '2014' '20']
 ['May' '2014' '19']
 ['June' '2014' '11']
 ['July' '2014' '20']
 ['August' '2014' '21']
 ['September' '2014' '22']
 ['November' '2014' '20']
 ['August' '2015' '22']
 ['September' '2015' '27']
 ['May' '2015' '21']
 ['January' '2015' '20']]


# Practicals! Do 1-2

1. Read words from 'words.txt' into a list and sort the list. 

2. Read the comma separated file 'numbers.csv' and store the 1st and 4th column in a list

In [17]:
# Practical 1.
input_file = open('words.txt')
lines = []
for line in input_file:
    lines.append(line.strip('\n'))
    lines.sort()
print(lines)

input_file.close()



['brown', 'brown', 'brown', 'brown', 'cat', 'cat', 'cat', 'cat', 'cow', 'cow', 'cow', 'cow', 'fluffy', 'fluffy', 'fluffy', 'fluffy', 'fluffy', 'fluffy', 'fluffy', 'fluffy', 'goat', 'goat', 'goat', 'goat', 'goat', 'goat', 'goat', 'goat', 'happy', 'happy', 'happy', 'happy', 'house', 'house', 'house', 'house', 'linux', 'linux', 'linux', 'linux', 'linux', 'linux', 'linux', 'linux', 'moose', 'moose', 'moose', 'moose', 'moose', 'moose', 'moose', 'moose', 'zebra', 'zebra', 'zebra', 'zebra']


In [18]:
# Practical 2.
import csv


## Writing to a file

Same thing, use the open() function but we need to set it to write mode instead of read-only


    r: Opens the file in read-only mode. Starts reading from the beginning of the file and is the default mode for the open() function.

    rb: Opens the file as read-only in binary format and starts reading from the beginning of the file. While binary format can be used for different purposes, it is usually used when dealing with things like images, videos, etc.

    r+: Opens a file for reading and writing, placing the pointer at the beginning of the file.

    w: Opens in write-only mode. The pointer is placed at the beginning of the file and this will overwrite any existing file with the same name. It will create a new file if one with the same name doesn't exist.
   
    wb: Opens a write-only file in binary mode.

    w+: Opens a file for writing and reading.

    wb+: Opens a file for writing and reading in binary mode.

    a: Opens a file for appending new information to it. The pointer is placed at the end of the file. A new file is created if one with the same name doesn't exist.

    ab: Opens a file for appending in binary mode.

    a+: Opens a file for both appending and reading.

    ab+: Opens a file for both appending and reading in binary mode.

In [19]:
new_test_file = open('./new_test_file.txt', 'w')

new_test_file.write("Testing Testing 1 2 3")
# At this point the file will have been created, but nothing officially written to it until we close it!

new_test_file.close()
#There we go, all written!


# Dictionaries

Similar to lists but a bit more powerful in their organization. Built using a 'key : value' pair

    D = {}              #empty dictionary
    
    D = {'my_key' : 5}  #one entry
    
    D['my_key']         #access value

In [21]:
# I'll build us an example dictionary to play with
temps = {'Oslo':13, 'London':15.4, 'Paris':17.5}
print(temps)
print(temps['London'])

# Add a new key and value
temps['Madrid'] = 26.0
print(temps)

{'Oslo': 13, 'London': 15.4, 'Paris': 17.5}
15.4
{'Oslo': 13, 'London': 15.4, 'Paris': 17.5, 'Madrid': 26.0}


## Looping over Dictionaries

In [22]:
# We can print out the dictionary information in a nice way

for city in temps:
    print('The temperature in %s is %g'%(city, temps[city]))

The temperature in Oslo is 13
The temperature in London is 15.4
The temperature in Paris is 17.5
The temperature in Madrid is 26


In [23]:
# And we can check if a certain key exists and give nice output if it does/doesn't
if 'Berlin' in temps:
    print('Berlin', temps['Berlin'])
else:
    print('No temperature data for Berlin')

No temperature data for Berlin


# Practicals!

3. Read in the file block_data.txt using either loadtxt or genfromtxt. Print the month and year that made the most money.

4. Print cos(x) to a file in pairs

5. Read words from words.txt and count how many times they occur

In [29]:
# Practical 3

import numpy as np

#import file into list
file_name = 'block_data.txt'
month = np.genfromtxt(file_name, usecols=[0], dtype='str')
years, money = np.loadtxt(file_name, usecols=[1,2], skiprows=1, dtype='int', unpack=True)

max_val = max(money)
index = list(money).index(max_val)

print("The most money was earned in %s, %d which was $%dk" % (month[index], years[index], money[index]))

The most money was earned in August, 2015 which was $27k


In [30]:
# Practical 4
import math as m

#Print cos(x) to a file in pairs

#setup for  x values
dx = 0.1
start = 0
stop = 2*m.pi
N = (stop - start) / dx

#create writeable file
input_file = open('cos_vals.txt', 'w')


#get values and store them
print('printing values')
print('X values, Cos(X) values', file=input_file)

while start < stop:
	y_val = m.cos(start)
	#input_file.write(str(round(start,ndigits=2))+' '+str(y_val)+'\n')
	print(str(round(start,ndigits=1))+' '+str(y_val),file=input_file)
	start += dx

#CLOSE THE FILE
input_file.close()
print('All Done')


printing values
All Done


In [37]:
# Practical 5

import numpy as np

#Read in words.txt

file = open('words.txt', 'r')

#Create empty Dictionary
Dict = {}

for i in file:
    i = i.strip('\n')
    if i in Dict:
        Dict[i.strip('\n')] += 1
    else:
        Dict[i.strip('\n')] = 1

print(Dict)

{'happy': 4, 'brown': 4, 'cat': 4, 'fluffy': 8, 'goat': 8, 'cow': 4, 'moose': 8, 'linux': 8, 'zebra': 4, 'house': 4}


# Storing Data Objects

pickle https://docs.python.org/3/library/pickle.html

A brief warning: pickle is not secure, meaning you are able to unpickle any file you tell your computer to, not matter what the repercussions might be. It's completely safe to use for your own work and research, but do go downloading and opening strange pickles from the web.

json https://docs.python.org/3/library/json.html

In [38]:
import pickle

In [39]:
# Im going to store our temps dictionary in a pickle object 
with open('./temps.pickle', 'wb') as temp_file:
    pickle.dump(temps, temp_file, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('./temps.pickle', 'rb') as temp_file:
    temp2 = pickle.load(temp_file)
print(pickle.HIGHEST_PROTOCOL)
print(temps == temp2)

4
True


By the way, the protocol option is NOT necessary. However, the python2 pickle will dump files with protocol=0 and Python3 uses protocol=2 by defualt and protocol=4 at HIGHEST_PROTOCOL. 

The higher the protocol, the smaller the resulting binary file. This is generally not important unless you are doing some major pickling. 