<div class="pagebreak"></div>

# Files

Files are one of the most ubiquitous abstractions for computers.  As users, we constantly interact with files to store our documents and other data.  We organize these files into directories (folders).  Directories can contain subdirectories to provide a hierarchical structure of various contents.  

As with other programming languages, Python provides a rich set of functionality to interact with files and directories. Interacting with files will also be necessary to allow us to persist data. So far, we have just used variables that hold data in the computer's memory - such data will be lost when the program terminates.  By storing data in files, the information is placed (stored/persisted) on [secondary storage devices](https://en.wikipedia.org/wiki/Computer_data_storage#Secondary_storage) such as hard drives and USB sticks. 

We'll also use files to share data with other individuals and systems.  Such data is usually defined in common formats such as tab-delimited, comma-separated values(CSV), and JavaScript Object Notation(JSON).

Python's view of files and directories is largely based from the Unix/Linux operating system variants.  [Overview of the Unix File System](https://web.archive.org/web/20210419161551/https://homepages.uc.edu/~thomam/Intro_Unix_Text/File_System.html)<br>
(You should be familiar with all of the material on the "Overview of the Unix File System" page.

Support for files is defined with Python's [io](https://docs.python.org/3/library/io.html) module


## Working with Files
Generally to read or write files, you'll follow these steps.
1. Open the file 
2. Read or write to the file
3. Close the file once your are finished.

Step1:<br>
To open a file, use the built-in function [open()](https://docs.python.org/3/library/functions.html#open)

```
f = open(filename, mode)
```
`open()` returns a file object.  By default, a file is opened for reading as text file. i.e., mode='rt'.  

This table contains the different modes that may be specified:

| Character | Meaning
| :--------:|:-------|
'r'| open for reading (default)
'w'| open for writing, truncating the file first
'x'| open for exclusive creation, failing if the file already exists
'a'| open for writing, appending to the end of file if it exists
'b'| binary mode.  Specify in conjunction with 'r', 'w', 'x', or 'a'
't'| text mode (default). Specify in conjunction with 'r', 'w', 'x', or 'a'
'+'| open for updating (reading and writing).  Rarely used. See [open()](https://docs.python.org/3/library/functions.html#open)


Step 2:<br>
Read or write data to the file as necessary

Step 3:<br>
Finally, call `close()` to notify the operating system and interpreter that we are done with the file. The operating system can then release any allocated resources.
```
f.close()
```

## Text Files
Python views text files as a continuous stream of a stream data. By default, Python assumes the text data is represented with Unicode and stored within UTF-8 encoded files  

### Creating a new text file
The following code block opens a file called "text.txt" in the current directory for writing text.  The code block then shows 2 different ways of putting a string into the file. Finally, the last line closes the file.

In [None]:
f = open("test.txt", "wt")
print('String message, print built-function, but specify the file', file=f)
f.write('Another message, uses the write method of the file object')
f.write("test")
f.close()

If you examine the file in a text editor, you'll notice the file contains:
```
String message, print built-function, but specify the file
Another message, uses the write method of the file objecttest
```
By default, `print()` adds a newline at the end of each call unless you specify a different value in the `end` parameter.

The `write()` method does not add any newline characters - you will need to manually add newlines as needed.

### Reading a text file
To read a text file, we can use several different methods
- `read()`
- `readline()`
- `readlines()`
- an iterator

`read()` with no arguments will read the entire contents of the file into a string.  As such, you'll need to be careful with large files as you may exhaust the available memory in the computer.  

In [None]:
f = open("test.txt", "rt")
contents = f.read()
f.close()
print(contents)

To limit the number of characters read from the file in one method call, you can specify the maximum number of characters to read at a time.

In [None]:
f = open("test.txt", "rt")
numCharacters = 20
message = ""
while True:
    text = f.read(numCharacters)
    if not text:    #string is empty, nothing else to read in the file
        break
    print(text,end="###")
    message += text
f.close()
print("\n\nNow, dispay the message:")
print(message)

`readline()` will read a line at a time, returning the contents in a string. Any newline characters at the end of the line are kept in the returned string. If the end of the file is reached, an empty string is returned.  If a blank line exists, a string with a newline character is returned. This behavior allows for a string variable to be used a boolean in a condition check. If the string is non-empty (even just a newline), the string evaluates to true.  If the string is empty, the string evaluates to false in a condition check.

In [None]:
f = open("test.txt", "rt")
while True:
    line = f.readline()
    if not line:    #string is empty, nothing else to read in the file
        break
    print(line)
f.close()

Notice that in the above output, the newlines stored in the file are kept in the returned string.  If they were simply stripped from the return value, it would not be possible to distinguish between an empty line and the end of the file.

`readlines()` will read the entire contents of the file at once, returning a list where each element is a line from the file. Newline characters are not removed automatically.

In [None]:
f = open("test.txt", "rt")
lines = f.readlines()
f.close()
for line in lines:
    print(line.strip())    # stripe the newline character from the end of string 


Probably the most conventional way to ready a text file in Python is to use an iterator:

In [None]:
f = open("test.txt", "rt")
for line in f:
    print(line.strip())
f.close()

As before, this method keeps newline characters in the returned string.

While we noted that not specifying a limit to the `read()` can lead to memory issues, the other methods may have issues as well depending upon the presence of new line characters to split apart the data read.

## Closing Files Automatically
Unlike other programming languages, Python will close a file once it is no longer referenced (e.g., the file was opened in a function and the function has ended).  However, closing a file still serves two important purposes:
1. Forces any remaining writes to be completed and "flushed" to the file.  "Flushing" forces any internal buffers used by the Python Interpreter to send any remaining data to the operating system to be stored. For performance reasons, Python and other programming languages will use buffers when reading and writing data; the buffers require less calls to the operating system to manipulate files.
2. Clears any resources allocated to managing the open file

Utilizing the `with` statement, Python relies upon context managers to automatically take action when a code block is entered and and then exited by automatically calling special methods `__enter__()` and `__exit__()`.

Using objects defined with context managers can then take the form: 
<pre>
with <i>expression</i> as <i>variable</i>:
    <i>code block</i>
</pre>


In [None]:
with open("test.txt") as f:
    for line in f:
        print(line.strip())

Behind the scenes, Python basically uses this equivalent code sequence:

In [None]:
f = open("test.txt")
f.__enter__()
for line in f:
    print(line.strip())
f.__exit__()             # closes the file

## Binary Files
During this course, we'll primary use text files, but binary files are constantly used - images, videos, executables, specialized data files, etc. - in computer systems.

To read and write data to binary files, we can use [bytes](https://docs.python.org/3/library/stdtypes.html#bytes-objects) and [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray-objects) objects.  Other APIs have been written on top of these types to provide richer capabilities.

Literals can also be defined with byte strings. 

### Binary Example: IP Addresses

The following code fragment resolves a domain name into an IP address.  As you visit various websites on the Internet, the computer performs this resolution such that it can send your request to the appropriate server.

In this example, `socket.gethostbyname()` returns a string representation of the IP address.  For IPv4, address are composed of 4 parts, each with the value between 0 and 255.  So, each value is contained in a single byte and an IPv4 address can be represented with 4 bytes.  (IPv6 addresses are represented by 8 bytes.)   After printing out the value, the code converts it to bytes, which is an immutable sequence of byte values similar to a string. As such, we can uses indexes and slices just as we can with strings, tuples, and lists.

When displaying byte string literals, Python will display an ASCII value if a number can be converted to a printable ASCII character otherwise, it displays the number as a hexadecimal value. Recall in one of the earlier notebooks, we presented the built-in function `chr()` to convert a number to the corresponding Unicode character. (ASCII characters are the same as the first 127 characters of Unicode.)

In [None]:
import socket
addr = socket.gethostbyname('wsj.com')
print(addr)
ba = socket.inet_aton(addr)
print(ba)
print(ba[-1])
print(chr(ba[0]), chr(ba[1]), chr(ba[2]), chr(ba[3]))

### Writing to a Binary File

In [None]:
with open("test_binary.dat", 'wb') as f:
    f.write(ba)

### Reading from a Binary File

In [None]:
with open("test_binary.dat", 'rb') as f:
    ip_address = f.read()
print(ip_address)
print(type(ip_address))
print(socket.inet_ntoa(ip_address))    # convert the byte array to a string representation

##  Tricky Issues

### Newline Characters
One of the common issues when dealing with text files is that different platforms use different characters to signify a new line.  On Linux and MacOS, newlines are represented with just the bytecode `0x0a <LF> \n` while on Windows, `0x0d0a <CR><LF> \r\n` represents a new line. 

Within Python 3, the `open()` function has a parameter `newline` which controls how newlines are processed when reading in text files.  By default, universal newlines are enabled.  In this mode, lines can end with `\n`, `\r`, or `\r\n`.  Python will translate all of these to `\n` before returning a value to caller.   

When writing output to a file, any `\n` characters are translated to the system default line separator, `os.linesep`, as the output is sent to a file.

For both reading and writing, there are additional modes to force how newlines are handled if required.

### Encodings
Often times, text files can be created with different encodings to represent how text is stored for special characters (i.e., any characters that aren't ASCII).  By default, Python uses UTF-8 to encode files.   However, you may run across files stored in a different encoding (yes, Windows strikes again ...).  You'll need to figure out what the encoding and then open the file with by specifying the encoding argument to `open()` with the right value

In [None]:
with open("data/PakistanSuicideAttacks.csv") as f:
    for line in f:
        print(line.strip())

We can use the `chardet` module to detect the type.

In [None]:
import chardet
with open("data/PakistanSuicideAttacks.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
print(result)

In [None]:
with open("data/PakistanSuicideAttacks.csv",encoding='Windows-1252') as f:
    for line in f:
        print(line.strip())

### Preferred Encoding 
To see the preferred enconding for our current platform / operating system, we can use the `locale` module and `getprefferredencoding()`

In [None]:
import locale
locale.getpreferredencoding() 

## File Case Study: DJIA Returns and Statistics
The following code reads a file containing the returns for the Dow Jones Industrial Average (DJIA) from 1886 to mid-2022.  The data in the file is stored in a [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) (CSV) format. In this format, data is separated by commas and records by newlines. Optionally, a header row may be present with the field names.
<pre>
Year,Return
2022,-8.6
2021,18.73
2020,7.25
2019,22.34
2018,-5.63
</pre>
This file format is relatively common despite some of the flaws, non-standard versions, and flawed parsers (Microsoft Excel) that exist. The format is especially problematic with fields containing commas or newlines.

Python does contain a [csv module](https://docs.python.org/3/library/csv.html) that you should use rather than trying to parse records yourself.  Another option for many data science related projects is to use the `read_csv()` function in [pandas](https://pandas.pydata.org/) - this toolset will be covered in later notebooks.  Both of these existing capabilities handle many of the more challenging situations (dealing with strings containing commas and newlines) when parsing CSV files.

To track the data, this implementation uses two [parallel arrays](https://en.wikipedia.org/wiki/Parallel_array) to track the data.  This approach is not best practice and, generally discouraged.  As practice, you should rewrite this code using a dictionary where the key is the year and the value is the percentage return.

The code below computes many descriptive statistics.  Try creating functions for these.  The functions should have a single parameter that is a list(sequence) of values. You should look also look at computing the [first quartile, third quartile, and quartile range](https://en.wikipedia.org/wiki/Interquartile_range).

### Read and Parse the File

In [None]:
returns = []
years   = []
with open("data/djia_returns_1886_2022.csv") as f:
    headerline = f.readline()
    for line in f:
        split_line = line.strip().split(",")
        years.append(int(split_line[0]))
        returns.append(float(split_line[1]))

### Computing Descriptive Statistics

In [None]:
returns_sorted = sorted(returns)
total = sum(returns)
mean  = total / len(returns)
median = returns_sorted[len(returns)//2] if len(returns_sorted)%2 == 1 else (returns_sorted[len(returns)//2] + returns_sorted[1 +len(returns)//2])/2
print ("Mean(average):",mean)
print ("Median:",median)
print ("Min:", returns_sorted[0])
print ("Max:", returns_sorted[-1])
print ("Range:", returns_sorted[-1] - returns_sorted[0] )

In [None]:
dif = 0
for x in returns:
    dif += (mean-x)**2
population_variance = dif/len(returns)
std_dev = population_variance**.5
print ("Population Variance:", population_variance)
print ("Population Standard Deviation:",std_dev) 

### Basic Analysis
There are many questions we can ask of this data.  For instance, when did the DJIA have its best return?  worst return?

In [None]:
max_year_index = returns.index(returns_sorted[-1])
print("Best year:", years[max_year_index])
print("Worst year:", years[returns.index(returns_sorted[0])])

### Distribution
Here we'll bring in a visualization library, [seaborn](https://seaborn.pydata.org/), to see the distribution of returns using a histogram. 

In [None]:
import seaborn as sns
axes = sns.histplot(returns,bins=20)
axes.set_title("Distribution of the DJIA Annual Returns: 1886-2022")
axes.set(xlabel='Percentage Return', ylabel='Count')

It's not difficult to create visualizations when you can re-use code modules developed by others.  Probably the hardest part with many visualizations is simply getting data into the expected data format for the library.

### Demonstration of the Central Limit Theorem for $\bar{y}$
The Central Limit Theorem(CLT) is one of the fundamental concepts in statistics.  CLT can be stated as 


Let $\bar{y}$ denote the sample mean computed from a random sample of $n$ measurements from a population having a mean, $\mu$, and standard deviation, $\sigma$. 
Let $\mu_\bar{y}$ and $\sigma_\bar{y}$ denote the mean and standard deviation of the sampling distribution of $\bar{y}$. Then
1. $\mu_\bar{y} = \mu$
2. $\sigma_\bar{y} = \sigma / \sqrt{n}$
3. As $n$ grows large, the sampling distribution becomes more normal.
4. When the population distribution is normal, the sample distribution of $\bar{y}$ is exactly normal for any sample size $n$.

The following code block picks $n$ values from the population of DJIA returns.  It then computes the mean of that sample.  That process is repeated 5000 times.  Then the mean and standard deviation of the sample means are computed.  We then compare those to the population mean and the expected standard deviation from the CLT.

In [None]:
import random
random.seed(42)   # apply a seed so that the result is reproducible.  If not present, defaults to current time  

sample_means = []
n = 25
for y in range(5000):
    picked = []
    for x in range(n):
        picked.append(returns[random.randrange(0,len(returns))])
    sample_means.append(sum(picked)/len(picked))
    
sample_mean = sum(sample_means)/len(sample_means)

dif_sample = 0
for x in sample_means:
    dif_sample += (sample_mean-x)**2
sample_variance = dif_sample/(len(sample_means)-1)
sample_std_dev = sample_variance**.5
est_std_dev = std_dev / (n**.5)

print("Population Mean:",mean)
print("Sample mean: ", sample_mean)
print("Sample Standard Deviation:",sample_std_dev) 
print("Estimated Standard Deviation:", est_std_dev)    
    
axes = sns.histplot(sample_means,bins=20)
axes.set_title("Mean of Random Sample Returns, n="+str(n))
axes.set(xlabel='Mean(sample)', ylabel='Count')

### Distribution of Gaussian Random Values
This final code block pulls 5,000 random values from a Gaussian distribution where the mean and standard deviation are defined from the DJIA return values. [Documentation for random.gauss()](https://docs.python.org/3/library/random.html#random.gauss)

In [None]:
random_values = []
for x in range(5000):
    random_values.append(random.gauss(mean, std_dev))
axes = sns.histplot(random_values,bins=20)
axes.set_title("Distribution of Gaussian Random Values")
axes.set(xlabel='Random Value', ylabel='Count')

## Exercises 
1. Write a method named `blastoff` with two parameters: filename and countdown.  Countdown is a positive integer.  Using a range function, write a file that looks like 10 9 8 7 6 5 4 3 2 1---BlastOff

2. Write a method names `sum_blastoff_file` with one parameter - filename. The function reads a file produced by the function in the previous exercise.  it will read all of the numbers and produce their sum, printing the result to the console.

3. Given then stocks in x files, which stock had the largest monetary day gain, which had the biggest % change (gain or loss).  which has performed the best since it's inception?

TODO NEED to CREATE THIS FILE