# File handler object and opening and closing files

* `open` creates a file handler
    * Not the file itself
    * used to read and write
* Needs proper handling
    * open
    * close
    * flush
    

In [1]:
# read ('r') is the default mode
f = open('sell_short_trades.txt')
type(f)

_io.TextIOWrapper

In [5]:
f

<_io.TextIOWrapper name='sell_short_trades.txt' mode='r' encoding='UTF-8'>

In [6]:
f.close()

## Reading the whole file with `read`

You can read the whole file with the `read` method.

In [7]:
f = open('sell_short_trades.txt')
trades_file = f.read()
f.close()

In [9]:
trades_file[:500]

'                                                                                                                \n DOTC                                                                                 RUN DATE : 06/14/17 21:56 \n                                                                                             RPT DATE :06/14/17 \n                                    TTS0126:SELL SHORT TRADES & SHORT EXEMPT                                    \n Symbol   Side    Cxl      Qty    Price       Bi'

## Reading all lines with `readlines`

You can read all the lines of a file with the `readline` method.

In [3]:
f = open('sell_short_trades.txt')
lines = f.readlines()
f.close()

In [4]:
lines[:10]

['                                                                                                                \n',
 ' DOTC                                                                                 RUN DATE : 06/14/17 21:56 \n',
 '                                                                                             RPT DATE :06/14/17 \n',
 '                                    TTS0126:SELL SHORT TRADES & SHORT EXEMPT                                    \n',
 ' Symbol   Side    Cxl      Qty    Price       Bid        Ask        T-DatS-DatTradeID      TradeTiSS      Exbkr \n',
 '                                                                                                                \n',
 ' TradeCommType  SourceCommission   Account   OrderID      GTL               Trailer Info               Clr      \n',
 ' CERS     SS      NEW        2,756   2.400000    2.340000   2.45000006/1406/191706149900003 09:30:CustSS        \n',
 '                                              

## Reading files is unsafe!!1!one!

* File might not exist.
* File might be in use/locked.
* Path might not exist.

## Writing files is EVEN MORE unsafe!!1!one!

* You can overwrite data!
* File might not exist.
* File might be in use/locked.
* Path might not exist.

## Managing files inside `with` fixes many problems.

* Files automatically open and close.
* Automatic close $\rightarrow$ not locking out files
* Work with context managers that can automatically deal with weird cases.

## ALWAYS manage files inside a `with` statement 

* opening and closing is *context management*
* Automate this with ``with``
    * safe and **important**

In [29]:
with open('sell_short_trades.txt') as f:
    lines = f.readlines()
lines[:5]

['                                                                                                                \n',
 ' DOTC                                                                                 RUN DATE : 06/14/17 21:56 \n',
 '                                                                                             RPT DATE :06/14/17 \n',
 '                                    TTS0126:SELL SHORT TRADES & SHORT EXEMPT                                    \n',
 ' Symbol   Side    Cxl      Qty    Price       Bid        Ask        T-DatS-DatTradeID      TradeTiSS      Exbkr \n']

In [30]:
len(lines)

4158

## Cleaning File Workflow

When cleaning a file, we generally

1. Read and split the file e.g. into lines (unfold)
2. Transform the parts of the file e.g. filter the lines (tranform)
3. Join the processed lines together (fold)
4. Write the result to a *new file*

## Example 1

Suppose we only want to keep the lines that start with `'CERS'`.  Like most things, this can be accomplished with a comprehension.

**Note:** `strip` removes whitespace from the ends of the string.  Using `strip` early and often is a good habit!

In [40]:
cers_lines = [line.strip() for line in lines if line.strip().startswith('CERS')]
cers_lines[:5]

['CERS     SS      NEW        2,756   2.400000    2.340000   2.45000006/1406/191706149900003 09:30:CustSS',
 'CERS     SS      NEW          100   2.36000018422.360000   2.37000006/1406/191706149900003 10:20:ContraSSFREX',
 'CERS     SS      NEW          200   2.350000    2.360000   2.37000006/1406/191706149900003 10:20:ContraSSFREX',
 'CERS     SS      NEW          100   2.350000    2.350000   2.36000006/1406/191706149900003 10:20:ContraSSFREX',
 'CERS     SS      NEW          100   2.350000    2.350000   2.36000006/1406/191706149900003 10:22:ContraSSFREX']

## Example 2

Suppose we only want to keep the lines that start with `'CERS'` **and split up all the data the data on each line**.  Like most things, this can be accomplished with a comprehension.


In [37]:
cers_split_lines = [line.strip().split() for line in lines if line.strip().startswith('CERS')]
cers_split_lines[:5]

[['CERS',
  'SS',
  'NEW',
  '2,756',
  '2.400000',
  '2.340000',
  '2.45000006/1406/191706149900003',
  '09:30:CustSS'],
 ['CERS',
  'SS',
  'NEW',
  '100',
  '2.36000018422.360000',
  '2.37000006/1406/191706149900003',
  '10:20:ContraSSFREX'],
 ['CERS',
  'SS',
  'NEW',
  '200',
  '2.350000',
  '2.360000',
  '2.37000006/1406/191706149900003',
  '10:20:ContraSSFREX'],
 ['CERS',
  'SS',
  'NEW',
  '100',
  '2.350000',
  '2.350000',
  '2.36000006/1406/191706149900003',
  '10:20:ContraSSFREX'],
 ['CERS',
  'SS',
  'NEW',
  '100',
  '2.350000',
  '2.350000',
  '2.36000006/1406/191706149900003',
  '10:22:ContraSSFREX']]

## Example 3 - Being selective with `enumerate`

When the location of an element of a sequence matters, we can `enumerate` to gain access to the element index.  Suppose we want to only keep every other line that starts with `CERS`

In [38]:
every_other = [line for i, line in enumerate(lines) if line.strip().startswith('CERS') and i % 2 == 0]
every_other[:5]

[' CERS     SS      NEW          100   2.36000018422.360000   2.37000006/1406/191706149900003 10:20:ContraSSFREX  \n',
 ' CERS     SS      NEW          200   2.350000    2.360000   2.37000006/1406/191706149900003 10:20:ContraSSFREX  \n',
 ' CERS     SS      NEW          100   2.350000    2.350000   2.36000006/1406/191706149900003 10:20:ContraSSFREX  \n',
 ' CERS     SS      NEW          100   2.350000    2.350000   2.36000006/1406/191706149900003 10:22:ContraSSFREX  \n',
 ' CERS     SS      NEW          100   2.360000    2.350000   2.36000006/1406/191706149900003 10:24:ContraSSFREX  \n']

<h2> <font color='red'> Exercise 1 </font></h2>

Suppose that we are only interested in keeping the second and third entry in each line that starts with `'CERS'`.  Perform this task with a list comprehension.  **Hint:** Slice!

## Joining processed line

After processing the lines in a file and before writing the results, we join the lines to prepare the contents of the file as a single string.  This is accomplished using the `'\n'.join`, which glues the lines back together using `'\n'`, which puts each string on its own line.

In [42]:
content = '\n'.join(cers_lines)
content[:100]

'CERS     SS      NEW        2,756   2.400000    2.340000   2.45000006/1406/191706149900003 09:30:Cus'

## Using `','.join` to create csv files.

When you have split a line of data into parts (see Example 2), we generally join these values back together using commas.  This can be accomplished by applying `','.join` to each of the split lines.

In [45]:
csv_content = '\n'.join([','.join(line) for line in cers_split_lines])
print(csv_content[:500])

CERS,SS,NEW,2,756,2.400000,2.340000,2.45000006/1406/191706149900003,09:30:CustSS
CERS,SS,NEW,100,2.36000018422.360000,2.37000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,200,2.350000,2.360000,2.37000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,100,2.350000,2.350000,2.36000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,100,2.350000,2.350000,2.36000006/1406/191706149900003,10:22:ContraSSFREX
CERS,SS,NEW,100,2.360000,2.350000,2.36000006/1406/191706149900003,10:24:Cont


<h2> <font color='red'> Exercise 2 </font></h2>

Suppose we are planning to write the contents of our work in <font color='red'> Exercise 1 </font> in to a csv file.  Create a content string that contains the data, separated by strings, with one line per data row. 

## Writing to files

* Need to open with `mode='w'` or `'a'`
    * `'w'` is *write*
    * `'a'` is *append*
* **Be careful!**
    * open('file','w') **immediately** erases `file`

## Example - writing out the `'CERS'` lines to a csv file

In [48]:
with open('cers_lines.csv','w') as outfile:
    outfile.write(csv_content)

In [49]:
!cat cers_lines.csv | head -n 5

CERS,SS,NEW,2,756,2.400000,2.340000,2.45000006/1406/191706149900003,09:30:CustSS
CERS,SS,NEW,100,2.36000018422.360000,2.37000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,200,2.350000,2.360000,2.37000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,100,2.350000,2.350000,2.36000006/1406/191706149900003,10:20:ContraSSFREX
CERS,SS,NEW,100,2.350000,2.350000,2.36000006/1406/191706149900003,10:22:ContraSSFREX


<h2> <font color='red'> Exercise 3 </font></h2>

Write the content from <font color='red'> Exercise 2 </font> into a csv file titled `example3.csv`, then use `cat` piped into `head` to check the result.