##  Python assignment 5, Data Science for Biology
## Topics to cover
- Working with files: Input/Output
- `for` loops, processing files one line at a time.
- Regular expressions
- `sys.argv`, `re.search`, `glob`

### The project for this week will involve reorganizing many files generated from temperature loggers that Kelly Klingler (former PhD student in Peacock lab) used to collect microclimate data around pika haypiles during 2012-2014. There are 38 temperature loggers that recorded data during 2012-2013 and 37 temperature loggers that recorded data during 2013-2014 (one was lost to water damage). Each of these automatic temperature loggers generates a text file (75 text files in total) with 5 columns of information. You will notice that almost every data logger will have two text files: one for 2012-2013 (denoted as “13”) and one for 2013-2014 (denoted as “14”). In addition, two data loggers were placed at each of 19 haypile locations (H1-H19) therefore each text file should also indicate whether a temperature logger was placed at the surface “S” or at some depth within the talus slope “D”.


### To get started working with these files, you can access a `.tgz` directory containing the 75 files from the course webpage. To uncompress this you will need to use the unix command `tar`:

    $ tar -zxvf logfiles.tgz

Once you have downloaded and extracted the files, have a careful look at their contents. You will notice that they all have the same format, so you can use the same code to extract similar information from each file.

### You will need to use `open` inside a loop so that you can read, and work on, each of the 75 files individually. Something like below will work (I realize I am giving you the solution here, but thats ok with me this week. I want folks to feel progress!):

    for filename in sys.argv[1:]:
        IN = open(filename, 'r')

### For this weeks program you are going to read in all files and make a single file that has specified data in an easily accessible format. The data of interest in each individual logger file is arranged in 4 columns (we are going to ignore the first). We want to write the information from those four columns from each .txt file to an outfile that has all of the data we are interested in. For each .txt file that is read in from `sys.argv`, you want to write the data from each column (there are four) to a row in the outfile (which will have 4 rows for each infile). The start of each line should have the name of the infile, but different features of that name need to be comma separated so that the data can eventually be sorted by that information.

For example: 
Infile: 1901302136_H15_D_14.txt.txt

The beginning of each line in the outfile that will have data from the infile (after you have used a regular expression to get rid of "_" and ".txt.txt") should look like this:

1901302136,H15,D,14, data…………………..

The output file should have 4 lines per logger (per infile) and should look as below. The idea here is that this file can easily be read into and worked on in R for the rest of the analyses Kelly needs to do. 

Example data for 1901302108,H7,D,13:

    1901302108,H7,D,13,10/7/2012,10/7/2012,10/7/2012,10/7/2012,
    1901302108,H7,D,13,8:00:00,8:35:00,9:10:00,9:45:00,10:20:00
    1901302108,H7,D,13,AM,AM,AM,AM,AM,AM,AM,PM,PM,PM,PM,PM,PM,
    1901302108,H7,D,13,41.4,41.4,41.4,41.4,41.6,41.7,42.0

Example data for 1901302108,H7,D,14:

    1901302108,H7,D,14,9/29/2013,9/29/2013,9/29/2013,9/29/2013,
    1901302108,H7,D,14,8:00:00,8:35:00,9:10:00,9:45:00,10:20:00,
    1901302108,H7,D,14,AM,AM,AM,AM,AM,AM,AM,PM,PM,PM,PM,PM,PM,PM,
    1901302108,H7,D,14,32.4,32.5,32.5,32.6,33.0,33.0,32.9


### Example code with one way to do this, using `glob` to access files in the working directory

### importing modules

In [8]:
import glob
import re
import codecs

### Using `glob` to process command line arguments in jupyter notebooks.
- Code below will store all filenames in directory ending in `.txt.txt` into `filelist`
- for loop below is just printing elements of the list to confirm this is working as expected.


In [9]:
filelist = glob.glob('*.txt.txt')
for file in filelist:
    print(file)

1901302120_H5_D_13.txt.txt
1901302235_H6_S_14.txt.txt
1901302225_H12_D_13.txt.txt
1901302121_H8_S_14.txt.txt
1901302158_H9_S_13.txt.txt
1901302217_H18_D_14.txt.txt
1901302236_H17_D_13.txt.txt
1901302110_H13_S_14.txt.txt
1901302146_H14_D_14.txt.txt
1901302222_H2_S_13.txt.txt
1901302240_H8_D_14.txt.txt
1901302109_H7_S_14.txt.txt
1901302212_H9_D_13.txt.txt
1901302150_H14_S_14.txt.txt
1901302119_H4_S_13.txt.txt
1901302141_H19_S_14.txt.txt
1901302241_H2_D_14.txt.txt
1901302203_H7_S_13.txt.txt
1901302118_H1_D_13.txt.txt
1901302194_H10_S_14.txt.txt
1901302117_H15_S_13.txt.txt
1901302227_H11_S_14.txt.txt
1901302224_H3_S_13.txt.txt
1901302237_H10_D_13.txt.txt
1901302136_H15_D_14.txt.txt
1901302108_H7_D_13.txt.txt
1901302115_H1_S_13.txt.txt
1901302138_H11_D_13.txt.txt
1901302228_H4_D_13.txt.txt
1901302223_H16_S_14.txt.txt
1901302134_H13_D_13.txt.txt
1901302157_H12_S_13.txt.txt
1901302214_H19_D_13.txt.txt
1901302243_H5_S_14.txt.txt
1901302122_H3_D_14.txt.txt
1901302137_H18_S_13.txt.txt
1901302202

### opening one output file handle to write to

In [12]:
OUT = open("outNEW_logger.txt", 'w')


### Two `for` loops to open filehandles to each file in the directory ending in .txt.txt, and to then go through each file one line at a time.
-

In [13]:
for filename in filelist:
    file_regex = "(\d+)_(\w\d+)_(\w)_(\d+).txt.txt"
    file_match = re.search(file_regex, filename)
    date = [file_match.group(1), file_match.group(2), file_match.group(3), file_match.group(4)]
    time = [file_match.group(1), file_match.group(2), file_match.group(3), file_match.group(4)]
    ampm = [file_match.group(1), file_match.group(2), file_match.group(3), file_match.group(4)]
    temp = [file_match.group(1), file_match.group(2), file_match.group(3), file_match.group(4)]
    mean_temp = [file_match.group(1), file_match.group(2), file_match.group(3), file_match.group(4)]
    
    IN = codecs.open(filename, 'r', encoding='utf-8', errors='ignore')
    #with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as IN:
    for line in IN:
        stripped = line.strip("\n")
        stripped = stripped.replace("\0","")
        if re.match("^\d", stripped):
            split_strip = stripped.split("\t")
            split_strip_time_ampm = split_strip[2].split(" ")
            date.append(split_strip[1])
            time.append(split_strip_time_ampm[0])
            ampm.append(split_strip_time_ampm[1])
            temp.append(split_strip[3])

    for thing in date:
        OUT.write(thing+",")
    OUT.write("\n")
        
    for thing in time:
        OUT.write(thing+",")
    OUT.write("\n")
        
    for thing in ampm:
        OUT.write(thing+",")
    OUT.write("\n")
        
    for thing in temp:
        OUT.write(thing+",")
    OUT.write("\n")        
        
    IN.close()
OUT.close()
