# File Parsing

Now making lists for data points is all good and dandy, but what if you have **lots** of data? Wouldn't it be nice to parse large amounts of data from a file instead of having to manually input copious amounts of information? You can achieve this through python's handy file parsing functions!

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

We will read Voyager-2 data in this notebook.
[Voyager-2](https://voyager.jpl.nasa.gov) is a spacecraft that was launched in [1977](https://en.wikipedia.org/wiki/Voyager_2). It collects data about the local space environment, including the flux (or rate per unit time and area) of electrons and protons in the nearby environnment.

Let's try reading the file.

We're going to use python's built-in functions to read data from a NASA Voyager data. |

The line below [opens](https://docs.python.org/3/library/functions.html#open) a file called ```VY2PLA_1H_FMT.txt``` in folder ```infiles```. If you look in the github or in your file folder containing these notebooks, you should find a folder called ```infiles``` and a file therein called ```VY2PLA_1H_FMT.txt```. The ```"r"``` string at the end tells the computer to open the file for _reading_ rather than for _writing_ (or something else).


In [None]:
outfile = open("infiles/VY2PLA_1H_FMT.txt","r")
print(outfile.name)

You might think, "Great work!", but of course, you know that when you open a file in say, Microsoft Word, and you want to know what's in there, you have to read it. You have to tell the computer to do that too. We will use [readlines()]() to do it here.

In [None]:
data=outfile.readlines()

This readlines all the lines in the file. How many lines are there?

In [None]:
len(data)

And what's in the file?

In [None]:
print(data)

Look carefully at the list above that gets returned by readlines. You'll see that each item in the list is a string that corresponds to a single line of text in the file. Each string ends with ```\n``` which is a formatting command that tells the computer to go to the next line.

#### <span style="color:blue"> Exercise 4.1 </span>
So that you can read the information in the line a little easier, write a for loop that prints each line in the file on a single line.

In [None]:
# Write a for loop that prints each line in the file.

### Parsing data

The file we read ```VY2PLA_1H_FMT.txt``` tells us about how the data is formatted in the data files. Armed with that information (go back and review if you didn't already), we can proceed to extract individual data points from the data files. ```v2_hour_2007.txt``` is one such file.

In [None]:
# that defines the format. Now we want to read the data itself
datafile = open("infiles/v2_hour_2007.txt","r")
data=datafile.readlines()
print(datafile.name)

In [None]:
print(len(data))

In [None]:
for line in data[0:4]:
    print(line,)

Let's try to extract the date (year, day-of-year, hour). I can see from the formatting information that the year is the first column of the data, the day-of-year is the second column, and hour is the third column.

So let's try ```line[0], line[1], line[2]``` in our for loop above.

In [None]:
for line in data[0:4]:
    year = line[0]
    dayofyear = line[1]
    hour = line[2]
    print(year, dayofyear, hour)

Does it work? 

It shouldn't. That's because each line is a single string with all the data separated by spaces. So we have to split the line up by the string values. Review the docstring for [split()](https://docs.python.org/2/library/stdtypes.html#str.split). By default the split funciton looks for spaces and divides up the line, but you can split a line based on other delimiters if you want. (Review the docs to figure out how.)

Below we will extract the first line from the data, then split the line based on the spaces, and then print the resulting list

In [None]:
line = data[0]
print(line)
items = line.split()
print(items)

What type are the new items in the list? They should be strings, and you can tell because there are ```' '``` quotation marks around each string. 

#### <span style="color:blue"> Exercise 4.2 </span>
But we want the year, day-of-year, and hour to be ```float```'s. Cast the items into floats

In [None]:
## Convert the items to floats




Now we will write a for loop that cycles through each line and extracts the year and day-of-year.

In [None]:
# cycle through each line and extract
# the year and day of year, hour,
## proton speed, proton density
## and proton temperature
for line in data[0:4]:
    split_line = line.split()
    print(split_line)
    print(float(split_line[0]))
    year = float(split_line[0])
    dayofyear = float(split_line[1])
    hour = float(split_line[2])
    print(year, dayofyear, hour)
    #proton_speed_kms = float(split_line[3])
    #proton_density_cm3 = float(split_line[4])
    #proton_temperature_K = float(split_line[5])**2*0.0052 * 11604.505

Let's try printing the year, dayofyear, and hour again, after we've parsed all those lines.

In [None]:
print(year, dayofyear, hour)

Notice a problem? We wanted to read **all** the data, and right now we only have the last ones. To read and **store** all the data, we'll need to use lists.

In [None]:
#Now store the data in arrays
years = []
days  = []
hours = []


for line in data:
    split_line = line.split()

    years.append(float(split_line[0]))
    days.append(float(split_line[1]))
    hours.append(float(split_line[2]))

#### <span style="color:blue"> Exercise 4.3 </span>
We also want to look at the data recorded by the plasma sensors on Voyager-2. In particular, we're interested in the protons near the Voyager-2 spacecraft.

Go back to the formatting text and figure out which of the columns have the speed, density, and temperature of the protons. Then store the data into three lists.

Note that to get the temperature in units of Kelvin:
$$ T = 0.0052 \times 11604.505  v_{thermal}^2 $$
where $v_{thermal}$ is the proton's thermal speed.

In [None]:
# cycle through each line and extract
# the year and day of year, hour,
# proton speed in km/s, proton density in cm^-3
# and proton temperature in K, storing the data in lists

# here are some empty lists for you to use
years = []
days  = []
hours = []
proton_speeds_kms = []
proton_densities_cm3 = []
proton_temperatures_K = []







## Interpreting the Data
Since its been traveling for more than **40 years**, Voyager-2 may have left the edge of the solar system by now. You might wonder, how do we not know whether it has or not? Isn't the size of the solar system known? 

Well, the answer is kind of. Voyager can help us understand how large the solar system is and its shape. You can tell by plotting the proton's density, speed, and temperature over time.

The sun pumps protons into the solar system in the form of the solar wind. When Voyager-2 leaves the solar system, it crosses a shock wave. Inside of the shock, the protons are moving very fast, pumped by the solar wind. Outside, you're in interstellar space whether the particles are moving much slower. You can read the fascinating story of Voyager crossing the shock [here](https://www.nature.com/articles/454038a).

#### <span style="color:blue"> Exercise 4.4 </span>
So let's take a look at what the data are telling us. 

Make three plots:
+ The proton speed vs. the day of year
+ The proton density vs. the day of year.
+ The proton temperature vs. the day of year.

You should see a transition in all three of these plots? Which one is the most striking? Can you find the day of year (in what year?) Voyager-2 left the solar system?

In [None]:
# make a plot of the proton speed vs. the day of year




In [None]:
# make a plot of the proton density vs. the day of year




In [None]:
# make a plot of the proton temperature vs. the day of year.





## Writing Data to a File

In [None]:
# now let's write our data to some files
import csv

writer = csv.writer(open('proton_speeds_kms.csv', 'w'))
for i,ps in enumerate(proton_speeds_kms):
    writer.writerow([days[i], ps])
writer = csv.writer(open('proton_densities_cm3.csv', 'w'))
for i,pd in enumerate(proton_densities_cm3):
    writer.writerow([days[i], pd])
writer = csv.writer(open('proton_temperatures_K.csv', 'w'))
for i,pt in enumerate(proton_temperatures_K):
    writer.writerow([days[i], pt])