In [1]:
import pandas as pd

# Wranglin' – Corralling Unruly Data
One bit at a time
***

**Version 0.1**

By AA Miller  
2025 Sep 9

For this exercise you will need some pre-prepared text files. They have been compiled into a tarball that you should [download](https://arch.library.northwestern.edu/downloads/8g84mm66j?locale=en) and unpack in the same directory as this notebook. 

Webster's Dictionary$^\ast$ defines wrangler as:

**wrangler** noun

wran·gler | raŋ-g(ə-)lər

(short for horse-wrangler, probably partial translation of Mexican Spanish caballerango groom): a ranch hand who takes care of the saddle horses broadly : cowboy 

$^\ast$actually https://www.merriam-webster.com/dictionary/ - Webster's didn't define wrangler in the way I wanted

How then, as physicts and astronomers, are we all like cowhands?

Data are often like horses in that: they all differ, rarely conform to a single standard set of behavior, and they love to eat hay.$^\dagger$

$^\dagger$I made that last one up.

Thus, in our efforts to better understand the Universe, we must often manipulate, coax, and, in some cases, force our data to "behave." This involves a variety of tasks, such as: gathering, cleaning, matching, restructuring, transforming, filtering, combining, merging, verifying, and fixing data.

Here is a brief and unfortunate truth, there isn't a single person in the entire world that would organize data in *exactly* the same way that you would.

As a result, you may find that data that are useful to you are not organized in an optimal fashion for use in your workflow/software.

Hence: the need to wrangle.

There is one important and significant way in which our lives as physicists are much better than the average data scientist: virtually all of it is numbers.

Furthermore, I contend that more often than not your data can easily be organized into a simple tabular structure.

Nevertheless, as you will see during the exercises, even with relatively simple, small numerical data sets there is a need for wrangling.

And wrangling brings up a lot of issues...

Consider the following data set that contains the street names for my best friends from childhood:

    ['Ewing', 'Isabella', 'Reese', 'Isabella', 
     'Thayer', 'Reese', 'Reese', 'Ewing', 'Reece']

Do you notice anything interesting?

Either my hometown has a street named "Reese"  and a street named "Reece", or the last entry was input incorrectly. 

If the later is true, then we have to raise the question of: what should we do?

For this particular data set, it would be possible to create a verification procedure to test for similar errors.

1. Collect every street name in the city (from post office?)
2. Confirm every data entry has a counterpart.

For any instances where this isn't the case, one could then intervene with a correction. 

This particular verification catches this street name error, but it doesn't correct for the possibility that the person doing the data entry may have been reading addresses really quickly and the third "Reese" entry should have actually said "Lawndale."

(verification is really hard)

Data provenance – a historical record of the data and its origins – is really really hard.

If you are making "corrections" to the data, then each and every one of those corrections should be reported (for databases this is called "logging"). Ideally, these reports would live with the data so others could understand how things have changed.

If you did change "Reece" to "Reese", anyone working with the data should be able to confirm those changes.

Suppose now you wanted to use the same street name data set to estimate which street I lived on while growing up. 

One way to mathematically approach this problem would be to convert the streets in the data set to GPS coordinates, and then perform an average for the coordinates of where I lived. 

This too is a form of wrangling, because the data you have (street names) are not the data you need (coordinates). 

Why harp on this? 

In practice, data scientists (including physicists) spend an unreasonable amount of time manipulating and quality checking data (some indsutry experts estimate that up to 80% of their time is spent warnglin').

Today, we will work through several examples that require wrangling, while, hopefully, building some strategies to minimize the amount of time you spend on these tasks in the future.

For completeness, I will mention that there is a famous canonical paper about [data wrangling](http://vis.stanford.edu/files/2011-Wrangler-CHI.pdf), which introduces the [`Wrangler`](http://vis.stanford.edu/wrangler/), a tool specifically designed to take heterogeneous (text) data, and provide a set of suggested operations/manipulations to  create a homogenous table amenable to standard statistical analysis.

One extremely nice property of the `Wrangler` is that it records every operation performed on the data, ensuring high-fidelity reporting on the data provenance. We should do a better job of this (certainly for astronomy, possibly in your field as well).

Today, we are going to focus on `python` solutions to some specific data sets (drawn from astronomy, but the specifics will not matter).

Hopefully you learn some tricks to make your work easier in the future.

## Problem 0) An (Incomplete) Introduction to Pandas DataFrames

[`Pandas`](https://pandas.pydata.org/) is a powerful open-source Python library designed for data analysis and manipulation. 

There are two primary data structures: 

1. `Series` (for one-dimensional data)
2. `DataFrame` (for two-dimensional, tabular data).

If you need to load, clean, transform, and/or analyze a data set, `pandas` makes this very easy. 

It also natively knows about many of the most common data formats (e.g., CSV, Excel, SQL databases, etc), which significantly streamlines the data reading process. For example: 

`astro_df = pd.read_csv('star_table1.csv')`

`pandas` is highly intuitive with a relatively minimal learning curve. It includes many built-in methods for aggregate analysis, missing values, filtering, and data grouping. 

Its intuitive syntax and rich functionality allow users to handle missing values, filter rows, group data, and perform complex operations efficiently.

(I am worried that the lectures have become painfully dry as I slowly discuss software syntax, so I inclue several useful and basic examples below but will not present these as slides)

To create a pandas `Series` and inspect the basic attributes:

In [2]:
s = pd.Series([1.0, 2.5, 3.3], index=['x', 'y', 'z'], name='example')

print(f"Values: {s.values}")
print(f"Index: {list(s.index)}")
print(f"Dtype: {s.dtype}")
print(f"Shape: {s.shape}")
print(f"Name: {s.name}\n")

Values: [1.  2.5 3.3]
Index: ['x', 'y', 'z']
Dtype: float64
Shape: (3,)
Name: example



To select data based on a slice or a mask:

In [3]:
print(f"Position 1: {s.iloc[1]}\n")

print(f"Slice by positions [1:3):\n{s.iloc[1:3]}\n")

mask = s > 3
print(f"Mask s > 3:\n{mask}\n")
print(f"Filtered (s > 3):\n{s[mask]}\n")


Position 1: 2.5

Slice by positions [1:3):
y    2.5
z    3.3
Name: example, dtype: float64

Mask s > 3:
x    False
y    False
z     True
Name: example, dtype: bool

Filtered (s > 3):
z    3.3
Name: example, dtype: float64



To get basic statistics and summaries:

*Note* – there are many more options than the ones shown here.

In [4]:
print(f"Mean: {s.mean()}")
print(f"Std: {s.std()}")
print(f"Min/Max: {s.min()} / {s.max()}\n")

Mean: 2.2666666666666666
Std: 1.1676186592091329
Min/Max: 1.0 / 3.3



It is also possible to sort, find and replace, and remove missing values – read the docs for more examples! 

For tabular data then one would want to create a `DataFrame`

In [5]:
# Create a DataFrame from a dictionary 
#(note - 2d data arrays can also be used, with column names separately specified)
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Occupation": ["Engineer", "Doctor", "Artist"]
}

df = pd.DataFrame(data)

# Inspect basic attributes
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")
print(f"Data types:\n{df.dtypes}")


Shape: (3, 3)
Columns: ['Name', 'Age', 'Occupation']
Index: [0, 1, 2]
Data types:
Name          object
Age            int64
Occupation    object
dtype: object


The `DataFrame` can be sliced and specific data can be extracted: 

In [6]:
# Select a single column
ages = df["Age"]
print(f"Ages:\n{ages}")

# Select multiple columns
subset = df[["Name", "Occupation"]]
print(f"Subset:\n{subset}")

# Select rows by index
first_two = df.iloc[:2]
print(f"First two rows:\n{first_two}")

# Conditional selection
older_than_28 = df[df["Age"] > 28]
print(f"People older than 28:\n{older_than_28}")


Ages:
0    25
1    30
2    35
Name: Age, dtype: int64
Subset:
      Name Occupation
0    Alice   Engineer
1      Bob     Doctor
2  Charlie     Artist
First two rows:
    Name  Age Occupation
0  Alice   25   Engineer
1    Bob   30     Doctor
People older than 28:
      Name  Age Occupation
1      Bob   30     Doctor
2  Charlie   35     Artist


Basic statistics and summaries: 

In [7]:
# Summary statistics
summary = df.describe()
print(f"Summary statistics:\n{summary}")

# Mean age
mean_age = df["Age"].mean()
print(f"Mean age: {mean_age:.2f}")

# Count of unique occupations
unique_jobs = df["Occupation"].nunique()
print(f"Number of unique occupations: {unique_jobs}")


Summary statistics:
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0
Mean age: 30.00
Number of unique occupations: 3


Columns can be modified and new columns can be added:

In [8]:
# Add a new column
df["Age in 10 Years"] = df["Age"] + 10
print(f"DataFrame with new column:\n{df}")

# Modify an existing column
df["Name"] = df["Name"].str.upper()
print(f"Modified names:\n{df['Name']}")


DataFrame with new column:
      Name  Age Occupation  Age in 10 Years
0    Alice   25   Engineer               35
1      Bob   30     Doctor               40
2  Charlie   35     Artist               45
Modified names:
0      ALICE
1        BOB
2    CHARLIE
Name: Name, dtype: object


**Problem 0a**

Read the SDSS data for Problem 4, stored in a csv file called `DSFP_SDSS_spec_train.csv`, into a `pandas` DataFrame called `sdss_spec`.

In [9]:
sdss_spec = pd.read_csv("DSFP_SDSS_spec_train.csv")
sdss_spec.head()

Unnamed: 0,specObjID,z,type,psfMag_u,psfMag_g,psfMag_r,psfMag_i,psfMag_z,modelMag_u,modelMag_g,...,extinction_i,extinction_z,w1mpro,w1snr,w2mpro,w2snr,w3mpro,w3snr,w4mpro,w4snr
0,299567742770505728,0.071414,ext,20.88291,19.23907,18.5617,18.09715,17.76469,19.62189,18.03702,...,0.061267,0.045572,14.395,21.1,14.236,15.6,11.029,6.8,8.579,-0.9
1,299568017178650624,0.07138,ext,20.88291,19.23907,18.5617,18.09715,17.76469,19.62189,18.03702,...,0.061267,0.045572,14.395,21.1,14.236,15.6,11.029,6.8,8.579,-0.9
2,299566643258877952,0.088173,ext,20.84844,18.9604,18.08027,17.62953,17.31857,20.18508,18.2612,...,0.047582,0.035392,14.162,35.0,13.97,22.3,12.233,2.9,9.067,-0.3
3,299569116690278400,0.088161,ext,20.84844,18.9604,18.08027,17.62953,17.31857,20.18508,18.2612,...,0.047582,0.035392,14.162,35.0,13.97,22.3,12.233,2.9,9.067,-0.3
4,299568292056557568,0.066539,ext,21.28256,19.61427,18.98529,18.52956,18.26322,20.18081,18.47435,...,0.058709,0.043669,14.734,30.6,14.512,17.6,11.078,9.8,9.054,2.3


`pandas` provides many different methods for selecting columns from the DataFrame. Supposing you wanted `psfMag`, you could use any of the following:

    sdss_spec['psfMag_g']
    sdss_spec[['psfMag_r', 'psfMag_z']]
    sdss_spec.psfMag_g

(notice that selecting multiple columns requires a list within `[]`)

`pandas` can also be used to aggregate the results of a search.

**Problem 0c**

How many extended sources (`type` = `ext`) have `modelMag_i` between 19 and 20? Use as few lines as possible.

In [10]:
len(sdss_spec[(sdss_spec.type == 'ext') & 
              (sdss_spec.modelMag_i > 19) & 
              (sdss_spec.modelMag_i < 20)])

369

`pandas` also enables [`GROUP BY`](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) operations, where the data are split based on some criterion, a function is then applied to the groups, and the results are then combined back into a data structure.

**Problem 0d**

Group the data by their `type` and then report the minimum, median, and maximum redshift of each group. Can you immediately tell anything about these sources based on these results?

*Hint* - just execute the cell below.

In [11]:
grouped = sdss_spec.groupby([sdss_spec.type])
print(grouped['z'].min())
print(grouped['z'].median())
print(grouped['z'].max())

type
ext   -0.005469
ps    -0.010875
Name: z, dtype: float64
type
ext    0.107352
ps     0.000275
Name: z, dtype: float64
type
ext    6.839257
ps     6.687022
Name: z, dtype: float64


Finally, we currently only have a single table, but `pandas` also has methods to join one or more tables (providing a lot of functionality similar to databases) which makes the [join or merge](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) extremely powerful.

## Problem 1) The Sins of Our Predecessors

If at any point in your career you need to access archival infrared data, you will likely need to retrieve that information from the [NASA IPAC InfraRed Science Archive](https://irsa.ipac.caltech.edu). IRSA houses the data for every major NASA IR mission, and several ground-based missions as well (e.g., 2MASS, IRTF). Whether you are sudying brown dwarfs, explosive transients, solar system objects, star-formation, galaxy evolution, Milky Way dust and the resulting extinction of extragalactic observations, or quasars (and much more) the IR plays a critical role.

Given the importance of IR observations, it stands to reason that IRSA would provide data in a simple to read format for modern machines, such as comma separated values or FITS binary tables...

Right?...

**Right?...**

In fact, IRSA has created their own standard for storing data in a text file. The particulars of this format can be found in `irsa_catalog_WISE_iPTF14jg_search_results.tbl`, a file that is written in the standard IRSA format.

*shameless plug alert!* iPTF14jg is a [really strange star](https://arxiv.org/pdf/1901.10693.pdf) that exhibited a large outburst that we still don't totally understand. The associated data file includes [NEOWISE](https://neowise.ipac.caltech.edu/) observations of the mid-IR evolution of this outburst.

**Problem 1a**

Using `pandas` read the data in the IRSA table file into a `DataFrame` object.

*Hint 1* - you absolutely should look at the text file to develop a strategy to accomplish this goal.

*Hint 2* - you may want to manipulate the text file so that it can more easily be read by `pandas`. **If you do this** be sure to copy the file to another name as you will want to leave the original intact. 

In [12]:
# Solution 1 - pure python solution with pandas

with open('irsa_catalog_WISE_iPTF14jg_search_results.tbl') as f:
    ll = f.readlines()
    for linenum, l in enumerate(ll):
        if l[0] == '|':
            header = l.replace('|', ',').replace(' ', '')
            header = list(header[1:-2].split(','))
            break
            
irsa_tbl = pd.read_csv("irsa_catalog_WISE_iPTF14jg_search_results.tbl", 
            skiprows=linenum+4, delim_whitespace=True, 
            header=None, names=header)
irsa_tbl.head(5)

Unnamed: 0,ra,dec,sigra,sigdec,sigradec,w1mpro,w1sigmpro,w1snr,w1rchi2,w2mpro,...,w4sigmpro_allwise,tmass_key,j_m_2mass,j_msig_2mass,h_m_2mass,h_msig_2mass,k_m_2mass,k_msig_2mass,dist,angle
0,40.125566,60.879306,0.0947,0.0936,-0.0337,13.251,0.038,28.8,0.8767,12.735,...,,,,,,,,,0.028668,242.330013
1,40.125574,60.879297,0.0925,0.0874,-0.0397,13.241,0.036,30.0,1.261,12.629,...,,,,,,,,,0.048011,192.659014
2,40.125623,60.879318,0.0808,0.0794,0.0098,12.753,0.039,27.8,0.764,11.901,...,,,,,,,,,0.079498,69.867298
3,40.125553,60.879286,0.0663,0.0608,-0.0156,12.774,0.028,38.6,1.361,11.935,...,,,,,,,,,0.097389,209.287467
4,40.125572,60.87928,0.0645,0.0599,-0.0223,12.84,0.028,38.7,1.555,11.897,...,,,,,,,,,0.107707,187.00605


That pure python solution is a bit annoying as it requires a for loop with a break, and specific knowledge about how IRSA tables handle data headers (hence the use of `linenum + 4` for `skiprows`). Alternatively, one could  manipulate the data file to read in the data.

In [13]:
# solution 2 - edit the text file
# !cp irsa_catalog_WISE_iPTF14jg_search_results.tbl tmp.tbl
### delete lines 1-89, and 90-92
### replace whitespace with commas (may require multiple operations)
### replace '|' with commas
### replace ',,' with single commas
### replace ',\n,' with '\n'
### delete the comma at the very beginning and very end of the file

tedit_tbl = pd.read_csv('tmp.tbl')
tedit_tbl.head(5)

Unnamed: 0,ra,dec,sigra,sigdec,sigradec,w1mpro,w1sigmpro,w1snr,w1rchi2,w2mpro,...,w4sigmpro_allwise,tmass_key,j_m_2mass,j_msig_2mass,h_m_2mass,h_msig_2mass,k_m_2mass,k_msig_2mass,dist,angle
0,40.125566,60.879306,0.0947,0.0936,-0.0337,13.251,0.038,28.8,0.8767,12.735,...,,,,,,,,,0.028668,242.330013
1,40.125574,60.879297,0.0925,0.0874,-0.0397,13.241,0.036,30.0,1.261,12.629,...,,,,,,,,,0.048011,192.659014
2,40.125623,60.879318,0.0808,0.0794,0.0098,12.753,0.039,27.8,0.764,11.901,...,,,,,,,,,0.079498,69.867298
3,40.125553,60.879286,0.0663,0.0608,-0.0156,12.774,0.028,38.6,1.361,11.935,...,,,,,,,,,0.097389,209.287467
4,40.125572,60.87928,0.0645,0.0599,-0.0223,12.84,0.028,38.7,1.555,11.897,...,,,,,,,,,0.107707,187.00605


That truly wasn't all that better - as it required a bunch of clicks/text editor edits. (There are programs such as `sed` and `awk` that could be used to execute all the necessary edits from the command line, but that too is cumbersome and somewhat like the initial all `python` solution). 

Side note - if astronomers are creating data in a "standard" format, then it ought to be easy for other astronomers to access that data.

Fortunately, in this particular case, there is an easy solution - [`astropy Tables`](http://docs.astropy.org/en/stable/table/). 

IRSA tables are so commonly used throughout the community, that the folks at `astropy` have created a convenience method for all of us to read in tables created in that particular (unusual?) format. I show an example here, but this will only be relevant for the astronomy students.

**Problem 1b**

Use [`Table.read()`](http://docs.astropy.org/en/stable/api/astropy.table.Table.html#astropy.table.Table.read) to read in `irsa_catalog_WISE_iPTF14jg_search_results.tbl` to an `astropy Table` object.

In [14]:
from astropy.table import Table

Table.read('irsa_catalog_WISE_iPTF14jg_search_results.tbl', format='ipac')

ra,dec,sigra,sigdec,sigradec,w1mpro,w1sigmpro,w1snr,w1rchi2,w2mpro,w2sigmpro,w2snr,w2rchi2,nb,na,cc_flags,ph_qual,qual_frame,mjd,allwise_cntr,w1mpro_allwise,w1sigmpro_allwise,w2mpro_allwise,w2sigmpro_allwise,w3mpro_allwise,w3sigmpro_allwise,w4mpro_allwise,w4sigmpro_allwise,tmass_key,j_m_2mass,j_msig_2mass,h_m_2mass,h_msig_2mass,k_m_2mass,k_msig_2mass,dist,angle
deg,deg,arcsec,arcsec,arcsec,mag,mag,Unnamed: 7_level_1,Unnamed: 8_level_1,mag,mag,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,mjdate,Unnamed: 19_level_1,mag,mag,mag,mag,mag,mag,mag,mag,Unnamed: 28_level_1,mag,mag,mag,mag,mag,mag,arcsec,deg
float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,int64,int64,str4,str2,int64,float64,int64,float64,float64,float64,float64,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,float64,float64,float64
40.1255655,60.8793063,0.0947,0.0936,-0.0337,13.251,0.038,28.8,0.8767,12.735,0.058,18.6,0.7135,1,0,0000,AA,10,57621.23501414,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.028668,242.330013
40.125574,60.879297,0.0925,0.0874,-0.0397,13.241,0.036,30.0,1.261,12.629,0.051,21.2,1.121,1,0,0000,AA,5,57621.62810604,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.048011,192.659014
40.1256226,60.8793176,0.0808,0.0794,0.0098,12.753,0.039,27.8,0.764,11.901,0.034,32.0,1.059,1,0,0000,AA,10,57063.79830186,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.079498,69.867298
40.1255528,60.8792864,0.0663,0.0608,-0.0156,12.774,0.028,38.6,1.361,11.935,0.037,29.3,0.7133,1,0,0000,AA,10,57256.77485551,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.097389,209.287467
40.1255725,60.8792803,0.0645,0.0599,-0.0223,12.84,0.028,38.7,1.555,11.897,0.031,35.1,1.045,1,0,0000,AA,5,57257.10275142,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.107707,187.00605
40.1255736,60.8792799,0.1164,0.1059,-0.039,13.587,0.041,26.4,1.647,13.145,0.076,14.3,1.03,1,0,0000,AA,5,57987.97028435,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.108928,185.907602
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40.1253594,60.8792163,0.0631,0.0675,-0.0093,12.791,0.03,36.6,1.13,11.972,0.039,27.5,0.6409,1,0,0000,AA,10,57422.73853115,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.512985,228.885562
40.1252771,60.8793411,0.115,0.1078,-0.0397,13.511,0.041,26.2,1.299,12.668,0.061,17.8,0.6642,1,0,0000,AA,5,57790.01894558,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.542346,281.913746
40.1253172,60.8792186,0.1035,0.1091,-0.0272,13.328,0.038,28.6,1.868,12.893,0.091,11.9,1.075,1,0,0000,AA,5,57790.21542903,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,--,0.565899,234.448038


A benefit to using this method, as opposed to `pandas`, is that data typing and data units are naturally read from the IRSA table and included with the associated columns. Thus, if you are uncertain if some brightness measurement is in magnitudes or Janskys, the `astropy Table` can report on that information.

Unfortunately, `astropy` does *not* know about every strange formating decision that every astronomer has made at some point in their lives (as we are about to see...) 

## Problem 2) The Sins of Our Journals

Unlike IRSA/IPAC, which uses a weird but nevertheless consistent format for data tables, data retrieved from Journal articles essentially follows no rules. In principle, tables in Journal articles are supposed to be provided in a machine readable format. In practice, as we are about to see, this is far from the case.

For this particular wrangling case study we will focus on supernova light curves, a simple thing to report: time, filter, brightness, uncertainty on that brightness, that the community has nevertheless managed to mangle into some truly wild and difficult to parse forms.

(Sorry for the heavy emphasis on time-domain examples - I'm pulling straight from my own life today, but the issues described here are not perfectly addressed by any subfield within the astro umbrella)

Here is the LaTeX-formatted version of Table 4 from [Miller et al. 2011](https://iopscience.iop.org/article/10.1088/0004-637X/730/2/80/meta):

<img style="display: block; margin-left: auto; margin-right: auto" src="images/Miller11_tbl4.png" width="350" align="middle">

That is a very simple table to interpret, no?

Have a look at the ["machine-readible" file](https://iopscience.iop.org/0004-637X/730/2/80/suppdata/apj382770t4_ascii.txt?doi=10.1088/0004-637X/730/2/80) that ApJ provides for readers that might want to evaluate these photometric measurements.

**Problem 2a** 

Read the ApJ version of Table 4 from from Miller et al. 2011 – called `Miller_et_al2011_table4.txt` – into a `pandas DataFrame`.

In [15]:
# pure python solution with pandas

tbl4 = pd.read_csv('Miller_et_al2011_table4.txt', 
                   skiprows=5, delim_whitespace=True,
                   skipfooter=3, engine='python',
                   names=['t_mid', 'J', 'Jdum', 'J_unc', 
                                   'H', 'Hdum', 'H_unc',
                                   'K', 'Kdum', 'K_unc'])
tbl4.drop(columns=['Jdum', 'Hdum', 'Kdum'], inplace=True)

print(tbl4)

       t_mid      J  J_unc     H  H_unc     K  K_unc
0  55466.137  10.04   0.03  9.14   0.03  8.65   0.03
1  55468.145   9.99   0.03  9.06   0.04  8.64   0.04
2  55469.148  10.04   0.03  9.07   0.03  8.70   0.03
3  55479.109  10.11   0.03  9.11   0.03  8.63   0.04
4  55504.164  10.20   0.03  9.24   0.03  8.74   0.03
5  55513.195  10.29   0.03  9.34   0.03  8.79   0.03
6  55518.168  10.32   0.03  9.34   0.04  8.84   0.03
7  55527.117  10.35   0.03  9.40   0.03  8.89   0.03
8  55531.145  10.40   0.03  9.44   0.03  8.97   0.03
9  55543.066  10.45   0.03  9.48   0.04  9.06   0.04


That wasn't too terrible. But what if we consider a more typical light curve table, where there are loads of missing data, such as Table 2 from [Foley et al. 2009](https://iopscience.iop.org/article/10.1088/0004-6256/138/2/376#aj309430t2):

<img style="display: block; margin-left: auto; margin-right: auto" src="images/Foley09_tbl2.png" width="650" align="middle">

Again, this table is straightforward to read, and it isn't hard to imagine how one could construct a machine-readable csv or other file from this information. But alas, this is not what is available from ApJ. So, we will need to figure out how to deal with both the missing data, "...", and the weird convention that many astronomers use where the uncertainties are (a) not reported in their own column, and (b) are not provided in the same units as the measurement itself. I can understand the former, but the later is somewhat baffling...

**Problem 2b** 

Read the ApJ version of Table 2 from from Foley et al. 2009 – called `Foley_et_al2009_table2.txt` – into either a `pandas DataFrame` or an `astropy Table`.

In [16]:
# a (not terribly satisfying) pure python solution
# read the file in, parse and write another file that plays nice with pandas

with open('Foley_et_al2009_for_pd.csv','w') as fw:
    print('JD,Bmag,Bmag_unc,Vmag,Vmag_unc,Rmag,Rmag_unc,Imag,Imag_unc,Unfiltmag,Unfiltmag_unc,Telescope',file=fw)
    with open('Foley_et_al2009_table2.txt') as f:
        ll = f.readlines()
        for l in ll:
            if l[0] == '2':
                print_str = l.split()[0] + ','
                for col in l.split()[1:]:
                    if col == 'sdotsdotsdot':
                        print_str += '-999,-999,'
                    elif col[0] == '>':
                        print_str += '{},-999,'.format(col[1:])
                    elif col == 'KAIT':
                        print_str += 'KAIT'
                    elif col == 'Nickel':
                        print_str += 'Nickel'
                    elif col[0] == '(':
                        print_str += '0.{},'.format(col[1:-1])
                    else:
                        print_str += '{},'.format(col)

                print(print_str,file=fw)

pd.read_csv('Foley_et_al2009_for_pd.csv')

Unnamed: 0,JD,Bmag,Bmag_unc,Vmag,Vmag_unc,Rmag,Rmag_unc,Imag,Imag_unc,Unfiltmag,Unfiltmag_unc,Telescope
0,2454764.8,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,19.5,-999.0,KAIT
1,2454778.69,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,18.069,0.092,KAIT
2,2454781.76,18.34,0.084,17.828,0.037,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,KAIT
3,2454783.74,18.229,0.062,17.718,0.042,17.509,0.041,17.377,0.054,-999.0,-999.0,KAIT
4,2454784.64,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,17.736,0.091,KAIT
5,2454784.71,18.23,0.03,17.635,0.03,17.57,0.03,17.392,0.03,-999.0,-999.0,Nickel
6,2454785.67,18.385,0.03,17.66,0.027,17.61,0.023,17.425,0.034,17.683,0.038,KAIT
7,2454786.8,18.415,0.03,17.71,0.03,17.544,0.03,17.358,0.03,-999.0,-999.0,Nickel
8,2454787.67,18.596,0.03,17.762,0.03,17.552,0.03,17.376,0.03,-999.0,-999.0,KAIT
9,2454789.68,18.904,0.03,17.827,0.014,17.573,0.011,17.353,0.03,17.732,0.028,KAIT


Okay - there is nothing elegant about that particular solution. But it works, and wranglin' ain't pretty. 

It is likely that you developed a solution that looks very different from this one, and that is fine. When data are provided in an unrulely format, the most important thing is to develop some method, any method, for converting the information into a useful format. Following whatever path you used above, it should now be easy to plot the light curve of SN 2008ha.