# Data Visualization Tips and Tricks

Hopefully this can be a supplement for your data visualization notes from the bootcamp! Feel free to refer back to this during our Astronomy module.

## Loading modules

The main modules we'll use are $\texttt{numpy}$, $\texttt{pandas}$, and $\texttt{matplotlib}$. There are other plotting modules we could potentially use, but $\texttt{matplotlib}$ is the most common.

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#this line is show I can show mathematical symbols
plt.rcParams.update({"text.usetex": True,  "axes.formatter.use_mathtext": True}) 

## Reading in data

In order to visualize any data, we need to read it into our code! Most data is in some kind of tabular format, like what you would see in Excel or Google Sheets. It is typically 1D (a list) or 2D (a sheet, with columns and rows), though if you have time series data, you might have 3D or higher-dimension data set. Let's just focus on 1D and 2D cases for now though. 

### $\color{green}{\texttt{numpy}}$
To read data with $\texttt{numpy}$, it helps to have it in what's known as a "comma-separated values" or CSV format. Your files will have the ".csv" extension if they are in this format, and you can save Google Sheets or Excels files in this way. 

Let's read in a 1D file, a list of galaxies:

In [None]:
galaxies = np.loadtxt('galaxies.csv', delimiter=',', dtype='str')
print(galaxies)

We've called the module $\texttt{loadtext}$ from the $\texttt{numpy}$ library and specified the file path, the delimiter (the thing that is separating the values), and the type of data in this file (in this case, text strings). The default assumes that the delimiter is a comma and that teh type of data is a float (numbers), so we'll be specific here to a) get good habits in reading in data and b) make sure we don't throw any errors.

But we can't really do much visualization with this 1D set of names. What if we now introduce a numerical amount for our ecological study of galaxy populations, and read in a 2D table:

In [None]:
galaxies_inventory = np.loadtxt('galaxy_inventory.csv', usecols=[0,1], delimiter=',', dtype={'names':('Name','Mass', 'Distance', 'Size', 'Luminosity', 'Type'), 'formats':('U15', float, float, float, float, 'U15')})

print(galaxies_inventory)

We had to specify the different data types in order to make sure the names and numbers were read in correctly (something we wouldn't have to do in $\texttt{pandas}$!), but now we have some quantitative data we can plot. If you want to explore the data though, and pull out specific information, we can call it by index number. Note, $\color{green}{\texttt{python}}$ is a zero-index language, meaning it starts counting at 0 instead of 1. So if we wanted to look at the first row of data in this table, we would type:

In [None]:
galaxy_inventory[0]

If we wanted to look at one or the other column we would type:

In [None]:
galaxy_inventory['Name']

In [None]:
galaxy_inventory['Mass']

### $\color{green}{\texttt{pandas}}$

Using $\texttt{pandas}$ is much better, in my opinion. It can figure out what your data looks like much without as much input, and it has additional functionality that makes reading and interpreting data a breeze compared to $\texttt{numpy}$. Plus, its user interface for tables is much nicer on the eye. It similarly likes CSV files, but it can read other data formats just as easily. $\texttt{pandas}$ creates objects known as "dataframes" as opposed to arrays. There are several ways you can create dataframes, either from your data files or from scratch within $\texttt{python}$. Lets first load in that 2D galaxies table:

In [None]:
galaxy_inventory_pd = pd.read_csv('galaxy_inventory.csv', names=['Name', 'Mass', 'Distance', 'Size', 'Luminosity', 'Type'])
galaxy_inventory_pd

Look at that! Didn't even have to specify the delimiter or the data type. It just knows. We can even convert $\texttt{numpy}$ tables into $\texttt{pandas}$ tables pretty easily:

In [None]:
pd.DataFrame(galaxy_inventory)

## Plotting

### Scatter plots

Scatter plots are good for visualizing data that might 'cluster' or follow a trend.

In [None]:
plt.scatter(galaxy_inventory['Mass'], galaxies_inventory['Luminosity'])

plt.xlabel('Mass (M$_{\odot}$')
plt.ylabel('Luminosity (L$_{\odot}$')
plt.title('Galactic Mass-Luminosity Relationship')
plt.savefig('mass_luminosity.png', transparent=False, bbox_inches='tight', dpi=300)
plt.show()