# <p style="background-color: #f5df18; padding: 10px;">Programming & Plotting in Python | **Reading Tabular Data into DataFrames** </p>



<div style="display: flex;">
    <div style="flex: 1; margin-right: 20px;">
        <h2>Questions</h2>
        <ul>
            <li>How can I read tabular data?</li>
        </ul>
    </div>
    <div style="flex: 1;">
        <h2>Learning Objectives</h2>
        <ul>
            <li>Import the Pandas library.</li>
    <li>Use Pandas to load a simple CSV data set.</li>
    <li>Get some basic information about a Pandas DataFrame.</li>
        </ul>
    </div>
</div>


## Use the Pandas library to do statistics on tabular data.
---


- [Pandas](https://pandas.pydata.org/) is a widely-used Python library for statistics, particularly on tabular data.


- Pandas uses `DataFrames`, which are 2-dimensional tables whose columns have names
    and potentially have different data types.


- We load Pandas with `import pandas as pd`. The alias `pd` is commonly used to refer to the Pandas library in code.


- We read Comma Separated Values (CSV) data file with `pd.read_csv`.
  - Argument is the name of the file to be read.
  - Returns a dataframe that you can assign to a variable


Let's read and print the contents of the `SDSS_2020.csv` file stored in the `data` directory

In [None]:
# use pandas to load the file `SDSS_2020.csv`



In [None]:
# use the '.head()' module to view the first 5 rows


In [None]:
# use the built-in print function to display data frame


- The columns in a dataframe are the observed variables, and the rows are the observations.

- Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

- Using descriptive dataframe names helps us distinguish between multiple dataframes so we won't accidentally overwrite a dataframe or read from the wrong one.

## 🔔 File Not Found
---

Our lessons store their data files in a `data` sub-directory,
which is why the path to the file is `data/SDSS_2020.csv`.
If you forget to include `data/`,
or if you include it but your copy of the file is somewhere else,
you will get a [runtime error](04-built-in.md)
that ends with a line like this:

```error
FileNotFoundError: [Errno 2] No such file or directory: 'data/SDSS_2020.csv'
```


## Use the `DataFrame.info()` method to find out more about a dataframe.
---


In [None]:
# use .info() to find out more about our dataframe



- This is a `DataFrame`
- There are 500,000 rows, each representing a unique star.
- 23 columns with various data types.
  - `Non-Null Count` returns the observations with missing or non-existing values.
- Uses 68.7 MB of memory.

## The `DataFrame.columns` variable stores information about the dataframe's columns.
---

## 🔭 SDSS Data Column Descriptions

This dataset contains information about astronomical objects observed by the [Sloan Digital Sky Survey (SDSS)](https://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey), including their **sky positions (`ra`, `dec`), brightness in different filters (`u`, `g`, `r`, `i`, and `z`), and spectroscopic measurements such as `redshift` (a proxy for distance) and object `classification` (e.g., star, galaxy, or quasar).**


Below is a brief explanation of each column in the SDSS data table:

| **Column**      | **Description** |
|-----------------|-----------------|
| `objid`         | Unique identifier for the imaging object in the SDSS database. |
| `ra`            | Right Ascension (in degrees) — the celestial equivalent of longitude, indicating the object's position on the sky. |
| `dec`           | Declination (in degrees) — the celestial equivalent of latitude, indicating the object's position on the sky. |
| `u`, `g`, `r`, `i`, `z` | Magnitudes of the object in the five SDSS photometric bands: ultraviolet (`u`), green (`g`), red (`r`), near-infrared (`i`), and far-infrared (`z`). These are **apparent magnitudes**, typically in the AB magnitude system. |
| `run`           | The specific imaging run number during which the object was observed. |
| `rerun`         | Version of the data processing pipeline used; rerun indicates reprocessed imaging data. |
| `camcol`        | Camera column — identifies which of the six columns of CCDs on the SDSS camera recorded the observation. |
| `field`         | The field number within a given run and camcol — helps locate the object in the SDSS imaging stripe. |
| `specobjid`     | Unique identifier for the spectroscopic observation of the object. |
| `class`         | The spectroscopic classification of the object (e.g., `'GALAXY'`, `'STAR'`, `'QSO'`). |
| `redshift`      | The redshift (`z`) measured from the object's spectrum — indicates its distance and recessional velocity. |
| `plate`         | Plate number of the spectroscopic observation — corresponds to a specific metal plate drilled with holes for fibers. |
| `mjd`           | Modified Julian Date on which the spectroscopic observation was made. |
| `fiberid`       | Fiber ID number — identifies which fiber on the plate was assigned to the object for spectroscopy. |

---


In [None]:
## We can access individuals columns using the syntax df['columns_name']



## Use `DataFrame.T` to transpose a dataframe.
---

- Sometimes want to treat columns as rows and vice versa.
- Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
- Like `columns`, it is a member variable.


## Use `DataFrame.describe()` to get summary statistics about data.
---

`DataFrame.describe()` gets the summary statistics of only the columns that have numerical data.
All other columns are ignored, unless you use the argument `include='all'`.

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Reading other data </p>

---

Read the data in `national-pokedex.csv`
(which should be in the same directory as `SDSS_2020.csv`)
into a variable called `national_dex`
and display its summary statistics.

In [None]:
### your answer here ###

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Inspecting data </p>

---

After reading the data for the National Pokédex
use `help(national_dex.head)` and `help(national_dex.tail)`
to find out what `DataFrame.head` and `DataFrame.tail` do.

1. What method call will display the first three rows of this data?
2. What method call will display the last three columns of this data?
  (Hint: you may need to change your view of the data.)

In [None]:
### your answer here ###

In [None]:

### your answer here ###

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Reading Files in Other Directories </p>
---

Suppose you carried out an observing run with the CTIO Blanco 4m telescope, capturing optical images of the Small Magellanic Cloud (SMC). After the run, a table of measured brightnesses (photometry) was saved in a file called `smc_photometry.csv` inside a folder called `images`.
You are analyzing the data in a Jupyter notebook named `analyze_smc_data.ipynb` that lives in a sibling folder named notebooks:

```output
your_home_directory/
├── images/
│   └── smc_photometry.csv
└── notebooks/
    └── analyze_smc_data.ipynb
```

In this hypothetical scenario, which value(s) should you pass to `read_csv` to read `smc_photometry.csv` in `analyze_smc_data.ipynb`?

In [None]:
### your answer here ###

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Modifying and saving data </p>
---


1.) Pandas provides the function `insert` to add a new column to an existing dataframe. Use this function to insert a new column labeled `Total` that is the sum total of the six individual stats – i.e., HP, Attack, Defense, etc. You can use `help` to get information on how to use `insert`.

2.) In addition to the `read_csv` function for reading data from a file,
Pandas provides a `to_csv` function to write dataframes to files. Applying what you've learned about reading from files, write one of your dataframes to a file called `updated-national-dex.csv`. You can use `help` to get information on how to use `to_csv`.

In [None]:
### your answer here ###

# <p style="background-color: #f5df18; padding: 10px;"> 🗝️ Key points</p>
---

- Use the Pandas library to get basic statistics out of tabular data.
- Use `index_col` to specify that a column's values should be used as row headings.
- Use `DataFrame.info` to find out more about a dataframe.
- The `DataFrame.columns` variable stores information about the dataframe's columns.
- Use `DataFrame.T` to transpose a dataframe.
- Use `DataFrame.describe` to get summary statistics about data.