# Introduction to the Pandas Library

*pandas* is a library within python that is designed to be used for data analysis. It is similar to Excel as it can handle large datasets, but with
 the advantage of being able to manipulate the data in a programmable way.
 You can
find the pandas documentation [here](https://pandas.pydata.org/docs/).


There is an [introductory video available](https://youtu.be/_T8LGqJtuGc) that tries to teach the basics of pands in just 10 minutes!

### Prerequisites
- variables and data types
- libraries (not sure if this is needed)
- Boolean operators
- print
- f-strings

### Learning Outcomes
- Read and write files
- Understand what a dataframe is
- Check files are imported correctly
- Select a subset of a DataFrame
- Add new columns to a dataframe
- Calculate summary statistics


The community standard alias for the pandas package is *pd*, which is assumed in the pandas documentation and in a lot of code you may see online.

In [None]:
import pandas as pd

## Reading files

In pandas, it is useful to read data into a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame),
which is similar to an Excel spreadsheet:

![Pandas DataFrame](DataFrame.png)

There are many ways to read data into pandas depending on the file type, but for regular delimited files,
 the function [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) can be used.

In [None]:
data = pd.read_csv("periodic_table.csv")
data

> This function assumes the data is comma separated, for other separators you can specify it using the delimiter parameter. If the separator is not a
regular character (e.g. a tab, multiple spaces), an internet search should tell you what string to use. E.g. for a *tab* separated file:
>
> ```data_tab = pd.read_csv("**need to get a file**", delimiter="\t")```
>
> There are other parameters available, to specify the headers, the datatype etc. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for full details.


### Viewing the data

Now that we have imported the data, it is important to view it is fully understand how it is formatted and ensure we imported it correctly. As you
may have noticed, when we try to display the dataframe, only some of the rows display. This is because only the first and last 5 rows will be shown
 by default. There are functions we can use to display specific
parts of the
dataframe:

- `data.head()` shows rows from the top of the file
- `data.tail()` shows rows from the bottom of the file
- `data.columns` shows the column names (header)

If a number is given to `head` and `tail`, it will display that many rows.

It can also be useful to check how pandas *interpreted* the data, and then change it if necessary. The data type can be checked using `.dtypes` and
it can be changed using `.astype()`.

To display the datatype of all columns, we can run the function on the whole dataframe:

In [None]:
data.dtypes

Or we can instead run the function on only one column:

In [None]:
data["AtomicNumber"].dtype

To change the data type, we need to reassign that column. E.g. to change the "Name" data to a string:

In [None]:
print(f'Data type before change: {data["Name"].dtype}')
data["Name"] = data["Name"].astype("string")
print(f'Data type after change: {data["Name"].dtype}')

## Exercise

Display the first 8 elements.

In [None]:
# Add your answer here

In [None]:
# Answer
data.head(8)

What element has atomic number 110? Hint: The table has 118 elements in it.

In [None]:
# Add your answer here

In [None]:
# Answer
data.tail(9)

# The element with an atomic number of 110 is Darmstadtium.

Change the "Symbol" data to strings. Check the data type of the column after.

In [None]:
# Add your answer here

In [None]:
# Answer
data["Symbol"] = data["Symbol"].astype("string")
print(f'Data type after change: {data["Symbol"].dtype}')

## Writing files

As with reading files, there are many ways to write data to a file depending on the file type wanted, but for regular delimited files,
 the function [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) can be used.

As DataFrames have an index column, we have to decide if we want to keep this or not. We can do this using the `index` parameter. To **NOT**
include the index column, use `index=False`.

In [None]:
data.to_csv("periodic_table_out.csv", index=False)

> As with reading files, we can specify what separator we want the data to be written using `sep`. There are many other useful parameters for
> specifying what data to save and how to save it. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for more infromation.

# To Do
- select a subset of a df
- create new columns
- calculate statistics