# Introduction to the Pandas Library

*pandas* is a library within python that is designed to be used for data analysis. It is similar to Excel as it can handle large datasets, but with
 the advantage of being able to manipulate the data in a programmable way.
 You can
find the pandas documentation [here](https://pandas.pydata.org/docs/).


There is an [introductory video available](https://youtu.be/_T8LGqJtuGc) that tries to teach the basics of pands in just 10 minutes!

## Prerequisites
- variables and data types
- libraries (not sure if this is needed)
- Boolean operators
- print
- f-strings

## Learning Outcomes
- Read and write files
- Understand what a dataframe is
- Check files are imported correctly
- Select a subset of a DataFrame
- Add new columns to a dataframe
- Calculate summary statistics


The community standard alias for the pandas package is *pd*, which is assumed in the pandas documentation and in a lot of code you may see online.

In [1]:
import pandas as pd

## Reading files

In pandas, it is useful to read data into a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame),
which is similar to an Excel spreadsheet:

![Pandas DataFrame](DataFrame.png)

There are many ways to read data into pandas depending on the file type, but for regular delimited files,
 the function [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) can be used.

In [2]:
data = pd.read_csv("periodic_table.csv")
data

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
0,1,H,Hydrogen,1.008000,FFFFFF,1s1,2.20,120.0,13.598,0.754,"+1, -1",Gas,13.81,20.28,0.000090,Nonmetal,1766
1,2,He,Helium,4.002600,D9FFFF,1s2,,140.0,24.587,,0,Gas,0.95,4.22,0.000179,Noble gas,1868
2,3,Li,Lithium,7.000000,CC80FF,[He]2s1,0.98,182.0,5.392,0.618,+1,Solid,453.65,1615.00,0.534000,Alkali metal,1817
3,4,Be,Beryllium,9.012183,C2FF00,[He]2s2,1.57,153.0,9.323,,+2,Solid,1560.00,2744.00,1.850000,Alkaline earth metal,1798
4,5,B,Boron,10.810000,FFB5B5,[He]2s2 2p1,2.04,192.0,8.298,0.277,+3,Solid,2348.00,4273.00,2.370000,Metalloid,1808
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113,114,Fl,Flerovium,290.192000,,[Rn]7s2 7p2 5f14 6d10 (predicted),,,,,"6, 4,2, 1, 0",Expected to be a Solid,,,,Post-transition metal,1998
114,115,Mc,Moscovium,290.196000,,[Rn]7s2 7p3 5f14 6d10 (predicted),,,,,"3, 1",Expected to be a Solid,,,,Post-transition metal,2003
115,116,Lv,Livermorium,293.205000,,[Rn]7s2 7p4 5f14 6d10 (predicted),,,,,"+4, +2, -2",Expected to be a Solid,,,,Post-transition metal,2000
116,117,Ts,Tennessine,294.211000,,[Rn]7s2 7p5 5f14 6d10 (predicted),,,,,"+5, +3, +1, -1",Expected to be a Solid,,,,Halogen,2010


> This function assumes the data is comma separated, for other separators you can specify it using the delimiter parameter. If the separator is not a
regular character (e.g. a tab, multiple spaces), an internet search should tell you what string to use. E.g. for a *tab* separated file:
>
> ```data_tab = pd.read_csv("**need to get a file**", delimiter="\t")```
>
> There are other parameters available, to specify the headers, the datatype etc. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for full details.


### Viewing the data

Now that we have imported the data, it is important to view it is fully understand how it is formatted and ensure we imported it correctly. As you
may have noticed, when we try to display the dataframe, only some of the rows display. This is because only the first and last 5 rows will be shown
 by default. There are functions we can use to display specific
parts of the
dataframe:

- `data.head()` shows rows from the top of the file
- `data.tail()` shows rows from the bottom of the file
- `data.columns` shows the column names (header)

If a number is given to `head` and `tail`, it will display that many rows.

It can also be useful to check how pandas *interpreted* the data, and then change it if necessary. The data type can be checked using `.dtypes` and
it can be changed using `.astype()`.

To display the datatype of all columns, we can run the function on the whole dataframe:

In [3]:
data.dtypes

AtomicNumber               int64
Symbol                    object
Name                      object
AtomicMass               float64
CPKHexColor               object
ElectronConfiguration     object
Electronegativity        float64
AtomicRadius             float64
IonizationEnergy         float64
ElectronAffinity         float64
OxidationStates           object
StandardState             object
MeltingPoint             float64
BoilingPoint             float64
Density                  float64
GroupBlock                object
YearDiscovered            object
dtype: object

Or we can instead run the function on only one column:

In [4]:
data["AtomicNumber"].dtype

dtype('int64')

To change the data type, we need to reassign that column. E.g. to change the "Name" data to a string:

In [5]:
print(f'Data type before change: {data["Name"].dtype}')
data["Name"] = data["Name"].astype("string")
print(f'Data type after change: {data["Name"].dtype}')

Data type before change: object
Data type after change: string


## Exercise

Display the first 8 elements.

In [6]:
# Add your answer here

In [7]:
# Answer
data.head(8)

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
0,1,H,Hydrogen,1.008,FFFFFF,1s1,2.2,120.0,13.598,0.754,"+1, -1",Gas,13.81,20.28,9e-05,Nonmetal,1766
1,2,He,Helium,4.0026,D9FFFF,1s2,,140.0,24.587,,0,Gas,0.95,4.22,0.000179,Noble gas,1868
2,3,Li,Lithium,7.0,CC80FF,[He]2s1,0.98,182.0,5.392,0.618,+1,Solid,453.65,1615.0,0.534,Alkali metal,1817
3,4,Be,Beryllium,9.012183,C2FF00,[He]2s2,1.57,153.0,9.323,,+2,Solid,1560.0,2744.0,1.85,Alkaline earth metal,1798
4,5,B,Boron,10.81,FFB5B5,[He]2s2 2p1,2.04,192.0,8.298,0.277,+3,Solid,2348.0,4273.0,2.37,Metalloid,1808
5,6,C,Carbon,12.011,909090,[He]2s2 2p2,2.55,170.0,11.26,1.263,"+4, +2, -4",Solid,3823.0,4098.0,2.267,Nonmetal,Ancient
6,7,N,Nitrogen,14.007,3050F8,[He] 2s2 2p3,3.04,155.0,14.534,,"+5, +4, +3, +2, +1, -1, -2, -3",Gas,63.15,77.36,0.001251,Nonmetal,1772
7,8,O,Oxygen,15.999,FF0D0D,[He]2s2 2p4,3.44,152.0,13.618,1.461,-2,Gas,54.36,90.2,0.001429,Nonmetal,1774


What element has atomic number 110? Hint: The table has 118 elements in it.

In [8]:
# Add your answer here

In [9]:
# Answer
data.tail(9)

# The element with an atomic number of 110 is Darmstadtium.

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
109,110,Ds,Darmstadtium,282.166,,[Rn]7s2 5f14 6d8 (predicted),,,,,"8, 6, 4, 2, 0",Expected to be a Solid,,,,Transition metal,1994
110,111,Rg,Roentgenium,282.169,,[Rn]7s2 5f14 6d9 (predicted),,,,,"5, 3, 1, -1",Expected to be a Solid,,,,Transition metal,1994
111,112,Cn,Copernicium,286.179,,[Rn]7s2 5f14 6d10 (predicted),,,,,"2, 1, 0",Expected to be a Solid,,,,Transition metal,1996
112,113,Nh,Nihonium,286.182,,[Rn]5f14 6d10 7s2 7p1 (predicted),,,,,,Expected to be a Solid,,,,Post-transition metal,2004
113,114,Fl,Flerovium,290.192,,[Rn]7s2 7p2 5f14 6d10 (predicted),,,,,"6, 4,2, 1, 0",Expected to be a Solid,,,,Post-transition metal,1998
114,115,Mc,Moscovium,290.196,,[Rn]7s2 7p3 5f14 6d10 (predicted),,,,,"3, 1",Expected to be a Solid,,,,Post-transition metal,2003
115,116,Lv,Livermorium,293.205,,[Rn]7s2 7p4 5f14 6d10 (predicted),,,,,"+4, +2, -2",Expected to be a Solid,,,,Post-transition metal,2000
116,117,Ts,Tennessine,294.211,,[Rn]7s2 7p5 5f14 6d10 (predicted),,,,,"+5, +3, +1, -1",Expected to be a Solid,,,,Halogen,2010
117,118,Og,Oganesson,295.216,,[Rn]7s2 7p6 5f14 6d10 (predicted),,,,,"+6, +4, +2, +1, 0, -1",Expected to be a Gas,,,,Noble gas,2006


Change the "Symbol" data to strings. Check the data type of the column after.

In [10]:
# Add your answer here

In [11]:
# Answer
data["Symbol"] = data["Symbol"].astype("string")
print(f'Data type after change: {data["Symbol"].dtype}')

Data type after change: string


## Writing files

As with reading files, there are many ways to write data to a file depending on the file type wanted, but for regular delimited files,
 the function [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) can be used.

As DataFrames have an index column, we have to decide if we want to keep this or not. We can do this using the `index` parameter. To **NOT**
include the index column, use `index=False`.

In [12]:
data.to_csv("periodic_table_out.csv", index=False)

> As with reading files, we can specify what separator we want the data to be written using `sep`. There are many other useful parameters for
> specifying what data to save and how to save it. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for more infromation.

# To Do
- select a subset of a df
- create new columns
- calculate statistics