# Introduction to Pandas

Pandas is the most popular Python library for tabular data structures. You can think of Pandas as a compelling version of Excel.

Pandas allows us to work with the less structured data available in many forms. Pandas introduces more flexibility (e.g., attaching labels to data, working with missing data, etc.) and brings additional operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)

Pandas Series and DataFrame objects build on the NumPy array structure and provide efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

> ✏️ The example is inspired by {cite}`tomasbeuzen` and {cite}`vanderplas_2017`.

Let's import pandas with the alias pd:

In [1]:
import pandas as pd
import numpy as np

pd.__version__

'1.5.0'

## Pandas Series

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc.), including a mix of them.

### Creating Series

By default, Series are labeled with indices starting from 0. For example:

In [2]:
pd.Series(data = [-5, 1.3, 21, 6, 3])

0    -5.0
1     1.3
2    21.0
3     6.0
4     3.0
dtype: float64

However, you can add a custom index:

In [3]:
pd.Series(data = [-5, 1.3, 21, 6, 3], index = ['a', 'b', 'c', 'd', 'e'])

a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

In [4]:
pd.Series(data = {'a': 10, 'b': 20, 'c': 30})

a    10
b    20
c    30
dtype: int64

Or from an ndarray:

In [5]:
pd.Series(data = np.random.randn(3))

0    1.721273
1   -1.010299
2    0.237278
dtype: float64

### Series Characteristics

You can access the index labels of your Series using the _.index_ attribute:

In [6]:
s = pd.Series(data = np.random.randn(3))
s.index

RangeIndex(start=0, stop=3, step=1)

You can access the underlying data array using .to_numpy():

In [7]:
s.to_numpy()

array([0.3807332 , 0.40186111, 0.54178968])

### Indexing and Slicing Series

Series are very much like ndarrays in this regard (in fact, Series can be passed to most NumPy functions):

In [8]:
s = pd.Series(data = range(5), index = ['A', 'B', 'C', 'D', 'A'])
s

A    0
B    1
C    2
D    3
A    4
dtype: int64

In [9]:
s[0]

0

In [10]:
s[0:3]

A    0
B    1
C    2
dtype: int64

Series are also like dictionaries in that we can access values using index labels:

In [11]:
s[["B", "D", "C"]]

B    1
D    3
C    2
dtype: int64

In [12]:
"A" in s

True

Series do allow for non-unique indexing, but be careful because indexing operations won't return unique values:

In [13]:
s["A"]

A    0
A    4
dtype: int64

Finally, we can also do boolean indexing with Series:

In [14]:
s[s >= 1]

B    1
C    2
D    3
A    4
dtype: int64

### Series Operations

Unlike ndarrays, operations between Series (+, -, /, *) align values based on their **LABELS** (not their position in the structure). The resulting index will be the sorted union of the two indexes. This gives you the flexibility to run operations on Series regardless of their labels.

In [15]:
s1 = pd.Series(data = range(4), index = ["A", "B", "C", "D"])
s1

A    0
B    1
C    2
D    3
dtype: int64

In [16]:
s2 = pd.Series(data = range(10, 14), index = ["B", "C", "D", "E"])
s2

B    10
C    11
D    12
E    13
dtype: int64

In [17]:
s1 + s2

A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts Series as an argument to most functions because series are built off numpy arrays:

In [18]:
s1 ** 2

A    0
B    1
C    4
D    9
dtype: int64

In [19]:
np.exp(s1)

A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

Finally, like arrays, Series have many built-in methods for various operations:

In [20]:
s1.mean()

1.5

In [21]:
# TODO - DFs - create them, reading and writing them, index and slice them, operations, merging and joining

## Exercises

In this practice exercise, we'll investigate different foods' carbon footprints. We'll leverage a dataset compiled by [Kasia Kulma](https://r-tastic.co.uk/post/from-messy-to-tidy/) and contributed to [R's Tidy Tuesday project](https://github.com/rfordatascience/tidytuesday).

Import the dataset as a data frame named df from this URL:

In [22]:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,country,food_category,consumption,co2_emmission
0,Argentina,Pork,10.51,37.2
1,Argentina,Poultry,38.66,41.53
2,Argentina,Beef,55.48,1712.0
3,Argentina,Lamb & Goat,1.56,54.63
4,Argentina,Fish,4.36,6.96


#### What is the maximum co2_emmission in the dataset and which food type and country does it belong to?

In [23]:
# TODO: your answer here

#### How many countries produce more than 1000 Kg CO2/person/year for at least one food type?

In [24]:
# TODO: your answer here

#### What is the total emissions of all other (non-meat) products in the dataset combined?

In [25]:
# TODO: your answer here

## Resources

```{bibliography}
:filter: docname in docnames
```