# 1 `pandas` series and data frames

In [1]:
# importing libraries
import pandas as pd
import numpy as np

**series**: a one-dimensional array of indexed data

A `pandas.Series` having an index is the main difference between a `pandas.Series` and a NumPy array.

In [2]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[0.07615405 0.25868038 0.76977358 0.86834243] 

<class 'pandas.core.series.Series'>
0    0.076154
1    0.258680
2    0.769774
3    0.868342
dtype: float64


Here, the index of the Series was printed along with the values, whereas in the NumPy array it was not. A Series also shows the values' data types.

A NumPy array is indexable, but the index is not part of its data structure.

## Creating a `pandas.Series`

Basic method:

`s = pd.Series(data, index=index)`

The `data` parameter can be:
- a list or NumPy array
- a Pthon Dictionary
- a single number, boolean (True/False), or string

The `index` parameter is optional; must be a list of indices of the same length as the data.

If the `index` parameter is not included, the default is to make the index equal to `[0, ..., len(data)-1]`

### Examples

In [6]:
# a series from a numpy array

pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [7]:
# from a list (with default index)

pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [8]:
# from a dictionary
# kets become the index and the values the corresponding data

# construct dictionary

d = {'key_0':2, 'key_1':'3', 'key_2':5}

# initialize a series using a dictionary

pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

Note: The data type `object` of the values in a Series usually indicates that the Series is made up of strings. However, from the example above we can see that `object` can also indicate a mix of strings *and* numbers. 

In [9]:
# from a single value
# in this case, an index MUST be provided; 
# the single valueu will be repeated to match the lenght of the index.

pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

## Simple Operations

Arithmetic operations work on series and so do most NumPy functions. 

In [13]:
# define a Series

s = pd.Series([98, 73, 65], index = ['Andrea', 'Beth', 'Carolina'])

# divide each element in series by 10

print(s/10, '\n')

# take the exponential of each element in Series

print(np.exp(s), '\n')

# original series in unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


Note: The argument `'\n'` within the `print` command indicates that each printed object should be on its own line.

In [14]:
# boolean

s > 70

# produces new pandas.Series indicating whether elements satisfy conditions

Andrea       True
Beth         True
Carolina    False
dtype: bool

"These kinds of simple conditions on pandas.Series will be key when we are selecting data from data frames."

## Identifying Missing Values

In `pandas`, a missing, NULL, or NA value can be repreated with the float value `numpy.nan`, which stands for "not a number".

In [15]:
# Series with NAs in it

s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

Notice data type is `float64`.

The `.hasnans` attribute for a `pandas.Series` returns "True" if there are any NA values in it, and false otherwise.

In [17]:
# check if series has NAs

s.hasnans

True

After figuring out whether or not there are NA values, it might be useful to know *which* elements in the series are NAs. We can do this using the `.isna` method.

In [18]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

The output is a `pandas.Series` of boolean values indicating if an element in the row at the given index is `np.nan` (True = is NA) or not.

In [23]:
# trying with -999 as NA value

s = pd.Series([1, 2, -999, -999], ['A', 'B', 'C', 'D'])

# using mask() method to replace -999 with NA

s.mask(s == -999)

A    1.0
B    2.0
C    NaN
D    NaN
dtype: float64

Returns all values that were -999 as NaN!

## Creating a `pandas.DataFrame`

**pandas.DataFrame**: the most used `pandas` object. It represents tabular data and we can think of it as a spreadsheet.

`pandas.Series` are the columns of `pandas.DataFrame`.

In [25]:
# initializing ditionary with columns' data

d = {'col_name_1': pd.Series(np.arange(3)),
    'col_name_2' : pd.Series([3.1, 3.2, 3.3])}

# creating a data frame from the dictionary
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [27]:
# changing index
df.index = ['a', 'b', 'c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


In [32]:
# changing column names using columns attribute

df.columns = ['integers', 'floats']

df

Unnamed: 0,integers,floats
a,0,3.1
b,1,3.2
c,2,3.3
