### 1 `pandas` series and data frames 

introduce two core objects in the `pandas` library, the `pandas.Series` and the `pandas.DataFrame`. 
We want to gain familiarity with these two objects, understand their relation to each other, and review Python data structures with dictionaries and lists.


### `pandas`

`pandas` is a Python package to wrangle and analyze tabular data. It is built on top of NumPy and has become the core tool for doing data analysis in Python. The standard abbreviation for `pandas` is `pd`.


In [14]:
# Always import packages in a single cell, each package should be in a new line

import pandas as pd
import numpy as np

### Series 

The first core object of pandas is the series. A series is a one-dimensional array of indexed data

A `pandas.Series` having an index is the main difference between a `pandas.Series` and a NumPy array.

In [2]:
# A numpy array 

array = np.random.randn(4) # random values from std normal distribution 
print(type(array))
print(array, "\n")

<class 'numpy.ndarray'>
[ 1.09796344  1.36370408  0.08745993 -0.72353854] 



In [4]:
# A pandas series made from the previous array 

s = pd.Series(array)
print(type(s))
print(s)

<class 'pandas.core.series.Series'>
0    1.097963
1    1.363704
2    0.087460
3   -0.723539
dtype: float64


The index is printed as part of the `pandas.Series` while the np.array is indexable, the index is not apart of the data structure. Printing the `pandas.Series` also shows the values and data type.

### Creating a `pandas.Series`

the basic method to creating a pandas series is to call: 

s = pd.Series(data, index = index) 

Data can be a list or a NumPy array, python dictionary, single number, boolean (True/False) string.

The index parameter is optional 

In [10]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [11]:
# A series from a list of strings with default index 

pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

Example: Creating a `pandas.Series` from a dictionary

a dictionary is a set of key-value pairs. If we create a `pandas.Series` via a dictionary, the keys will become the index and the values the corresponding data.

In [13]:
# Construct a dictionary 

d = {'key_0': 2,
     'key_1': 3,
     'key_2': 5}

#Initialize series using a dictionary

pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: int64

Creating a panda series with a single value: 

if we only provide a single number, boolean, or string as the data, we need to provide an index. 

In [3]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

Simple Operations

Arithmetic operations work on series and most NumPy functions: 

In [4]:
# Define a series :

s = pd.Series([98, 73, 65], index = ['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10

print(s/10, '\n')

# Take exponential of each element in a series
print(np.exp(s), '\n')

# Orginal series is unchanged 
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [5]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying missing values 

In pandas, we can represent a missing, NULL, or NA value with a float value numpy.nan which stands for not a number. 

In [6]:
# Series with NAs in it

s = pd.Series([1, 2, np.nan, 4, np.nan]) 
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

The hasnas attribute for pandas.Series returns `TRUE` if there are any NA values in it and false otherwise 

In [8]:
# Check if series has NAs

s.hasnans

True

Can use `isna()` method to know which elements in a series are NAs

In [9]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

1. The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be letters A through D.

2. In the pandas.Series documentation, look for the method `mask()` to update the series s so that the -999 values are replaced by NA values. 

In [4]:
s = pd.Series([1, -999, 2, -999], index = ['A', 'B', 'C', 'D'])
s

A      1
B   -999
C      2
D   -999
dtype: int64

In [10]:
?s.mask

[0;31mSignature:[0m
[0ms[0m[0;34m.[0m[0mmask[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mcond[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mother[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool_t'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m:[0m [0;34m'Level | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'Self | None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Replace values where the condition is True.

Parameters
----------
cond : bool Series/DataFrame, array-like, or callable
    Where `cond` is False, keep the original value. Where
    True, replace with corresponding value from `other`.
    If `cond` 

In [12]:
s.mask(s == -999)

A    1.0
B    NaN
C    2.0
D    NaN
dtype: float64

### Data frames 

the `pandas.DataFrame` is the most used pandas object. It represents tabular data and we can think of it as a spreadsheet. Each column of a `pandas.DataFrame` is a `pandas.Series`.



### Creating a `pandas.DataFrame`

pandas.DataFrame is a dictionary of pandas.Series, which each column name being the key and the column values being the key's value.

We can create one in this way :

In [16]:
# Initialize the dictionary with columns' data 

d = {'col_name_1' : pd.Series(np.arange(3)), 
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]), 
    }

# Create a data frame

df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


We can change the index by changing the `index` attribute in the dataframe

In [17]:
# Change the index 

df.index = ['a', 'b', 'c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


We can access the data frame's column names via the columns attribute. Update the column names to C1 and C2 by updating this attribute 

In [24]:
df.columns = ['C1', 'C2']
df
#

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3
