# Pandas
 Pandas is a Python package used to wrangle and analyze tabular data. 


In [6]:
import pandas as pd
import numpy as np

## pandas.Series 
  A series is a one dimensional array of indexed data. 
  A `NumPy` array does not have an index as part of the data structure.
  A `pandas` array has an index as part of the data structure.
  
  `pandas.Series` are the columns of the `pandas.DataFrame`.
  

In [7]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[ 0.6676212  -1.34025222  0.2171997  -0.08591628] 

<class 'pandas.core.series.Series'>
0    0.667621
1   -1.340252
2    0.217200
3   -0.085916
dtype: float64


## Creating a pandas.Series

`s = pd.Series(data, index=index)`

The `data` parameter can be:
1. a `NumPy` array
2. a Python dictionary
3. a number, boolean, or string

The`index` parameter is not necessary but if included should be the same length as the data. If an index is not provided the default is 0 to len(data)-1.

#### Creating a pandas.Series from a numpy array

In [8]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

#### Creating a pandas.Series from a list

In [11]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

#### Creating a pandas.Series from a dictionary

In [12]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

#### Creating a panda.Series from a single value

We have to provide an index and the length of the list will be the length of the index.

In [13]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

## Basic arithmatic on series

Note: `'\n'` is adding space between the outputs improving readability.

In [14]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Analyze whether the series values are greater than 70 in a boolean
print(s > 70, '\n')

# Original series is unchanged
print(s)



Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea       True
Beth         True
Carolina    False
dtype: bool 

Andrea      98
Beth        73
Carolina    65
dtype: int64


## Identifying missing values 

We can represent missing values in a float value `numpy.nan` (`np.nan`) which stands for "not a number".

In [15]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

The hasnans attribute is `True` if the series contains NAs and `False` otherwise. 

In [16]:
# Check if series has NAs
s.hasnans

True

In [17]:
# Identifies which values contain NAs with boolean output
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

## Check-in

1. The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be the the letters A through D.

2. In the pandas.Series documentation, look for the method mask(). Use this method to update the series s so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation.

In [42]:
s = pd.Series([100, -999, 10, -999], index = ['A', 'B', 'C', 'D'])
s.mask(s == -999) 

A    100.0
B      NaN
C     10.0
D      NaN
dtype: float64

In [47]:
# Looking as Series.mask() documentation
?s.mask

[0;31mSignature:[0m
[0ms[0m[0;34m.[0m[0mmask[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mcond[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mother[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool_t'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m:[0m [0;34m'Level | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'Self | None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Replace values where the condition is True.

Parameters
----------
cond : bool Series/DataFrame, array-like, or callable
    Where `cond` is False, keep the original value. Where
    True, replace with corresponding value from `other`.
    If `cond` 

# pandas.DataFrame

Each column of a `pandas.DataFrame` is a `pandas.Series`. And a `pandas.DataFrame` is a dictionary of `pandas.Series` where each column name is the key and the column values are the key's values.

In [49]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [50]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


## Check-in 

In [51]:
# Changing column names
df.columns = ['C1', 'C2']
df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3
