# MODULE: Pandas - SERIES

Pandas (supposedly derived from the words *panel* and *data*) is used for:
- data processing and cleaning, before feeding it into machine learning or data mining algorithms. 
- dealing with missing values
- creating new features (columns) from existing columns
- exploratory data analysis to understand what the data contains, summary statistics and plotting (using Matplotlib).

## Data Structures

The `pandas` library contains these useful data structures:
* `Series` object - A 1D array, similar to a column in a spreadsheet (with a column name and row labels).
* `DataFrame` object - A 2D table, similar to a spreadsheet (with column names and row labels).
* `Index` object - 1D array used to store the indices of the data values



## Task for this notebook

We will first learn about Pandas Series, which is an object that represents one column of (homogeneous) data in section 2. Then in section 3, we will learn about DataFrames which contain many columns.

In section 3, we will import 3 data files on US state populations from the web, save it locally and open each up as DataFrames. We will learn techniques to clean and process the data, identify and deal with missing values, explore the data and create many new computed variables or measures or features.

**NOTE:**\
Anywhere I write `Series.method()` or `Series.attribute`, it should technically be `pd.Series.method()` and `pd.Series.attribute` since Series and DataFrame are found in the pd package. Similarly for `DF.method()` it should be `pd.DataFrame.method()`,\
I use the former for simplicity.

## 4.1 Imports

In [2]:
# imports

import numpy as np
import pandas as pd

In [3]:
pd.__version__

'1.4.4'

## 4.2 Creating Series, attributes and diaplying

### 4.2.1 creating from numPy array

In [4]:
s = pd.Series(data = np.array([-2, -1, 0, 1, 2]))
s

0   -2
1   -1
2    0
3    1
4    2
dtype: int32

NumPy is typically used to store numeric data, but it can also store text (dtype = 'str' or 'U' for Unicode) or objects (dtype = 'O').

In [5]:
# let us convert from ndarray into a Series object
a = pd.Series(data = np.array(['a1', 'a2', 'a3', 'a4'], dtype='str'))
a

0    a1
1    a2
2    a3
3    a4
dtype: object

**Index labels**

Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually using the *index* parameter.
Index can be a sequence of numeric or text values.

Set the index using the index parameter

We can use list comprehension to provide a monotonically increasing index. When the number of items is large, list comprehesion is so much easier


In [6]:
pd.Series(np.arange(-2, 3), index = [i for i in range(100, 105)])
#OR
pd.Series(np.arange(-2, 3), index = [100,101,102,103,104])

100   -2
101   -1
102    0
103    1
104    2
dtype: int32

### 4.2.2 from standard python objects

We can set additional parameters.\
*index* \
*name* - give the series a name \
*dtype* - set the data type of the content

In [7]:
# create a sequence (list) to use use as index values
region=['Northeast', 'Midwest', 'South', 'West']

# create a series of US population (2019) in 4 main regions
population = pd.Series([55982803, 68329004, 125580448, 78347268], 
                    index = region, 
                    name= 'Pop', 
                    dtype = np.int32)
population

Northeast     55982803
Midwest       68329004
South        125580448
West          78347268
Name: Pop, dtype: int32

When a dict is used as data, the index labels will use the dict keys

In [8]:
dictdata = {'Northeast': 55982803, 'Midwest': 68329004,
            'South': 125580448, 'West':  78347268}

population = pd.Series(dictdata, 
                    name= 'Pop', 
                    dtype = int)
population

Northeast     55982803
Midwest       68329004
South        125580448
West          78347268
Name: Pop, dtype: int32

We will look at reading from external files in section 3. The Pandas function to read always returns a Pandas DataFrame.

### 4.2.3.Attributes of Series

In [9]:
print(s.name)
s.name = "sample"
print(s.name)
s

None
sample


0   -2
1   -1
2    0
3    1
4    2
Name: sample, dtype: int32

Notice that the datatype of a Series of string is 'O' or object in Pandas.

In [10]:
print(s.dtype)
print(a.dtype)

int32
object


In [11]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [12]:
s.index.name = "newindex"
s

newindex
0   -2
1   -1
2    0
3    1
4    2
Name: sample, dtype: int32

In [13]:
s.index = ['a', 'b', 'c', 'x', 'y']
s

a   -2
b   -1
c    0
x    1
y    2
Name: sample, dtype: int32

In [14]:
s.values

array([-2, -1,  0,  1,  2])

In [15]:
s.index.values

array(['a', 'b', 'c', 'x', 'y'], dtype=object)

In [16]:
# not an attribute, but a related method
s.to_numpy()  

array([-2, -1,  0,  1,  2])

### 4.2.4 Displaying /viewing Series data

In [17]:
myseries = pd.Series(data = [i for i in np.random.randint(0, 200, 50)])
myseries.head()

0    141
1      9
2    167
3    126
4     90
dtype: int32

In [18]:
myseries.tail(n = 10)

40    112
41     69
42     40
43    155
44     13
45     75
46    115
47    135
48     35
49    122
dtype: int32

In [19]:
myseries.sample(frac = .1)

33    164
37    145
38     49
28    111
16     93
dtype: int32

### 4.2.5 Sorting

In [20]:
myseries.sort_values().head()

29     3
7      8
1      9
44    13
25    14
dtype: int32

In [21]:
s.sort_index(ascending = False)

y    2
x    1
c    0
b   -1
a   -2
Name: sample, dtype: int32

## 4.3 Indexing for Series - selecting specific rows 

You can access the items by integer location, like in a regular array. You can also access items by their index label.

To make it clear when you are accessing by label or by integer location, it is recommended to use the `DF.loc` attribute when accessing by label, and the `DF.iloc` attribute when accessing by integer location.

strictly speaking, writing `.iloc` or `.loc` is optional for Series because there is only one axis. When we work with DataFrames, this is not so.

### 4.3.1  integer/position indexing for rows

In [22]:
s[0]
#or
s.iloc[0]  

-2

### 4.3.2  label indexing for rows

In [23]:
population['Midwest']
#or
population.loc['Midwest']   #preferred

68329004

### 4.3.3 slicing for rows

You can use slicing to obtain multiple rows/values. 

Again, use `DF.iloc` when slicing using integer index locations. Use `DF.loc` when using labels. Notice that the behavior of the end index is different when slicing using ints vs labels. Here to .iloc ad .loc are optional for Series

In [24]:
# end index is not inclusive when slicing using int row index
population.iloc[1:3]
#or
population[1:3]


Midwest     68329004
South      125580448
Name: Pop, dtype: int32

In [25]:
# end index is inclusive when slicing using labels
population.loc['Midwest':'West']
#or
population['Midwest':'West']

Midwest     68329004
South      125580448
West        78347268
Name: Pop, dtype: int32

### 4.3.4 fancy indexing for rows

In [26]:
population.iloc[[0,2]]
#or
population[[0,2]]

Northeast     55982803
South        125580448
Name: Pop, dtype: int32

In [27]:
population.loc[['Northeast','South']]
#or
population[['Northeast','South']]

Northeast     55982803
South        125580448
Name: Pop, dtype: int32

### 4.3.5 boolean or mask indexing - applied to values only

Recall that the boolean condition is any expression that returns `True` or `False`. You can use any kind of arithmetic, logical, comparison, conditional operators to create a mask of True/False values, based on the data values in the Series.

Boolean indexing is combined with `loc`, but strictly speaking, writing `.loc` is optional.

In [28]:
population.loc[population > 100000000]
#or
population[population > 100000000]

South    125580448
Name: Pop, dtype: int32

use a single `&` to specify an *and* condition\
use a single `|` to specify an *or* condition \
use a single `!` to specify a *not* condition



In [29]:
population.loc[(population < 100000000) & 
               (population > 60000000)]
#or
population[(population < 100000000) & 
               (population > 60000000)]

Midwest    68329004
West       78347268
Name: Pop, dtype: int32

In [30]:
s[s !=0]

a   -2
b   -1
x    1
y    2
Name: sample, dtype: int32

### 4.3.6 isin() method - applied to row labels or values

In [31]:
s

a   -2
b   -1
c    0
x    1
y    2
Name: sample, dtype: int32

In [32]:
s[s.isin([-1,1])]

b   -1
x    1
Name: sample, dtype: int32

In [33]:
s[s.index.isin(['m', 'n', 'b'])]

b   -1
Name: sample, dtype: int32

### 4.3.7 filter() method - applied to labels only

It has several optional parameters
- `items` - used to specify exact labels to search for in index labels
- `like` - used to specify a single subtring to search for (anywhere) in index labels
- `regexp` - used to specify a regular expression pattern to match in the labels

In [34]:
population.filter(items = ['South', 'West'])

South    125580448
West      78347268
Name: Pop, dtype: int32

In [35]:
population.filter(like = "outh")

South    125580448
Name: Pop, dtype: int32

In [36]:
print(population.filter(regex='h$'), end="\n\n")
print(population.filter(regex='^N'), end="\n\n")
print(population.filter(regex='s'), end="\n\n")
print(population.filter(regex='[Ww]'), end="\n\n")
print(population.filter(regex='[0-9]'), end="\n\n")
print(population.filter(regex='.+th.+'), end="\n\n")

South    125580448
Name: Pop, dtype: int32

Northeast    55982803
Name: Pop, dtype: int32

Northeast    55982803
Midwest      68329004
West         78347268
Name: Pop, dtype: int32

Midwest    68329004
West       78347268
Name: Pop, dtype: int32

Series([], Name: Pop, dtype: int32)

Northeast    55982803
Name: Pop, dtype: int32



### 4.3.8 set or change values in-place in a Series

You can use any of the indexing to locate a row and then set its value using =. Changes the underlying Series values.

In [37]:
s[0] = 15
s

a    15
b    -1
c     0
x     1
y     2
Name: sample, dtype: int32

In [38]:
s[s.filter(like = 'b')] = -10
s

a    15
b    -1
c     0
x     1
y   -10
Name: sample, dtype: int32

In [39]:
# set teh same value in all index locations
s[[2,3]] = 12
print(s)

# set diff values as given in the list (list size shoudl mathc the number of indices selected)
s[[2,3]] = [12, 14]
s

a    15
b    -1
c    12
x    12
y   -10
Name: sample, dtype: int32


a    15
b    -1
c    12
x    14
y   -10
Name: sample, dtype: int32

In [40]:
s

a    15
b    -1
c    12
x    14
y   -10
Name: sample, dtype: int32

## 4.4 Series Operations


### 4.4.1 concat() function for concatenating Series (and / or DF)

`pd.concat()` is an upper level function that accepts a list of Pandas objects to concatenate.\
It has a parameter called `axis` (=0 add to rows, =1 add to columns)
If you concat Series it will produce a DataFrame.

Note: Missing values in data are denoted by `None`, np.nan, pd.NAl they will all be set to NaN - not A number in Pandas output.

In [41]:
popA = pd.concat([population, 
                  (pd.Series({'Central': np.nan, 'East' : None}))], 
                 axis = 0)
popA

Northeast     55982803.0
Midwest       68329004.0
South        125580448.0
West          78347268.0
Central              NaN
East                 NaN
dtype: float64

In [42]:
popB = pd.concat([population, 
                  (pd.Series({'South': 126546789, 'East' : None}))], 
                 axis = 1)
popB

Unnamed: 0,Pop,0
Northeast,55982803.0,
Midwest,68329004.0,
South,125580448.0,126546789.0
West,78347268.0,
East,,


### 4.4.2 converting DataFrame to Series

In [43]:
popA = popA.squeeze()
popA

Northeast     55982803.0
Midwest       68329004.0
South        125580448.0
West          78347268.0
Central              NaN
East                 NaN
dtype: float64

### 4.4.3 converting Series to DataFrame

You can call the `to_frame()` method of a Pandas Series to change it into a DataFrame with one column.
if the Series has no name, the column will be labeled as 0.
We can use parameter `name` to provide a name for the column.

In [44]:
popA.to_frame()

Unnamed: 0,0
Northeast,55982803.0
Midwest,68329004.0
South,125580448.0
West,78347268.0
Central,
East,


### 4.4.4 counts and missing values

`Series.count()` gives us the total number of non-missing observations.

In [45]:
population.count()

4

In [46]:
len(population)

4

If we have categorical values in a Series, it is useful to use `Series.value_counts()` to get a distribution of the values.

In [47]:
animals = pd.Series(['cat', 'dog', 'dog', 'rabbit', 'elephant', 'cat'])
animals.value_counts()

cat         2
dog         2
rabbit      1
elephant    1
dtype: int64

`Series.isna()` can be used to determine whether each value is **null, None, pd.NA, np.nan** - which are forms of missing data. returns True is missing, else returns False

In [48]:
popA.isna()

Northeast    False
Midwest      False
South        False
West         False
Central       True
East          True
dtype: bool

`Series.any()` returns whether any element is True, potentially over an axis. Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

In [49]:
# are there any True values? Missing values are False

popA.any()

True

In [50]:
# are there any False values in popA? Missing values are False

popA.isna().any()

True

In [51]:
# how many? - we can chain isna() with sum()
popA.isna().sum()

2

In [52]:
# are there ary False Values in population?
# no missing (False) values in population

population.isna().any()

False

### 4.4.5 replacing values in a new array  (if-then-else)

This will not change the values in the original Series

In [53]:
# where() method : if False-then replace with value
popA.where(popA.notna(), popA.mean())

# wherever condition is False, i.e., in the last two rows that had missing values,
# the value is replaced to the mean

Northeast    5.598280e+07
Midwest      6.832900e+07
South        1.255804e+08
West         7.834727e+07
Central      8.205988e+07
East         8.205988e+07
dtype: float64

In [54]:
# mask() is the opposite of where()
# produces a new Series and change the value to that given in the second parameter if the condition is True
popA.mask(popA > popA.mean(), 0)

Northeast    55982803.0
Midwest      68329004.0
South               0.0
West         78347268.0
Central             NaN
East                NaN
dtype: float64

In [55]:
s.replace({15 : -15, 20 : -20})
s

a    15
b    -1
c    12
x    14
y   -10
Name: sample, dtype: int32

### 4.4.6 Unary and binary operations

We will look at these in DataFrames

### 4.4.7 passing Series objects as parameters to NumPy functions

In [56]:
# here we compute the exponent of the Series values

np.exp(s)

a    3.269017e+06
b    3.678794e-01
c    1.627548e+05
x    1.202604e+06
y    4.539993e-05
Name: sample, dtype: float64

### 4.4.8 working with String values

`Series.str` attribute provides us access to several **vectorized** string functions. they also allow the use of regular expressions for manipulating string values.

One use of `Series.str` is to help clean up columns that have string content. Let us create a Series with animal names, but let the data contans extra spaces in it.

Then use Series.str methods to remove the leading and trailing spaces, and capitalize the words in the Series.

In [57]:
animals = pd.Series(['cat   ', '  dog', 'dog', ' rabbit ', '  elephant', 'cat'])
animals

0        cat   
1           dog
2           dog
3       rabbit 
4      elephant
5           cat
dtype: object

In [58]:
animals = animals.str.replace(" ", "" )
animals

0         cat
1         dog
2         dog
3      rabbit
4    elephant
5         cat
dtype: object

In [59]:
animals = animals.str.capitalize()
animals

0         Cat
1         Dog
2         Dog
3      Rabbit
4    Elephant
5         Cat
dtype: object

In [60]:
# does the string contain a substring?

animals.str.contains('bb')

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

In [61]:
# select only those rows where the value contains 'bb'
# any expression that returns a Series of booleans- True/ False can be used as a mask
# to filter out values where the mask is False (keep only values where mask is True)

mask = animals.str.contains('bb')
animals.loc[mask]

3    Rabbit
dtype: object

There are several other methods - `Series.str.startswith()`, `Series.str.endswith()`, etc.

# THE END