In [1]:
#!pip install jovian --quiet --upgrade

In [2]:
# import jovian
# jovian.commit(project='DSEMT-Notes')

Numpy Array attributes:
1. ndim: the number of dimensions
2. shape: the size of each dimension
3. size: the total size of array
4. dtype: the data type of array
5. itemsize: list the size ( in bytes) of the array
6. nbytes: lists the total size ( in bytes) of the array!

In [3]:
import numpy as np


In [4]:
rng = np.random.default_rng(seed=42)
print(type(rng))
x1 = rng.integers(10, size=6)
x2 = rng.integers(10, size=(3, 4))
x3 = rng.integers(10, size=(3, 4, 5))

<class 'numpy.random._generator.Generator'>


In [5]:
print(x3.size) # total elements in array
print(x3.shape) # tuple consisting size of each dimension
print(x3.itemsize) # size of each element in bytes
print(x3.nbytes) # total size in bytes of the array
print(x3.dtype) # data type of the array
print()

60
(3, 4, 5)
8
480
int64



In [6]:
array = np.random.default_rng(seed=42).integers(low=10,high=100,size=(3,4),endpoint=True)
print(array)
print(array.dtype)
print(array.size)

[[18 80 69 49]
 [49 88 17 73]
 [28 18 57 98]]
int64
12


### Array Indexing: Accessing Single Elements

In [7]:
# Accessing the element and changing the value
print(x2)
x2[1, 1] = 12
print()
print(x2)

[[0 6 2 0]
 [5 9 7 7]
 [7 7 5 1]]

[[ 0  6  2  0]
 [ 5 12  7  7]
 [ 7  7  5  1]]


Reference: https://numpy.org/doc/stable/reference/generated/numpy.arange.html

`numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)
Return evenly spaced values within a given interval.`

arange can be called with a varying number of positional arguments:

arange(stop): Values are generated within the half-open interval [0, stop) (in other words, the interval including start but excluding stop).

arange(start, stop): Values are generated within the half-open interval [start, stop).

arange(start, stop, step) Values are generated within the half-open interval [start, stop), with spacing between values given by step

In [8]:
# Generating array of size 10 from 0 to 9
x = np.arange(10)
print(x)
print(x.dtype)

[0 1 2 3 4 5 6 7 8 9]
int64


In [9]:
y = np.arange(start =1, stop = 10,dtype = 'float')
y

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

Numpy consists of:
1. continuous data buffer with the data elements
2. metadata about the array

### view() vs copy()

view() - data buffer remains the same
copy() - copies the data buffer and metadata

In [10]:
arr = np.array([3, 4, 5])
x = arr.view()
arr[0] = 42

print(arr)
print(x)

[42  4  5]
[42  4  5]


### np.base()
ndarray.base
Base object if memory is from some other object.

In [11]:
arr = np.array([1, 2, 3, 4, 5])

x = arr.copy()
y = arr.view()

print(x.base)
print(y.base)

print(x.base is None)
print(y.base is arr)
print(arr.base is None)

None
[1 2 3 4 5]
True
True
True


`Note:`
If you perform slicing operation on the array and change the value of the elements the change in perfomed to the orignal array as well.

### np.reshape()
`numpy.reshape(a, newshape, order='C')`
Gives a new shape to an array without changing its data.

Parameters:
aarray_like
Array to be reshaped.

newshape - int or tuple of ints

`The new shape should be compatible with the original shape`. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

order{‘C’, ‘F’, ‘A’}, optional
Read the elements of a using this index order, and place the elements into the reshaped array using this index order. ‘C’ means to read / write the elements using C-like index order, with the last axis index changing fastest, back to the first axis index changing slowest. ‘F’ means to read / write the elements using Fortran-like index order, with the first index changing fastest, and the last index changing slowest. Note that the ‘C’ and ‘F’ options take no account of the memory layout of the underlying array, and only refer to the order of indexing. ‘A’ means to read / write the elements in Fortran-like index order if a is Fortran contiguous in memory, C-like order otherwise.

Returns:
reshaped_arrayndarray
This will be a new view object if possible; otherwise, it will be a copy. Note there is no guarantee of the memory layout (C- or Fortran- contiguous) of the returned array.

In [12]:
arr = np.arange(9).reshape((3,3))
print(arr)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


### numpy.concatenate
`numpy.concatenate((a1, a2, ...), axis=0, out=None, dtype=None, casting="same_kind")`
Join a sequence of arrays along an existing axis.

Parameters:
a1, a2, …sequence of array_like
The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default).

axisint, optional
The axis along which the arrays will be joined. If axis is None, arrays are flattened before use. Default is 0.

outndarray, optional
If provided, the destination to place the result. The shape must be correct, matching that of what concatenate would have returned if no out argument were specified.

dtypestr or dtype
If provided, the destination array will have this dtype. Cannot be provided together with out.


casting{‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional
Controls what kind of data casting may occur. Defaults to ‘same_kind’.

Returns:
resndarray
The concatenated array.

In [13]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [99, 99, 99]
print(np.concatenate([x, y, z],dtype='float'))

[ 1.  2.  3.  3.  2.  1. 99. 99. 99.]


In [14]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
grid

array([[1, 2, 3],
       [4, 5, 6]])

In [15]:
# concatenate along the first axis/ index (default)
grid1 = np.concatenate([grid, grid])
print(grid1)
print(grid1.shape)

[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]
(4, 3)


In [16]:
# concatenate along the second axis/ columns
grid2 = np.concatenate([grid, grid], axis=1)
print(grid2)
print(grid2.shape)

[[1 2 3 1 2 3]
 [4 5 6 4 5 6]]
(2, 6)


### numpy.vstack
`numpy.vstack(tup, *, dtype=None, casting='same_kind')`
Stack arrays in sequence vertically (row wise).

This is equivalent to concatenation along the first axis after 1-D arrays of shape (N,) have been reshaped to (1,N). Rebuilds arrays divided by vsplit.

This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.

`np.row_stack is an alias for vstack`. They are the same function.

Parameters:
tupsequence of ndarrays
The arrays must have the same shape along all but the first axis. 1-D arrays must have the same length.

dtypestr or dtype
If provided, the destination array will have this dtype. Cannot be provided together with out.

.. versionadded:: 1.24
casting{‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional
Controls what kind of data casting may occur. Defaults to ‘same_kind’.

.. versionadded:: 1.24
Returns:
stackedndarray
The array formed by stacking the given arrays, will be at least 2-D.

In [17]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

In [18]:
print(np.vstack([x, grid]))
print()
print(np.row_stack([x,grid]))

[[1 2 3]
 [9 8 7]
 [6 5 4]]

[[1 2 3]
 [9 8 7]
 [6 5 4]]



### numpy.hstack()
`numpy.hstack(tup, *, dtype=None, casting='same_kind')`
Stack arrays in sequence horizontally (column wise).

This is equivalent to concatenation along the second axis, except for 1-D arrays where it concatenates along the first axis. Rebuilds arrays divided by hsplit.

This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.

Parameters:
tupsequence of ndarrays
The arrays must have the same shape along all but the second axis, except 1-D arrays which can be any length.

dtypestr or dtype
If provided, the destination array will have this dtype. Cannot be provided together with out.

.. versionadded:: 1.24
casting{‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional
Controls what kind of data casting may occur. Defaults to ‘same_kind’.

.. versionadded:: 1.24
Returns:
stackedndarray
The array formed by stacking the given arrays.

In [19]:
y = np.array([[99],
              [10]])

In [20]:
print(np.hstack([grid,y]))

[[ 9  8  7 99]
 [ 6  5  4 10]]


### Spliting the array:
ndarray can be splitted using np.split(), np.hsplit(), np.vsplit()

`numpy.split`

numpy.split(ary, indices_or_sections, axis=0)[source]
Split an array into multiple sub-arrays as views into ary.

Parameters:
aryndarray
Array to be divided into sub-arrays.

indices_or_sectionsint or 1-D array
If indices_or_sections is an integer, N, the array will be divided into N equal arrays along axis. If such a split is not possible, an error is raised.

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in

ary[:2]

ary[2:3]

ary[3:]

If an index exceeds the dimension of the array along axis, an empty sub-array is returned correspondingly.

axisint, optional
The axis along which to split, default is 0.

Returns:
sub-arrayslist of ndarrays
A list of sub-arrays as views into ary.

Raises:
ValueError
If indices_or_sections is given as an integer, but a split does not result in equal division.

`np.hsplit()` - To split along the second axis (columns)

`np.vsplit()` - To split along the first axis (rows)

In [21]:
grid = np.arange(16).reshape((4, 4))
grid.shape

(4, 4)

In [22]:
# split this array into upper and lower halves
vs = np.vsplit(grid, [2])
print(vs)
print(vs[0].shape)

[array([[0, 1, 2, 3],
       [4, 5, 6, 7]]), array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])]
(2, 4)


In [23]:
# split the array into left and right halves
hs = np.hsplit(grid, [2])
print(hs)
print(hs[0].shape)

[array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13]]), array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15]])]
(4, 2)


### `Aggregation Functions`
To perform calculations such as sum, max, min, mean, median, mode, etc
For multidimensional array we need to provide an additional argument i.e axis to specify to perform aggreagation along the rows or columns.


np.sum - compute sum of elements

np.prod - compute product of  elements

np.mean - compute mean of elements

np.std -  compute standard deviation

np.var - compute variance

np.min - find minimum element

np.max - find maximum element

np.argmin - find index of min element

np.argmax - find the index of max element

np.median - Compute median of elements

np.percentile - compute rank based statistics

np.any - Evaluate if any elements are true

np.all - Evaluate if all the elements are true

### Universal functions (Ufunc)
Universal functions are the functions that operate on ndimensional array in element by element fashion.

np.add

np.subtract

np.negative

np.multiply

np.divide

np.floor_divide

np.power

np.mod

 ### `Broadcasting`

Set of rules applying on binary ufuncs

<div>
<img src="https://files.speakerdeck.com/presentations/d9e9387ef12f4ce3b37291cd392b775c/slide_14.jpg"  width="700"/>
</div>


### Python Pandas
Enhanced version of numpy where rows and columns can be thought of labels
Three fundamental Pandas data structure:
1. Series
2. DataFrame
3. Index

`Series`: One Dimensional array of indexed data

The difference between Pandas Series and Numpy arrays is numpy has `implictly` defined integer index where as pandas series can be referenced using `explicitly` defined labeled index




### pd.Series()

`class pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=_NoDefault.no_default)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Constructing Series from a list with copy=False.

In [24]:
import pandas as pd

In [25]:
r = [1, 2]
ser = pd.Series(r, copy=False)
ser.iloc[0] = 999
r

[1, 2]

Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.

Constructing Series from a 1d ndarray with copy=False.

In [26]:
r = np.array([1, 2])
ser = pd.Series(r, copy=False)
ser.iloc[0] = 999
r

array([999,   2])

Constructing the Series using dictionary:


In [27]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population_dict

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [28]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

### Constructing Dataframe
1. From a single series object (A single column dataframe)
2. From a list of dictionaries
3. From a dictionary of Series object
4. From a two dimensional Numpy array
5. From [a structured numpy array](https://numpy.org/doc/stable/user/basics.rec.html)

### Pandas Index
Indexes are immutable

Slicing in Dataframe:
If slicing using `explicit index`, the final index is `included` in result where as, in case of `implicit index`, the final index is `exluded`

`loc` - allows indexing and slicing that always refer to `explicit` index
`iloc` - allows indexing and slicing that always refer to `implict` index

1. From a single series object


In [29]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population_dict

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [30]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [31]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area_dict

{'California': 423967,
 'Texas': 695662,
 'New York': 141297,
 'Florida': 170312,
 'Illinois': 149995}

In [32]:
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [33]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [34]:
# From single series object
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


3. From a dictionary of Series objects

In [35]:
pd.DataFrame({'population': population,
              'area': area},columns=['population','area'])

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### Pandas Index

Index as ordered set

Index objects also have many of the attributes familiar from NumPy arrays.


In [36]:
# Index as ordered set
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [37]:
# intersection
indA.intersection(indB)

Int64Index([3, 5, 7], dtype='int64')

In [38]:
# union
indA.union(indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [39]:
# symmetric difference
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

### Operating on data in Pandas
Ufun operations:
1. Index preservation
2. Index Alignment

`Index preservation`:

Perform ufunc similar way in numpy arrays

The result will be another Pandas object with the indexes reserved.

`Index Alignment`:
Combining two Series objects union of indexes using arithmetic set operations.

Similar alignment occurs for both columns and indexes while performing operations on dataframe.

### Missing values in Pandas

In python, `None` means the data is not present and is used for object data type.

`NaN` - is a special floating point value used in arrays with numerical data type.


In [40]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A)
print(B)

0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64


In [41]:
A.add(B)

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [42]:
B.add(A, fill_value=2)

0    4.0
1    5.0
2    9.0
3    7.0
dtype: float64

In [43]:
rng = np.random.default_rng(100)
A = pd.DataFrame(rng.integers(0, 10, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,7,8
1,1,5


### Handling missing values

In [44]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A)
print(B)

0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64


In [45]:
A.add(B,fill_value=A.mean())

0    6.0
1    5.0
2    9.0
3    9.0
dtype: float64

In [46]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

In [47]:
1 + np.nan

nan

In [48]:
0 * np.nan

nan

In [49]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [50]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

In numpy having None is considered as type `object`, where as in pandas series it is considered as `float`

In [51]:
x = np.array([1, None, 3, 4])
x

array([1, None, 3, 4], dtype=object)

In [52]:
#import numpy as np
x = np.array([1, np.nan, 3, 4])
x.dtype

dtype('float64')

In [53]:
pd.Series([1, None, 3, 4])

0    1.0
1    NaN
2    3.0
3    4.0
dtype: float64

In [54]:
#import pandas as pd
pd.Series([1, None, '3', 4])

0       1
1    None
2       3
3       4
dtype: object

### Detecting null values: isnull() and notnull()


In [55]:
data = pd.Series([1, np.nan, 'hello', None])

In [56]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [57]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [58]:
data[data.notnull()]

0        1
2    hello
dtype: object

Dropping null values:

In [59]:
data.dropna()

0        1
2    hello
dtype: object

In [60]:
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [61]:
data.dropna(inplace=True)

In [62]:
data

0        1
2    hello
dtype: object

### pandas.DataFrame.dropna
`DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)`

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [63]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [64]:
df.dropna() #default axis = 0 / rows

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [65]:
df.dropna(axis=1) # drop the columns

Unnamed: 0,2
0,2
1,5
2,6


In [66]:
df[3] = np.nan
df.dropna(axis = 1,how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


### Filing missing values


### pandas.DataFrame.fillna

`DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=_NoDefault.no_default)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [67]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [68]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [69]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [70]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [71]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [72]:
df.fillna(method='bfill', axis=0)

Unnamed: 0,0,1,2
0,1.0,3.0,2
1,2.0,3.0,5
2,,4.0,6


In [73]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [np.nan,      np.nan,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,,,5
2,,4.0,6


In [74]:
df.fillna(method='ffill', axis=0)

Unnamed: 0,0,1,2
0,1.0,,2
1,1.0,,5
2,1.0,4.0,6


### Multi Indexing
A MultiIndex can be created from:

1. a list of arrays (using [MultiIndex.from_arrays()](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_arrays.html#pandas.MultiIndex.from_arrays)),

2. an array of tuples (using [MultiIndex.from_tuples()](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_tuples.html#pandas.MultiIndex.from_tuples)),

3. a crossed set of iterables (using [MultiIndex.from_product()](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_product.html#pandas.MultiIndex.from_product)),

4.  a DataFrame (using [MultiIndex.from_frame()](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_frame.html#pandas.MultiIndex.from_frame))

Reference: https://pandas.pydata.org/docs/user_guide/advanced.html

In [75]:
# Using from_tuples
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [76]:
# Using from_arrays
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [77]:
# Using from_product
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [78]:

index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [79]:
pops = pd.Series(populations, index=index)
pops

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [80]:
type(pops.index)

pandas.core.indexes.multi.MultiIndex

### pandas.DataFrame.reindex
`DataFrame.reindex(labels=None, *, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html

In [81]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
                  'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
                  index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


In [82]:
# Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned NaN
new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
             'Chrome']
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


### pandas.DataFrame.stack
`DataFrame.stack(level=-1, dropna=_NoDefault.no_default, sort=_NoDefault.no_default, future_stack=False)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:

if the columns have a single level, the output is a Series;

if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.

### pandas.DataFrame.unstack
`DataFrame.unstack(level=-1, fill_value=None, sort=True)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html#pandas.DataFrame.unstack

Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.

If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex)


In [83]:
pop_df = pops.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [84]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

`Index names`




In [85]:
pops.index.names = ['state', 'year']
pops

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Aggregation Functions in Pandas
`pandas.DataFrame.mean`

DataFrame.mean(axis=0, skipna=True, numeric_only=False, **kwargs)

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

### Pandas Groupby

pandas.DataFrame.groupby

`DataFrame.groupby(by=None, axis=_NoDefault.no_default, level=None, as_index=True, sort=True, group_keys=True, observed=_NoDefault.no_default, dropna=True)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

Groupby Aggregate function

`DataFrameGroupBy.aggregate(func=None, *args, engine=None, engine_kwargs=None, **kwargs)`

It can take a string, a function, or a list, a dict, etc., and compute all the aggregates at once.

Reference: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html


In [86]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [87]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [88]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


`Groupby Filter function`

DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs)

Elements from groups are filtered if they do not satisfy the boolean criterion specified by func

In [89]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar'],
                   'B' : [1, 2, 3, 4, 5, 6],
                   'C' : [2.0, 5., 8., 1., 2., 9.]})
grouped = df.groupby('A')
grouped.filter(lambda x: x['B'].mean() > 3.)

Unnamed: 0,A,B,C
1,bar,2,5.0
3,bar,4,1.0
5,bar,6,9.0


`Groupby Transform function`

`DataFrameGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html


In [90]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two'],
                   'C' : [1, 5, 5, 2, 5, 5],
                   'D' : [2.0, 5., 8., 1., 2., 9.]})
df

Unnamed: 0,A,B,C,D
0,foo,one,1,2.0
1,bar,one,5,5.0
2,foo,two,5,8.0
3,bar,three,2,1.0
4,foo,two,5,2.0
5,bar,two,5,9.0


In [91]:
grouped = df.groupby('A')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7de7746418d0>

In [92]:
grouped.transform(lambda x: (x - x.mean()) / x.std())

  grouped.transform(lambda x: (x - x.mean()) / x.std())


Unnamed: 0,C,D
0,-1.154701,-0.57735
1,0.57735,0.0
2,0.57735,1.154701
3,-1.154701,-1.0
4,0.57735,-0.57735
5,0.57735,1.0


`Groupby Apply function`

`DataFrameGroupBy.apply(func, *args, include_groups=True, **kwargs)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html

### Time Series data
`Time stamps` refer particular moments in time

`Time interval and periods` refers length of time between a particular begining and end point

`Time deltas and durations` reference exact point of time

datetime and dateutil :https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Numpy datetime64:
https://numpy.org/doc/stable/reference/arrays.datetime.html





### Pandas Time Series Datastructure

For **Time Stamps**, pandas provide Timestamp type
It is based on numpy.datetime64 datatype
The associated structure is `DateTimeIndex`

For **Time Periods**, Pandas provides the Period type.
This encodes a fixed-frequency interval based on numpy.datetime64.

The associated index structure is `Periodindex`.

For **Time deltas or durations**, Pandas provides the Timedelta type.

It is based on numpy.timedelta64.

The associated index structure is `Timedeltalndex`.


### DateTime objects


In [93]:
from datetime import datetime
datetime(year=2020, month=7, day=4)

datetime.datetime(2020, 7, 4, 0, 0)

In [94]:
from dateutil import parser
date = parser.parse("4th of July, 2023")
date

datetime.datetime(2023, 7, 4, 0, 0)

Numpy datetime64 dtype

Reference: https://numpy.org/doc/stable/reference/arrays.datetime.html

In [95]:
date = np.array('2023-07-04', dtype=np.datetime64)
date

array('2023-07-04', dtype='datetime64[D]')

In [96]:
date + np.arange(12)

array(['2023-07-04', '2023-07-05', '2023-07-06', '2023-07-07',
       '2023-07-08', '2023-07-09', '2023-07-10', '2023-07-11',
       '2023-07-12', '2023-07-13', '2023-07-14', '2023-07-15'],
      dtype='datetime64[D]')

In [97]:
np.datetime64('2023-07-04')

numpy.datetime64('2023-07-04')

In [100]:
type(np.datetime64('2023-07-04'))

numpy.datetime64

In [98]:
np.datetime64('2023-07-04 12:00')

numpy.datetime64('2023-07-04T12:00')

In [99]:
np.datetime64('2020-07-04 12:59:59.50', 'ns')

numpy.datetime64('2020-07-04T12:59:59.500000000')

### Pandas DateTime

Reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

### pandas.to_datetime
`pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)`

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [101]:
date = pd.to_datetime("4th of July, 2020")
date

Timestamp('2020-07-04 00:00:00')

In [102]:
date.strftime('%A')

'Saturday'

### Numpy Timedelta

NumPy allows the subtraction of two datetime values, an operation which produces a number with a time unit. Because NumPy doesn’t have a physical quantities system in its core, the timedelta64 data type was created to complement datetime64. The arguments for timedelta64 are a number, to represent the number of units, and a date/time unit, such as `(D)ay, (M)onth, (Y)ear, (h)ours, (m)inutes, or (s)econds`. The timedelta64 data type also accepts the string `NAT` in place of the number for a `Not A Time` value.

Reference: https://numpy.org/doc/stable/reference/arrays.datetime.html

In [104]:
np.timedelta64(1, 'D')

numpy.timedelta64(1,'D')

In [105]:
np.timedelta64(4, 'h')

numpy.timedelta64(4,'h')

In [106]:
np.timedelta64('nAt')

numpy.timedelta64('NaT')

Datetimes and Timedeltas work together to provide ways for simple datetime calculations.

In [103]:
np.datetime64('2009-01-01') - np.datetime64('2008-01-01')

numpy.timedelta64(366,'D')

In [107]:
np.datetime64('2011-06-15T00:00') + np.timedelta64(12, 'h')

numpy.datetime64('2011-06-15T12:00')

In [108]:
np.timedelta64(1,'W') / np.timedelta64(1,'D')

7.0

There are two Timedelta units (‘Y’, years and ‘M’, months) which are treated specially, because how much time they represent changes depending on when they are used. While a timedelta day unit is equivalent to 24 hours, there is no way to convert a month unit into days, because different months have different numbers of days.

In [109]:
a = np.timedelta64(1, 'Y')

In [110]:
np.timedelta64(a, 'M')

numpy.timedelta64(12,'M')

In [111]:
np.timedelta64(a, 'D') # Throws Error as different year can have different number of days

TypeError: Cannot cast NumPy timedelta64 scalar from metadata [Y] to [D] according to the rule 'same_kind'

Format values for datetime and timedelta:

<div>
<img src="https://i.stack.imgur.com/Ps5ez.png"  width="700"/>
</div>



### Pandas Timedelta
class `pandas.Timedelta(value=<object object>, unit=None, **kwargs)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html

<div>
<img src="https://img-blog.csdnimg.cn/835abf6e44a44980828b2940aaa5639a.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA56eN6bqm5Y2X5bGx5LiL,size_20,color_FFFFFF,t_70,g_se,x_16"  width="700"/>
</div>

In [112]:
td = pd.Timedelta(1, "d")
td

Timedelta('1 days 00:00:00')

Initialize the Timedelta object with kwargs

In [113]:
td2 = pd.Timedelta(days=1)
td2

Timedelta('1 days 00:00:00')

### pandas.to_timedelta
`pandas.to_timedelta(arg, unit=None, errors='raise')`

Reference : https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html

In [115]:
pd.to_timedelta(np.arange(12), 'D')

TimedeltaIndex([ '0 days',  '1 days',  '2 days',  '3 days',  '4 days',
                 '5 days',  '6 days',  '7 days',  '8 days',  '9 days',
                '10 days', '11 days'],
               dtype='timedelta64[ns]', freq=None)

A TimedeltaIndex is created, for example, when a date is subtracted from another:

In [119]:
dates - dates[0]

TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

In [114]:
date = pd.to_datetime("4th of July, 2020")
date + pd.to_timedelta(np.arange(12), 'D')

DatetimeIndex(['2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07',
               '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11',
               '2020-07-12', '2020-07-13', '2020-07-14', '2020-07-15'],
              dtype='datetime64[ns]', freq=None)

### pandas.Period
`pandas.Period(value=None, freq=None, ordinal=None, year=None, month=None, quarter=None, day=None, hour=None, minute=None, second=None)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Period.html

In [128]:
period = pd.Period('2012-1-1', freq='D')
period

Period('2012-01-01', 'D')

### pandas.Series.dt.to_period

Pandas Time Series: Indexing by Time

`Series.dt.to_period(*args, **kwargs)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.to_period.html



In [116]:
dates = pd.to_datetime([datetime(2020, 7, 3), '4th of July, 2020',
                       '2020-Jul-6', '07-07-2020', '20200708'])
dates

DatetimeIndex(['2020-07-03', '2020-07-04', '2020-07-06', '2020-07-07',
               '2020-07-08'],
              dtype='datetime64[ns]', freq=None)

In [117]:
dates.to_period('D')   # 'D' indicates daily frequency

PeriodIndex(['2020-07-03', '2020-07-04', '2020-07-06', '2020-07-07',
             '2020-07-08'],
            dtype='period[D]')

In [118]:
df = pd.DataFrame({"y": [1, 2, 3]},
                  index=pd.to_datetime(["2000-03-31 00:00:00",
                                        "2000-05-31 00:00:00",
                                        "2000-08-31 00:00:00"]))
df.index.to_period("M")

PeriodIndex(['2000-03', '2000-05', '2000-08'], dtype='period[M]')

### Pandas functions for regular date sequencies.

pd.date_range() for timestamps

pd.period_range() for periods

pd.timedelta_range() for time deltas


### pandas.date_range
`pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, inclusive='both', *, unit=None, **kwargs)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.date_range.html

Frequency List: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

<div>
<img src="https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMjMyODc1NS03MjlkOWY5NTUxZDhiZWQ3LnBuZw?x-oss-process=image/format,png" width="700"/>
</div>

Specify start and end, with the default daily frequency

In [120]:
pd.date_range(start='1/1/2018', end='1/08/2018')

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

Specify start and periods, the number of periods (days)

In [121]:
pd.date_range(start='1/1/2018', periods=8)

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

In [125]:
pd.date_range(start='1/1/2018', periods=5, freq ='M')

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31'],
              dtype='datetime64[ns]', freq='M')

### pandas.period_range
`pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.period_range.html

In [126]:
pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')

PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
             '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
             '2018-01'],
            dtype='period[M]')

In [127]:
pd.period_range(start=pd.Period('2017Q1', freq='Q'),
                end=pd.Period('2017Q2', freq='Q'), freq='M')

PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'], dtype='period[M]')

### pandas.timedelta_range
`pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, *, unit=None)`

Reference: https://pandas.pydata.org/docs/reference/api/pandas.timedelta_range.html

Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:

Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.

A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.

In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:

W-SUN, W-MON, W-TUE, W-WED, etc.

On top of this, codes can be combined with numbers to specify other frequencies.

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module

In [131]:
pd.timedelta_range(start='1 day', periods=4)

TimedeltaIndex(['1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')

The closed parameter specifies which endpoint is included. The default behavior is to include both endpoints

In [130]:
pd.timedelta_range(start='1 day', periods=4, closed='right')

TimedeltaIndex(['2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')

The freq parameter specifies the frequency of the TimedeltaIndex. Only fixed frequencies can be passed, non-fixed frequencies such as ‘M’ (month end) will raise

In [129]:
pd.timedelta_range(start='1 day', end='2 days', freq='6h')

TimedeltaIndex(['1 days 00:00:00', '1 days 06:00:00', '1 days 12:00:00',
                '1 days 18:00:00', '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq='6H')