(**You can also open this notebook in Google Colab**)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-fall/2023-09-19/notebook/code_demo.ipynb)

# Python basics - additional topics

## Class and Objects in Python

In general
- A `class` is a blueprint for declaring and creating objects
- An `object` is a class instance that allows programmers to use variables and methods from inside the class
- A class defines a set of `attributes` (`<--> properties`) and `methods` (`<--> functions`) that the objects of that class will have.

### Create a class in Python

In [15]:
class table:
    def __init__(self, l, w, h):
        self.l = l
        self.w = w
        self.h = h
        self.has_a_flat_top = True
    
    def hold_weight(self, weight):
        print(f'Holding a weight of {weight} kg')

    def self_introduction(self):
        print(f'''
        I have a width of {self.w}
        I have a length of {self.l}
        I have a height of {self.h}
        ''')

### Create an object out of a class

In [16]:
t1 = table(l=3, w=1, h=1)

In [17]:
type(t1)

__main__.table

In [18]:
int1 = int(1.2)
type(int1)

int

### Access the attributes and methods of an object
You can access the `attributes` and `methods` of class `table` use the following pattern
```python
table_1.l
table_1.w
table_1.h
table_1.has_a_flat_top
table_1.hold_weight(weight=10)
```

In [19]:
t1.hold_weight(weight=2)

Holding a weight of 2 kg


In [20]:
t1.l, t1.w, t1.h

(3, 1, 1)

In [21]:
t1.has_a_flat_top

True

In [22]:
t1.self_introduction()


        I have a width of 1
        I have a length of 3
        I have a height of 1
        


### Class inheritance
In Python, `class inheritance` is a mechanism by which a new class can be created from an existing class, inheriting its attributes and methods. The new class is called a `subclass` or `derived class`, while the existing class is called the `superclass` or `base class`.

To create a subclass in Python, you can define a new class that inherits from the superclass using the syntax `class Subclass(Superclass)`

```mermaid
flowchart TD
    animal-->dog
    animal-->cat
```

```python
class Animal:
    def __init__(self, name):
        self.name = name

    def speak(self):
        pass

class Dog(Animal):
    pass

class Cat(Animal):
    pass

```

In [33]:
class Animal:
    def __init__(self, name):
        self.name = name

    def speak(self):
        pass

class Dog(Animal):
    def bark(self):
        print('barking')

    def speak(self):
        print('I can speak')

class Cat(Animal):
    def meow(self):
        print('meow')

    def speak(self):
        print('meow')

In [34]:
d1 = Dog('dog')
c1 = Cat('cat')

In [35]:
d1.name

'dog'

In [36]:
c1.name

'cat'

In [37]:
d1.bark()

barking


In [38]:
c1.meow()

meow


In [39]:
d1.speak()

I can speak


In [40]:
c1.speak()

meow


## Files and `I/O`
* Major tool/function: `open(file, mode='r')` (https://docs.python.org/3/library/functions.html#open)
* The default mode is 'r' (open for reading text, synonym of 'rt'). The available modes:

    | Character | Meaning                                                         |
    |-----------|-----------------------------------------------------------------|
    | 'r'       | open for reading (default)                                      |
    | 'w'       | open for writing, truncating the file first                     |
    | 'a'       | open for writing, appending to the end of the file if it exists |
    | 'b'       | binary mode                                                     |

**read**

In [42]:
## Read from a file
var = 'test-read.txt'
var2 = '/Users/xiangshiyin/Documents/Teaching/data-programming-with-python/2023-fall/2023-09-19/notebook/' + var
print(f'The full path is {var2}')
fr = open(var2,'r') # create one file handle
lines = fr.readlines() # read all into a list
fr.close()

The full path is /Users/xiangshiyin/Documents/Teaching/data-programming-with-python/2023-fall/2023-09-19/notebook/test-read.txt


In [43]:
lines

['this is the 1st line\n', 'this is the 2nd line\n', 'this is the 3rd line']

In [44]:
## Another convenient way to automatically handle file handle closure

with open('test-read.txt','r') as fr:
    for line in fr.readlines():
        print(line)

this is the 1st line

this is the 2nd line

this is the 3rd line


In [45]:
with open('test-read.txt','r') as fr:
    for line in fr:
        print(line)

this is the 1st line

this is the 2nd line

this is the 3rd line


**write**

In [46]:
## open file in 'w' mode
fw = open('test-write-1.txt','w')
fw.write('this is a test')
fw.close()

In [47]:
with open('test-write-1.txt','r') as fr:
    for line in fr:
        print(line)

this is a test


In [48]:
## Write new content to a file
with open('test-write-2.txt','w') as fw:
    for i in range(6,11):
        fw.write(f'this is line {i}\n')

In [50]:
## Append to an existing file
with open('test-write-2.txt','w') as fw:
    for i in range(4,7):
        fw.write(f'this is line {i}\n')

**append** - the correct way

In [52]:
## Read and write
with open('test-write-2.txt','a') as fa:
    fa.write('this is a new line\n')
    for i in range(7,10):
        fa.write(f'this is line {i}\n')

## Library import in depth
### A simple Python package
Assume we have a package with the following file distribution
```md
└── sample_package
    └── sample.py
    └── subpackage
        └── subsample.py
```
The content of `sample.py` is like
```python
x = 123
y = 234

def hello():
    print('Hello World')
```

The content of `subsample.py`
```python
xx = 1
yy = 2
```

### Things might be more complicated
![](../pics/library_tree.png)

***You could***
* `import` the whole library, by `import a`
* `import` a module (python script), by `import a.aa`
* `import` a object (variable, function, class, etc.) in a module, by `import a.aa.aaa`, or `from a.aa import aaa`


**However**, you should keep using the `<object>` name in the `import <object>` statement in your program to reference the object you imported. **Sometimes, this could be quite inconvenient** because the `<object>` string could be pretty long due to the complicatedd file structures in the python library

**There are two ways** to solve the problem:
* `from a import aa` (use the `from` statement to reference the complicated folder relationships)
* `import a.aa as aa` (create an alias)

In [None]:
%%sh

tree sample_package

In [None]:
from sample_package.sample import hello
hello()

In [None]:
from sample_package.subpackage.subsample import xx

In [None]:
xx

# Numpy recap

## import `numpy`

In [53]:
import numpy as np

In [None]:
# import numpy

## Create numpy arrays

In [54]:
# create a numpy array out of a list
aList = [1,2,3,4]
aNumpyArray = np.array(aList)

In [55]:
aNumpyArray

array([1, 2, 3, 4])

In [56]:
type(aNumpyArray)

numpy.ndarray

In [57]:
# check the dimension of the numpy array
aNumpyArray.ndim

1

In [58]:
# the shape of a numpy array
aNumpyArray.shape

(4,)

In [59]:
len(aNumpyArray)

4

In [60]:
# get the absolute size/length of a vector
bList = [3,4]
bNumpyArray = np.array(bList)
np.linalg.norm(bNumpyArray)

5.0

In [61]:
# a 2D example
aNumpyArray = np.array([2, 0, 0, 2]).reshape(2,2)

In [62]:
aNumpyArray

array([[2, 0],
       [0, 2]])

In [63]:
aNumpyArray.ndim

2

In [64]:
aNumpyArray.shape

(2, 2)

In [65]:
## get the inverse of the 1D vector
np.linalg.inv(aNumpyArray)

array([[0.5, 0. ],
       [0. , 0.5]])

## Operations on numpy arrays

In [66]:
aNumpyArray.T

array([[2, 0],
       [0, 2]])

In [67]:
a = np.array(aList).reshape(2,2)
b = np.eye(2)

In [68]:
a

array([[1, 2],
       [3, 4]])

In [69]:
b

array([[1., 0.],
       [0., 1.]])

In [70]:
a.dot(b)

array([[1., 2.],
       [3., 4.]])

## Generate random numbers with `numpy`

In [71]:
np.random.rand(3)

array([0.6466486 , 0.97280517, 0.13422135])

In [72]:
np.random.randn(2,2)

array([[0.60758591, 1.10146034],
       [0.84684232, 0.31525558]])

In [73]:
np.random.randint(low=0, high=10, size=100)

array([1, 8, 6, 1, 2, 5, 9, 0, 3, 7, 5, 7, 4, 6, 9, 6, 1, 2, 5, 9, 5, 9,
       3, 8, 3, 7, 5, 1, 0, 4, 5, 0, 9, 6, 9, 7, 2, 3, 7, 0, 1, 9, 8, 1,
       0, 9, 7, 0, 2, 5, 2, 6, 8, 5, 7, 2, 8, 9, 8, 4, 9, 9, 2, 7, 8, 5,
       2, 3, 7, 4, 2, 0, 8, 5, 1, 0, 2, 0, 8, 9, 0, 3, 6, 3, 3, 0, 0, 8,
       6, 2, 8, 3, 3, 4, 8, 8, 1, 9, 1, 2])

## Example

### `axis` in numpy array

### `np.sum()` - [[*official doc*](https://numpy.org/doc/stable/reference/generated/numpy.sum.html#numpy.sum), [*how does `axis` work in numpy*](https://stackoverflow.com/questions/22320534/how-does-the-axis-parameter-from-numpy-work)]

#### For 1D array (or vector)

In [74]:
vector = np.array([1,2,3])
np.sum(vector)

6

In [75]:
vector.sum()

6

#### For 2D array (or matrix)
- `axis = 0` is equivalent to $\sum_{i}{A_{ij}}$
- `axis = 1` is equivalent to $\sum_{j}{A_{ij}}$
- `axis = None` (default) is equivalent to $\sum_{i,j}{A_{ij}}$

In [76]:
matrix = np.array([
    [1,2,3],
    [4,5,6]
])

In [77]:
matrix

array([[1, 2, 3],
       [4, 5, 6]])

In [79]:
matrix.sum(axis=0)

array([5, 7, 9])

In [80]:
matrix.sum(axis=1)

array([ 6, 15])

In [78]:
matrix.sum()

21

### The `length` - the $L_2$ norm
#### For 1D vector

In [81]:
vector = np.array([3,4])
np.linalg.norm(vector)

5.0

#### For 2D array
- `axis = 0` is equivalent to $\sqrt{\sum_{i}|A_{ij}|^2}$
- `axis = 1` is equivalent to $\sqrt{\sum_{j}|A_{ij}|^2}$
- `axis = None` (default) is equivalent to $\sqrt{\sum_{i,j}|A_{ij}|^2}$

In [82]:
# Calculating the norm of a 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

# Calculate the L2 (Euclidean) norm along different axes
l2_norm_axis_0 = np.linalg.norm(matrix, axis=0)  # Calculate along columns (axis=0)
l2_norm_axis_1 = np.linalg.norm(matrix, axis=1)  # Calculate along rows (axis=1)
l2_norm_axis_none = np.linalg.norm(matrix) # Calculate on the flattened array

print("L2 Norm along columns (axis=0):", l2_norm_axis_0)
print("L2 Norm along rows (axis=1):", l2_norm_axis_1)
print("L2 Norm along rows (axis=None):", l2_norm_axis_none)


L2 Norm along columns (axis=0): [4.12310563 5.38516481 6.70820393]
L2 Norm along rows (axis=1): [3.74165739 8.77496439]
L2 Norm along rows (axis=None): 9.539392014169456


#### For 3D array
- `axis = 0` is equivalent to $\sqrt{\sum_{i}|A_{ijk}|^2}$
- `axis = 1` is equivalent to $\sqrt{\sum_{j}|A_{ijk}|^2}$
- `axis = 2` is equivalent to $\sqrt{\sum_{k}|A_{ijk}|^2}$
- `axis = None` (default) is equivalent to $\sqrt{\sum_{i,j,k}|A_{ijk}|^2}$

In [83]:
import numpy as np

# Create a 3D array (3x4x2)
array_3d = np.array([
    [
        [1, 2],
        [3, 4],
        [5, 6],
        [7, 8]
    ],
    [
        [9, 10],
        [11, 12],
        [13, 14],
        [15, 16]
    ],
    [
        [17, 18],
        [19, 20],
        [21, 22],
        [23, 24]
    ]
])

print(f"The shape of the 3D array: {array_3d.shape}")

# Calculate the L2 norm along different axes
l2_norm_axis_0 = np.linalg.norm(array_3d, axis=0)  # Calculate along the first dimension (axis=0)
l2_norm_axis_1 = np.linalg.norm(array_3d, axis=1)  # Calculate along the second dimension (axis=1)
l2_norm_axis_2 = np.linalg.norm(array_3d, axis=2)  # Calculate along the third dimension (axis=2)

print("L2 Norm along the first dimension (axis=0):")
print(l2_norm_axis_0)

print("\nL2 Norm along the second dimension (axis=1):")
print(l2_norm_axis_1)

print("\nL2 Norm along the third dimension (axis=2):")
print(l2_norm_axis_2)





The shape of the 3D array: (3, 4, 2)
L2 Norm along the first dimension (axis=0):
[[19.26136028 20.68816087]
 [22.15851981 23.66431913]
 [25.19920634 26.75817632]
 [28.33725463 29.93325909]]

L2 Norm along the second dimension (axis=1):
[[ 9.16515139 10.95445115]
 [24.41311123 26.38181192]
 [40.24922359 42.23742416]]

L2 Norm along the third dimension (axis=2):
[[ 2.23606798  5.          7.81024968 10.63014581]
 [13.45362405 16.2788206  19.10497317 21.9317122 ]
 [24.75883681 27.58622845 30.41381265 33.24154028]]


### Nearest neighbor search

Euclidean distance between 2 points $(x_1,y_1,z_1)$ and $(x_2,y_2,z_2)$ is:
$$\sqrt{(x_2-x_1)^2+(y_2-y1)^2+(z_2-z_1)^2}$$

In [84]:
### Pure iterative Python ###
points = [[9,2,8],[4,7,2],[3,4,4],[5,6,9],[5,0,7],[8,2,7],[0,3,2],[7,3,0],[6,1,1],[2,9,6]]
target = [4,5,3]

shortest_distance = 10 ** 10
nearest_neighbor = []
for point in points:
    x,y,z = point
    x0,y0,z0 = target
    # magic to calculate the distance
    d = ((x-x0)**2 + (y-y0)**2 + (z-z0)**2) ** 0.5
    # figure out if this distance is the minimum distance
    if d <= shortest_distance:
        shortest_distance = d
        nearest_neighbor = [x,y,z]
    
    # if so, print the distance and point, and disclare that this is the closest data point to qPoint

print(f'The shortest distance to the target is {shortest_distance}')
print(f'The nearest neighbor is {nearest_neighbor}')

The shortest distance to the target is 1.7320508075688772
The nearest neighbor is [3, 4, 4]


In [87]:
# # # Equivalent NumPy vectorization # # #
import numpy as np
points = np.array([[9,2,8],[4,7,2],[3,4,4],[5,6,9],[5,0,7],[8,2,7],[0,3,2],[7,3,0],[6,1,1],[2,9,6]])
# points.shape
target = np.array([4,5,3]).reshape(1,3)
distances = np.linalg.norm(points-target,axis=1)
minIdx = np.argmin(distances)  # compute all euclidean distances at once and return the index of the smallest one
print(f'The shortest distance to the target is {distances[minIdx]}')
print(f'The nearest neighbor is {points[minIdx]}')

The shortest distance to the target is 1.7320508075688772
The nearest neighbor is [3 4 4]


In [86]:
distances

array([7.68114575, 2.23606798, 1.73205081, 6.164414  , 6.4807407 ,
       6.40312424, 4.58257569, 4.69041576, 4.89897949, 5.38516481])

# Pandas

* `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
* It is included in the installation of the Anaconda distribution
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a `DataFrame`.

<img align="center" src="../pics/dataframe-structure.png" style="height:300px;">


## Import the core libraries

In [88]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

## Important data structures - `Series` and `DataFrame`

### `Series`
A `Series` is a one-dimensional `array-like` object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its `index`.

In [89]:
x = pd.Series([1,2,3,4])
x

0    1
1    2
2    3
3    4
dtype: int64

In [90]:
# the array part
x.array

<PandasArray>
[1, 2, 3, 4]
Length: 4, dtype: int64

In [91]:
# the index part
x.index

RangeIndex(start=0, stop=4, step=1)

In [92]:
y = pd.Series([1,3,5,7,9],index=['a','b','c','d','e'])
y

a    1
b    3
c    5
d    7
e    9
dtype: int64

In [93]:
type(y)

pandas.core.series.Series

In [94]:
y['a']

1

In [95]:
# mutable
y['c'] = 11

In [96]:
y

a     1
b     3
c    11
d     7
e     9
dtype: int64

**Just like 1D numpy arrays ...**

In [97]:
y.ndim

1

In [98]:
y.shape

(5,)

`series` could also be converted to a dictionary

In [99]:
y.to_dict()

{'a': 1, 'b': 3, 'c': 11, 'd': 7, 'e': 9}

### `DataFrame`

#### Create `dataframe` from raw data

In [100]:
import pandas as pd

In [101]:
# create df from a dictionary
x = {
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
}

df1 = pd.DataFrame(x)

In [102]:
df1

Unnamed: 0,A,B,C
0,1,5,abc
1,2,6,def
2,a,7,ghi
3,4,8,jkl


In [103]:
# create df from a list
y = [
    ['a','b','c'],
    ['d','e','f']
]

df2 = pd.DataFrame(y, columns=['col1','col2','col3'])
df2

Unnamed: 0,col1,col2,col3
0,a,b,c
1,d,e,f


In [104]:
type(y)

list

In [105]:
# create df with fancier settings
z = {
    'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
}
df3 = pd.DataFrame(z) 

In [106]:
df3

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


#### Create `dataframe` from text file

In [107]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv')
df

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.14,2014.0
3,Angola,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4100.32,2014.0
4,Antigua and Barbuda,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",14414.30,2011.0
...,...,...,...,...,...,...,...
184,Venezuela,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",7744.75,2010.0
185,Vietnam,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",2088.34,2012.0
186,Yemen,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1302.94,2008.0
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.15,2010.0


In [108]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0)

In [109]:
df.head(3)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.14,2014.0


**Just like in `numpy arrays`**

In [110]:
df.shape

(189, 7)

In [111]:
df.ndim

2

**Something more ...**

In [112]:
df.dtypes

Country                           object
Subject Descriptor                object
Units                             object
Scale                             object
Country/Series-specific Notes     object
2015                              object
Estimates Start After            float64
dtype: object

In [113]:
df.columns

Index(['Country', 'Subject Descriptor', 'Units', 'Scale',
       'Country/Series-specific Notes', '2015', 'Estimates Start After'],
      dtype='object')

In [114]:
list(df.columns)

['Country',
 'Subject Descriptor',
 'Units',
 'Scale',
 'Country/Series-specific Notes',
 '2015',
 'Estimates Start After']

In [116]:
# list(df.index)

In [117]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        189 non-null    object 
 1   Subject Descriptor             189 non-null    object 
 2   Units                          189 non-null    object 
 3   Scale                          189 non-null    object 
 4   Country/Series-specific Notes  188 non-null    object 
 5   2015                           187 non-null    object 
 6   Estimates Start After          188 non-null    float64
dtypes: float64(1), object(6)
memory usage: 10.5+ KB


**A little bit reformatting ...**

In [118]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')
df.head(3)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.14,2014.0


In [119]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        189 non-null    object 
 1   Subject Descriptor             189 non-null    object 
 2   Units                          189 non-null    object 
 3   Scale                          189 non-null    object 
 4   Country/Series-specific Notes  188 non-null    object 
 5   2015                           187 non-null    float64
 6   Estimates Start After          188 non-null    float64
dtypes: float64(2), object(5)
memory usage: 10.5+ KB


#### Create `dataframe` from excel spreadsheet

In [None]:
# pd.read_excel() # press shift + tab

In [120]:
## import from excel spreadsheet (need to have package `openpyxl` pre-installed)
df2 = pd.read_excel(io='../data/excel-test-file.xlsx', sheet_name='tab1', header=0)

df2.head(5)

Unnamed: 0,col1,col2,col3
0,1,a,a12
1,2,b,b23
2,3,c,c31


In [121]:
df3 = pd.read_excel(io='../data/excel-test-file.xlsx',sheet_name='tab2',header=0)
df3.head(3)

Unnamed: 0,col4,col5
0,d,4
1,e,5
2,f,6


## View `dataframe`

In [122]:
# create a dataframe from a numpy array, with columns labeled
df = pd.DataFrame(np.random.randn(6,4), columns = ['Ann', "Bob", "Charly", "Don"])
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


**df.head()**

In [123]:
df.head(2)

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572


In [124]:
df.head()

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079


**df.tail()**

In [125]:
df.tail(2)

Unnamed: 0,Ann,Bob,Charly,Don
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [126]:
df.tail()

Unnamed: 0,Ann,Bob,Charly,Don
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


**`dataframe` attributes**

In [127]:
type(df)

pandas.core.frame.DataFrame

In [128]:
list(df.columns)

['Ann', 'Bob', 'Charly', 'Don']

In [129]:
list(df.index)

[0, 1, 2, 3, 4, 5]

In [130]:
df.ndim

2

In [131]:
df.shape

(6, 4)

In [132]:
len(df)

6

In [133]:
df.dtypes

Ann       float64
Bob       float64
Charly    float64
Don       float64
dtype: object

In [134]:
df.values # convert df to numpy array

array([[-1.17894913,  0.05948412, -0.34705229, -1.89280965],
       [-1.8031279 , -1.7650486 ,  1.05518358,  0.16757236],
       [-0.04019974,  0.02018757,  0.46933455, -1.19158792],
       [-0.7969876 , -1.68401536,  0.1330125 , -1.31833578],
       [-1.44583619, -1.70577173, -0.55753979,  1.70407883],
       [ 0.97885456,  0.1037467 , -1.13470362, -0.64309923]])

In [135]:
df.values.shape

(6, 4)

In [136]:
# you can also do
df.to_numpy()

array([[-1.17894913,  0.05948412, -0.34705229, -1.89280965],
       [-1.8031279 , -1.7650486 ,  1.05518358,  0.16757236],
       [-0.04019974,  0.02018757,  0.46933455, -1.19158792],
       [-0.7969876 , -1.68401536,  0.1330125 , -1.31833578],
       [-1.44583619, -1.70577173, -0.55753979,  1.70407883],
       [ 0.97885456,  0.1037467 , -1.13470362, -0.64309923]])

In [138]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ann     6 non-null      float64
 1   Bob     6 non-null      float64
 2   Charly  6 non-null      float64
 3   Don     6 non-null      float64
dtypes: float64(4)
memory usage: 320.0 bytes


**df.describe()**

In [139]:
df.describe() # generate descriptive stats on the data

Unnamed: 0,Ann,Bob,Charly,Don
count,6.0,6.0,6.0,6.0
mean,-0.714374,-0.82857,-0.063628,-0.52903
std,1.02612,0.975347,0.780505,1.296682
min,-1.803128,-1.765049,-1.134704,-1.89281
25%,-1.379114,-1.700333,-0.504918,-1.286649
50%,-0.987968,-0.831914,-0.10702,-0.917344
75%,-0.229397,0.04966,0.385254,-0.035096
max,0.978855,0.103747,1.055184,1.704079


**df.transpose()**

In [140]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [141]:
# transpose a datafrme

df.transpose()
# type(df.transpose())

Unnamed: 0,0,1,2,3,4,5
Ann,-1.178949,-1.803128,-0.0402,-0.796988,-1.445836,0.978855
Bob,0.059484,-1.765049,0.020188,-1.684015,-1.705772,0.103747
Charly,-0.347052,1.055184,0.469335,0.133013,-0.55754,-1.134704
Don,-1.89281,0.167572,-1.191588,-1.318336,1.704079,-0.643099


In [142]:
df.T # you can also do it this way

Unnamed: 0,0,1,2,3,4,5
Ann,-1.178949,-1.803128,-0.0402,-0.796988,-1.445836,0.978855
Bob,0.059484,-1.765049,0.020188,-1.684015,-1.705772,0.103747
Charly,-0.347052,1.055184,0.469335,0.133013,-0.55754,-1.134704
Don,-1.89281,0.167572,-1.191588,-1.318336,1.704079,-0.643099


**sort `dataframe`**

In [143]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [144]:
# sort_index(), by labels (index or column)
# df
df.sort_index(axis=0, ascending=False)

Unnamed: 0,Ann,Bob,Charly,Don
5,0.978855,0.103747,-1.134704,-0.643099
4,-1.445836,-1.705772,-0.55754,1.704079
3,-0.796988,-1.684015,0.133013,-1.318336
2,-0.0402,0.020188,0.469335,-1.191588
1,-1.803128,-1.765049,1.055184,0.167572
0,-1.178949,0.059484,-0.347052,-1.89281


In [145]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,Don,Charly,Bob,Ann
0,-1.89281,-0.347052,0.059484,-1.178949
1,0.167572,1.055184,-1.765049,-1.803128
2,-1.191588,0.469335,0.020188,-0.0402
3,-1.318336,0.133013,-1.684015,-0.796988
4,1.704079,-0.55754,-1.705772,-1.445836
5,-0.643099,-1.134704,0.103747,0.978855


In [146]:
# sort_values(), by values
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [147]:
df.sort_values(by='Ann', ascending=True)

Unnamed: 0,Ann,Bob,Charly,Don
1,-1.803128,-1.765049,1.055184,0.167572
4,-1.445836,-1.705772,-0.55754,1.704079
0,-1.178949,0.059484,-0.347052,-1.89281
3,-0.796988,-1.684015,0.133013,-1.318336
2,-0.0402,0.020188,0.469335,-1.191588
5,0.978855,0.103747,-1.134704,-0.643099


In [None]:
# df.sort_values(by='Ann', ascending=True, inplace=True)

In [149]:
df.sort_values(by=['Ann','Bob'], ascending=True)

Unnamed: 0,Ann,Bob,Charly,Don
1,-1.803128,-1.765049,1.055184,0.167572
4,-1.445836,-1.705772,-0.55754,1.704079
0,-1.178949,0.059484,-0.347052,-1.89281
3,-0.796988,-1.684015,0.133013,-1.318336
2,-0.0402,0.020188,0.469335,-1.191588
5,0.978855,0.103747,-1.134704,-0.643099


## Select `dataframe`

Pandas documentation on select and indexing `dataframe`:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing
* https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced

### The different ways

| Type                  | Notes                                       |
|-----------------------|---------------------------------------------|
| `df[column]`          | Select by column labels                     |
| `df.loc[rows]`        | Select by row labels                        |
| `df.loc[:, cols]`     | Select by column labels                     |
| `df.loc[rows, cols]`  | Select by row and column labels             |
| `df.iloc[rows]`       | Select by row positional indices            |
| `df.iloc[:, cols]`    | Select by column positional indices         |
| `df.iloc[rows, cols]` | Select by row and column positional indices |
| `df.at[row, col]`     | Select an element by row and column labels  |
| `df.iat[row, col]`    | Select an element by row and column indices |

In [150]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


### Single column vs. multiple columns

In [151]:
df['Ann']

0   -1.178949
1   -1.803128
2   -0.040200
3   -0.796988
4   -1.445836
5    0.978855
Name: Ann, dtype: float64

In [153]:
type(df['Ann'])

pandas.core.series.Series

In [154]:
df.Ann

0   -1.178949
1   -1.803128
2   -0.040200
3   -0.796988
4   -1.445836
5    0.978855
Name: Ann, dtype: float64

In [155]:
type(df.Ann)

pandas.core.series.Series

Selecting multiple columns yields a dataframe, which references a subset of the original dataframe. Note you are NOT creating a new copy here!

In [157]:
df[['Ann','Bob']]

Unnamed: 0,Ann,Bob
0,-1.178949,0.059484
1,-1.803128,-1.765049
2,-0.0402,0.020188
3,-0.796988,-1.684015
4,-1.445836,-1.705772
5,0.978855,0.103747


In [158]:
type(df[['Ann','Bob']])

pandas.core.frame.DataFrame

In [156]:
df[['Ann']]

Unnamed: 0,Ann
0,-1.178949
1,-1.803128
2,-0.0402
3,-0.796988
4,-1.445836
5,0.978855


### Select by labels
* You could use `.loc` method of `dataframe` to select data by labels. Typical format is like
```python
df.loc[row_indexer, column_indexer]
```
* More details can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing


In [159]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [160]:
df.index

RangeIndex(start=0, stop=6, step=1)

In [161]:
# by row label
df.loc[0]

Ann      -1.178949
Bob       0.059484
Charly   -0.347052
Don      -1.892810
Name: 0, dtype: float64

In [162]:
# by row and column label
df.loc[[0,1],['Ann','Bob']]

Unnamed: 0,Ann,Bob
0,-1.178949,0.059484
1,-1.803128,-1.765049


In [163]:
df.loc[0,['Ann','Bob']] # get a Series

Ann   -1.178949
Bob    0.059484
Name: 0, dtype: float64

In [164]:
x = [1,2,3,4,5]
x[0:2]

[1, 2]

In [165]:
df.loc[0:2,['Ann','Bob']] # note here the row for `index=2` is also displayed

Unnamed: 0,Ann,Bob
0,-1.178949,0.059484
1,-1.803128,-1.765049
2,-0.0402,0.020188


In [166]:
# by column label only
df.loc[:,['Ann']] # note that you'll get a dataframe instead of a Series

Unnamed: 0,Ann
0,-1.178949
1,-1.803128
2,-0.0402
3,-0.796988
4,-1.445836
5,0.978855


In [171]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [167]:
# what if I just want to get the value of a particular cell?
df.loc[2,'Ann']

-0.040199735516232356

In [168]:
# you can also do
df.at[2,'Ann']

-0.040199735516232356

### Select by Position

* You could use `.iloc` method of `dataframe` to select data by labels. Typical format is like
```python
df.iloc[row_position_indexer, column_position_indexer]
```
* More details can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing

In [169]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [170]:
# select by row position
df.iloc[0]

Ann      -1.178949
Bob       0.059484
Charly   -0.347052
Don      -1.892810
Name: 0, dtype: float64

In [172]:
# select by row position range
df.iloc[0:2] # note that only the only one end of the range is included, different from df.loc

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572


In [173]:
# you can also do
df.iloc[0:2,]

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572


In [174]:
df.iloc[0:2,:]

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572


In [175]:
# select by column position range
df.iloc[:,0:2]

Unnamed: 0,Ann,Bob
0,-1.178949,0.059484
1,-1.803128,-1.765049
2,-0.0402,0.020188
3,-0.796988,-1.684015
4,-1.445836,-1.705772
5,0.978855,0.103747


In [176]:
# select by row and column position range
df.iloc[0:2,0:2]

Unnamed: 0,Ann,Bob
0,-1.178949,0.059484
1,-1.803128,-1.765049


In [177]:
# what if I just want to get the value of a particular cell?
df.iloc[0,0]

-1.1789491258520512

In [178]:
# you can also do
df.iat[0,0]

-1.1789491258520512

### Select by conditions

In [179]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [180]:
df[df.Ann>=0]

Unnamed: 0,Ann,Bob,Charly,Don
5,0.978855,0.103747,-1.134704,-0.643099


In [181]:
df.loc[df.Ann>=0,['Ann','Bob']]

Unnamed: 0,Ann,Bob
5,0.978855,0.103747


In [182]:
df.loc[(df.Ann>=-0.5)&(df.Ann<=1.4),['Ann','Bob']]

Unnamed: 0,Ann,Bob
2,-0.0402,0.020188
5,0.978855,0.103747


## Set/change values - "mutable"

In [183]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-1.178949,0.059484,-0.347052,-1.89281
1,-1.803128,-1.765049,1.055184,0.167572
2,-0.0402,0.020188,0.469335,-1.191588
3,-0.796988,-1.684015,0.133013,-1.318336
4,-1.445836,-1.705772,-0.55754,1.704079
5,0.978855,0.103747,-1.134704,-0.643099


In [184]:
# add a new column
df['E'] = 5
df

Unnamed: 0,Ann,Bob,Charly,Don,E
0,-1.178949,0.059484,-0.347052,-1.89281,5
1,-1.803128,-1.765049,1.055184,0.167572,5
2,-0.0402,0.020188,0.469335,-1.191588,5
3,-0.796988,-1.684015,0.133013,-1.318336,5
4,-1.445836,-1.705772,-0.55754,1.704079,5
5,0.978855,0.103747,-1.134704,-0.643099,5


In [185]:
df['F'] = np.arange(6)
df

Unnamed: 0,Ann,Bob,Charly,Don,E,F
0,-1.178949,0.059484,-0.347052,-1.89281,5,0
1,-1.803128,-1.765049,1.055184,0.167572,5,1
2,-0.0402,0.020188,0.469335,-1.191588,5,2
3,-0.796988,-1.684015,0.133013,-1.318336,5,3
4,-1.445836,-1.705772,-0.55754,1.704079,5,4
5,0.978855,0.103747,-1.134704,-0.643099,5,5


In [186]:
df['G'] = [5 if i%2==0 else 6 for i in range(6)]
df

Unnamed: 0,Ann,Bob,Charly,Don,E,F,G
0,-1.178949,0.059484,-0.347052,-1.89281,5,0,5
1,-1.803128,-1.765049,1.055184,0.167572,5,1,6
2,-0.0402,0.020188,0.469335,-1.191588,5,2,5
3,-0.796988,-1.684015,0.133013,-1.318336,5,3,6
4,-1.445836,-1.705772,-0.55754,1.704079,5,4,5
5,0.978855,0.103747,-1.134704,-0.643099,5,5,6


In [187]:
# set values by labels
df.loc['2020-08-25','E'] = 3
# df.at['2020-08-25','E'] = 3
df

Unnamed: 0,Ann,Bob,Charly,Don,E,F,G
0,-1.178949,0.059484,-0.347052,-1.89281,5.0,0.0,5.0
1,-1.803128,-1.765049,1.055184,0.167572,5.0,1.0,6.0
2,-0.0402,0.020188,0.469335,-1.191588,5.0,2.0,5.0
3,-0.796988,-1.684015,0.133013,-1.318336,5.0,3.0,6.0
4,-1.445836,-1.705772,-0.55754,1.704079,5.0,4.0,5.0
5,0.978855,0.103747,-1.134704,-0.643099,5.0,5.0,6.0
2020-08-25,,,,,3.0,,


In [None]:
# set values by position
df.iloc[0,5] = -1
df

In [None]:
# set values by condition
df.loc[df.Ann>0,'E'] = 4
df

### `Reindex`
Create a new object with the values rearranged to align with the new index

#### On `series`

In [None]:
x = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [None]:
x

In [None]:
y = x.reindex(["a", "b", "c", "d", "e"])
y

#### On `dataframe`

In [None]:
df = pd.DataFrame(
    np.arange(9).reshape(3,3),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)

df

In [None]:
df2 = df.reindex(index=['a', 'b', 'c', 'd'])
df2

In [None]:
df3 = df.reindex(columns=['Texas', 'Utah', 'California'])
df3

## Missing values

`pandas` primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) from `pandas` official documentation for more details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [None]:
dates = pd.date_range(start='2020-08-25', end='2020-10-01', freq='7D')
dates

In [None]:
df1 = df.reindex(index=dates[:6],columns=list(df.columns)+['G'])
df1

In [None]:
# fill in values at some locations
df1.loc['2020-08-25':'2020-09-08','G'] = 1
df1

In [None]:
# to get the boolean mask where values are nan
df1.isna()

In [None]:
# you can also do
pd.isna(df1)

In [None]:
# drop any rows that have missing values
df2 = df1.copy()
df2.dropna(how='any')

In [None]:
df2 # df2 is not impacted since the inplace flag is not flipped

In [None]:
# fill missing values
df1.fillna(value=-999)

## Operations on `dataframe`

**Stats**

In [None]:
df.describe()

In [None]:
df

In [None]:
# df.mean()
list(df.mean())

In [None]:
df.mean()

In [None]:
df.mean().values

In [None]:
df.mean(axis=0)

In [None]:
df.mean(axis=1)

**Histogram**

In [None]:
df

In [None]:
df['histcol'] = np.random.randint(0,3,size=3)
df

In [None]:
df.histcol.value_counts()

In [None]:
df.histcol.nunique()

In [None]:
df.histcol.unique()

In [None]:
# df.histcol.hist()
df.histcol.hist(density=True)

**Apply functions/logics to the data**

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply the function on all columns

In [None]:
df.apply(lambda x: -x) # apply the function on all columns

In [None]:
df.California.map(lambda x: x+1) # apply the function on one single column

## `dataframe` and table operations

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

**Concat**

In [None]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](joins.jpg)

In [None]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [None]:
tb1

In [None]:
tb2

In [None]:
pd.merge(tb1, tb2, on='key', how='inner')

In [None]:
pd.merge(tb1, tb2, on='key', how='left')

In [None]:
pd.merge(tb1, tb2, on='key', how='right')

In [None]:
pd.merge(tb1, tb2, on='key', how='outer')

**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

In [None]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

In [None]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

In [None]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

In [None]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

## Write/Export `dataframe` to files

**CSV file**

In [None]:
df

In [189]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True)

**Excel spreadsheet**

In [190]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)