# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summmer/2023-06-21/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
  * Numpy
  * Nearest neighbor search example review
  * Special topic: namespaces in Python
* `Pandas`


# Special topic

## Concepts: Scope and Namespace

### How to find the Sherlock Homes Museum? what if we only know the street name?
![](../pics/221b-baker-street-address.png)

**We need to find the appropriate streetname within certain scopes/domains**
* Different scope of search areas result in different addresses, and different search result
    * uk
    * uk, london
    * uk, london, 221b baker street
    
**Same logic also applies to variable/object identification in Python programming!**
* We need the address (`NAMESPACE`) to find a place(`VARIABLE/OBJECT`)
* Different scope(`SCOPE`) of search areas result in different addresses(`NAMESPACE`), and different search result(`VARIABLE/OBJECT`)
* If we represent the `namespace` in the `python` way:
    * uk
    * uk.london
    * uk.london.221bBakerStreet

**Python defines different `SCOPE`s by indenting the code and creating code blocks!** Below are the places where we see code indentations:
* Function definition
* Loops (for/while)
* Conditional statment (if-else)

Therefore, `SCOPE`s in Python code can be defined by:
* Function definition
* (*SPECIAL*) Script file where the code belongs to

When you use a variable in your code, **Python will use the following orders to try different `SCOPE`s and locate the appropriate `VARIABLE/OBJECT`**
* The `local scope`: the code block where you use the variable
* The `enclosed scope`
* The `global scope`: the script file (Jupyter Notebook in our lecture demo) where you code belongs to
* The `built-in scope`



![](../pics/scope.png)

In [None]:
import numpy as np

### Global & Local Variables
* Global variable: Variables you defined in the global scope
* Local variable: Variables you defined in a local scope

In [None]:
s = 'I am a string' # the global scope
def test():
    print('From function test(): ',s) # use the global scope

test()

In [None]:
s = 'I am a string' # the global scope
def test():
    s = 'I am a new string' # the local scope
    print('From function test(): ',s) # use the local scope

test()
print('Outside function test(): ',s) # the global scope

In [None]:
s = 'I am a string' # the global scope
def test():
    s = 'I am a new string' # the local scope
    print('From function test(), before assignment: ',s) # will try the local scope first
    print('From function test(), after assignment: ',s)

test()

In [None]:
var = 123

def test():
    var = 124
    print(var)

def test1():
    var = 1234
    print(var)
    
test()
test1()
print(var)

In [None]:
var = 123
id(var)

In [None]:
var = 123

def test():
    var = 124
#     print(var)
    print(id(var))

def test1():
    var = 1234
#     print(var)
    print(id(var))
    
test()
test1()
print(id(var))

In [None]:
s = 'I am a string' # the global scope
def test():
    global s
    print('From function test(), before assignment: ',s) # will try the global scope
    s = 'I am a new string'
    print('From function test(), after assignment: ',s)

test()
print('From outside test(), ', s)

In [None]:
s = 'I am a string' # the global scope
def test():
#     global s
#     print('From function test(), before assignment: ',s) # will try the global scope
    s = 'I am a new string'
    print('From function test(), after assignment: ',s)

test()
print('From outside test(), ', s)

In [None]:
a = [1,2,3]
for i in range(3):
    a = 4
    print(i,a)
print(a)

### Reading Materials
* Fabrizio Romano *Learning Python: Learn to code like a professional with Python - an opensource, versatile, and powerful programming language* Packt Publishing, 2015. Chapter 1

### More Examples

In [None]:
s = 'I am a string'
def test1():
    s = 'I am a new string'
    print('From function test1(): ',s)

def test2():
    s = 'I am another new string'
    print('From function test2(): ',s)

print('Outside function test1() and test2(), before calling the 2 functions: ',s)
test1()
test2()
print('Outside function test1() and test2(), after calling the 2 functions: ',s)

### Passing parameters to the function

In [None]:
s = 'I am a string'
def test(s):
    print('From function test(), before assignment: ',s)
    s = 'I am a new string'
    print('From function test(), after assignment: ',s)

print('Outside function test(), before calling test(): ',s)    
test(s)
print('Outside function test(), after calling test() ',s)

In [None]:
## The above code is actually equivalent to the following code

s = 'I am a string'
def test(s_local):
    print('From function test(), before assignment: ',s_local)
    s_local = 'I am a new string'
    print('From function test(), after assignment: ',s_local)

print('Outside function test(), before calling test(): ',s)    
test(s_local=s)
print('Outside function test(), after calling test() ',s)

In [None]:
s = [1,2,3]
def test(s):
    print('From function test(), before assignment: ',s)
    s[1] = 5
    print('From function test(), after assignment: ',s)

print('Outside function test(), before calling test(): ',s)    
test(s)
print('Outside function test(), after calling test() ',s)

In [None]:
## The above code is actually equivalent to the following code

s = [1,2,3]
def test(s_local):
    print('From function test(), before assignment: ',s_local)
    s_local[1]=5
    print('From function test(), after assignment: ',s_local)

print('Outside function test(), before calling test(): ',s)    
test(s_local=s)
print('Outside function test(), after calling test() ',s)

# Recap of previous lecture

## Numpy and Numpy Arrays

## import `numpy`

In [None]:
import numpy as np

## Create numpy arrays

In [None]:
aList = [1,2,3,4]
aNumpyArray = np.array(aList)

In [None]:
aNumpyArray

In [None]:
aNumpyArray.ndim

In [None]:
aNumpyArray.shape

In [None]:
## get the absolute size of a vector
bList = [3,4]
bNumpyArray = np.array(bList)
np.linalg.norm(bNumpyArray)

In [None]:
aNumpyArray = np.array(aList).reshape(2,2)

In [None]:
aNumpyArray.ndim

In [None]:
aNumpyArray.shape

In [None]:
## get the inverse of the 1D vector
np.linalg.inv(aNumpyArray)

## Operations on numpy arrays

In [None]:
aNumpyArray.T

In [None]:
a = np.array(aList).reshape(2,2)
b = np.eye(2)

In [None]:
b

In [None]:
a.dot(b)

## Generate random numbers with `numpy`

In [None]:
np.random.rand(3)

In [None]:
np.random.randn(2,2)

In [None]:
np.random.randint(low=0, high=10, size=100)

## Example: Nearest neighbor search

Euclidean distance between 2 points $(x_1,y_1,z_1)$ and $(x_2,y_2,z_2)$ is:
$$\sqrt{(x_2-x_1)^2+(y_2-y1)^2+(z_2-z_1)^2}$$

In [None]:

### Pure iterative Python ###
points = [[9,2,8],[4,7,2],[3,4,4],[5,6,9],[5,0,7],[8,2,7],[0,3,2],[7,3,0],[6,1,1],[2,9,6]]
qPoint = [4,5,3]

minIdx = -1
minDist = -1
for idx, point in enumerate(points):  # iterate over all points
    print('index is {}, point is {}'.format(idx, point))
    dist = sum([(dp-dq)**2 for dp,dq in zip(point,qPoint)])**0.5  # compute the euclidean distance for each point to q
    if dist < minDist or minDist < 0:  # if necessary, update minimum distance and index of the corresponding point
        minDist = dist
        minIdx = idx

print('Nearest point to q: ', points[minIdx])

In [None]:
# # # Equivalent NumPy vectorization # # #
import numpy as np
points = np.array([[9,2,8],[4,7,2],[3,4,4],[5,6,9],[5,0,7],[8,2,7],[0,3,2],[7,3,0],[6,1,1],[2,9,6]])
qPoint = np.array([4,5,3]).reshape(1,3)
minIdx = np.argmin(np.linalg.norm(points-qPoint,axis=1))  # compute all euclidean distances at once and return the index of the smallest one
print('Nearest point to q: ', points[minIdx])

# Quick Tutorial on Pandas

* `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
* It is included in the installation of the Anaconda distribution
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.

<img align="center" src="../pics/dataframe-structure.png" style="height:300px;">


## Import the core libraries

In [None]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

## Create `dataframe` from raw data

In [None]:
{
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
}

In [None]:
# create df from a dictionary
df1 = pd.DataFrame({
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
})

In [None]:
df1

In [None]:
[
    ['a','b','c'],
    ['d','e','f']
]

In [None]:
# create df from a list
df2 = pd.DataFrame([
    ['a','b','c'],
    ['d','e','f']
], columns=['col1','col2','col3'])
df2

In [None]:
[3] * 4

In [None]:
# create df with fancier settings
df3 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
}) 

In [None]:
df3

In [None]:
df3.shape

In [None]:
df3.ndim

References
* `Series`: https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#series
* `Time series and date functionality`: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-series-date-functionality

## Create `dataframe` from text file

In [None]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0)

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
list(df.columns)

In [None]:
df.info()

In [None]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')
df.head(10)

In [None]:
df.info()

## Create `dataframe` from excel spreadsheet

In [None]:
# pd.read_excel() # press shift + tab

In [None]:
## import from excel spreadsheet (need to have package `openpyxl` pre-installed)
df2 = pd.read_excel(io='../data/excel-test-file.xlsx', sheet_name='tab1', header=0)

df2.head(5)

In [None]:
df3 = pd.read_excel(io='../data/excel-test-file.xlsx',sheet_name='tab2',header=0)
df3.head(3)

## [SPECIAL] Create `Series`

One-dimensional ndarray with axis labels (including time series).

In [None]:
s = pd.Series([1,3,5,7,9],index=['a','b','c','d','e'])
s

In [None]:
type(s)

In [None]:
s['a']

In [None]:
s['c']

## View `dataframe`

In [None]:
# create a dataframe from a numpy array, with columns labeled
df = pd.DataFrame(np.random.randn(6,4), columns = ['Ann', "Bob", "Charly", "Don"])
df

**df.head()**

In [None]:
df.head(2)

In [None]:
df.head()

**df.tail()**

In [None]:
df.tail(2)

In [None]:
df.tail()

**`dataframe` attributes**

In [None]:
list(df.index)

In [None]:
df

In [None]:
dates = pd.date_range(start='20200825', end='20201201', freq='7D')
dates

In [None]:
type(dates)

In [None]:
dates.ndim

In [None]:
df.index = dates[:6]
df

In [None]:
df.index

In [None]:
# df.columns
list(df.columns)

In [None]:
df.dtypes

In [None]:
df.values # convert df to numpy array

In [None]:
df.values.shape

In [None]:
# you can also do
df.to_numpy()

**df.describe()**

In [None]:
df.describe() # generate descriptive stats on the data

**df.transpose()**

In [None]:
df

In [None]:
# transpose a datafrme

df.transpose()
# type(df.transpose())

In [None]:
df.T # you can also do it this way

**sort `dataframe`**

In [None]:
df

In [None]:
# sort_index(), by labels (index or column)
# df
df.sort_index(axis=0, ascending=False)

In [None]:
df

In [None]:
df.sort_index(axis=1, ascending=False)

In [None]:
# sort_values(), by values
df

In [None]:
df.sort_values(by='Ann', ascending=True)
# df.sort_values(by=['Ann','Bob'], ascending=True)

## Select `dataframe`

Pandas documentation on select and indexing `dataframe`:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing
* https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced

### Select columns

Selecting a single column, which yields a Series, equivalent to df.A

In [None]:
df['Ann']

In [None]:
type(df)

In [None]:
type(df['Ann'])

In [None]:
df.Ann

In [None]:
type(df['Ann'])

Selecting multiple columns yields a dataframe, which references a subset of the original dataframe. Note you are NOT creating a new copy here!

In [None]:
df[['Ann','Bob']]

In [None]:
type(df[['Ann','Bob']])

### Select by labels
* You could use `.loc` method of `dataframe` to select data by labels. Typical format is like
```python
df.loc[row_indexer, column_indexer]
```
* More details can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing


In [None]:
dates = pd.date_range(start='20200825', end='20201201', periods=15)
dates

In [None]:
df

In [None]:
df.index

In [None]:
dates[0]

In [None]:
# by row label
df.loc[dates[0]]

In [None]:
# by row and column label
df.loc[dates[0:2],['Ann','Bob']]

In [None]:
df.loc['2020-08-25',['Ann','Bob']] # get a Series

In [None]:
df.loc['2020-08-25':'2020-09-08',['Ann','Bob']] # note here the row for '2020-09-08' is also displayed

In [None]:
# by column label only
df.loc[:,['Ann']] # note that you'll get a dataframe instead of a Series

In [None]:
# what if I just want to get the value of a particular cell?
df.loc['2020-08-25','Ann']

In [None]:
# you can also do
df.at['2020-08-25','Ann']

### Select by Position

* You could use `.iloc` method of `dataframe` to select data by labels. Typical format is like
```python
df.iloc[row_position_indexer, column_position_indexer]
```
* More details can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing

In [None]:
df

In [None]:
# select by row position
df.iloc[0]

In [None]:
# select by row position range
df.iloc[0:2] # note that only the only one end of the range is included, different from df.loc

In [None]:
# you can also do
df.iloc[0:2,]

In [None]:
# select by column position range
df.iloc[:,0:2]

In [None]:
# select by row and column position range
df.iloc[0:2,0:2]

In [None]:
# what if I just want to get the value of a particular cell?
df.iloc[0,0]

In [None]:
# you can also do
df.iat[0,0]

### Select by conditions

In [None]:
df

In [None]:
df[df.Ann>=0]

In [None]:
df.loc[df.Ann>=0,['Ann','Bob']]

In [None]:
df.loc[(df.Ann>=-0.5)&(df.Ann<=1.4),['Ann','Bob']]

### Set values

In [None]:
df

In [None]:
# add a new column
df['E'] = 5
df

In [None]:
df['F'] = np.arange(6)
df

In [None]:
# set values by labels
df.loc['2020-08-25','E'] = 3
# df.at['2020-08-25','E'] = 3
df

In [None]:
# set values by position
df.iloc[0,5] = -1
df

In [None]:
# set values by condition
df.loc[df.Ann>0,'E'] = 4
df

## Missing values

`pandas` primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) from `pandas` official documentation for more details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [None]:
df1 = df.reindex(index=dates[:6],columns=list(df.columns)+['G'])
df1

In [None]:
# fill in values at some locations
df1.loc['2020-08-25':'2020-09-08','G'] = 1
df1

In [None]:
# to get the boolean mask where values are nan
df1.isna()

In [None]:
# you can also do
pd.isna(df1)

In [None]:
# drop any rows that have missing values
df2 = df1.copy()
df2.dropna(how='any')

In [None]:
df2 # df2 is not impacted since the inplace flag is not flipped

In [None]:
# fill missing values
df1.fillna(value=-999)

## Operations on `dataframe`

**Stats**

In [None]:
df

In [None]:
# df.mean()
list(df.mean())

In [None]:
df.mean()

In [None]:
df.mean().values

In [None]:
df.mean(axis=0)

In [None]:
df.mean(axis=1)

**Histogram**

In [None]:
df

In [None]:
df['histcol'] = np.random.randint(0,3,size=6)
df

In [None]:
df.histcol.value_counts()

In [None]:
df.histcol.nunique()

In [None]:
df.histcol.unique()

In [None]:
# df.histcol.hist()
df.histcol.hist(density=True)

**Apply functions/logics to the data**

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply the function on all columns

In [None]:
df.apply(lambda x: -x) # apply the function on all columns

In [None]:
df.E.map(lambda x: x+1) # apply the function on one single column

## `dataframe` and table operations

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

**Concat**

In [None]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

**Append new data from another `dataframe`**

In [None]:
df_p2 = pd.DataFrame(np.random.randn(4, 4), columns=['a','b','c','d'])
df_p2

In [None]:
pd.__version__

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](joins.jpg)

In [None]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [None]:
tb1

In [None]:
tb2

In [None]:
pd.merge(tb1, tb2, on='key', how='inner')

In [None]:
pd.merge(tb1, tb2, on='key', how='left')

In [None]:
pd.merge(tb1, tb2, on='key', how='right')

In [None]:
pd.merge(tb1, tb2, on='key', how='outer')

**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

In [None]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

In [None]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

In [None]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

In [None]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

**Pivot table**

In [None]:
df = pd.DataFrame({'ModelNumber' : ['one', 'one', 'two', 'three'] * 3,
                   'Submodel' : ['A', 'B', 'C'] * 4,
                   'Type' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'Xval' : np.random.randn(12),
                   'Yval' : np.random.randn(12)})

df

We can produce pivot tables from this data very easily:

In [None]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
)

In [None]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
#     , aggfunc='count'
    ,aggfunc=lambda x: abs(x)
)

## Write/Export `dataframe` to files

**CSV file**

In [None]:
df

In [None]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True,index=None)

**Excel spreadsheet**

In [None]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)

## Pandas and time series data [offline-reading]

Please check `pandas` official documentations:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#time-series
* https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
rng = pd.date_range('1/1/2012', periods=10, freq='S')
rng

In [None]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

In [None]:
ts.resample('2S').mean()

More on `resampling` can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

In [None]:
ts.rolling(window=5).mean() # rolling average

In [None]:
ts.rolling(window=5, center=True).mean() # set the label at the center

More on `pandas.Series.rolling` can be found at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html

In [None]:
ts.plot()