(**You can also open this notebook in Google Colab**)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-fall/2023-09-26/notebook/code_demo.ipynb)

# Python basics - additional topics

## Library import in depth
### A simple Python package
Assume we have a package with the following file distribution
```md
└── sample_package
    └── sample.py
    └── subpackage
        └── subsample.py
```
The content of `sample.py` is like
```python
x = 123
y = 234

def hello():
    print('Hello World')
```

The content of `subsample.py`
```python
xx = 1
yy = 2
```

### Things might be more complicated
![](../pics/library_tree.png)

***You could***
* `import` the whole library, by `import a`
* `import` a module (python script), by `import a.aa`
* `import` a object (variable, function, class, etc.) in a module, by `import a.aa.aaa`, or `from a.aa import aaa`


**However**, you should keep using the `<object>` name in the `import <object>` statement in your program to reference the object you imported. **Sometimes, this could be quite inconvenient** because the `<object>` string could be pretty long due to the complicatedd file structures in the python library

**There are two ways** to solve the problem:
* `from a import aa` (use the `from` statement to reference the complicated folder relationships)
* `import a.aa as aa` (create an alias)

In [None]:
%%sh

tree sample_package

In [None]:
from sample_package.sample import hello
hello()

In [None]:
from sample_package.subpackage.subsample import xx

In [None]:
xx

# `pandas` continued

In [5]:
import pandas as pd
import numpy as np

## Create `dataframe` from files

### `csv` file

In [2]:
df1 = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')
df1.head(3)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.14,2014.0


### `excel` file

In [3]:
df2 = pd.read_excel(io='../data/excel-test-file.xlsx', sheet_name='tab1', header=0)
df2.head(3)

Unnamed: 0,col1,col2,col3
0,1,a,a12
1,2,b,b23
2,3,c,c31


In [4]:
df3 = pd.read_excel(io='../data/excel-test-file.xlsx',sheet_name='tab2',header=0)
df3.head(3)

Unnamed: 0,col4,col5
0,d,4
1,e,5
2,f,6


## Different ways to select a subset of a `dataframe`

| Type                  | Notes                                       |
|-----------------------|---------------------------------------------|
| `df[column]`          | Select by column labels                     |
| `df.loc[rows]`        | Select by row labels                        |
| `df.loc[:, cols]`     | Select by column labels                     |
| `df.loc[rows, cols]`  | Select by row and column labels             |
| `df.iloc[rows]`       | Select by row positional indices            |
| `df.iloc[:, cols]`    | Select by column positional indices         |
| `df.iloc[rows, cols]` | Select by row and column positional indices |
| `df.at[row, col]`     | Select an element by row and column labels  |
| `df.iat[row, col]`    | Select an element by row and column indices |

### `Reindex`
Create a new object with the values rearranged to align with the new index

#### On `series`

In [None]:
x = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [None]:
x

In [None]:
y = x.reindex(["a", "b", "c", "d", "e"])
y

#### On `dataframe`

In [None]:
df = pd.DataFrame(
    np.arange(9).reshape(3,3),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)

df

In [None]:
df2 = df.reindex(index=['a', 'b', 'c', 'd'])
df2

In [None]:
df3 = df.reindex(columns=['Texas', 'Utah', 'California'])
df3

## Missing values

`pandas` primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) from `pandas` official documentation for more details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [None]:
dates = pd.date_range(start='2020-08-25', end='2020-10-01', freq='7D')
dates

In [None]:
df1 = df.reindex(index=dates[:6],columns=list(df.columns)+['G'])
df1

In [None]:
# fill in values at some locations
df1.loc['2020-08-25':'2020-09-08','G'] = 1
df1

In [None]:
# to get the boolean mask where values are nan
df1.isna()

In [None]:
# you can also do
pd.isna(df1)

In [None]:
# drop any rows that have missing values
df2 = df1.copy()
df2.dropna(how='any')

In [None]:
df2 # df2 is not impacted since the inplace flag is not flipped

In [None]:
# fill missing values
df1.fillna(value=-999)

## Operations on `dataframe`

**Stats**

In [6]:
df = pd.DataFrame(
    np.arange(9).reshape(3,3),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)

df

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [7]:
df.describe()

Unnamed: 0,Ohio,Texas,California
count,3.0,3.0,3.0
mean,3.0,4.0,5.0
std,3.0,3.0,3.0
min,0.0,1.0,2.0
25%,1.5,2.5,3.5
50%,3.0,4.0,5.0
75%,4.5,5.5,6.5
max,6.0,7.0,8.0


In [None]:
df

In [None]:
# df.mean()
list(df.mean())

In [None]:
df.mean()

In [None]:
df.mean().values

In [None]:
df.mean(axis=0)

In [None]:
df.mean(axis=1)

**Histogram**

In [None]:
df

In [None]:
df['histcol'] = np.random.randint(0,3,size=3)
df

In [None]:
df.histcol.value_counts()

In [None]:
df.histcol.nunique()

In [None]:
df.histcol.unique()

In [None]:
# df.histcol.hist()
df.histcol.hist(density=True)

**Apply functions/logics to the data**

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply the function on all columns

In [None]:
df.apply(lambda x: -x) # apply the function on all columns

In [None]:
df.California.map(lambda x: x+1) # apply the function on one single column

## `dataframe` and table operations

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

**Concat**

In [None]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](joins.jpg)

In [None]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [None]:
tb1

In [None]:
tb2

In [None]:
pd.merge(tb1, tb2, on='key', how='inner')

In [None]:
pd.merge(tb1, tb2, on='key', how='left')

In [None]:
pd.merge(tb1, tb2, on='key', how='right')

In [None]:
pd.merge(tb1, tb2, on='key', how='outer')

**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

In [None]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

In [None]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

In [None]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

In [None]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

## Write/Export `dataframe` to files

**CSV file**

In [None]:
df

In [None]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True)

**Excel spreadsheet**

In [None]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)