(**You can also open this notebook in Google Colab**)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-fall/2023-09-26/notebook/code_demo.ipynb)

# Python basics - additional topics

## Library import in depth
### A simple Python package
Assume we have a package with the following file distribution
```md
└── sample_package
    └── sample.py
    └── subpackage
        └── subsample.py
```
The content of `sample.py` is like
```python
x = 123
y = 234

def hello():
    print('Hello World')
```

The content of `subsample.py`
```python
xx = 1
yy = 2
```

### Things might be more complicated
![](../pics/library_tree.png)

***You could***
* `import` the whole library, by `import a`
* `import` a module (python script), by `import a.aa`
* `import` a object (variable, function, class, etc.) in a module, by `import a.aa.aaa`, or `from a.aa import aaa`


**However**, you should keep using the `<object>` name in the `import <object>` statement in your program to reference the object you imported. **Sometimes, this could be quite inconvenient** because the `<object>` string could be pretty long due to the complicatedd file structures in the python library

**There are two ways** to solve the problem:
* `from a import aa` (use the `from` statement to reference the complicated folder relationships)
* `import a.aa as aa` (create an alias)

In [None]:
%%sh

tree sample_package

In [None]:
from sample_package.sample import hello
hello()

In [None]:
from sample_package.subpackage.subsample import xx

In [None]:
xx

# `pandas` continued

In [None]:
import pandas as pd
import numpy as np

## Create `dataframe` from files

### `csv` file

In [None]:
df1 = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')
df1.head(3)

### `excel` file

In [None]:
df2 = pd.read_excel(io='../data/excel-test-file.xlsx', sheet_name='tab1', header=0)
df2.head(3)

In [None]:
df3 = pd.read_excel(io='../data/excel-test-file.xlsx',sheet_name='tab2',header=0)
df3.head(3)

## Different ways to select a subset of a `dataframe`

| Type                  | Notes                                       |
|-----------------------|---------------------------------------------|
| `df[column]`          | Select by column labels                     |
| `df.loc[rows]`        | Select by row labels                        |
| `df.loc[:, cols]`     | Select by column labels                     |
| `df.loc[rows, cols]`  | Select by row and column labels             |
| `df.iloc[rows]`       | Select by row positional indices            |
| `df.iloc[:, cols]`    | Select by column positional indices         |
| `df.iloc[rows, cols]` | Select by row and column positional indices |
| `df.at[row, col]`     | Select an element by row and column labels  |
| `df.iat[row, col]`    | Select an element by row and column indices |

### `Reindex`
Create a new object with the values rearranged to align with the new index

#### On `series`

In [None]:
x = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [None]:
x

In [None]:
y = x.reindex(["a", "b", "c", "d", "e"])
y

#### On `dataframe`

In [None]:
df = pd.DataFrame(
    np.arange(9).reshape(3,3),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)

df

In [None]:
df2 = df.reindex(index=['a', 'b', 'c', 'd'])
df2

In [None]:
df3 = df.reindex(columns=['Texas', 'Utah', 'California'])
df3

## Missing values

`pandas` primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) from `pandas` official documentation for more details.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

### Introduce missing value to data

In [None]:
import pandas as pd
import numpy as np

In [None]:
dates = pd.date_range(start='2020-08-25', end='2020-10-01', freq='7D')
dates

In [None]:
df1 = df.reindex(index=dates[:6],columns=list(df.columns)+['G'])
df1

In [None]:
# fill in values at some locations
df1.loc['2020-08-25':'2020-09-08','G'] = 1
df1

In [None]:
# to get the boolean mask where values are nan
df1.isna()

In [None]:
# you can also do
pd.isna(df1)

You could also do the following ...

In [None]:
df1.notna()

In [None]:
pd.notna(df1)

In [None]:
# drop any rows that have missing values
df2 = df1.copy()
df2.dropna(how='any')

In [None]:
df2 # df2 is not impacted since the inplace flag is not flipped

In [None]:
# fill missing values
df1.fillna(value=-999)

### What represents `missing/Null` value

* When viewing a dataframe/series with missing value, these are the common markers indicating missing/null values
  * `NaN`
  * `None`
  * `NaT`
* In terms of the actual values, here are the common markers
  * `np.nan` - the primary marker used to represent missing values in pandas. It is a special floating-point value that can be used for numerical and non-numerical data types. You can use pandas methods like `isna()`, `isnull()`, `fillna()`, and others to work with `np.nan` values.
  * `None` - the built-in Python None type
  * Custom missing value markers - depends on the actual data you have
    * Could be empty string
    * Could be intuitive string values of `N/A`, `NULL`, etc.
* In `pd.read_csv()` and `pd.read_excel()`, you could use the `na_values` parameter to tell pandas what should be treated as missing/null values
  * [*read_csv()*](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
  * [*read_excel()*](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)

In [None]:
s = pd.Series([1., 2., 3.])
s.loc[0] = None
s

In [None]:
s = pd.Series(["a", "b", "c"])
s.loc[1] = np.nan
s

In [None]:
df = pd.DataFrame(
    np.random.randn(5, 3),
    index=["a", "c", "e", "f", "h"],
    columns=["one", "two", "three"],
)

df['timestamp'] = pd.Timestamp("20120101")
df.loc['f', 'timestamp'] = None
df

## Operations on `dataframe`

**Stats**

In [None]:
df = pd.DataFrame(
    np.arange(9).reshape(3,3),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)

df

In [None]:
df.describe()

In [None]:
df

In [None]:
# df.mean()
list(df.mean())

In [None]:
df.mean()

In [None]:
df.mean().values

In [None]:
df.mean(axis=0)

In [None]:
df.mean(axis=1)

**Histogram**

In [None]:
df

In [None]:
df['histcol'] = np.random.randint(0,3,size=3)
df

In [None]:
df.histcol.value_counts()

In [None]:
df.histcol.nunique()

In [None]:
df.histcol.unique()

In [None]:
# df.histcol.hist()
df.histcol.hist(density=True)

**Apply functions/logics to the data**

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply the function on all columns

In [None]:
df.apply(lambda x: -x) # apply the function on all columns

In [None]:
df.California.map(lambda x: x+1) # apply the function on one single column

## `dataframe` and table operations

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

**Concat**

In [None]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](../pics/joins.jpg)

In [None]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [None]:
tb1

In [None]:
tb2

In [None]:
pd.merge(tb1, tb2, on='key', how='inner')

In [None]:
pd.merge(tb1, tb2, on='key', how='left')

In [None]:
pd.merge(tb1, tb2, on='key', how='right')

In [None]:
pd.merge(tb1, tb2, on='key', how='outer')

**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

In [None]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

In [None]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

In [None]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

## Write/Export `dataframe` to files

**CSV file**

In [None]:
df

In [None]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True)

**Excel spreadsheet**

In [None]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)

# Pandas Exercise

Microsoft created a long time ago the fictitious multinational manufacturing company called Adventure Works and shipped the AdventureWorks database as part of SQL Server.

**TASK**
1. write the Python Pandas expression to produce a table as described in the problem statements.
2. The SQL expression may give you a hint. It also allows you to see both systems side-by-side.
3. If you don't know SQL just ignore the SQL code.

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.set_option('display.max_columns',None) #unlimited
pd.set_option('display.max_rows',None)

## import the dataset

In [None]:
%%time

Employees = pd.read_excel('../data/Employees.xls')
Territory = pd.read_excel('../data/SalesTerritory.xls')
Customers = pd.read_excel('../data/Customers.xls')
Orders = pd.read_excel('../data/ItemsOrdered.xls')

In [None]:
Employees.head(3)

In [None]:
Employees.shape

In [None]:
Territory.head(3)

In [None]:
Territory.shape

In [None]:
Territory

In [None]:
Customers.head(3)

In [None]:
Customers.shape

In [None]:
Orders.head(3)

In [None]:
Orders.shape

## Filtering

### Provide a list of employees that are married

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.MaritalStatus = 'M';
```

In [None]:
Employees.head(3)

In [None]:
Employees.MaritalStatus.nunique()

In [None]:
Employees.MaritalStatus.unique()

In [None]:
## select by condition
Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].head(3)

In [None]:
x = Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].shape

# type(x)
x

In [None]:
x[0]

In [None]:
Employees.loc[Employees.MaritalStatus == 'S', ['EmployeeID', 'FirstName', 'LastName']].shape[0]

In [None]:
x[1]

In [None]:
Employees.loc[Employees.MaritalStatus == 'S', ['EmployeeID', 'FirstName', 'LastName']].head(3)

### Show me a list of employees that have a lastname that begins with "R"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE 'R%';
```

In [None]:
'Robert'.startswith('R')

In [None]:
'Robert'.startswith('S')

In [None]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].head(10)

In [None]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

In [None]:
Employees.loc[Employees.LastName.map(lambda x: str(x).startswith('R')), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

In [None]:
Employees.loc[Employees.LastName.map(lambda x: str(x).startswith('R')), ['EmployeeID', 'FirstName', 'LastName']].head(3)

### Show me a list of employees that have a lastname that ends with "r"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%r';
```

In [None]:
'Robert'.endswith('a')

In [None]:
'Robert'.endswith('t')

In [None]:
Employees.loc[Employees.LastName.map(lambda x: str(x).endswith('r')), ['EmployeeID', 'FirstName', 'LastName']].head(10)

### Provide a list of employees that have a hyphenated lastname.

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%-%';
```

In [None]:
'd' in 'abc'

In [None]:
'b' in 'abc'

In [None]:
Employees.loc[Employees.LastName.map(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']].shape[0]

In [None]:
Employees.loc[Employees.LastName.map(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']]

In [None]:
Employees.loc[Employees.LastName.apply(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']]

In [None]:
Employees.loc[Employees.LastName.str.contains('-'), 
              ['EmployeeID', 'FirstName', 'LastName']].head(3)

### Provide a list of employees that are on salary and have more than 35 vacation hours left.

SQL logic
```sql
SELECT 	
  e.EmployeeID
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
FROM dbo.Employees AS e
WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 35);
```

In [None]:
Employees.head(3)

In [None]:
Employees.columns

In [None]:
Employees.SalariedFlag.unique()

In [None]:
Employees.info()

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].head(3)

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].shape[0]

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].EmployeeID.nunique()


### Show the same as above but limit it to American employees. [practice]

SQL logic
```sql
SELECT DISTINCT CountryName FROM dbo.Employees;

SELECT 	
  e.EmployeeID 
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
  , e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.SalariedFlag = 1
  AND e.VacationHours > 35
  AND e.CountryName = 'United States';
```

In [None]:
Employees.CountryName.unique()

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35)&(Employees.CountryName=='United States'), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].shape[0]

### Change the logic to include anyone who meets any of the 3 conditions (i.e., people who are either married, live in Washington state, or have more than 35 vacation hours left)

SQL logic
```sql
SELECT 	
  e.EmployeeID
  ,e.FirstName
  ,e.LastName
  ,e.MaritalStatus
  ,e.VacationHours
  ,e.SalariedFlag
  ,e.StateProvinceName
  ,e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.MaritalStatus = 'M' 
  OR e.VacationHours > 35 
  OR e.StateProvinceName = 'Washington'
	;
```

In [None]:
Employees.loc[(Employees.MaritalStatus=='M')|(Employees.VacationHours>35)|(Employees.StateProvinceName=='Washington'), 
              ['EmployeeID', 'FirstName', 'LastName','MaritalStatus','VacationHours','SalariedFlag','StateProvinceName','CountryName']].head(3)

In [None]:
Employees.loc[(Employees.MaritalStatus=='M')|(Employees.VacationHours>35)|(Employees.StateProvinceName=='Washington'), 
              ['EmployeeID', 'FirstName', 'LastName','MaritalStatus','VacationHours','SalariedFlag','StateProvinceName','CountryName']].EmployeeID.nunique()

## Joins
![](../pics/joins.jpg)

### If any are salespeople then show me the details about their sales territory
```sql
SELECT e.EmployeeID ,e.FirstName + ' ' + e.LastName AS EmployeeName ,st.* 
FROM dbo.Employees AS e 
INNER JOIN dbo.SalesTerritory AS st ON e.TerritoryID = st.TerritoryID
```

In [None]:
Territory.shape

In [None]:
Territory

In [None]:
Employees.columns

In [None]:
Employees.head(10)

In [None]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'inner'
)
df.head(3)

In [None]:
df.EmployeeID.nunique()

In [None]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

In [None]:
df.shape

In [None]:
Employees.shape

In [None]:
df['EmployeeName'] = [
    first + ' ' + last
    for first,last in zip(df.FirstName, df.LastName)
]

In [None]:
df.head(3)

### For all sales territories, also show what customers fall under them
```sql
SELECT * 
FROM dbo.SalesTerritory AS st 
LEFT OUTER JOIN dbo.Customers AS c ON c.SalesTerritoryID = st.TerritoryID ;
```

In [None]:
Customers.head(3)

In [None]:
Territory.head(3)

In [None]:
Territory.shape

In [None]:
df = pd.merge(
    Territory,
    Customers,
#     on = 'TerritoryID',
    left_on='TerritoryID',
    right_on='SalesTerritoryID',
    how = 'left'
)
df.head(3)

In [None]:
df.shape

In [None]:
df = pd.merge(
    Territory,
    Customers.rename(columns={'SalesTerritoryID':'TerritoryID'}),
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

Are there any sales territories that don't have any customers associated?

In [None]:
# df.loc[condition, column_list]
df[df.CustomerID.isna()]

In [None]:
df[df.CustomerID.isna()].shape[0]

## Grouping

Reading Materials: 
* (official doc): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
* (summary) https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

### What is the earliest birthdate for all employees?

SQL logic
```sql
SELECT MIN(e.BirthDate) FROM dbo.Employees AS e;
```

In [None]:
Employees.head(3)

In [None]:
Employees.columns

In [None]:
type(Employees.dtypes)

In [None]:
Employees.loc[:,['BirthDate']].head(3)

In [None]:
Employees.dtypes

In [None]:
# Employees.dtypes.reset_index()
# Employees.dtypes['BirthDate']
str(Employees.dtypes['BirthDate'])

In [None]:
Employees.BirthDate.dtypes

In [None]:
'1970-01-01' < '2023-06-26'

In [None]:
Employees.BirthDate.min()

In [None]:
Employees.BirthDate.max()

In [None]:
Employees.BirthDate.nunique()

### Add to the above, the most recent birthdate for all employees

SQL logic
```sql
SELECT 
  MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e;
```

In [None]:
x = [4,5,1,2,3]
min(x), max(x)

* Lexicographic order [[wikipedia](https://en.wikipedia.org/wiki/Lexicographic_order)]

In [None]:
'2ab' < '1ab'

In [None]:
# 'abcdefg'

'a' > 'b'

In [None]:
Employees.agg({'BirthDate':['min','max']}).T

# Employees.agg({'BirthDate':['min','max']})

In [None]:
Employees.agg({'BirthDate':[min,max]}).T

In [None]:
Employees.agg({'BirthDate':[min,max]}).T.reset_index(drop=True)
# Employees.agg({'BirthDate':['min','max']}).T.reset_index(drop=False)

### Show the above results broken down by gender

SQL logic
```sql
SELECT 
  e.Gender
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender
;
```

In [None]:
Employees.groupby('Gender')['BirthDate'].min().reset_index()

In [None]:
Employees.groupby('Gender').agg({'BirthDate':[min,max]})

In [None]:
Employees.groupby('Gender').agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### Show the above results broken down by gender, and salaried/hourly

SQL logic
```sql
SELECT 
  e.Gender
  , e.SalariedFlag
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender, e.SalariedFlag
;
```

In [None]:
Employees.groupby(['Gender','SalariedFlag']).agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### What are the average vacation hours for all employees?

SQL logic
```sql
SELECT AVG(e.VacationHours)
FROM dbo.Employees AS e	
;
```

In [None]:
Employees.VacationHours.mean()

### Show the above results broken down and ordered by job title¶

SQL logic
```sql
SELECT 
  e.JobTitle
  , AVG(e.VacationHours) AS 'Average Vacation'
  , MIN(e.VacationHours) AS 'Minimum Vacation'
FROM dbo.Employees AS e
GROUP BY e.JobTitle
;
```

In [None]:
Employees.groupby('JobTitle')['VacationHours'].min().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].mean().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].apply(lambda x: sum(x)/len(x)).reset_index().head(3)

In [None]:
Employees.groupby('JobTitle').agg(
    avg_pto_left=('VacationHours',lambda x: sum(x)/len(x)),
    min_pto_left=('VacationHours',min)
).reset_index()

In [None]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation