# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summmer/2023-06-26/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
* `Pandas` exercise


## Pandas intro

* `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
* It is included in the installation of the Anaconda distribution
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.

<img align="center" src="../pics/dataframe-structure.png" style="height:300px;">


In [None]:
import pandas as pd

In [None]:
pd.__version__

In [None]:
x = {
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
}

In [None]:
# create df from a dictionary
df1 = pd.DataFrame(x)

In [None]:
df1

In [None]:
y = [
    ['a','b','c'],
    ['d','e','f']
]

In [None]:
y

In [None]:
# create df from a list
df2 = pd.DataFrame(y, columns=['col1','col2','col3'])
df2

## Create `dataframe` from text file

In [None]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')

In [None]:
df

In [None]:
df.head(2)

In [None]:
df.tail(2)

In [None]:
df.info()

## Left-over topics

In [None]:
# create a dataframe from a numpy array, with columns labeled
df = pd.DataFrame(np.random.randn(6,4), columns = ['Ann', "Bob", "Charly", "Don"])
df

**Apply functions/logics to the data**

In [None]:
df

In [None]:
df.apply(np.cumsum) # apply the function on all columns

In [None]:
df.apply(lambda x: -x) # apply the function on all columns

In [None]:
df.Don.map(lambda x: x+1) # apply the function on one single column

**`dataframe` and table operations**

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

**Concat**

In [None]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

**Append new data from another `dataframe`**

In [None]:
df_p2 = pd.DataFrame(np.random.randn(4, 4), columns=['a','b','c','d'])
df_p2

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](../pics/joins.jpg)

In [None]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [None]:
tb1

In [None]:
tb2

In [None]:
pd.merge(tb1, tb2, on='key', how='inner')

In [None]:
pd.merge(tb1, tb2, on='key', how='left')

In [None]:
pd.merge(tb1, tb2, on='key', how='right')

In [None]:
pd.merge(tb1, tb2, on='key', how='outer')

**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

In [None]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

In [None]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

In [None]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

In [None]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

**Pivot table**

In [None]:
df = pd.DataFrame({'ModelNumber' : ['one', 'one', 'two', 'three'] * 3,
                   'Submodel' : ['A', 'B', 'C'] * 4,
                   'Type' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'Xval' : np.random.randn(12),
                   'Yval' : np.random.randn(12)})

df

We can produce pivot tables from this data very easily:

In [None]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
)

In [None]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
#     , aggfunc='count'
    ,aggfunc=lambda x: abs(x)
)

**Write/Export `dataframe` to files**

**CSV file**

In [None]:
df

In [None]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True,index=None)

**Excel spreadsheet**

In [None]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)

# Pandas Exercise

Microsoft created a long time ago the fictitious multinational manufacturing company called Adventure Works and shipped the AdventureWorks database as part of SQL Server.

**TASK**
1. write the Python Pandas expression to produce a table as described in the problem statements.
2. The SQL expression may give you a hint. It also allows you to see both systems side-by-side.
3. If you don't know SQL just ignore the SQL code.

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.set_option('display.max_columns',None) #unlimited
pd.set_option('display.max_rows',None)

## import the dataset

In [None]:
%%time

Employees = pd.read_excel('../data/Employees.xls')
Territory = pd.read_excel('../data/SalesTerritory.xls')
Customers = pd.read_excel('../data/Customers.xls')
Orders = pd.read_excel('../data/ItemsOrdered.xls')

In [None]:
Employees.head(3)

In [None]:
Territory.head(3)

In [None]:
Customers.head(3)

In [None]:
Orders.head(3)

## Filtering

### Provide a list of employees that are married

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.MaritalStatus = 'M';
```

In [None]:
Employees.MaritalStatus.unique()

In [None]:
## select by condition
Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].head(3)

In [None]:
Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].shape

In [None]:
Employees.loc[Employees.MaritalStatus == 'S', ['EmployeeID', 'FirstName', 'LastName']].shape[0]

### Show me a list of employees that have a lastname that begins with "R"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE 'R%';
```

In [None]:
'Robert'.startswith('R')

In [None]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].head(10)

In [None]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

In [None]:
Employees.loc[Employees.LastName.map(lambda x: str(x).startswith('R')), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

### Show me a list of employees that have a lastname that ends with "r"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%r';
```

In [None]:
# 'Robert'.endswith('t')
'Robert'.endswith('a')

In [None]:
Employees.loc[Employees.LastName.map(lambda x: str(x).endswith('r')), ['EmployeeID', 'FirstName', 'LastName']].head(10)

### Provide a list of employees that have a hyphenated lastname.

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%-%';
```

In [None]:
'd' in 'abc'

In [None]:
# Employees.loc[Employees.LastName.apply(lambda x: '-' in str(x)), 
#               ['EmployeeID', 'FirstName', 'LastName']].head(3)

Employees.loc[Employees.LastName.apply(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']].shape[0]

# Employees.loc[Employees.LastName.str.contains('-'), 
#               ['EmployeeID', 'FirstName', 'LastName']].head(3)

### Provide a list of employees that are on salary and have more than 35 vacation hours left.

SQL logic
```sql
SELECT 	
  e.EmployeeID
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
FROM dbo.Employees AS e
WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 35);
```

In [None]:
Employees.head(3)

In [None]:
Employees.columns

In [None]:
Employees.SalariedFlag.unique()

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].head(3)

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].shape

In [None]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].EmployeeID.nunique()

# list(Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
#               ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].EmployeeID)

### Show the same as above but limit it to American employees. [practice]

SQL logic
```sql
SELECT DISTINCT CountryName FROM dbo.Employees;

SELECT 	
  e.EmployeeID 
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
  , e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.SalariedFlag = 1
  AND e.VacationHours > 35
  AND e.CountryName = 'United States';
```

In [None]:
Employees.CountryName.unique()

### Change the logic to include anyone who meets any of the 3 conditions (i.e., people who are either married, live in Washington state, or have more than 35 vacation hours left)

SQL logic
```sql
SELECT 	
  e.EmployeeID
  ,e.FirstName
  ,e.LastName
  ,e.MaritalStatus
  ,e.VacationHours
  ,e.SalariedFlag
  ,e.StateProvinceName
  ,e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.MaritalStatus = 'M' 
  OR e.VacationHours > 35 
  OR e.StateProvinceName = 'Washington'
	;
```

In [None]:
Employees.loc[(Employees.MaritalStatus=='M')|(Employees.VacationHours>35)|(Employees.StateProvinceName=='Washington'), 
              ['EmployeeID', 'FirstName', 'LastName','MaritalStatus','VacationHours','SalariedFlag','StateProvinceName','CountryName']].head(10)

## Joins
![](../pics/joins.jpg)

### Show me all the employees, and if any are salespeople then show me the details about their sales territory
```sql
SELECT e.EmployeeID ,e.FirstName + ' ' + e.LastName AS EmployeeName ,st.* 
FROM dbo.Employees AS e 
INNER JOIN dbo.SalesTerritory AS st ON e.TerritoryID = st.TerritoryID
```

In [None]:
Territory.shape

In [None]:
Territory

In [None]:
Employees.columns

In [None]:
Employees.head(10)

In [None]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'inner'
)
df.head(3)

In [None]:
df.loc[(df.FirstName=='Ben')&(df.LastName=='Miller'),:]

In [None]:
df.shape

In [None]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

In [None]:
df.shape

In [None]:
Employees.shape

In [None]:
df['EmployeeName'] = [
    first + ' ' + last
    for first,last in zip(df.FirstName, df.LastName)
]

In [None]:
df.head(3)

### For those sales territories, also show what customers fall under them
```sql
SELECT * 
FROM dbo.SalesTerritory AS st 
LEFT OUTER JOIN dbo.Customers AS c ON c.SalesTerritoryID = st.TerritoryID ;
```

In [None]:
Customers.head(3)

In [None]:
Territory.head(3)

In [None]:
df = pd.merge(
    Territory,
    Customers,
#     on = 'TerritoryID',
    left_on='TerritoryID',
    right_on='SalesTerritoryID',
    how = 'left'
)
df.head(3)

In [None]:
df = pd.merge(
    Territory,
    Customers.rename(columns={'SalesTerritoryID':'TerritoryID'}),
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

Are there any sales territories that don't have any customers associated?

In [None]:
# df.loc[condition, column_list]
df[df.CustomerID.isna()]

## Grouping

Reading Materials: 
* (official doc): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
* (summary) https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

### What is the earliest birthdate for all employees?

SQL logic
```sql
SELECT MIN(e.BirthDate) FROM dbo.Employees AS e;
```

In [None]:
Employees.head(3)

In [None]:
Employees.describe()

In [None]:
Employees.dtypes

In [None]:
type(Employees.dtypes)

In [None]:
Employees.loc[:,['BirthDate']].head(3)

In [None]:
# Employees.dtypes.reset_index()
# Employees.dtypes['BirthDate']
str(Employees.dtypes['BirthDate'])

In [None]:
Employees.BirthDate.dtypes

In [None]:
Employees.BirthDate.min()

In [None]:
Employees.BirthDate.max()

In [None]:
Employees.BirthDate.nunique()

### Add to the above, the most recent birthdate for all employees

SQL logic
```sql
SELECT 
  MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e;
```

In [None]:
x = [4,5,1,2,3]
min(x), max(x)

* Lexicographic order [[wikipedia](https://en.wikipedia.org/wiki/Lexicographic_order)]

In [None]:
'2ab' < '1ab'

In [None]:
# 'abcdefg'

'a' > 'b'

In [None]:
Employees.agg({'BirthDate':['min','max']}).T

# Employees.agg({'BirthDate':['min','max']})

In [None]:
Employees.agg({'BirthDate':[min,max]}).T

In [None]:
Employees.agg({'BirthDate':[min,max]}).T.reset_index(drop=True)
# Employees.agg({'BirthDate':['min','max']}).T.reset_index(drop=False)

### Show the above results broken down by gender

SQL logic
```sql
SELECT 
  e.Gender
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender
;
```

In [None]:
Employees.groupby('Gender')['BirthDate'].min().reset_index()

In [None]:
Employees.groupby('Gender').agg({'BirthDate':[min,max]})

In [None]:
Employees.groupby('Gender').agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### Show the above results broken down by gender, and salaried/hourly

SQL logic
```sql
SELECT 
  e.Gender
  , e.SalariedFlag
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender, e.SalariedFlag
;
```

In [None]:
Employees.groupby(['Gender','SalariedFlag']).agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### What are the average vacation hours for all employees?

SQL logic
```sql
SELECT AVG(e.VacationHours)
FROM dbo.Employees AS e	
;
```

In [None]:
Employees.VacationHours.mean()

### Show the above results broken down and ordered by job title¶

SQL logic
```sql
SELECT 
  e.JobTitle
  , AVG(e.VacationHours) AS 'Average Vacation'
  , MIN(e.VacationHours) AS 'Minimum Vacation'
FROM dbo.Employees AS e
GROUP BY e.JobTitle
;
```

In [None]:
Employees.groupby('JobTitle')['VacationHours'].min().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].mean().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].apply(lambda x: sum(x)/len(x)).reset_index().head(3)

In [None]:
Employees.groupby('JobTitle').agg(
    avg_pto_left=('VacationHours',lambda x: sum(x)/len(x)),
    min_pto_left=('VacationHours',min)
).reset_index()

# The Python Statistics Landscape

There are many Python statistics libraries for you to work with.

* **Foundation Libraries**
    * `statistics`: built-in Python library for descriptive statistics (link: https://docs.python.org/3/library/statistics.html)
    * `numpy`: numerical computing, numpy arrays, covered in lecture 03
    * `scipy`: scientific computing based on numpy, the `scipy.stats` module (link: https://docs.scipy.org/doc/scipy/reference/stats.html) covers a large number of probability distributions and statistical functions (link: https://www.scipy.org/)
    
* **Data Science Libraries**
    * `pandas`: 1D and 2D labeled data manipulations and computation, covered in lecture 04
    * `statsmodels`: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration (link: https://www.statsmodels.org/stable/index.html)
    * `matplotlib`: graphs and visualization (link: https://matplotlib.org/)

# Descriptive Statistical Analysis

Descriptive statistics is about describing and summarizing data. It uses two main approaches:

* The quantitative approach describes and summarizes data numerically.
* The visual approach illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. When you describe and summarize a single variable, you’re performing univariate analysis. When you search for statistical relationships among a pair of variables, you’re doing a bivariate analysis. Similarly, a multivariate analysis is concerned with multiple variables at once.


**[Case Study]**

**Atlanta Police Department Crime Data** ![APD Logo](https://atlantapd.galls.com/photos/partners/atlantapd/logo.jpg)


The Atlanta Police Department provides raw crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Load the 2009-2019 crime data

In [None]:
df = pd.read_csv('../data/COBRA-2009-2019.csv',sep=',',header=0)
df.head(3)

In [None]:
df.shape

In [None]:
df.info()

## Quantitative Analysis

In [None]:
df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

df.head(3)

In [None]:
## number of reports every year
# df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

num_rpt_by_yr = df.groupby('rpt_yr').agg(
    num_row=('Report Number',len),
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_yr

In [None]:
df.groupby('rpt_yr')['Report Number'].nunique().reset_index()

In [None]:
## number of cases per shift in 2019
num_rpt_by_shift = df[df.rpt_yr=='2019'].groupby('Shift Occurence').agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_shift

In [None]:
## number of cases per shift in the past 3 years
num_rpt_by_yr_shift = df[df.rpt_yr>='2017'].groupby(['rpt_yr','Shift Occurence']).agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
# num_rpt_by_yr_shift
num_rpt_by_yr_shift.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
## % of cases per shift in the past 3 years
num_rpt_by_yr_shift2 = pd.merge(
    num_rpt_by_yr_shift,
    num_rpt_by_yr.loc[:,['rpt_yr','num_rpt']].copy().rename(columns={'num_rpt':'annual_total'}),
    on='rpt_yr'
)
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
num_rpt_by_yr_shift2['percent'] = [
    round(subtotal/total,2)
    for subtotal,total in zip(num_rpt_by_yr_shift2.num_rpt,num_rpt_by_yr_shift2.annual_total)
]
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

**Can you do better??**

## Visual Analysis

### Visualize the YOY change of % by shift with bar chart

In [None]:
num_rpt_by_yr_shift2.rpt_yr = num_rpt_by_yr_shift2.rpt_yr.astype(int)
dw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Day Watch']
ew = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Evening Watch']
mw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Morning Watch']
unk = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Unknown']

plt.plot(dw.rpt_yr, dw.percent, '-o', label='day watch')
plt.plot(ew.rpt_yr, ew.percent, '-o', label='evening watch')
plt.plot(mw.rpt_yr, mw.percent, '-o', label='morning watch')
plt.plot(unk.rpt_yr, unk.percent, '-o', label='unknown')

plt.xticks(ticks=[2017,2018,2019])
plt.legend()
plt.show()