# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summer/2023-06-26/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
* `Pandas` exercise


## Pandas intro

* `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
* It is included in the installation of the Anaconda distribution
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.

<img align="center" src="../pics/dataframe-structure.png" style="height:300px;">


In [1]:
import pandas as pd

In [2]:
pd.__version__

'2.0.2'

In [4]:
import numpy as np

In [5]:
x = {
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
}

In [6]:
# create df from a dictionary
df1 = pd.DataFrame(x)

In [7]:
df1

Unnamed: 0,A,B,C
0,1,5,abc
1,2,6,def
2,a,7,ghi
3,4,8,jkl


In [8]:
y = [
    ['a','b','c'],
    ['d','e','f']
]

In [9]:
y

[['a', 'b', 'c'], ['d', 'e', 'f']]

In [10]:
df2 = pd.DataFrame(y)
df2

Unnamed: 0,0,1,2
0,a,b,c
1,d,e,f


In [11]:
# create df from a list
df2 = pd.DataFrame(y, columns=['col1','col2','col3'])
df2

Unnamed: 0,col1,col2,col3
0,a,b,c
1,d,e,f


## Create `dataframe` from text file

In [12]:
# df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')
# df = pd.read_csv('/Users/xiangshiyin/Downloads/2023-06-26/data/imf-gdp-per-capita-2015.csv',sep=',',header=0)
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0)

In [13]:
df

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.14,2014.0
3,Angola,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4100.32,2014.0
4,Antigua and Barbuda,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",14414.30,2011.0
...,...,...,...,...,...,...,...
184,Venezuela,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",7744.75,2010.0
185,Vietnam,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",2088.34,2012.0
186,Yemen,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1302.94,2008.0
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.15,2010.0


In [14]:
type(df)

pandas.core.frame.DataFrame

In [15]:
df.head(2)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0


In [16]:
df.tail(2)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.15,2010.0
188,Zimbabwe,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1064.35,2012.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        189 non-null    object 
 1   Subject Descriptor             189 non-null    object 
 2   Units                          189 non-null    object 
 3   Scale                          189 non-null    object 
 4   Country/Series-specific Notes  188 non-null    object 
 5   2015                           187 non-null    object 
 6   Estimates Start After          188 non-null    float64
dtypes: float64(1), object(6)
memory usage: 10.5+ KB


## Left-over topics

In [18]:
# create a dataframe from a numpy array, with columns labeled
df = pd.DataFrame(np.random.randn(6,4), columns = ['Ann', "Bob", "Charly", "Don"])
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-0.529609,-0.992113,-0.186842,-1.201507
1,0.117599,-0.082121,0.4288,2.366233
2,-1.598973,0.815992,-0.282021,-1.14618
3,0.67104,-0.916267,-0.366229,1.077208
4,-0.006897,-1.048466,-0.452772,0.620554
5,-1.545505,-0.465299,1.055477,0.415989


**Apply functions/logics to the data**

In [19]:
df.apply(np.cumsum) # apply the function on all columns

Unnamed: 0,Ann,Bob,Charly,Don
0,-0.529609,-0.992113,-0.186842,-1.201507
1,-0.41201,-1.074234,0.241958,1.164726
2,-2.010983,-0.258242,-0.040063,0.018546
3,-1.339943,-1.174509,-0.406292,1.095754
4,-1.346839,-2.222975,-0.859064,1.716308
5,-2.892344,-2.688274,0.196413,2.132297


In [20]:
df.apply(lambda x: -x) # apply the function on all columns

Unnamed: 0,Ann,Bob,Charly,Don
0,0.529609,0.992113,0.186842,1.201507
1,-0.117599,0.082121,-0.4288,-2.366233
2,1.598973,-0.815992,0.282021,1.14618
3,-0.67104,0.916267,0.366229,-1.077208
4,0.006897,1.048466,0.452772,-0.620554
5,1.545505,0.465299,-1.055477,-0.415989


In [21]:
df.Don.apply(lambda x: x+1) # apply the function on one single column

0   -0.201507
1    3.366233
2   -0.146180
3    2.077208
4    1.620554
5    1.415989
Name: Don, dtype: float64

In [22]:
df[['Don']].apply(lambda x: x+1)

Unnamed: 0,Don
0,-0.201507
1,3.366233
2,-0.14618
3,2.077208
4,1.620554
5,1.415989


In [23]:
df.Don.map(lambda x: x+1) # apply the function on one single column

0   -0.201507
1    3.366233
2   -0.146180
3    2.077208
4    1.620554
5    1.415989
Name: Don, dtype: float64

In [26]:
df[['Don']].map(lambda x: x+1)

AttributeError: 'DataFrame' object has no attribute 'map'

**`dataframe` and table operations**

In [27]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

Unnamed: 0,a,b,c,d
0,-0.203283,-1.005227,1.45155,-0.210442
1,-1.944991,-1.095959,-1.724071,-0.092406
2,0.734761,-1.019218,0.113762,2.122268
3,0.122388,-0.713059,1.561926,-0.264754
4,0.22125,1.300123,0.116931,0.764971
5,-1.840479,1.528076,0.040473,0.10712
6,-0.725554,0.004309,1.133448,0.246196
7,0.232094,-0.002987,-1.701746,0.353314
8,-0.364404,1.997004,-1.132634,-1.403338
9,-1.623864,-1.325168,1.00076,0.482729


**Concat**

In [28]:
df[:3]

Unnamed: 0,a,b,c,d
0,-0.203283,-1.005227,1.45155,-0.210442
1,-1.944991,-1.095959,-1.724071,-0.092406
2,0.734761,-1.019218,0.113762,2.122268


In [29]:
df[7:]

Unnamed: 0,a,b,c,d
7,0.232094,-0.002987,-1.701746,0.353314
8,-0.364404,1.997004,-1.132634,-1.403338
9,-1.623864,-1.325168,1.00076,0.482729


In [30]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)

pieces:
 [          a         b         c         d
0 -0.203283 -1.005227  1.451550 -0.210442
1 -1.944991 -1.095959 -1.724071 -0.092406
2  0.734761 -1.019218  0.113762  2.122268,           a         b         c         d
7  0.232094 -0.002987 -1.701746  0.353314
8 -0.364404  1.997004 -1.132634 -1.403338
9 -1.623864 -1.325168  1.000760  0.482729]


In [31]:
print("put back together:\n")
pd.concat(pieces, axis=0)

put back together:



Unnamed: 0,a,b,c,d
0,-0.203283,-1.005227,1.45155,-0.210442
1,-1.944991,-1.095959,-1.724071,-0.092406
2,0.734761,-1.019218,0.113762,2.122268
7,0.232094,-0.002987,-1.701746,0.353314
8,-0.364404,1.997004,-1.132634,-1.403338
9,-1.623864,-1.325168,1.00076,0.482729


In [33]:
# pd.concat(pieces, axis=1)

In [34]:
print("put back together:\n")
pd.concat(pieces, axis=0, ignore_index=True)

put back together:



Unnamed: 0,a,b,c,d
0,-0.203283,-1.005227,1.45155,-0.210442
1,-1.944991,-1.095959,-1.724071,-0.092406
2,0.734761,-1.019218,0.113762,2.122268
3,0.232094,-0.002987,-1.701746,0.353314
4,-0.364404,1.997004,-1.132634,-1.403338
5,-1.623864,-1.325168,1.00076,0.482729


**Append new data from another `dataframe`**

In [35]:
df_p2 = pd.DataFrame(np.random.randn(4, 4), columns=['a','b','c','d'])
df_p2

Unnamed: 0,a,b,c,d
0,0.856823,1.207297,1.074037,-0.088272
1,2.082606,1.097481,0.327872,1.026959
2,-0.66352,0.174239,-1.134953,1.18888
3,0.180761,-0.921303,-0.192347,0.784204


In [36]:
df

Unnamed: 0,a,b,c,d
0,-0.203283,-1.005227,1.45155,-0.210442
1,-1.944991,-1.095959,-1.724071,-0.092406
2,0.734761,-1.019218,0.113762,2.122268
3,0.122388,-0.713059,1.561926,-0.264754
4,0.22125,1.300123,0.116931,0.764971
5,-1.840479,1.528076,0.040473,0.10712
6,-0.725554,0.004309,1.133448,0.246196
7,0.232094,-0.002987,-1.701746,0.353314
8,-0.364404,1.997004,-1.132634,-1.403338
9,-1.623864,-1.325168,1.00076,0.482729


In [37]:
df.append(df2)

AttributeError: 'DataFrame' object has no attribute 'append'

**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](../pics/joins.jpg)

In [38]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [39]:
tb1

Unnamed: 0,key,lval
0,foo,1
1,boo,2
2,foo,3


In [40]:
tb2

Unnamed: 0,key,rval
0,foo,5
1,coo,6


In [41]:
pd.merge(tb1, tb2, on='key', how='inner')

Unnamed: 0,key,lval,rval
0,foo,1,5
1,foo,3,5


In [42]:
pd.merge(tb1, tb2, on='key', how='left')

Unnamed: 0,key,lval,rval
0,foo,1,5.0
1,boo,2,
2,foo,3,5.0


In [43]:
pd.merge(tb1, tb2, on='key', how='right')

Unnamed: 0,key,lval,rval
0,foo,1.0,5
1,foo,3.0,5
2,coo,,6


In [44]:
pd.merge(tb1, tb2, on='key', how='outer')

Unnamed: 0,key,lval,rval
0,foo,1.0,5.0
1,foo,3.0,5.0
2,boo,2.0,
3,coo,,6.0


**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [45]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

Unnamed: 0,A,B,C,D
0,foo,one,-0.442363,1.267936
1,bar,one,0.832728,0.477826
2,foo,two,1.668707,-0.960048
3,bar,three,-0.733803,-0.978669
4,foo,two,0.421418,-1.591831
5,bar,two,0.439527,-0.523856
6,foo,one,-0.108889,1.091792
7,foo,three,1.192294,-0.061249


In [46]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

Unnamed: 0,A,C
0,bar,0.179484
1,foo,0.546233


In [47]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

Unnamed: 0,A,B,C,D
0,bar,one,0.832728,0.477826
1,bar,three,-0.733803,-0.978669
2,bar,two,0.439527,-0.523856
3,foo,one,-0.551253,2.359727
4,foo,three,1.192294,-0.061249
5,foo,two,2.090124,-2.551879


In [48]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

Unnamed: 0,A,B,C,D
0,bar,one,0.832728,0.477826
1,bar,three,-0.733803,-0.978669
2,bar,two,0.439527,-0.523856
3,foo,one,-0.275626,1.179864
4,foo,three,1.192294,-0.061249
5,foo,two,1.045062,-1.275939


In [49]:
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

Unnamed: 0,A,B,C
0,bar,one,0.832728
1,bar,three,-0.733803
2,bar,two,0.439527
3,foo,one,-0.551253
4,foo,three,1.192294
5,foo,two,2.090124


In [50]:
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation


Unnamed: 0,A,B,C
0,bar,one,0.693435
1,bar,three,0.538467
2,bar,two,0.193184
3,foo,one,0.207542
4,foo,three,1.421566
5,foo,two,2.962175


In [52]:
output = df.groupby(['A','B'])[['C']].apply(lambda x: np.sum(x**2))
# type(output)

output

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
bar,one,0.693435
bar,three,0.538467
bar,two,0.193184
foo,one,0.207542
foo,three,1.421566
foo,two,2.962175


In [53]:
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index()

Unnamed: 0,A,B,C
0,bar,one,0.693435
1,bar,three,0.538467
2,bar,two,0.193184
3,foo,one,0.207542
4,foo,three,1.421566
5,foo,two,2.962175


**Pivot table**

In [54]:
df = pd.DataFrame({'ModelNumber' : ['one', 'one', 'two', 'three'] * 3,
                   'Submodel' : ['A', 'B', 'C'] * 4,
                   'Type' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'Xval' : np.random.randn(12),
                   'Yval' : np.random.randn(12)})

df

Unnamed: 0,ModelNumber,Submodel,Type,Xval,Yval
0,one,A,foo,0.023584,-0.570346
1,one,B,foo,0.438156,-0.215334
2,two,C,foo,-0.909017,-2.011637
3,three,A,bar,-0.727137,0.933237
4,one,B,bar,-0.956029,0.5777
5,one,C,bar,0.295525,-1.911593
6,two,A,foo,-0.113633,-0.376586
7,three,B,foo,-0.110332,-0.42094
8,one,C,foo,1.785561,-0.327511
9,one,A,bar,0.419486,0.10426


We can produce pivot tables from this data very easily:

In [55]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
)

Unnamed: 0_level_0,Type,bar,foo
ModelNumber,Submodel,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.419486,0.023584
one,B,-0.956029,0.438156
one,C,0.295525,1.785561
three,A,-0.727137,
three,B,,-0.110332
three,C,0.284524,
two,A,,-0.113633
two,B,0.605151,
two,C,,-0.909017


In [56]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
#     , aggfunc='count'
    ,aggfunc=lambda x: abs(x)
)

Unnamed: 0_level_0,Type,bar,foo
ModelNumber,Submodel,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.419486,0.023584
one,B,0.956029,0.438156
one,C,0.295525,1.785561
three,A,0.727137,
three,B,,0.110332
three,C,0.284524,
two,A,,0.113633
two,B,0.605151,
two,C,,0.909017


**Write/Export `dataframe` to files**

**CSV file**

In [57]:
df

Unnamed: 0,ModelNumber,Submodel,Type,Xval,Yval
0,one,A,foo,0.023584,-0.570346
1,one,B,foo,0.438156,-0.215334
2,two,C,foo,-0.909017,-2.011637
3,three,A,bar,-0.727137,0.933237
4,one,B,bar,-0.956029,0.5777
5,one,C,bar,0.295525,-1.911593
6,two,A,foo,-0.113633,-0.376586
7,three,B,foo,-0.110332,-0.42094
8,one,C,foo,1.785561,-0.327511
9,one,A,bar,0.419486,0.10426


In [59]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True, index=None)
# df.to_csv('../data/to-csv-test.csv',sep=',',header=True)

**Excel spreadsheet**

In [60]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)

# Pandas Exercise

Microsoft created a long time ago the fictitious multinational manufacturing company called Adventure Works and shipped the AdventureWorks database as part of SQL Server.

**TASK**
1. write the Python Pandas expression to produce a table as described in the problem statements.
2. The SQL expression may give you a hint. It also allows you to see both systems side-by-side.
3. If you don't know SQL just ignore the SQL code.

In [61]:
import pandas as pd
import numpy as np

In [62]:
pd.set_option('display.max_columns',None) #unlimited
pd.set_option('display.max_rows',None)

## import the dataset

In [63]:
%%time

Employees = pd.read_excel('../data/Employees.xls')
Territory = pd.read_excel('../data/SalesTerritory.xls')
Customers = pd.read_excel('../data/Customers.xls')
Orders = pd.read_excel('../data/ItemsOrdered.xls')

CPU times: user 44.4 ms, sys: 6.8 ms, total: 51.2 ms
Wall time: 62.7 ms


In [64]:
Employees.head(3)

Unnamed: 0,EmployeeID,ManagerID,TerritoryID,Title,FirstName,MiddleName,LastName,Suffix,JobTitle,NationalIDNumber,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,PhoneNumber,PhoneNumberType,EmailAddress,AddressLine1,AddressLine2,City,StateProvinceName,PostalCode,CountryName
0,259,250.0,,,Ben,T,Miller,,Buyer,20269531,1967-07-05,M,M,2004-04-09,0,55,47,151-555-0113,Work,ben0@adventure-works.com,101 Candy Rd.,,Redmond,Washington,98052,United States
1,278,274.0,6.0,,Garrett,R,Vargas,,Sales Representative,234474252,1969-03-07,M,M,2005-07-01,1,33,36,922-555-0165,Work,garrett1@mapleleafmail.ca,10203 Acorn Avenue,,Calgary,Alberta,T2P 2G8,Canada
2,204,26.0,,,Gabe,B,Mares,,Production Technician - WC40,440379437,1982-06-11,M,M,2003-04-09,0,57,48,310-555-0117,Work,gabe0@adventure-works.com,1061 Buskrik Avenue,,Edmonds,Washington,98020,United States


In [70]:
Employees.shape

(291, 26)

In [65]:
Territory.head(3)

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,1,Northwest,US,North America,7887186.79,3298694.49
1,2,Northeast,US,North America,2402176.85,3607148.94
2,3,Central,US,North America,3072175.12,3205014.08


In [66]:
Territory.shape

(12, 6)

In [67]:
Territory

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,1,Northwest,US,North America,7887186.79,3298694.49
1,2,Northeast,US,North America,2402176.85,3607148.94
2,3,Central,US,North America,3072175.12,3205014.08
3,4,Southwest,US,North America,10510853.87,5366575.71
4,5,Southeast,US,North America,2538667.25,3925071.43
5,6,Canada,CA,North America,6771829.14,5693988.86
6,7,France,FR,Europe,4772398.31,2396539.76
7,8,Germany,DE,Europe,3805202.35,1307949.79
8,9,Australia,AU,Pacific,5977814.92,2278548.98
9,10,United Kingdom,GB,Europe,5012905.37,1635823.4


In [68]:
Customers.head(3)

Unnamed: 0,CustomerID,SalesTerritoryID,FirstName,LastName,City,StateName
0,10101,1,John,Gray,Lynden,Washington
1,10298,4,Leroy,Brown,Pinetop,Arizona
2,10299,1,Elroy,Keller,Snoqualmie,Washington


In [69]:
Customers.shape

(17, 6)

In [71]:
Orders.head(3)

Unnamed: 0,CustomerID,OrderDate,Item,Quantity,Price
0,10330,2004-06-30,Pogo stick,1,28.0
1,10101,2004-06-30,Raft,1,58.0
2,10298,2004-07-01,Skateboard,1,33.0


In [72]:
Orders.shape

(32, 5)

## Filtering

### Provide a list of employees that are married

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.MaritalStatus = 'M';
```

In [73]:
Employees.head(3)

Unnamed: 0,EmployeeID,ManagerID,TerritoryID,Title,FirstName,MiddleName,LastName,Suffix,JobTitle,NationalIDNumber,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,PhoneNumber,PhoneNumberType,EmailAddress,AddressLine1,AddressLine2,City,StateProvinceName,PostalCode,CountryName
0,259,250.0,,,Ben,T,Miller,,Buyer,20269531,1967-07-05,M,M,2004-04-09,0,55,47,151-555-0113,Work,ben0@adventure-works.com,101 Candy Rd.,,Redmond,Washington,98052,United States
1,278,274.0,6.0,,Garrett,R,Vargas,,Sales Representative,234474252,1969-03-07,M,M,2005-07-01,1,33,36,922-555-0165,Work,garrett1@mapleleafmail.ca,10203 Acorn Avenue,,Calgary,Alberta,T2P 2G8,Canada
2,204,26.0,,,Gabe,B,Mares,,Production Technician - WC40,440379437,1982-06-11,M,M,2003-04-09,0,57,48,310-555-0117,Work,gabe0@adventure-works.com,1061 Buskrik Avenue,,Edmonds,Washington,98020,United States


In [75]:
Employees.MaritalStatus.nunique()

2

In [76]:
Employees.MaritalStatus.unique()

array(['M', 'S'], dtype=object)

In [77]:
## select by condition
Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName
0,259,Ben,Miller
1,278,Garrett,Vargas
2,204,Gabe,Mares


In [84]:
x = Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName']].shape

# type(x)
x

(147, 3)

In [85]:
x[0]

147

In [79]:
Employees.loc[Employees.MaritalStatus == 'S', ['EmployeeID', 'FirstName', 'LastName']].shape[0]

144

In [86]:
x[1]

3

In [80]:
Employees.loc[Employees.MaritalStatus == 'S', ['EmployeeID', 'FirstName', 'LastName']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName
5,66,Karan,Khanna
6,270,François,Ajenstat
7,22,Sariya,Harnpadoungsataya


### Show me a list of employees that have a lastname that begins with "R"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE 'R%';
```

In [81]:
'Robert'.startswith('R')

True

In [82]:
'Robert'.startswith('S')

False

In [90]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].head(10)

Unnamed: 0,EmployeeID,FirstName,LastName
9,124,Kim,Ralls
10,10,Michael,Raheem
16,166,Jack,Richins
27,147,Sandra,Reátegui Alayo
44,133,Michael,Rothkugel
95,44,Simon,Rapier
99,65,Randy,Reeves
128,145,Cynthia,Randall
131,149,Andy,Ruth
166,74,Bjorn,Rettig


In [91]:
Employees.loc[Employees.LastName.str.startswith("R"), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

15

In [92]:
Employees.loc[Employees.LastName.map(lambda x: str(x).startswith('R')), ['EmployeeID', 'FirstName', 'LastName']].shape[0]

15

In [93]:
Employees.loc[Employees.LastName.map(lambda x: str(x).startswith('R')), ['EmployeeID', 'FirstName', 'LastName']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName
9,124,Kim,Ralls
10,10,Michael,Raheem
16,166,Jack,Richins


### Show me a list of employees that have a lastname that ends with "r"

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%r';
```

In [94]:
'Robert'.endswith('a')

False

In [95]:
'Robert'.endswith('t')

True

In [96]:
Employees.loc[Employees.LastName.map(lambda x: str(x).endswith('r')), ['EmployeeID', 'FirstName', 'LastName']].head(10)

Unnamed: 0,EmployeeID,FirstName,LastName
0,259,Ben,Miller
8,161,Kirk,Koenigsbauer
18,203,Ken,Myer
49,199,Paula,Nartker
53,41,Bryan,Baker
56,104,Mary,Baker
64,225,Alan,Brewer
75,156,Lane,Sacksteder
95,44,Simon,Rapier
97,96,Elizabeth,Keyser


### Provide a list of employees that have a hyphenated lastname.

SQL logic
```sql
SELECT 
  e.EmployeeID
  , e.FirstName
  , e.LastName 
FROM dbo.Employees AS e
WHERE e.LastName LIKE '%-%';
```

In [97]:
'd' in 'abc'

False

In [98]:
'b' in 'abc'

True

In [99]:
Employees.loc[Employees.LastName.map(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']].shape[0]

3

In [100]:
Employees.loc[Employees.LastName.map(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']]

Unnamed: 0,EmployeeID,FirstName,LastName
114,284,Tete,Mensa-Annan
134,180,Katie,McAskill-White
176,280,Pamela,Ansman-Wolfe


In [102]:
Employees.loc[Employees.LastName.apply(lambda x: '-' in str(x)), 
              ['EmployeeID', 'FirstName', 'LastName']]

Unnamed: 0,EmployeeID,FirstName,LastName
114,284,Tete,Mensa-Annan
134,180,Katie,McAskill-White
176,280,Pamela,Ansman-Wolfe


In [103]:
Employees.loc[Employees.LastName.str.contains('-'), 
              ['EmployeeID', 'FirstName', 'LastName']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName
114,284,Tete,Mensa-Annan
134,180,Katie,McAskill-White
176,280,Pamela,Ansman-Wolfe


### Provide a list of employees that are on salary and have more than 35 vacation hours left.

SQL logic
```sql
SELECT 	
  e.EmployeeID
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
FROM dbo.Employees AS e
WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 35);
```

In [104]:
Employees.head(3)

Unnamed: 0,EmployeeID,ManagerID,TerritoryID,Title,FirstName,MiddleName,LastName,Suffix,JobTitle,NationalIDNumber,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,PhoneNumber,PhoneNumberType,EmailAddress,AddressLine1,AddressLine2,City,StateProvinceName,PostalCode,CountryName
0,259,250.0,,,Ben,T,Miller,,Buyer,20269531,1967-07-05,M,M,2004-04-09,0,55,47,151-555-0113,Work,ben0@adventure-works.com,101 Candy Rd.,,Redmond,Washington,98052,United States
1,278,274.0,6.0,,Garrett,R,Vargas,,Sales Representative,234474252,1969-03-07,M,M,2005-07-01,1,33,36,922-555-0165,Work,garrett1@mapleleafmail.ca,10203 Acorn Avenue,,Calgary,Alberta,T2P 2G8,Canada
2,204,26.0,,,Gabe,B,Mares,,Production Technician - WC40,440379437,1982-06-11,M,M,2003-04-09,0,57,48,310-555-0117,Work,gabe0@adventure-works.com,1061 Buskrik Avenue,,Edmonds,Washington,98020,United States


In [105]:
Employees.columns

Index(['EmployeeID', 'ManagerID', 'TerritoryID', 'Title', 'FirstName',
       'MiddleName', 'LastName', 'Suffix', 'JobTitle', 'NationalIDNumber',
       'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'SalariedFlag',
       'VacationHours', 'SickLeaveHours', 'PhoneNumber', 'PhoneNumberType',
       'EmailAddress', 'AddressLine1', 'AddressLine2', 'City',
       'StateProvinceName', 'PostalCode', 'CountryName'],
      dtype='object')

In [106]:
Employees.SalariedFlag.unique()

array([0, 1])

In [107]:
Employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291 entries, 0 to 290
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   EmployeeID         291 non-null    int64         
 1   ManagerID          290 non-null    float64       
 2   TerritoryID        14 non-null     float64       
 3   Title              8 non-null      object        
 4   FirstName          291 non-null    object        
 5   MiddleName         278 non-null    object        
 6   LastName           291 non-null    object        
 7   Suffix             2 non-null      object        
 8   JobTitle           291 non-null    object        
 9   NationalIDNumber   291 non-null    int64         
 10  BirthDate          291 non-null    object        
 11  MaritalStatus      291 non-null    object        
 12  Gender             291 non-null    object        
 13  HireDate           291 non-null    datetime64[ns]
 14  SalariedFl

In [108]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName,VacationHours,SalariedFlag
6,270,François,Ajenstat,67,1
11,248,Mike,Seamans,59,1
19,245,Barbara,Moreland,58,1


In [109]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].shape[0]

30

In [110]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].EmployeeID.nunique()


30

### Show the same as above but limit it to American employees. [practice]

SQL logic
```sql
SELECT DISTINCT CountryName FROM dbo.Employees;

SELECT 	
  e.EmployeeID 
  , e.FirstName
  , e.LastName
  , e.VacationHours
  , e.SalariedFlag
  , e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.SalariedFlag = 1
  AND e.VacationHours > 35
  AND e.CountryName = 'United States';
```

In [111]:
Employees.CountryName.unique()

array(['United States', 'Canada', 'Australia', 'France', 'United Kingdom',
       'Germany'], dtype=object)

In [113]:
Employees.loc[(Employees.SalariedFlag==1)&(Employees.VacationHours>35)&(Employees.CountryName=='United States'), 
              ['EmployeeID', 'FirstName', 'LastName','VacationHours','SalariedFlag']].shape[0]

28

### Change the logic to include anyone who meets any of the 3 conditions (i.e., people who are either married, live in Washington state, or have more than 35 vacation hours left)

SQL logic
```sql
SELECT 	
  e.EmployeeID
  ,e.FirstName
  ,e.LastName
  ,e.MaritalStatus
  ,e.VacationHours
  ,e.SalariedFlag
  ,e.StateProvinceName
  ,e.CountryName
FROM dbo.Employees AS e
WHERE 
  e.MaritalStatus = 'M' 
  OR e.VacationHours > 35 
  OR e.StateProvinceName = 'Washington'
	;
```

In [115]:
Employees.loc[(Employees.MaritalStatus=='M')|(Employees.VacationHours>35)|(Employees.StateProvinceName=='Washington'), 
              ['EmployeeID', 'FirstName', 'LastName','MaritalStatus','VacationHours','SalariedFlag','StateProvinceName','CountryName']].head(3)

Unnamed: 0,EmployeeID,FirstName,LastName,MaritalStatus,VacationHours,SalariedFlag,StateProvinceName,CountryName
0,259,Ben,Miller,M,55,0,Washington,United States
1,278,Garrett,Vargas,M,33,1,Alberta,Canada
2,204,Gabe,Mares,M,57,0,Washington,United States


In [116]:
Employees.loc[(Employees.MaritalStatus=='M')|(Employees.VacationHours>35)|(Employees.StateProvinceName=='Washington'), 
              ['EmployeeID', 'FirstName', 'LastName','MaritalStatus','VacationHours','SalariedFlag','StateProvinceName','CountryName']].EmployeeID.nunique()

286

## Joins
![](../pics/joins.jpg)

### If any are salespeople then show me the details about their sales territory
```sql
SELECT e.EmployeeID ,e.FirstName + ' ' + e.LastName AS EmployeeName ,st.* 
FROM dbo.Employees AS e 
INNER JOIN dbo.SalesTerritory AS st ON e.TerritoryID = st.TerritoryID
```

In [117]:
Territory.shape

(12, 6)

In [118]:
Territory

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,1,Northwest,US,North America,7887186.79,3298694.49
1,2,Northeast,US,North America,2402176.85,3607148.94
2,3,Central,US,North America,3072175.12,3205014.08
3,4,Southwest,US,North America,10510853.87,5366575.71
4,5,Southeast,US,North America,2538667.25,3925071.43
5,6,Canada,CA,North America,6771829.14,5693988.86
6,7,France,FR,Europe,4772398.31,2396539.76
7,8,Germany,DE,Europe,3805202.35,1307949.79
8,9,Australia,AU,Pacific,5977814.92,2278548.98
9,10,United Kingdom,GB,Europe,5012905.37,1635823.4


In [119]:
Employees.columns

Index(['EmployeeID', 'ManagerID', 'TerritoryID', 'Title', 'FirstName',
       'MiddleName', 'LastName', 'Suffix', 'JobTitle', 'NationalIDNumber',
       'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'SalariedFlag',
       'VacationHours', 'SickLeaveHours', 'PhoneNumber', 'PhoneNumberType',
       'EmailAddress', 'AddressLine1', 'AddressLine2', 'City',
       'StateProvinceName', 'PostalCode', 'CountryName'],
      dtype='object')

In [120]:
Employees.head(10)

Unnamed: 0,EmployeeID,ManagerID,TerritoryID,Title,FirstName,MiddleName,LastName,Suffix,JobTitle,NationalIDNumber,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,PhoneNumber,PhoneNumberType,EmailAddress,AddressLine1,AddressLine2,City,StateProvinceName,PostalCode,CountryName
0,259,250.0,,,Ben,T,Miller,,Buyer,20269531,1967-07-05,M,M,2004-04-09,0,55,47,151-555-0113,Work,ben0@adventure-works.com,101 Candy Rd.,,Redmond,Washington,98052,United States
1,278,274.0,6.0,,Garrett,R,Vargas,,Sales Representative,234474252,1969-03-07,M,M,2005-07-01,1,33,36,922-555-0165,Work,garrett1@mapleleafmail.ca,10203 Acorn Avenue,,Calgary,Alberta,T2P 2G8,Canada
2,204,26.0,,,Gabe,B,Mares,,Production Technician - WC40,440379437,1982-06-11,M,M,2003-04-09,0,57,48,310-555-0117,Work,gabe0@adventure-works.com,1061 Buskrik Avenue,,Edmonds,Washington,98020,United States
3,78,26.0,,,Reuben,H,D'sa,,Production Supervisor - WC40,370989364,1981-09-27,M,M,2003-01-16,0,72,56,191-555-0112,Work,reuben0@adventure-works.com,1064 Slow Creek Road,,Seattle,Washington,98104,United States
4,255,250.0,,,Gordon,L,Hee,,Buyer,466142721,1960-12-30,M,M,2004-02-12,0,52,46,230-555-0144,Cell,gordon0@adventure-works.com,108 Lakeside Court,,Bellevue,Washington,98004,United States
5,66,26.0,,,Karan,R,Khanna,,Production Technician - WC60,834186596,1964-04-07,S,M,2004-01-23,0,28,34,447-555-0186,Work,karan0@hotmail.com,1102 Ravenwood,,Seattle,Washington,98104,United States
6,270,263.0,,,François,P,Ajenstat,,Database Administrator,643805155,1969-06-17,S,M,2003-02-18,1,67,53,785-555-0110,Cell,françois0@yahoo.com,1144 Paradise Ct.,,Issaquah,Washington,98027,United States
7,22,16.0,,,Sariya,E,Harnpadoungsataya,,Marketing Specialist,95958330,1981-06-21,S,M,2003-01-13,0,45,42,399-555-0176,Work,sariya0@adventure-works.com,1185 Dallas Drive,,Everett,Washington,98201,United States
8,161,26.0,,,Kirk,J,Koenigsbauer,,Production Technician - WC45,275962311,1979-03-10,S,M,2003-01-16,0,74,57,669-555-0150,Work,kirk0@hotmail.com,1220 Bradford Way,,Seattle,Washington,98104,United States
9,124,250.0,,,Kim,T,Ralls,,Stocker,420776180,1978-06-01,S,F,2003-01-27,0,98,69,309-555-0129,Work,kim0@adventure-works.com,1226 Shoe St.,,Bothell,Washington,98011,United States


In [121]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'inner'
)
df.head(3)

Unnamed: 0,TerritoryID,EmployeeID,FirstName,LastName,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,6.0,278,Garrett,Vargas,Canada,CA,North America,6771829.14,5693988.86
1,6.0,282,José,Saraiva,Canada,CA,North America,6771829.14,5693988.86
2,1.0,283,David,Campbell,Northwest,US,North America,7887186.79,3298694.49


In [122]:
df.EmployeeID.nunique()

14

In [123]:
df = pd.merge(
    Employees.loc[:,['TerritoryID','EmployeeID','FirstName','LastName']],
    Territory,
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

Unnamed: 0,TerritoryID,EmployeeID,FirstName,LastName,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,,259,Ben,Miller,,,,,
1,6.0,278,Garrett,Vargas,Canada,CA,North America,6771829.14,5693988.86
2,,204,Gabe,Mares,,,,,


In [124]:
df.shape

(291, 9)

In [None]:
Employees.shape

In [125]:
df['EmployeeName'] = [
    first + ' ' + last
    for first,last in zip(df.FirstName, df.LastName)
]

In [126]:
df.head(3)

Unnamed: 0,TerritoryID,EmployeeID,FirstName,LastName,Name,CountryCode,Region,SalesYTD,SalesLastYear,EmployeeName
0,,259,Ben,Miller,,,,,,Ben Miller
1,6.0,278,Garrett,Vargas,Canada,CA,North America,6771829.14,5693988.86,Garrett Vargas
2,,204,Gabe,Mares,,,,,,Gabe Mares


### For all sales territories, also show what customers fall under them
```sql
SELECT * 
FROM dbo.SalesTerritory AS st 
LEFT OUTER JOIN dbo.Customers AS c ON c.SalesTerritoryID = st.TerritoryID ;
```

In [127]:
Customers.head(3)

Unnamed: 0,CustomerID,SalesTerritoryID,FirstName,LastName,City,StateName
0,10101,1,John,Gray,Lynden,Washington
1,10298,4,Leroy,Brown,Pinetop,Arizona
2,10299,1,Elroy,Keller,Snoqualmie,Washington


In [128]:
Territory.head(3)

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear
0,1,Northwest,US,North America,7887186.79,3298694.49
1,2,Northeast,US,North America,2402176.85,3607148.94
2,3,Central,US,North America,3072175.12,3205014.08


In [131]:
Territory.shape

(12, 6)

In [129]:
df = pd.merge(
    Territory,
    Customers,
#     on = 'TerritoryID',
    left_on='TerritoryID',
    right_on='SalesTerritoryID',
    how = 'left'
)
df.head(3)

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear,CustomerID,SalesTerritoryID,FirstName,LastName,City,StateName
0,1,Northwest,US,North America,7887186.79,3298694.49,10101.0,1.0,John,Gray,Lynden,Washington
1,1,Northwest,US,North America,7887186.79,3298694.49,10299.0,1.0,Elroy,Keller,Snoqualmie,Washington
2,1,Northwest,US,North America,7887186.79,3298694.49,10325.0,1.0,Ginger,Schultz,Pocatello,Idaho


In [130]:
df.shape

(25, 12)

In [132]:
df = pd.merge(
    Territory,
    Customers.rename(columns={'SalesTerritoryID':'TerritoryID'}),
    on = 'TerritoryID',
    how = 'left'
)
df.head(3)

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear,CustomerID,FirstName,LastName,City,StateName
0,1,Northwest,US,North America,7887186.79,3298694.49,10101.0,John,Gray,Lynden,Washington
1,1,Northwest,US,North America,7887186.79,3298694.49,10299.0,Elroy,Keller,Snoqualmie,Washington
2,1,Northwest,US,North America,7887186.79,3298694.49,10325.0,Ginger,Schultz,Pocatello,Idaho


Are there any sales territories that don't have any customers associated?

In [133]:
# df.loc[condition, column_list]
df[df.CustomerID.isna()]

Unnamed: 0,TerritoryID,Name,CountryCode,Region,SalesYTD,SalesLastYear,CustomerID,FirstName,LastName,City,StateName
5,2,Northeast,US,North America,2402176.85,3607148.94,,,,,
18,6,Canada,CA,North America,6771829.14,5693988.86,,,,,
19,7,France,FR,Europe,4772398.31,2396539.76,,,,,
20,8,Germany,DE,Europe,3805202.35,1307949.79,,,,,
21,9,Australia,AU,Pacific,5977814.92,2278548.98,,,,,
22,10,United Kingdom,GB,Europe,5012905.37,1635823.4,,,,,
23,11,Brazil,BR,South America,0.0,261589.958,,,,,
24,12,Mexico,MX,North America,0.0,0.0,,,,,


In [134]:
df[df.CustomerID.isna()].shape[0]

8

## Grouping

Reading Materials: 
* (official doc): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
* (summary) https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

### What is the earliest birthdate for all employees?

SQL logic
```sql
SELECT MIN(e.BirthDate) FROM dbo.Employees AS e;
```

In [135]:
Employees.head(3)

Unnamed: 0,EmployeeID,ManagerID,TerritoryID,Title,FirstName,MiddleName,LastName,Suffix,JobTitle,NationalIDNumber,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,PhoneNumber,PhoneNumberType,EmailAddress,AddressLine1,AddressLine2,City,StateProvinceName,PostalCode,CountryName
0,259,250.0,,,Ben,T,Miller,,Buyer,20269531,1967-07-05,M,M,2004-04-09,0,55,47,151-555-0113,Work,ben0@adventure-works.com,101 Candy Rd.,,Redmond,Washington,98052,United States
1,278,274.0,6.0,,Garrett,R,Vargas,,Sales Representative,234474252,1969-03-07,M,M,2005-07-01,1,33,36,922-555-0165,Work,garrett1@mapleleafmail.ca,10203 Acorn Avenue,,Calgary,Alberta,T2P 2G8,Canada
2,204,26.0,,,Gabe,B,Mares,,Production Technician - WC40,440379437,1982-06-11,M,M,2003-04-09,0,57,48,310-555-0117,Work,gabe0@adventure-works.com,1061 Buskrik Avenue,,Edmonds,Washington,98020,United States


In [137]:
Employees.columns

Index(['EmployeeID', 'ManagerID', 'TerritoryID', 'Title', 'FirstName',
       'MiddleName', 'LastName', 'Suffix', 'JobTitle', 'NationalIDNumber',
       'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'SalariedFlag',
       'VacationHours', 'SickLeaveHours', 'PhoneNumber', 'PhoneNumberType',
       'EmailAddress', 'AddressLine1', 'AddressLine2', 'City',
       'StateProvinceName', 'PostalCode', 'CountryName'],
      dtype='object')

In [None]:
type(Employees.dtypes)

In [None]:
Employees.loc[:,['BirthDate']].head(3)

In [138]:
Employees.dtypes

EmployeeID                    int64
ManagerID                   float64
TerritoryID                 float64
Title                        object
FirstName                    object
MiddleName                   object
LastName                     object
Suffix                       object
JobTitle                     object
NationalIDNumber              int64
BirthDate                    object
MaritalStatus                object
Gender                       object
HireDate             datetime64[ns]
SalariedFlag                  int64
VacationHours                 int64
SickLeaveHours                int64
PhoneNumber                  object
PhoneNumberType              object
EmailAddress                 object
AddressLine1                 object
AddressLine2                 object
City                         object
StateProvinceName            object
PostalCode                   object
CountryName                  object
dtype: object

In [139]:
# Employees.dtypes.reset_index()
# Employees.dtypes['BirthDate']
str(Employees.dtypes['BirthDate'])

'object'

In [None]:
Employees.BirthDate.dtypes

In [140]:
'1970-01-01' < '2023-06-26'

True

In [141]:
Employees.BirthDate.min()

'1945-11-17'

In [142]:
Employees.BirthDate.max()

'1985-07-01'

In [143]:
Employees.BirthDate.nunique()

279

### Add to the above, the most recent birthdate for all employees

SQL logic
```sql
SELECT 
  MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e;
```

In [None]:
x = [4,5,1,2,3]
min(x), max(x)

* Lexicographic order [[wikipedia](https://en.wikipedia.org/wiki/Lexicographic_order)]

In [None]:
'2ab' < '1ab'

In [None]:
# 'abcdefg'

'a' > 'b'

In [None]:
Employees.agg({'BirthDate':['min','max']}).T

# Employees.agg({'BirthDate':['min','max']})

In [None]:
Employees.agg({'BirthDate':[min,max]}).T

In [None]:
Employees.agg({'BirthDate':[min,max]}).T.reset_index(drop=True)
# Employees.agg({'BirthDate':['min','max']}).T.reset_index(drop=False)

### Show the above results broken down by gender

SQL logic
```sql
SELECT 
  e.Gender
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender
;
```

In [None]:
Employees.groupby('Gender')['BirthDate'].min().reset_index()

In [None]:
Employees.groupby('Gender').agg({'BirthDate':[min,max]})

In [None]:
Employees.groupby('Gender').agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### Show the above results broken down by gender, and salaried/hourly

SQL logic
```sql
SELECT 
  e.Gender
  , e.SalariedFlag
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender, e.SalariedFlag
;
```

In [None]:
Employees.groupby(['Gender','SalariedFlag']).agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

### What are the average vacation hours for all employees?

SQL logic
```sql
SELECT AVG(e.VacationHours)
FROM dbo.Employees AS e	
;
```

In [None]:
Employees.VacationHours.mean()

### Show the above results broken down and ordered by job title¶

SQL logic
```sql
SELECT 
  e.JobTitle
  , AVG(e.VacationHours) AS 'Average Vacation'
  , MIN(e.VacationHours) AS 'Minimum Vacation'
FROM dbo.Employees AS e
GROUP BY e.JobTitle
;
```

In [None]:
Employees.groupby('JobTitle')['VacationHours'].min().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].mean().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].apply(lambda x: sum(x)/len(x)).reset_index().head(3)

In [None]:
Employees.groupby('JobTitle').agg(
    avg_pto_left=('VacationHours',lambda x: sum(x)/len(x)),
    min_pto_left=('VacationHours',min)
).reset_index()

# The Python Statistics Landscape

There are many Python statistics libraries for you to work with.

* **Foundation Libraries**
    * `statistics`: built-in Python library for descriptive statistics (link: https://docs.python.org/3/library/statistics.html)
    * `numpy`: numerical computing, numpy arrays, covered in lecture 03
    * `scipy`: scientific computing based on numpy, the `scipy.stats` module (link: https://docs.scipy.org/doc/scipy/reference/stats.html) covers a large number of probability distributions and statistical functions (link: https://www.scipy.org/)
    
* **Data Science Libraries**
    * `pandas`: 1D and 2D labeled data manipulations and computation, covered in lecture 04
    * `statsmodels`: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration (link: https://www.statsmodels.org/stable/index.html)
    * `matplotlib`: graphs and visualization (link: https://matplotlib.org/)

# Descriptive Statistical Analysis

Descriptive statistics is about describing and summarizing data. It uses two main approaches:

* The quantitative approach describes and summarizes data numerically.
* The visual approach illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. When you describe and summarize a single variable, you’re performing univariate analysis. When you search for statistical relationships among a pair of variables, you’re doing a bivariate analysis. Similarly, a multivariate analysis is concerned with multiple variables at once.


**[Case Study]**

**Atlanta Police Department Crime Data** ![APD Logo](https://atlantapd.galls.com/photos/partners/atlantapd/logo.jpg)


The Atlanta Police Department provides raw crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Load the 2009-2019 crime data

In [None]:
df = pd.read_csv('../data/COBRA-2009-2019.csv',sep=',',header=0)
df.head(3)

In [None]:
df.shape

In [None]:
df.info()

## Quantitative Analysis

In [None]:
df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

df.head(3)

In [None]:
## number of reports every year
# df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

num_rpt_by_yr = df.groupby('rpt_yr').agg(
    num_row=('Report Number',len),
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_yr

In [None]:
df.groupby('rpt_yr')['Report Number'].nunique().reset_index()

In [None]:
## number of cases per shift in 2019
num_rpt_by_shift = df[df.rpt_yr=='2019'].groupby('Shift Occurence').agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_shift

In [None]:
## number of cases per shift in the past 3 years
num_rpt_by_yr_shift = df[df.rpt_yr>='2017'].groupby(['rpt_yr','Shift Occurence']).agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
# num_rpt_by_yr_shift
num_rpt_by_yr_shift.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
## % of cases per shift in the past 3 years
num_rpt_by_yr_shift2 = pd.merge(
    num_rpt_by_yr_shift,
    num_rpt_by_yr.loc[:,['rpt_yr','num_rpt']].copy().rename(columns={'num_rpt':'annual_total'}),
    on='rpt_yr'
)
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
num_rpt_by_yr_shift2['percent'] = [
    round(subtotal/total,2)
    for subtotal,total in zip(num_rpt_by_yr_shift2.num_rpt,num_rpt_by_yr_shift2.annual_total)
]
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

**Can you do better??**

## Visual Analysis

### Visualize the YOY change of % by shift with bar chart

In [None]:
num_rpt_by_yr_shift2.rpt_yr = num_rpt_by_yr_shift2.rpt_yr.astype(int)
dw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Day Watch']
ew = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Evening Watch']
mw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Morning Watch']
unk = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Unknown']

plt.plot(dw.rpt_yr, dw.percent, '-o', label='day watch')
plt.plot(ew.rpt_yr, ew.percent, '-o', label='evening watch')
plt.plot(mw.rpt_yr, mw.percent, '-o', label='morning watch')
plt.plot(unk.rpt_yr, unk.percent, '-o', label='unknown')

plt.xticks(ticks=[2017,2018,2019])
plt.legend()
plt.show()