# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summmer/2023-06-26/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
* `Pandas` data table practice


## Pandas intro

* `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
* It is included in the installation of the Anaconda distribution
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.

<img align="center" src="../pics/dataframe-structure.png" style="height:300px;">


In [1]:
import pandas as pd

In [None]:
pd.__version__

In [2]:
x = {
    'A':[1,2,'a',4],
    'B':np.arange(5,9),
    'C':['abc','def','ghi','jkl']
}

In [3]:
# create df from a dictionary
df1 = pd.DataFrame(x)

In [4]:
df1

Unnamed: 0,A,B,C
0,1,5,abc
1,2,6,def
2,a,7,ghi
3,4,8,jkl


In [5]:
y = [
    ['a','b','c'],
    ['d','e','f']
]

In [6]:
y

[['a', 'b', 'c'], ['d', 'e', 'f']]

In [7]:
# create df from a list
df2 = pd.DataFrame(y, columns=['col1','col2','col3'])
df2

Unnamed: 0,col1,col2,col3
0,a,b,c
1,d,e,f


## Create `dataframe` from text file

In [10]:
df = pd.read_csv('../data/imf-gdp-per-capita-2015.csv',sep=',',header=0, thousands=',')

In [11]:
df

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.380,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.140,2014.0
3,Angola,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4100.320,2014.0
4,Antigua and Barbuda,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",14414.300,2011.0
...,...,...,...,...,...,...,...
184,Venezuela,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",7744.750,2010.0
185,Vietnam,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",2088.340,2012.0
186,Yemen,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1302.940,2008.0
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.150,2010.0


In [13]:
df.head(2)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.38,2010.0


In [14]:
df.tail(2)

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.15,2010.0
188,Zimbabwe,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1064.35,2012.0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        189 non-null    object 
 1   Subject Descriptor             189 non-null    object 
 2   Units                          189 non-null    object 
 3   Scale                          189 non-null    object 
 4   Country/Series-specific Notes  188 non-null    object 
 5   2015                           187 non-null    float64
 6   Estimates Start After          188 non-null    float64
dtypes: float64(2), object(5)
memory usage: 10.5+ KB


## Left-over topics

In [16]:
# create a dataframe from a numpy array, with columns labeled
df = pd.DataFrame(np.random.randn(6,4), columns = ['Ann', "Bob", "Charly", "Don"])
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-0.075613,1.017055,-0.231971,0.281038
1,1.871802,1.177187,0.488473,1.617121
2,-1.13293,-2.571783,-1.143734,0.294608
3,-1.395896,-1.399802,-0.005413,-0.118741
4,0.031269,-1.849808,1.511087,-0.351729
5,-1.634218,-1.435256,-0.243885,-0.707536


**Apply functions/logics to the data**

In [17]:
df

Unnamed: 0,Ann,Bob,Charly,Don
0,-0.075613,1.017055,-0.231971,0.281038
1,1.871802,1.177187,0.488473,1.617121
2,-1.13293,-2.571783,-1.143734,0.294608
3,-1.395896,-1.399802,-0.005413,-0.118741
4,0.031269,-1.849808,1.511087,-0.351729
5,-1.634218,-1.435256,-0.243885,-0.707536


In [18]:
df.apply(np.cumsum) # apply the function on all columns

Unnamed: 0,Ann,Bob,Charly,Don
0,-0.075613,1.017055,-0.231971,0.281038
1,1.79619,2.194242,0.256502,1.898159
2,0.663259,-0.377541,-0.887232,2.192766
3,-0.732636,-1.777343,-0.892645,2.074025
4,-0.701367,-3.627151,0.618442,1.722295
5,-2.335586,-5.062407,0.374557,1.014759


In [19]:
df.apply(lambda x: -x) # apply the function on all columns

Unnamed: 0,Ann,Bob,Charly,Don
0,0.075613,-1.017055,0.231971,-0.281038
1,-1.871802,-1.177187,-0.488473,-1.617121
2,1.13293,2.571783,1.143734,-0.294608
3,1.395896,1.399802,0.005413,0.118741
4,-0.031269,1.849808,-1.511087,0.351729
5,1.634218,1.435256,0.243885,0.707536


In [20]:
df.Don.map(lambda x: x+1) # apply the function on one single column

0    1.281038
1    2.617121
2    1.294608
3    0.881259
4    0.648271
5    0.292464
Name: Don, dtype: float64

**`dataframe` and table operations**

In [21]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a','b','c','d'])
df

Unnamed: 0,a,b,c,d
0,0.479395,0.009698,-0.642094,1.526164
1,-1.16595,-0.605575,-0.301711,0.021265
2,-1.29171,-0.401191,0.705015,-0.654065
3,-1.915392,-0.26943,0.45812,-0.181959
4,0.436119,-1.124019,-1.508345,-0.188509
5,-1.175274,-0.378172,1.099213,-0.191926
6,0.306628,0.22411,1.702252,-0.422782
7,1.022252,0.675524,-0.073074,-0.915405
8,0.702581,-1.093067,0.61571,-0.771582
9,-0.624004,1.893307,-0.417301,0.53336


**Concat**

In [22]:
pieces = [df[:3], df[7:]]
print("pieces:\n", pieces)
print("put back together:\n")
# pd.concat(pieces, axis=1)
pd.concat(pieces, axis=0)

pieces:
 [          a         b         c         d
0  0.479395  0.009698 -0.642094  1.526164
1 -1.165950 -0.605575 -0.301711  0.021265
2 -1.291710 -0.401191  0.705015 -0.654065,           a         b         c         d
7  1.022252  0.675524 -0.073074 -0.915405
8  0.702581 -1.093067  0.615710 -0.771582
9 -0.624004  1.893307 -0.417301  0.533360]
put back together:



Unnamed: 0,a,b,c,d
0,0.479395,0.009698,-0.642094,1.526164
1,-1.16595,-0.605575,-0.301711,0.021265
2,-1.29171,-0.401191,0.705015,-0.654065
7,1.022252,0.675524,-0.073074,-0.915405
8,0.702581,-1.093067,0.61571,-0.771582
9,-0.624004,1.893307,-0.417301,0.53336


**Append new data from another `dataframe`**

In [23]:
df_p2 = pd.DataFrame(np.random.randn(4, 4), columns=['a','b','c','d'])
df_p2

Unnamed: 0,a,b,c,d
0,2.118406,-0.072762,0.240113,0.726529
1,0.581579,-0.182054,-0.504068,-1.157024
2,2.021594,-1.165638,0.557288,1.357327
3,1.627731,-0.533214,1.984791,0.81846


**Joins**

More details at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
![](../pics/joins.jpg)

In [24]:
tb1 = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
tb2 = pd.DataFrame({'key': ['foo', 'coo'], 'rval': [5, 6]})

In [25]:
tb1

Unnamed: 0,key,lval
0,foo,1
1,boo,2
2,foo,3


In [26]:
tb2

Unnamed: 0,key,rval
0,foo,5
1,coo,6


In [27]:
pd.merge(tb1, tb2, on='key', how='inner')

Unnamed: 0,key,lval,rval
0,foo,1,5
1,foo,3,5


In [28]:
pd.merge(tb1, tb2, on='key', how='left')

Unnamed: 0,key,lval,rval
0,foo,1,5.0
1,boo,2,
2,foo,3,5.0


In [29]:
pd.merge(tb1, tb2, on='key', how='right')

Unnamed: 0,key,lval,rval
0,foo,1.0,5
1,foo,3.0,5
2,coo,,6


In [30]:
pd.merge(tb1, tb2, on='key', how='outer')

Unnamed: 0,key,lval,rval
0,foo,1.0,5.0
1,foo,3.0,5.0
2,boo,2.0,
3,coo,,6.0


**Grouping**

By `group by` we are referring to a process involving one or more of the following steps

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure
See the Grouping section from the `pandas` official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

In [31]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df

Unnamed: 0,A,B,C,D
0,foo,one,-0.882953,-1.205503
1,bar,one,-0.04742,1.121723
2,foo,two,1.209021,-0.559162
3,bar,three,-0.157566,-0.022916
4,foo,two,1.161196,0.002663
5,bar,two,0.324595,-1.05781
6,foo,one,-0.896778,-0.125148
7,foo,three,0.238036,1.341379


In [32]:
df.groupby('A')['C'].mean().reset_index() # simple stats grouped by 1 column

Unnamed: 0,A,C
0,bar,0.03987
1,foo,0.165704


In [33]:
df.groupby(['A','B']).sum().reset_index() # simple stats grouped by multiple columns

Unnamed: 0,A,B,C,D
0,bar,one,-0.04742,1.121723
1,bar,three,-0.157566,-0.022916
2,bar,two,0.324595,-1.05781
3,foo,one,-1.779731,-1.33065
4,foo,three,0.238036,1.341379
5,foo,two,2.370217,-0.5565


In [34]:
df.groupby(['A','B']).mean().reset_index() # simple stats grouped by multiple columns

Unnamed: 0,A,B,C,D
0,bar,one,-0.04742,1.121723
1,bar,three,-0.157566,-0.022916
2,bar,two,0.324595,-1.05781
3,foo,one,-0.889866,-0.665325
4,foo,three,0.238036,1.341379
5,foo,two,1.185108,-0.27825


In [35]:
# df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x**2)).reset_index() # customized aggregation
df.groupby(['A','B'])['C'].apply(lambda x: np.sum(x)).reset_index() # customized aggregation

Unnamed: 0,A,B,C
0,bar,one,-0.04742
1,bar,three,-0.157566
2,bar,two,0.324595
3,foo,one,-1.779731
4,foo,three,0.238036
5,foo,two,2.370217


**Pivot table**

In [36]:
df = pd.DataFrame({'ModelNumber' : ['one', 'one', 'two', 'three'] * 3,
                   'Submodel' : ['A', 'B', 'C'] * 4,
                   'Type' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'Xval' : np.random.randn(12),
                   'Yval' : np.random.randn(12)})

df

Unnamed: 0,ModelNumber,Submodel,Type,Xval,Yval
0,one,A,foo,-0.308225,-0.782224
1,one,B,foo,-0.149588,1.189076
2,two,C,foo,1.220761,-0.144764
3,three,A,bar,1.117211,0.05749
4,one,B,bar,0.261723,1.436341
5,one,C,bar,-0.557518,0.5738
6,two,A,foo,-0.905013,0.287883
7,three,B,foo,-0.297401,-0.507857
8,one,C,foo,1.38222,-0.625587
9,one,A,bar,0.093153,-1.15903


We can produce pivot tables from this data very easily:

In [37]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
)

Unnamed: 0_level_0,Type,bar,foo
ModelNumber,Submodel,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.093153,-0.308225
one,B,0.261723,-0.149588
one,C,-0.557518,1.38222
three,A,1.117211,
three,B,,-0.297401
three,C,-0.609428,
two,A,,-0.905013
two,B,1.445047,
two,C,,1.220761


In [38]:
pd.pivot_table(
    df
    , values='Xval'
    , index=['ModelNumber', 'Submodel']
    , columns=['Type']
#     , aggfunc='count'
    ,aggfunc=lambda x: abs(x)
)

Unnamed: 0_level_0,Type,bar,foo
ModelNumber,Submodel,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.093153,0.308225
one,B,0.261723,0.149588
one,C,0.557518,1.38222
three,A,1.117211,
three,B,,0.297401
three,C,0.609428,
two,A,,0.905013
two,B,1.445047,
two,C,,1.220761


**Write/Export `dataframe` to files**

**CSV file**

In [39]:
df

Unnamed: 0,ModelNumber,Submodel,Type,Xval,Yval
0,one,A,foo,-0.308225,-0.782224
1,one,B,foo,-0.149588,1.189076
2,two,C,foo,1.220761,-0.144764
3,three,A,bar,1.117211,0.05749
4,one,B,bar,0.261723,1.436341
5,one,C,bar,-0.557518,0.5738
6,two,A,foo,-0.905013,0.287883
7,three,B,foo,-0.297401,-0.507857
8,one,C,foo,1.38222,-0.625587
9,one,A,bar,0.093153,-1.15903


In [40]:
df.to_csv('../data/to-csv-test.csv',sep=',',header=True,index=None)

**Excel spreadsheet**

In [41]:
df.to_excel('../data/to-excel-test.xlsx',sheet_name='tab1',header=True,index=None)