## 01_py_basics

#### Python repository for general data manipulation techniques and working with pandas. 

## This notebook will cover:

**1. Selecting columns**
* basic selection and drop columns
* select using pattern recognition
* relocate columns

**2. Creating new columns and values**
* basics of creating new columns
* basic sums and numerical column manipulation
* row sums and combine columns together
* conditionally create columns (ifelse / case_when equivlient)

**3. Filtering data**
* basic filtering
* filter NAs
* filter using arrays

**4. Aggregations using groupby**
* basic groupby aggregation
* multiple calculations
* groupby with conditional calculations
* unnest concatonated cells

**5. Joins (merge)**
* left join using merge
* concatonate dfs together

**Glossary**
Glossary of functions used throughout notebook. 

### 0. Set up ---

Basic set up to load and inspect data before any data exploration and analysis:

First step is to load in basic Python libraris and the data..

**Please note the trade data is not real-world values rather dummy data for the purposes of demonstrations.**

In [1]:
# pandas and numpy are universally used in python, like tidyverse is in R. 
import pandas as pd
import numpy as np

!pip install openpyxl

# chnage from scientific notation 
pd.set_option('display.float_format', lambda x: '%.5f' % x)

trade = pd.read_excel("data/trade_data.xlsx") # upload xlsxl
tariff = pd.read_excel("data/tariff_data.xlsx")
uk_trqs = pd.read_csv("data/uk_trqs.csv",dtype={'quota__order_number': str})
# upload csv

Looking in indexes: https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/
Collecting openpyxl
  Downloading https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/openpyxl/openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/et-xmlfile/et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
[0m

In [2]:
trade.head()

Unnamed: 0,Year,Flow,Commodity Code,Country Code,Country Name,Value GBP,Suppression notes
0,2020,Exports,1012100,TW,Taiwan,892,
1,2020,Exports,1062000,TW,Taiwan,14101,
2,2020,Exports,1063100,TW,Taiwan,1750,
3,2020,Exports,2031913,TW,Taiwan,290818,
4,2020,Exports,2031990,TW,Taiwan,1140,


basic df exploration:

In [3]:
# column names and types:
trade.dtypes

Year                   int64
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP              int64
Suppression notes    float64
dtype: object

In [4]:
# df summary:
trade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41142 entries, 0 to 41141
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               41142 non-null  int64  
 1   Flow               41142 non-null  object 
 2   Commodity Code     41142 non-null  object 
 3   Country Code       41142 non-null  object 
 4   Country Name       41142 non-null  object 
 5   Value GBP          41142 non-null  int64  
 6   Suppression notes  0 non-null      float64
dtypes: float64(1), int64(2), object(4)
memory usage: 2.2+ MB


using .info is very useful as in additional to Dtypes being printed you are provided with the "non-null" values or in other words NAs. For example the supression notes column is only NA values.

In [5]:
# summarise numerical values
trade.describe()

Unnamed: 0,Year,Value GBP,Suppression notes
count,41142.0,41142.0,0.0
mean,2019.92774,2658886.4581,
std,0.25892,61225646.8482,
min,2019.0,4.0,
25%,2020.0,5892.25,
50%,2020.0,34204.5,
75%,2020.0,260768.75,
max,2020.0,8963450144.0,


In [6]:
# simple df dimensions use shape:
trade.shape

(41142, 7)

**Note:** that the year column is uploaded as a value. It may be preferable to work with a character type rather than value for this column. When uploading data the data type can be specified

In [3]:
trade2 = pd.read_excel("data/trade_data.xlsx",dtype={'Year': str}) # convert year to string when uploading data
trade3 = pd.read_excel("data/trade_data.xlsx",dtype=str) # all columns as string
trade4 = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64}) # convert value to float opposed to integer. Floats allows for decimal points
print(trade2.dtypes,trade3.dtypes,trade4.dtypes)

Year                  object
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP              int64
Suppression notes    float64
dtype: object Year                 object
Flow                 object
Commodity Code       object
Country Code         object
Country Name         object
Value GBP            object
Suppression notes    object
dtype: object Year                   int64
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP            float64
Suppression notes    float64
dtype: object


In [2]:
# want float for value so re-upload trade data:
trade = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64})
trade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41142 entries, 0 to 41141
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               41142 non-null  int64  
 1   Flow               41142 non-null  object 
 2   Commodity Code     41142 non-null  object 
 3   Country Code       41142 non-null  object 
 4   Country Name       41142 non-null  object 
 5   Value GBP          41142 non-null  float64
 6   Suppression notes  0 non-null      float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.2+ MB


## janitor - clean_names() equivalent. 

Working with cleaner string/column names is highlgihy recommended. 

In [2]:
trade.columns = trade.columns.str.lower().str.replace(" ","_")
trade.dtypes

year                   int64
flow                  object
commodity_code        object
country_code          object
country_name          object
value_gbp              int64
suppression_notes    float64
dtype: object

In [4]:
# using function - helpful if multiple dataframes to convert.
def  cleanCols(df): 
    df.columns = df.columns.str.lower().str.replace(" ","_")
    return(df)

trade = cleanCols(trade)
trade2 = cleanCols(trade2)
trade3 = cleanCols(trade3)
tariff = cleanCols(tariff)

## 1. Select columns ----

basic selection:

In [11]:
trade2 = trade[["year","flow","commodity_code","country_name","value_gbp"]]
trade2.dtypes

year                int64
flow               object
commodity_code     object
country_name       object
value_gbp         float64
dtype: object

In [12]:
# use an array:
cols = ["year","flow","commodity_code","country_name","value_gbp"]
trade2 = trade[cols]

In [13]:
trade2.dtypes

year                int64
flow               object
commodity_code     object
country_name       object
value_gbp         float64
dtype: object

drop columns:

In [14]:
# remove columns
trade2 = trade.drop(["year","flow","commodity_code"], 1) # index 1 reference columns to remove from df
trade2.dtypes

country_code          object
country_name          object
value_gbp            float64
suppression_notes    float64
dtype: object

In [15]:
trade2 = trade.drop(cols,1)
trade2.dtypes

country_code          object
suppression_notes    float64
dtype: object

select columns using column indexes (numbers): tbc:

### 1.a select columns using string patterns

The tariff data uploaded is a good df for this example as it has alot of strings with patterns which can be used for tidy selecitons

In [16]:
tariff.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
quota_number                                object
in_quota_tariff_line_code                  float64
preferential_applied_duty_rate_excluded     object
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                      

In [17]:
prefCol = tariff.columns[tariff.columns.str.contains(pat = 'pref')]
prefCol2 = [col for col in tariff.columns if 'pref' in col]

In [18]:
print(prefCol,prefCol2)

Index(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024',
       'preferential_applied_duty_rate_excluded'],
      dtype='object') ['preferential_applied_duty_rate_2021', 'preferential_applied_duty_rate_2022', 'preferential_applied_duty_rate_2023', 'preferential_applied_duty_rate_2024', 'preferential_applied_duty_rate_excluded']


Note difference between output types: one is an indexed array. 

In [19]:
mfnCol = [col for col in tariff.columns if 'mfn' in col]

In [20]:
codeCol = [col for col in tariff.columns if 'commodity' in col]

In [21]:
colNames = [codeCol,mfnCol,prefCol2]
print(colNames)

[['commodity_heading', 'commodity_code', 'commodity_code_description'], ['mfn_applied_duty_rate', 'mfn_applied_rate_ukgt'], ['preferential_applied_duty_rate_2021', 'preferential_applied_duty_rate_2022', 'preferential_applied_duty_rate_2023', 'preferential_applied_duty_rate_2024', 'preferential_applied_duty_rate_excluded']]


In [22]:
#tariff2 = tariff[colNames]
#tariff2.dtypes
# for error fix use:
#colNames = np.concatenate((codeCol,prefCol, mfnCol))

**NOTE the error.** Three list arrays have been combined together which then can't be used in this way to filter a pandas df. 

You can use numpy arrays for the column filters to select the data by using np.concatonate

In [23]:
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol))
tariff2 = tariff[colNames2]
tariff2.head()

Unnamed: 0,commodity_heading,commodity_code,commodity_code_description,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,preferential_applied_duty_rate_excluded,mfn_applied_duty_rate,mfn_applied_rate_ukgt
0,01 - Live Animals,1012100,Pure-bred breeding horses,0%,0%,0%,0%,N,0%,0.0
1,01 - Live Animals,1012910,Horses for slaughter,0%,0%,0%,0%,N,0%,0.0
2,01 - Live Animals,1012990,"Live horses (excl. for slaughter, pure-bred fo...",0%,0%,0%,0%,N,10%,0.1
3,01 - Live Animals,1013000,Live asses,0%,0%,0%,0%,N,6%,0.06
4,01 - Live Animals,1019000,Live mules and hinnies,0%,0%,0%,0%,N,10%,0.1


### 1b. select columns with numerical values and combination of string patterns

Select columns which contain numerical values and where numerical values end the column string

i.e. preferntial. + 2021, 2022 etc...

```python
tariff2=tariff[["commodity_code","preferential_applied_duty_rate_2021,
                "preferential_applied_duty_rate_2022",
                "preferential_applied_duty_rate_2023","
                "preferential_applied_duty_rate_2024"]]
```

If there were even more columns to manually type everything out is tedious and time consuming when it can easily be done using string recognition

In [24]:
col = np.array(tariff.columns[tariff.columns.str.contains('.*[0-9].*', regex=True)]) # select columns with any muerical value
col

array(['x8_digit_or_10_digit', 'preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'cn8', 'hs2',
       'hs2_description', 'cn8_count', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'], dtype=object)

doesnt create what is required - can combine str.contains multiple times:

In [25]:
# doens't work when trying to extract numerical vlaues at end of string: (anyone know fix?)
col_list = [col for col in tariff.columns if col.endswith('.*[0-9].*')]
col_list

[]

In [26]:
#alternsative quick way can be a simple pattern within the numerical strings, however, extract unwanted tariff columns:
cl = tariff.columns[tariff.columns.str.contains(pat = '20')]
cl

Index(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'],
      dtype='object')

In [27]:
col = np.array(tariff.columns[tariff.columns.str.contains('20',regex=True)]) # select columns with any muerical value
col

array(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'], dtype=object)

In [28]:
#example using startswith and endswith:
col_list = [col for col in tariff.columns if (col.startswith('pref') & col.endswith("2"))]
col_list

['preferential_applied_duty_rate_2022']

In [29]:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
c

array(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024'], dtype=object)

In [30]:
# need to combine commoidty code with c in np.array
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff[c2].head()

Unnamed: 0,commodity_code,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024
0,1012100,0%,0%,0%,0%
1,1012910,0%,0%,0%,0%
2,1012990,0%,0%,0%,0%
3,1013000,0%,0%,0%,0%
4,1019000,0%,0%,0%,0%


In [31]:
# full solution:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff2 = tariff[c2]
tariff2.head()

Unnamed: 0,commodity_code,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024
0,1012100,0%,0%,0%,0%
1,1012910,0%,0%,0%,0%
2,1012990,0%,0%,0%,0%
3,1013000,0%,0%,0%,0%
4,1019000,0%,0%,0%,0%


****

### 1c. Relocate columns:

I am currnelty unaware of a single line function which acheives this like relocate in tidyverse. However it takes a few lines having specified the columns wanting to be relocated within the df.

Example: trade data set - move flow column next to trade value

In [32]:
trade2 = trade.copy()

In [33]:
# name column(s) to be moved:
col = trade2["flow"]
# drop column in df
trade2.drop(labels=["flow"], axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(4,"flow",col)
trade2.head()

Unnamed: 0,year,commodity_code,country_code,country_name,flow,value_gbp,suppression_notes
0,2020,1012100,TW,Taiwan,Exports,892.0,
1,2020,1062000,TW,Taiwan,Exports,14101.0,
2,2020,1063100,TW,Taiwan,Exports,1750.0,
3,2020,2031913,TW,Taiwan,Exports,290818.0,
4,2020,2031990,TW,Taiwan,Exports,1140.0,


In [34]:
# Can easily move multiple columns using same method:
cols = trade2[["country_name","country_code"]]
col1 = trade2["country_name"]
col2 = trade2["country_code"]
trade2.drop(cols, axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(1,"country_name",col1)
trade2.insert(1,"country_code",col2)
trade2.head()

Unnamed: 0,year,country_code,country_name,commodity_code,flow,value_gbp,suppression_notes
0,2020,TW,Taiwan,1012100,Exports,892.0,
1,2020,TW,Taiwan,1062000,Exports,14101.0,
2,2020,TW,Taiwan,1063100,Exports,1750.0,
3,2020,TW,Taiwan,2031913,Exports,290818.0,
4,2020,TW,Taiwan,2031990,Exports,1140.0,


If you want to move a larger selection of columns the above method isn't the most helpful. You can more easily specific the seleciton naming the order of columns (similar to select in tidyverse):

In [35]:
trade2 = trade[["year","country_code","country_name","flow","commodity_code","value_gbp","suppression_notes"]]
trade2.head()

Unnamed: 0,year,country_code,country_name,flow,commodity_code,value_gbp,suppression_notes
0,2020,TW,Taiwan,Exports,1012100,892.0,
1,2020,TW,Taiwan,Exports,1062000,14101.0,
2,2020,TW,Taiwan,Exports,1063100,1750.0,
3,2020,TW,Taiwan,Exports,2031913,290818.0,
4,2020,TW,Taiwan,Exports,2031990,1140.0,


However if you have alot more columns this is also not particularly helpful if you want to decrease time writing out column names..

In [36]:
#example df:
    
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
tariffCol = [col for col in tariff.columns if 'tariff' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol,tariffCol))
tariff2 = tariff[colNames2]
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
dtype: object

There are alot of pattenr recogmition strings within this dataframe. However i am approaching this as if there weren't and we wanted to relocate multiple columns ot select positions within a df.

In [37]:
tariff2 = tariff.copy()

In [38]:
# relocate MFN columns to front of data frame (method is useful when moving numerous columns to new position)
cols_to_move = ["mfn_applied_duty_rate","mfn_applied_rate_ukgt"]
#col_index = ["commo
tariff3 = tariff2[cols_to_move + [ col for col in tariff2.columns if col not in cols_to_move ]]
tariff3.head(3)

Unnamed: 0,mfn_applied_duty_rate,mfn_applied_rate_ukgt,commodity_heading,commodity_code,x8_digit_or_10_digit,commodity_code_description,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,...,value_usd,cn8_count,tariff_status_2021,tariff_status_final_2021,tariff_status_2022,tariff_status_final_2022,tariff_status_2023,tariff_status_final_2023,tariff_status_2024,tariff_status_final_2024
0,0%,0.0,01 - Live Animals,1012100,8,Pure-bred breeding horses,0%,0%,0%,0%,...,15123.15643,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
1,0%,0.0,01 - Live Animals,1012910,8,Horses for slaughter,0%,0%,0%,0%,...,,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
2,10%,0.1,01 - Live Animals,1012990,8,"Live horses (excl. for slaughter, pure-bred fo...",0%,0%,0%,0%,...,25331.97586,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero


In [39]:
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
quota_number                                object
in_quota_tariff_line_code                  float64
preferential_applied_duty_rate_excluded     object
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                      

In [40]:
# move pref columns to end of df
cols_to_move = [col for col in tariff.columns if 'pref' in col]
tariff3 = tariff2[[ col for col in tariff2.columns if col not in cols_to_move ]+cols_to_move]
tariff3.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
quota_number                                object
in_quota_tariff_line_code                  float64
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                                    int64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023             

### **Still looking for solution to move selected columns to arbitary postion in df, i,e, relocate pref columns after "in_quota_tariff_line_code" for example**

****

## 2. Create new columns

creating columns is simple in Python. 

#### Basics

In [41]:
trade2 = trade.copy()

In [42]:
trade2["new_col"] = 10
trade2.head(3)

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col
0,2020,Exports,1012100,TW,Taiwan,892.0,,10
1,2020,Exports,1062000,TW,Taiwan,14101.0,,10
2,2020,Exports,1063100,TW,Taiwan,1750.0,,10


In [43]:
#convert gbp values:
usd = 0.8
eur = 0.9
trade2["value_usd"] = trade2["value_gbp"]*usd
trade2["value_eur"] = trade2["value_gbp"]*eur
trade2.head(3)

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,value_usd,value_eur
0,2020,Exports,1012100,TW,Taiwan,892.0,,10,713.6,802.8
1,2020,Exports,1062000,TW,Taiwan,14101.0,,10,11280.8,12690.9
2,2020,Exports,1063100,TW,Taiwan,1750.0,,10,1400.0,1575.0


In [44]:
# add columns together
trade2["new_col"] = trade2["value_gbp"]+trade2["value_usd"]+trade2["value_eur"]
trade2["new_col2"] = trade2["value_gbp"]/100
trade2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,value_usd,value_eur,new_col2
0,2020,Exports,1012100,TW,Taiwan,892.0,,2408.4,713.6,802.8,8.92
1,2020,Exports,1062000,TW,Taiwan,14101.0,,38072.7,11280.8,12690.9,141.01
2,2020,Exports,1063100,TW,Taiwan,1750.0,,4725.0,1400.0,1575.0,17.5
3,2020,Exports,2031913,TW,Taiwan,290818.0,,785208.6,232654.4,261736.2,2908.18
4,2020,Exports,2031990,TW,Taiwan,1140.0,,3078.0,912.0,1026.0,11.4


In [45]:
# summarise column values easily:
trade2.sum()

year                                                          83103867
flow                 ExportsExportsExportsExportsExportsExportsExpo...
commodity_code       0101210001062000010631000203191302031990020322...
country_code         TWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTWTW...
country_name         TaiwanTaiwanTaiwanTaiwanTaiwanTaiwanTaiwanTaiw...
value_gbp                                           109391906659.00000
suppression_notes                                              0.00000
new_col                                             295358147979.30005
value_usd                                            87513525327.20001
value_eur                                            98452715993.10001
new_col2                                              1093919066.59000
dtype: object

In [46]:
total_value_gbp = trade["value_gbp"].sum()
total_value_gbp

109391906659.0

In [47]:
# count total number of NANs. Very useful for a quick check.
trade.isnull().sum()

year                     0
flow                     0
commodity_code           0
country_code             0
country_name             0
value_gbp                0
suppression_notes    41142
dtype: int64

### **Sum across rows:**

In [48]:
trade2.dtypes

year                   int64
flow                  object
commodity_code        object
country_code          object
country_name          object
value_gbp            float64
suppression_notes    float64
new_col              float64
value_usd            float64
value_eur            float64
new_col2             float64
dtype: object

In [49]:
# example: sum all new column vlaues together
trade2["sum_col"] = trade2["new_col"]+trade2["value_usd"]+trade2["value_eur"]+trade2["new_col2"]

In [50]:
# alternatively name columns and sum across which is cleaner and less time to type:
sum_cols = ["new_col","value_usd","value_eur","new_col2"]
trade2["sum_col2"] = trade2[sum_cols].sum(axis=1)
trade2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,value_usd,value_eur,new_col2,sum_col,sum_col2
0,2020,Exports,1012100,TW,Taiwan,892.0,,2408.4,713.6,802.8,8.92,3933.72,3933.72
1,2020,Exports,1062000,TW,Taiwan,14101.0,,38072.7,11280.8,12690.9,141.01,62185.41,62185.41
2,2020,Exports,1063100,TW,Taiwan,1750.0,,4725.0,1400.0,1575.0,17.5,7717.5,7717.5
3,2020,Exports,2031913,TW,Taiwan,290818.0,,785208.6,232654.4,261736.2,2908.18,1282507.38,1282507.38
4,2020,Exports,2031990,TW,Taiwan,1140.0,,3078.0,912.0,1026.0,11.4,5027.4,5027.4


### **Update numerical columns only:**

In [51]:
# example 1:
#numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
#for c in [c for c in trade.columns if df[c].dtype in numerics]:
#    trade[c] = trade[c]/100

In [52]:
# smaller one line example:
numeric_df = trade2.apply(lambda x: x/100 if np.issubdtype(x.dtype, np.number) else x)
numeric_df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,value_usd,value_eur,new_col2,sum_col,sum_col2
0,20.2,Exports,1012100,TW,Taiwan,8.92,,24.084,7.136,8.028,0.0892,39.3372,39.3372
1,20.2,Exports,1062000,TW,Taiwan,141.01,,380.727,112.808,126.909,1.4101,621.8541,621.8541
2,20.2,Exports,1063100,TW,Taiwan,17.5,,47.25,14.0,15.75,0.175,77.175,77.175
3,20.2,Exports,2031913,TW,Taiwan,2908.18,,7852.086,2326.544,2617.362,29.0818,12825.0738,12825.0738
4,20.2,Exports,2031990,TW,Taiwan,11.4,,30.78,9.12,10.26,0.114,50.274,50.274


In [53]:
# update multiple columns at once:
cols = ["value_eur","value_usd"]
trade2[cols] = trade2[cols]*1000
trade2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,value_usd,value_eur,new_col2,sum_col,sum_col2
0,2020,Exports,1012100,TW,Taiwan,892.0,,2408.4,713600.0,802800.0,8.92,3933.72,3933.72
1,2020,Exports,1062000,TW,Taiwan,14101.0,,38072.7,11280800.0,12690900.0,141.01,62185.41,62185.41
2,2020,Exports,1063100,TW,Taiwan,1750.0,,4725.0,1400000.0,1575000.0,17.5,7717.5,7717.5
3,2020,Exports,2031913,TW,Taiwan,290818.0,,785208.6,232654400.0,261736200.0,2908.18,1282507.38,1282507.38
4,2020,Exports,2031990,TW,Taiwan,1140.0,,3078.0,912000.0,1026000.0,11.4,5027.4,5027.4


### **Combine columns together:**

In [54]:
trade2 = trade.copy()
trade2["new_col"] = trade2["year"].map(str)+trade2["flow"] # use map(str) as year is numeric column
trade2["new_col2"] = trade2["country_code"]+" - "+trade2["country_name"]
trade2["commoidty_code2"] = "0"+trade2["commodity_code"]
trade2.tail()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,new_col,new_col2,commoidty_code2
41137,2019,Imports,94036090,ZM,Zambia,932.0,,2019Imports,ZM - Zambia,94036090
41138,2020,Imports,95030041,ZM,Zambia,3812.0,,2020Imports,ZM - Zambia,95030041
41139,2020,Imports,95030099,ZM,Zambia,3972.0,,2020Imports,ZM - Zambia,95030099
41140,2020,Imports,97050000,ZM,Zambia,2213.0,,2020Imports,ZM - Zambia,97050000
41141,2020,Imports,99209900,ZM,Zambia,25009.0,,2020Imports,ZM - Zambia,99209900


### **Conditionally create columns**

There are two useful and simple ways to create and update columns using condiitonal logic

In [56]:
trade2 = trade.copy()
trade2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
0,2020,Exports,1012100,TW,Taiwan,892.0,
1,2020,Exports,1062000,TW,Taiwan,14101.0,
2,2020,Exports,1063100,TW,Taiwan,1750.0,
3,2020,Exports,2031913,TW,Taiwan,290818.0,
4,2020,Exports,2031990,TW,Taiwan,1140.0,


**np.where**

In [58]:
# create column to indicate if value is greater than 100,000
trade2["value_flag"] = np.where(trade2["value_gbp"] > 10000,"Yes","No")
trade2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,value_flag
0,2020,Exports,1012100,TW,Taiwan,892.0,,No
1,2020,Exports,1062000,TW,Taiwan,14101.0,,Yes
2,2020,Exports,1063100,TW,Taiwan,1750.0,,No
3,2020,Exports,2031913,TW,Taiwan,290818.0,,Yes
4,2020,Exports,2031990,TW,Taiwan,1140.0,,No


In [61]:
# nested np.where statement:
trade["value_flag"] = np.where(trade["value_gbp"] > 100000, "100k",
                               np.where(trade["value_gbp"] > 10000,"10k",
                                        np.where(trade["value_gbp"] > 1000, "1k","<1k"))) # ensure last condition is created
trade

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,value_flag
0,2020,Exports,01012100,TW,Taiwan,892.00000,,<1k
1,2020,Exports,01062000,TW,Taiwan,14101.00000,,10k
2,2020,Exports,01063100,TW,Taiwan,1750.00000,,1k
3,2020,Exports,02031913,TW,Taiwan,290818.00000,,100k
4,2020,Exports,02031990,TW,Taiwan,1140.00000,,1k
...,...,...,...,...,...,...,...,...
41137,2019,Imports,94036090,ZM,Zambia,932.00000,,<1k
41138,2020,Imports,95030041,ZM,Zambia,3812.00000,,1k
41139,2020,Imports,95030099,ZM,Zambia,3972.00000,,1k
41140,2020,Imports,97050000,ZM,Zambia,2213.00000,,1k


In [70]:
# can use & or | operaters inside np.where statements
# create flag if country = Taiwan and value is over > 100k
# create flag is value is > 100K or less than 1k

#'' ** ensure both logical conditions are within brackets ()
trade["example_flag"] = np.where((trade["value_gbp"] > 100000) & (trade["country_name"] == "Taiwan"),"Yes","no")
trade["example_flag2"] = np.where((trade["value_gbp"] > 100000) | (trade["value_gbp"] < 1000),"Yes","no")
trade

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,value_flag,example_flag,example_flag2
0,2020,Exports,01012100,TW,Taiwan,892.00000,,<1k,no,Yes
1,2020,Exports,01062000,TW,Taiwan,14101.00000,,10k,no,no
2,2020,Exports,01063100,TW,Taiwan,1750.00000,,1k,no,no
3,2020,Exports,02031913,TW,Taiwan,290818.00000,,100k,Yes,Yes
4,2020,Exports,02031990,TW,Taiwan,1140.00000,,1k,no,no
...,...,...,...,...,...,...,...,...,...,...
41137,2019,Imports,94036090,ZM,Zambia,932.00000,,<1k,no,Yes
41138,2020,Imports,95030041,ZM,Zambia,3812.00000,,1k,no,no
41139,2020,Imports,95030099,ZM,Zambia,3972.00000,,1k,no,no
41140,2020,Imports,97050000,ZM,Zambia,2213.00000,,1k,no,no


**np.select**

np.select method when dealing with multiple conditions can be help to write cleaner and more consice code to read and follow. 

This method you specific your conditions and outcomes within an array then define a column using this inputs within np.select

In [78]:
conditions = [(trade["value_gbp"]>100000), (trade["value_gbp"] >10000), (trade["value_gbp"] >1000)]
choices = ["100k","10k","1k"]
trade['value_flag2'] = np.select(conditions, choices, default="<1k") # chnage default to 0 or any character
trade

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,value_flag,example_flag,example_flag2,value_flag2
0,2020,Exports,01012100,TW,Taiwan,892.00000,,<1k,no,Yes,<1k
1,2020,Exports,01062000,TW,Taiwan,14101.00000,,10k,no,no,10k
2,2020,Exports,01063100,TW,Taiwan,1750.00000,,1k,no,no,1k
3,2020,Exports,02031913,TW,Taiwan,290818.00000,,100k,Yes,Yes,100k
4,2020,Exports,02031990,TW,Taiwan,1140.00000,,1k,no,no,1k
...,...,...,...,...,...,...,...,...,...,...,...
41137,2019,Imports,94036090,ZM,Zambia,932.00000,,<1k,no,Yes,<1k
41138,2020,Imports,95030041,ZM,Zambia,3812.00000,,1k,no,no,1k
41139,2020,Imports,95030099,ZM,Zambia,3972.00000,,1k,no,no,1k
41140,2020,Imports,97050000,ZM,Zambia,2213.00000,,1k,no,no,1k


****

## 3. Filtering

Filter trade data for simple conditions like year or country name

In [8]:
# filter for country name = United States
df = trade.copy()
df = df.loc[df["country_name"] == "United States"]
df

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
20956,2020,Exports,01012100,US,United States,16637203,
20957,2020,Exports,01012990,US,United States,1752234,
20958,2020,Exports,01019000,US,United States,54000,
20959,2020,Exports,01061100,US,United States,163530,
20960,2020,Exports,01061900,US,United States,80969,
...,...,...,...,...,...,...,...
33842,2020,Imports,97030000,US,United States,89441661,
33843,2020,Imports,97040000,US,United States,515297,
33844,2020,Imports,97050000,US,United States,154824702,
33845,2020,Imports,97060000,US,United States,66962179,


In [11]:
# you don't have ot use loc but I have grown acustomed to this method. 
df = trade.copy()
df = df[df["country_name"] == "United States"]
df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
20956,2020,Exports,1012100,US,United States,16637203,
20957,2020,Exports,1012990,US,United States,1752234,
20958,2020,Exports,1019000,US,United States,54000,
20959,2020,Exports,1061100,US,United States,163530,
20960,2020,Exports,1061900,US,United States,80969,


In [14]:
# filter using opoerators:
# filter for United States, year and flow:
df = trade.copy()
df2 = df.loc[(df["country_name"]=="United States") & (df["year"] == 2020) & (df["flow"] == "Imports")]
df2.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
27391,2020,Imports,1012100,US,United States,5271191,
27392,2020,Imports,1012990,US,United States,88494,
27393,2020,Imports,1013000,US,United States,2655,
27394,2020,Imports,1019000,US,United States,3275,
27395,2020,Imports,1051300,US,United States,316106,


In [16]:
# filter if value of trade is > 10000 or less than < 1000
df2 = df.loc[(df["value_gbp"] > 10000) | (df["value_gbp"] < 1000)]

**Important:** when creating multiple conditions ensure they are within brackets ()

****

### Filter NAs

The UK TRQ data set has multiple NAs throughout which will be a useful dataset to demonstrate

In [19]:
# utilise .info() for quick overview of Non-Null counts
uk_trqs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 17 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   quota_definition__sid                   2848 non-null   int64  
 1   quota__order_number                     2848 non-null   object 
 2   quota__geographical_areas               2830 non-null   object 
 3   quota__headings                         2848 non-null   object 
 4   quota__commodities                      2845 non-null   object 
 5   quota__measurement_unit                 2848 non-null   object 
 6   quota__monetary_unit                    0 non-null      float64
 7   quota_definition__description           0 non-null      float64
 8   quota_definition__validity_start_date   2848 non-null   object 
 9   quota_definition__validity_end_date     2848 non-null   object 
 10  quota_definition__suspension_periods    6 non-null      obje

In [20]:
# alternatively - quick simple sum of null values
uk_trqs.isnull().sum()

quota_definition__sid                        0
quota__order_number                          0
quota__geographical_areas                   18
quota__headings                              0
quota__commodities                           3
quota__measurement_unit                      0
quota__monetary_unit                      2848
quota_definition__description             2848
quota_definition__validity_start_date        0
quota_definition__validity_end_date          0
quota_definition__suspension_periods      2842
quota_definition__blocking_periods        2848
quota_definition__status                     0
quota_definition__last_allocation_date    2131
quota_definition__initial_volume             0
quota_definition__balance                 1093
quota_definition__fill_rate                  0
dtype: int64

In [23]:
# filter df for NAs in geographical areas column
na_df = uk_trqs[uk_trqs['quota__geographical_areas'].isnull()]
na_df.head(2)

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
341,21857,50825,,0603 – Cut flowers and flower buds of a kind s...,603197000,Kilogram (kg),,,01/12/2021,31/12/2021,,,Closed,30/12/2021,4246,4246.0,0.0
342,21858,50825,,0603 – Cut flowers and flower buds of a kind s...,603197000,Kilogram (kg),,,01/01/2022,31/12/2022,,,Open,,50000,50000.0,0.0


In [26]:
# filter df for not NAs in geographical areas column
not_na_df = uk_trqs[~(uk_trqs['quota__geographical_areas'].isnull())]
not_na_df.head(2)

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0


**Drop NA columns**

In [29]:
# drop any columns which has an NA value in:
drop_na = uk_trqs.drop(uk_trqs.columns[uk_trqs.isna().sum()>len(uk_trqs.columns)],axis = 1)
drop_na.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 10 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   quota_definition__sid                  2848 non-null   int64  
 1   quota__order_number                    2848 non-null   object 
 2   quota__headings                        2848 non-null   object 
 3   quota__commodities                     2845 non-null   object 
 4   quota__measurement_unit                2848 non-null   object 
 5   quota_definition__validity_start_date  2848 non-null   object 
 6   quota_definition__validity_end_date    2848 non-null   object 
 7   quota_definition__status               2848 non-null   object 
 8   quota_definition__initial_volume       2848 non-null   int64  
 9   quota_definition__fill_rate            2848 non-null   float64
dtypes: float64(1), int64(2), object(7)
memory usage: 222.6+ KB


In [41]:
# via using a list:
na_cols = uk_trqs.columns[uk_trqs.isna().any()].tolist() # cretae list of columns with NAs in. 
uk_trqs2 = uk_trqs[[col for col in uk_trqs.columns if col not in na_cols]]
uk_trqs2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   quota_definition__sid                  2848 non-null   int64  
 1   quota__order_number                    2848 non-null   object 
 2   quota__headings                        2848 non-null   object 
 3   quota__measurement_unit                2848 non-null   object 
 4   quota_definition__validity_start_date  2848 non-null   object 
 5   quota_definition__validity_end_date    2848 non-null   object 
 6   quota_definition__status               2848 non-null   object 
 7   quota_definition__initial_volume       2848 non-null   int64  
 8   quota_definition__fill_rate            2848 non-null   float64
dtypes: float64(1), int64(2), object(6)
memory usage: 200.4+ KB


In [31]:
# drop columns which only contain NAs i.e. (quota__monetary_unit)
drop_na_cols = uk_trqs.dropna(axis=1, how='all') 
drop_na_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 14 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   quota_definition__sid                   2848 non-null   int64  
 1   quota__order_number                     2848 non-null   object 
 2   quota__geographical_areas               2830 non-null   object 
 3   quota__headings                         2848 non-null   object 
 4   quota__commodities                      2845 non-null   object 
 5   quota__measurement_unit                 2848 non-null   object 
 6   quota_definition__validity_start_date   2848 non-null   object 
 7   quota_definition__validity_end_date     2848 non-null   object 
 8   quota_definition__suspension_periods    6 non-null      object 
 9   quota_definition__status                2848 non-null   object 
 10  quota_definition__last_allocation_date  717 non-null    obje

****

### Filter by arrays

Rather than using multiple OR opoerators you can use simple arrays to filter your data frame.

Example - filter the trade data for Thailand, Taiwan and United States. 

In [6]:
df = trade.copy()
df2 = df.loc[(df["country_name"] == "Taiwan") | (df["country_name"] == "Thailand") | (df["country_name"] == "United States")]
pd.unique(df2["country_name"])

array(['Taiwan', 'Thailand', 'United States'], dtype=object)

In [8]:
country_array = ["Taiwan","Thailand","United States"]
df2 = df[df["country_name"].isin(country_array)]
pd.unique(df2["country_name"])

array(['Taiwan', 'Thailand', 'United States'], dtype=object)

This method is storngly preferable whne working with far greater numbers of values to filter by

In [13]:
code_array = ["01012100","01062000","02031913","02031990","94036090"]
df2 = df[df["commodity_code"].isin(code_array)]
print(pd.unique(df2["commodity_code"]),df2.shape)

['01012100' '01062000' '02031913' '02031990' '94036090'] (40, 7)


In [14]:
# not in:
code_array = ["01012100","01062000","02031913","02031990","94036090"]
df2 = df[~(df["commodity_code"].isin(code_array))]
print(pd.unique(df2["commodity_code"]),df2.shape)

['01063100' '02032219' '02032290' ... '20089329' '86063000' '08093090'] (41102, 7)


In [19]:
# using other column df:
df = trade.head(20)
df2 = trade.head(40)

In [21]:
# 20 unique codes:
code_filt = pd.unique(df["commodity_code"])
code_filt.shape

(20,)

In [23]:
print(pd.unique(df2["commodity_code"]).shape)
# 40 unique codes:

(40,)


In [25]:
# filter df2 using df will result in 20 codes:
df3 = df2[df2["commodity_code"].isin(code_filt)]
df3.shape

(20, 7)

****

### Filter across columns

filter across columns if value exists, i.e. any vlaue column contains "0". tbc.

In [10]:
df = tariff.copy()
df.head()

Unnamed: 0,commodity_heading,commodity_code,x8_digit_or_10_digit,commodity_code_description,mfn_applied_duty_rate,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,quota_number,...,value_usd,cn8_count,tariff_status_2021,tariff_status_final_2021,tariff_status_2022,tariff_status_final_2022,tariff_status_2023,tariff_status_final_2023,tariff_status_2024,tariff_status_final_2024
0,01 - Live Animals,1012100,8,Pure-bred breeding horses,0%,0%,0%,0%,0%,,...,15123.15643,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
1,01 - Live Animals,1012910,8,Horses for slaughter,0%,0%,0%,0%,0%,,...,,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
2,01 - Live Animals,1012990,8,"Live horses (excl. for slaughter, pure-bred fo...",10%,0%,0%,0%,0%,,...,25331.97586,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
3,01 - Live Animals,1013000,8,Live asses,6%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4,01 - Live Animals,1019000,8,Live mules and hinnies,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero


In [14]:
df.shape

(4826, 28)

In [46]:
# filter any column in df which contains text string:
df2 = df[df.stack().str.contains('10%').any(level=0)]
#df2 = df[df.stack().str.contains('7%').any(level=0)]
#df2 = df[df.stack().str.contains('Eggs').any(level=0)]
df2

Unnamed: 0,commodity_heading,commodity_code,x8_digit_or_10_digit,commodity_code_description,mfn_applied_duty_rate,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,quota_number,...,value_usd,cn8_count,tariff_status_2021,tariff_status_final_2021,tariff_status_2022,tariff_status_final_2022,tariff_status_2023,tariff_status_final_2023,tariff_status_2024,tariff_status_final_2024
2,01 - Live Animals,1012990,8,"Live horses (excl. for slaughter, pure-bred fo...",10%,0%,0%,0%,0%,,...,25331.97586,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4,01 - Live Animals,1019000,8,Live mules and hinnies,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
411,"03 - Fish And Crustaceans, Molluscs And Other ...",3029100,8,"Fresh or chilled fish livers, roes and milt",10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
413,"03 - Fish And Crustaceans, Molluscs And Other ...",3029900,8,"Fresh or chilled fish fins, heads, tails, maws...",10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
507,"03 - Fish And Crustaceans, Molluscs And Other ...",3039190,8,"Frozen fish livers, roes and milt (excl. hard ...",10%,0%,0%,0%,0%,,...,986388.63031,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4729,48 - Paper And Paperboard; Articles Of Paper P...,48101300,8,"Paper and paperboard used for writing, printin...",0%,0%,0%,0%,0%,,...,29624.16550,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
4730,48 - Paper And Paperboard; Articles Of Paper P...,48101400,8,"Paper and paperboard used for writing, printin...",0%,0%,0%,0%,0%,,...,88016.30949,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
4731,48 - Paper And Paperboard; Articles Of Paper P...,48101900,8,"Paper and paperboard used for writing, printin...",0%,0%,0%,0%,0%,,...,62340.10734,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero
4733,48 - Paper And Paperboard; Articles Of Paper P...,48102930,8,"Paper and paperboard used for writing, printin...",0%,0%,0%,0%,0%,,...,,1,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero,mfn_zero


Alternatively use applymap with la,bda x and any: (will work with non-text strings)

In [48]:
df2 = df[df.applymap(lambda x: x == "10%").any(1)]
df2

Unnamed: 0,commodity_heading,commodity_code,x8_digit_or_10_digit,commodity_code_description,mfn_applied_duty_rate,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,quota_number,...,value_usd,cn8_count,tariff_status_2021,tariff_status_final_2021,tariff_status_2022,tariff_status_final_2022,tariff_status_2023,tariff_status_final_2023,tariff_status_2024,tariff_status_final_2024
2,01 - Live Animals,1012990,8,"Live horses (excl. for slaughter, pure-bred fo...",10%,0%,0%,0%,0%,,...,25331.97586,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4,01 - Live Animals,1019000,8,Live mules and hinnies,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
411,"03 - Fish And Crustaceans, Molluscs And Other ...",3029100,8,"Fresh or chilled fish livers, roes and milt",10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
413,"03 - Fish And Crustaceans, Molluscs And Other ...",3029900,8,"Fresh or chilled fish fins, heads, tails, maws...",10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
507,"03 - Fish And Crustaceans, Molluscs And Other ...",3039190,8,"Frozen fish livers, roes and milt (excl. hard ...",10%,0%,0%,0%,0%,,...,986388.63031,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4544,44 - Wood And Articles Of Wood; Wood Charcoal,44123110,8,Plywood consisting solely of sheets of wood <=...,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4549,44 - Wood And Articles Of Wood; Wood Charcoal,44129410,8,Laminated wood with at least one outer ply of ...,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4552,44 - Wood And Articles Of Wood; Wood Charcoal,44129940,8,Veneered panels and similar laminated wood wit...,10%,0%,0%,0%,0%,,...,,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero
4553,44 - Wood And Articles Of Wood; Wood Charcoal,44129950,8,Veneered panels and similar laminated wood wit...,10%,0%,0%,0%,0%,,...,10997.57965,1,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero,pref_zero


Filter select columns rather than all df columns:

In [76]:
# select columns to filter across. can use iloc, name columns or use string recognition:
#1
col_names = df.iloc[:,[5,6,7,8]]
col_names = col_names.columns
col_names
#2
col_names = ["preferential_applied_duty_rate_2021","preferential_applied_duty_rate_2022","preferential_applied_duty_rate_2023","preferential_applied_duty_rate_2024"]
#3
col_names = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
col_names

array(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024'], dtype=object)

In [91]:
# filter across rows hwere pref tariff == x
df2 = df[(df[col_names] == "2%").any(1)]
df2

Unnamed: 0,commodity_heading,commodity_code,x8_digit_or_10_digit,commodity_code_description,mfn_applied_duty_rate,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,quota_number,...,value_usd,cn8_count,tariff_status_2021,tariff_status_final_2021,tariff_status_2022,tariff_status_final_2022,tariff_status_2023,tariff_status_final_2023,tariff_status_2024,tariff_status_final_2024
690,"03 - Fish And Crustaceans, Molluscs And Other ...",306931010,10,"Smoked, whether in shell or not, whether or no...",6%,3%,2%,1%,0%,,...,,2,dutiable,partially_dutiable,dutiable,partially_dutiable,dutiable,partially_dutiable,pref_zero,not_dutiable
691,"03 - Fish And Crustaceans, Molluscs And Other ...",306939010,10,"Smoked, whether in shell or not, whether or no...",6%,3%,2%,1%,0%,,...,,2,dutiable,partially_dutiable,dutiable,partially_dutiable,dutiable,partially_dutiable,pref_zero,not_dutiable
1721,"16 - Preparations Of Meat, Of Fish Or Of Crust...",16051000,8,"Crab, prepared or preserved (excl. smoked)",8%,3%,2%,1%,0%,,...,1522.28417,1,dutiable,dutiable,dutiable,dutiable,dutiable,dutiable,pref_zero,pref_zero


****

## 4. Aggregations

Data transformations grouping and aggregating data is one of the most common practices I and our department does. We extract and clean large maounts of data aggregating it to more actionable outputs with teams. Groupby is essential and straight forward for aggregations. 

I will be demonstrating aggregation using the trade data set which is a very rich and useful dataset as there are multiple ways to group and summarise the data which would be useful for people. 

In [41]:
df = trade.copy()
df["value_gbp2"] = df["value_gbp"]*10

In [42]:
# group by year - sum total value. Notice difference when keeping index as false:
df_agg  = df.groupby(["year"])["value_gbp"].sum()
df_agg

year
2019    13419136018
2020    95972770641
Name: value_gbp, dtype: int64

In [43]:
df_agg2 = df.groupby(["year"], as_index = False)["value_gbp"].sum()
df_agg2

Unnamed: 0,year,value_gbp
0,2019,13419136018
1,2020,95972770641


In [44]:
# group by using count and mean
df_agg = df.groupby(["year"])["value_gbp"].count()
# df_agg = groupby(["year"])["value_gbp"].mean()
df_agg

year
2019     2973
2020    38169
Name: value_gbp, dtype: int64

In [45]:
# multiple calculations of same column:
df_agg = df.groupby(["year"]).agg({"value_gbp": ["sum","mean","count","max","min"]})
df_agg

Unnamed: 0_level_0,value_gbp,value_gbp,value_gbp,value_gbp,value_gbp
Unnamed: 0_level_1,sum,mean,count,max,min
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2019,13419136018,4513668.35452,2973,4640828469,50
2020,95972770641,2514416.69001,38169,8963450144,4


In [46]:
# Multiple grouping for year and country
df_agg = df.groupby(["year","flow"]).agg({"value_gbp": ["sum","mean","count","max","min"]})
df_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,value_gbp,value_gbp,value_gbp,value_gbp,value_gbp
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,count,max,min
year,flow,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019,Exports,399852360,281586.16901,1420,67777232,567
2019,Imports,13019283658,8383312.07856,1553,4640828469,50
2020,Exports,48254392304,2064182.41451,23377,1952956674,5
2020,Imports,47718378337,3225958.51386,14792,8963450144,4


In [48]:
# seperate aggregate calculations:
df_agg = df.groupby(["year","flow"], as_index = False).agg({"value_gbp":"sum","value_gbp2":"mean"})
df_agg

Unnamed: 0,year,flow,value_gbp,value_gbp2
0,2019,Exports,399852360,2815861.69014
1,2019,Imports,13019283658,83833120.78558
2,2020,Exports,48254392304,20641824.1451
3,2020,Imports,47718378337,32259585.13859


****

### Conditional aggreations (similar to sumif in excel)

In [None]:
# Calculate total trade values for each year and trade flow for America:

In [59]:
df_agg = df.groupby(["year","flow"]).apply(lambda x: x[x['country_name'] == 'United States']['value_gbp'].sum())
df_agg

year  flow   
2019  Exports              0
      Imports     7982756283
2020  Exports    44699521286
      Imports    37221280504
dtype: int64

In [66]:
# alternative way using assign and numpy:
df_agg = df.assign(
    us1 = np.where(df["country_name"]=="United States",df.value_gbp,0),
    us2 = np.where(df["country_name"]=="United States",df.value_gbp2,0)
   ).groupby("year").agg({"us1":"sum","us2":"mean"})

df_agg

Unnamed: 0_level_0,us1,us2
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2019,7982756283,26850845.21695
2020,81920801790,21462653.40722


This method is handy if you wanted to conditionally aggregate specific countries into a wider dataframe. For Example - what are the year trade values of Taiwan and Thailand:

In [68]:
df_agg = df.assign(
    thailand = np.where(df["country_name"]=="Thailand",df.value_gbp,0),
    taiwan = np.where(df["country_name"]=="Taiwan",df.value_gbp,0)
   ).groupby(["year","flow"],as_index=False).agg({"thailand":"sum","taiwan":"sum"})

df_agg

Unnamed: 0,year,flow,thailand,taiwan
0,2019,Exports,0,208299148
1,2019,Imports,0,153283679
2,2020,Exports,1161053338,947693378
3,2020,Imports,2564820816,3160904392


****

### Unnest equivilent 

Un-concatonate a cell broken up by delimiter into new seperate rows inside a df. 

upload uk_trq data with commodity codes concatoneted together in one column seperated by a delimiter. 

In [80]:
uk_trqs = pd.read_csv("data/uk_trqs.csv",dtype={'quota__order_number': str}) # upload xlsxl
uk_trqs.head()

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.0
3,20816,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,30/12/2021,2000,5093.1,0.0
4,21866,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Critical,28/02/2022,2000,106.696,0.94665


In [85]:
# select columns to groupby and to unconcatonate. In this instance we have a quota level daa frame. So we select the quota order number and commodity codes. 
df = uk_trqs[["quota__order_number","quota__commodities"]]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2848 entries, 0 to 2847
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   quota__order_number  2848 non-null   object
 1   quota__commodities   2845 non-null   object
dtypes: object(2)
memory usage: 44.6+ KB


In [86]:
# NOTE the below unnest steps won't work if NaN present in data:
# remove NaN values. 
df = df.loc[~df["quota__commodities"].isnull()]

Unnamed: 0,quota__order_number,quota__commodities
0,50006,0302410000|0303510000|0304595000|0304599010|03...
1,50006,0302410000|0303510000|0304595000|0304599010|03...
2,50006,0302410000|0303510000|0304595000|0304599010|03...
3,50007,0305511010|0305511020|0305519010|0305519020|03...
4,50007,0305511010|0305511020|0305519010|0305519020|03...
...,...,...
2843,59281,0202100015|0202100099|0202201015|0202201099|02...
2844,59281,0202100015|0202100099|0202201015|0202201099|02...
2845,59282,0203121100|0203121900|0203191100|0203191300|02...
2846,59282,0203121100|0203121900|0203191100|0203191300|02...


In [88]:
# following steps to split out each cell within delimiters and create new row:
new_df = pd.DataFrame(df.quota__commodities.str.split('|').tolist(), index=df.quota__order_number).stack()
new_df = new_df.reset_index([0, 'quota__order_number'])
new_df.columns = ['quota__order_number', 'quota__commodities']
new_df['quota__order_number'] = new_df[ 'quota__order_number'].str.strip() # remove whitespace
new_df

Unnamed: 0,quota__order_number,quota__commodities
0,50006,0302410000
1,50006,0303510000
2,50006,0304595000
3,50006,0304599010
4,50006,0304992300
...,...,...
21407,59282,0203295900
21408,59282,0210111100
21409,59282,0210111900
21410,59282,0210113100


****

## 5. Joins

In [None]:
# simple left join using dfs with unique rows with simple one to one relationship:

In [106]:
df = trade.groupby("country_name").mean()
df2 = trade.groupby("country_name").sum()

In [107]:
# two dataframes with same index, can join using index
df3 = pd.merge(df,df2, left_index = True, right_index = True)
df3.head()

Unnamed: 0_level_0,year_x,value_gbp_x,suppression_notes_x,year_y,value_gbp_y,suppression_notes_y
country_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Taiwan,2019.88456,694558.82489,,12999977,4470180597.0,0.0
Tajikistan,2020.0,16364.82258,,125240,1014619.0,0.0
Tanzania (United Republic of),2020.0,103189.5672,,2254320,115159557.0,0.0
Thailand,2020.0,666644.14994,,11289780,3725874154.0,0.0
Timor-Leste,2020.0,19510.93103,,58580,565817.0,0.0


In [111]:
# you can use concat with using outer join by default - using axis = 1. axis = 0 combined rows. 
df3 = pd.concat([df2,df2], axis = 1)
df3

Unnamed: 0,country_name,year,value_gbp,suppression_notes,country_name.1,year.1,value_gbp.1,suppression_notes.1
0,Taiwan,12999977,4470180597.0,0.0,Taiwan,12999977,4470180597.0,0.0
1,Tajikistan,125240,1014619.0,0.0,Tajikistan,125240,1014619.0,0.0
2,Tanzania (United Republic of),2254320,115159557.0,0.0,Tanzania (United Republic of),2254320,115159557.0,0.0
3,Thailand,11289780,3725874154.0,0.0,Thailand,11289780,3725874154.0,0.0
4,Timor-Leste,58580,565817.0,0.0,Timor-Leste,58580,565817.0,0.0
5,Tonga,48480,472700.0,0.0,Tonga,48480,472700.0,0.0
6,Trinidad and Tobago,3203720,408306612.0,0.0,Trinidad and Tobago,3203720,408306612.0,0.0
7,Tunisia,3464277,308408720.0,0.0,Tunisia,3464277,308408720.0,0.0
8,Turkmenistan,606000,14082988.0,0.0,Turkmenistan,606000,14082988.0,0.0
9,Tuvalu,2020,1381.0,0.0,Tuvalu,2020,1381.0,0.0


In [113]:
# you can combine multiple dfs together using concat:
df4 = pd.concat([df,df2,df3], axis = 1) # again note - to bind together rows chnage axis to 0. 
df4.head()

Unnamed: 0,country_name,year,value_gbp,suppression_notes,country_name.1,year.1,value_gbp.1,suppression_notes.1,country_name.2,year.2,value_gbp.2,suppression_notes.2,country_name.3,year.3,value_gbp.3,suppression_notes.3
0,Taiwan,2019.88456,694558.82489,,Taiwan,12999977,4470180597.0,0.0,Taiwan,12999977,4470180597.0,0.0,Taiwan,12999977,4470180597.0,0.0
1,Tajikistan,2020.0,16364.82258,,Tajikistan,125240,1014619.0,0.0,Tajikistan,125240,1014619.0,0.0,Tajikistan,125240,1014619.0,0.0
2,Tanzania (United Republic of),2020.0,103189.5672,,Tanzania (United Republic of),2254320,115159557.0,0.0,Tanzania (United Republic of),2254320,115159557.0,0.0,Tanzania (United Republic of),2254320,115159557.0,0.0
3,Thailand,2020.0,666644.14994,,Thailand,11289780,3725874154.0,0.0,Thailand,11289780,3725874154.0,0.0,Thailand,11289780,3725874154.0,0.0
4,Timor-Leste,2020.0,19510.93103,,Timor-Leste,58580,565817.0,0.0,Timor-Leste,58580,565817.0,0.0,Timor-Leste,58580,565817.0,0.0


In [110]:
# merge not using index:
df = trade.groupby("country_name", as_index = False).mean()
df2 = trade.groupby("country_name", as_index = False).sum()
df3 = pd.merge(df,df2, on = "country_name", how = "left")
df3

Unnamed: 0,country_name,year_x,value_gbp_x,suppression_notes_x,year_y,value_gbp_y,suppression_notes_y
0,Taiwan,2019.88456,694558.82489,,12999977,4470180597.0,0.0
1,Tajikistan,2020.0,16364.82258,,125240,1014619.0,0.0
2,Tanzania (United Republic of),2020.0,103189.5672,,2254320,115159557.0,0.0
3,Thailand,2020.0,666644.14994,,11289780,3725874154.0,0.0
4,Timor-Leste,2020.0,19510.93103,,58580,565817.0,0.0
5,Tonga,2020.0,19695.83333,,48480,472700.0,0.0
6,Trinidad and Tobago,2020.0,257444.26986,,3203720,408306612.0,0.0
7,Tunisia,2019.98659,179830.15743,,3464277,308408720.0,0.0
8,Turkmenistan,2020.0,46943.29333,,606000,14082988.0,0.0
9,Tuvalu,2020.0,1381.0,,2020,1381.0,0.0


**NOTE:** when different column index names, use "left_on" and "right_on"

```python

By Default:

join  is a column-wise left join
pd.merge is a column-wise inner join
pd.concat  is a row-wise outer join

```

****

## Glossary

### 0. Set up

```python

# simple uploads

import pandas as pd
import numpy as np

pd.read_excel('filepath') 
pd.read_csv('filepath')

pd.read_excel('filepath', dtype={'column': str}) # convert "column" to string when uploading data
pd.read_excel("filepath",dtype=str) # convert all columns to string

```

```python
# simple data exploration

df.dtypes # column types
df.info # dataframe info, covers dataframe types, NaNs. 
df.shape # shape of df, i.e. number of rows, columns. 
df.describe() # summarise numerical values

df.head() 
df.tail()

```

```python
# clean column names

df.columns = df.columns.str.lower().str.replace(" ","_")

```

### 1. Select columns

```python

# basic selection 

df[["col1","col2","col3"]] # ensure double square brackets [[]]

# selection using array

array = ["col1","col2","col3"]
df[array]

# drop columns

df.drop(["col2","col3"], 1)

```

#### 1a select using pattern recognition

```python

pattern_col = [col for col in df.columns if 'pattern' in col]
pattern_col2 = [col for col in df.columns if 'pattern2' in col]

# combine using np.concatonate to filter df:

colNames =  np.concatenate((pattern_col, pattern_col2))
new_df = df[colNames]

```

#### 1b select columns with numerical values

```python

cols = np.array(df.columns[df.columns.str.contains('.*[0-9].*', regex=True)]) # select columns with any muerical value

# pattern using endswith and startswith
cols = [col for col in df.columns if col.endswith('.*[0-9].*')]
cols = [col for col in df.columns if col.startswith('.*[0-9].*')]


# combine numerical pattenr recongition with string

col_list = [col for col in df.columns if (col.startswith('pattern') & col.endswith("2"))]

col_list = np.array(df.columns[df.columns.str.contains(pat = "pref") & df.columns.str.contains('20',regex=True)])

```

#### 1c relocate columns

```python

col = df["col1"]
# drop column in df
df.drop(labels=["col1], axis = 1, inplace = True)
df.insert(3,"col1",col) # 3 is column position (chnage to index number you want)

# move multiple columns to start or end of df:
                
cols_to_move = ["col1","col2","col3"]
               
df2 = df[cols_to_move + [ col for col in df.columns if col not in cols_to_move ]]
df2 = df[[ col for col in df.columns if col not in cols_to_move ]+cols_to_move]
                
              
```

### 2. Create new columns

##### simple creation of columns

```python
# new columns are simple to create:

df["new_col"] = 10 # value in all cells
df["new_col"] = df["value_col"]*10
df["new_col"] = df["value_col1"] + df["value_col2"]

# sum across rows:

sum_cols = ["col1","col2","col3","col4"]
df["sum_col"] = df[sum_cols].sum(axis=1)

```

##### update numerical columns:

```python
# labda defined function (example columns dividing by 100):

df = df.apply(lambda x: x/100 if np.issubdtype(x.dtype, np.number) else x)

```

##### update multiple cdefined columns at once:

```python
cols = ["col1","col2","col3]
df[cols] = df[cols]*1000

```

##### combine columns together

```python
# equivalent to using paste in R - concatonate columns together

# use "+"

df["new_col"] = df["value_col"].map(str)+df["col2"] # use map(str) as year is numeric column
df["new_col"] = df["col1"]+" - "+df["col2"] # create string combining two columns seperating out "-"
df["new_col"] = "0"+df["col1"] # combine simple string with column

```

#### conditionally create columns

##### np.where

```python
# equilvaent to R - using mutate combine with ifelse. Somethign used commonly. 


# np.where

df["new_col"] = np.where(df["value_col"] > 10000,"Yes","No")

# nested np.where statement:

df["value_flag"] = np.where(df["value_col"] > 100000, "100k",
                               np.where(df["value_col"] > 10000,"10k",
                                        np.where(df["value_col"] > 1000, "1k","<1k")))

# using logicial operaters: 
# ensure each condition is inside a bracket ()
df["example_flag"] = np.where((df["value_col"] > 100000) & (df["col"] == "Taiwan"),"Yes","no")
df["example_flag2"] = np.where((df["value_col"] > 100000) | (df["value_col"] < 1000),"Yes","no")

```

##### np.select

np.select is very useful for writing more concise and clean code when multiple conditions

```python
conditions = [(df["value_col"]>100000), (df["value_col"] >10000), (df["value_col"] >1000)]
choices = ["100k","10k","1k"]
df['value_col'] = np.select(conditions, choices, default="<1k") # chnage default to 0 or any character

```

### 3. Filtering

##### basic filtering



```python

there are multiple ways to filter a dataframe. The only two I use for simple filtering are:
    
df = df[df["col"] == "condition"]
df = df.loc[df["col"] == "condition]
            

```

```python 

filter using operators (ensure use of brackets () )
    
df = df.loc[(df["col1"] == "condition") & (df["col2"] == "condition2")]

df = df.loc[(df["val1"] > 100) | (df["val2"] < 10)]
```

##### filter NAs

```python

# filter where col is NaN
na_df = df[df["col"].isnull()]

# filter where col is NOT NaN

not_na_df = df[~(df["col"].isnull())]
```

##### Drop columns with NaN

```python
drop_na = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)

# OR

na_cols = df.columns[df.isna().any()].tolist() # cretae list of columns with NAs in. 
drop_na = df[[col for col in df.columns if col not in na_cols]]


# drop columns which only contain NAs 
drop_na_cols = df.dropna(axis=1, how='all') 

```

##### filter by arrays

```python

array = ["value1","value2","value3"]
df2 = df[df["value_col"].isin(array)]

# not in
array = ["value1","value2","value3"]
df2 = df[~(df["value_col"].isin(array))]

```

##### filter across columns

```python
# filter any columns which contains string:

df = df[df.stack().str.contains('string').any(level=0)]

# OR

df = df[df.applymap(lambda x: x == "string").any(1)]

# filter across selected columns:

columns_to_filt = ["col1", "col2", "col3", "col4"]

df2 = df[(df[columns_to_filt] == "condition").any(1)] 

```

### 4. Aggregations

##### basic aggregations using groupby

```python
# basic and quick aggregation:

df.groupby(["col_agg"])["col_value"].sum # (.count, .mean etc)
df.groupby(["col_agg"], as_index = False)["col_value"].sum # use as_index = False to remove index and have as a column

# multiple calculations one one value

df.groupby(["col_agg"]).agg({"col_value": ["sum","mean","count","max","min"]})

# multiple conditions within aggregation:

df.groupby(["col_agg1","col_agg2","col_agg3"]).agg({"col_value": "sum"})

# seperate aggregate calculations:
df.groupby(["col_agg1","col_agg2"], as_index = False).agg({"col_value1":"sum","col_value2":"mean"})

```

##### conditional aggregations

```python

# equivalent to sumif
df.groupby(["col_agg").apply(lambda x: x[x['col'] == 'condition']['col_value'].sum())
            
# alternative way using assign and numpy:
            
df_agg = df.assign(
    val1 = np.where(df["col"]=="condition",df.col_value1,0),
    val2 = np.where(df["col"]=="condition",df.col_value2,0)
   ).groupby("col_agg").agg({"val1":"sum","val2":"mean"})


```

### 5. Joins

```python

df3 = pd.merge(df,df2, left_index = True, right_index = True)

# merge using defined "how"
df3 = pd.merge(df,df2,on = "joinID", how = "left") # can be placed by right, inner etc. 

# concat (default outer join if index is 1, 0 for row bind

df4 = pd.concat([df,df2,df3], axis = 1) 
df4 = pd.concat([df,df2,df3], axis = 0) 


By Default:

join  is a column-wise left join
pd.merge is a column-wise inner join
pd.concat  is a row-wise outer join

```

End.