## 01_py_basics

#### Python repository for general data manipulation techniques and working with pandas. 

## This notebook will cover:

**1. Selecting columns**
* basic selection and drop columns
* select using pattern recognition
* relocate columns

**2. Creating new columns and values**

**3. Filtering data**

**4. Aggregations using groupby**

**5. Joins (merge)**

### 0. Set up ---

Basic set up to load and inspect data before any data exploration and analysis:

First step is to load in basic Python libraris and the data..

In [1]:
# pandas and numpy are universally used in python, like tidyverse is in R. 
import pandas as pd
import numpy as np

!pip install openpyxl

# chnage from scientific notation 
pd.set_option('display.float_format', lambda x: '%.5f' % x)

trade = pd.read_excel("data/trade_data.xlsx") # upload xlsxl
tariff = pd.read_excel("data/tariff_data.xlsx")
# upload csv

Looking in indexes: https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
trade.head()

Unnamed: 0,Year,Flow,Commodity Code,Country Code,Country Name,Value GBP,Suppression notes
0,2020,Exports,1012100,TW,Taiwan,892,
1,2020,Exports,1062000,TW,Taiwan,14101,
2,2020,Exports,1063100,TW,Taiwan,1750,
3,2020,Exports,2031913,TW,Taiwan,290818,
4,2020,Exports,2031990,TW,Taiwan,1140,


basic df exploration:

In [3]:
# column names and types:
trade.dtypes

Year                   int64
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP              int64
Suppression notes    float64
dtype: object

In [4]:
# df summary:
trade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41142 entries, 0 to 41141
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               41142 non-null  int64  
 1   Flow               41142 non-null  object 
 2   Commodity Code     41142 non-null  object 
 3   Country Code       41142 non-null  object 
 4   Country Name       41142 non-null  object 
 5   Value GBP          41142 non-null  int64  
 6   Suppression notes  0 non-null      float64
dtypes: float64(1), int64(2), object(4)
memory usage: 2.2+ MB


using .info is very useful as in additional to Dtypes being printed you are provided with the "non-null" values or in other words NAs. For example the supression notes column is only NA values.

In [5]:
# summarise numerical values
trade.describe()

Unnamed: 0,Year,Value GBP,Suppression notes
count,41142.0,41142.0,0.0
mean,2019.92774,2658886.4581,
std,0.25892,61225646.8482,
min,2019.0,4.0,
25%,2020.0,5892.25,
50%,2020.0,34204.5,
75%,2020.0,260768.75,
max,2020.0,8963450144.0,


In [6]:
# simple df dimensions use shape:
trade.shape

(41142, 7)

**Note:** that the year column is uploaded as a value. It may be preferable to work with a character type rather than value for this column. When uploading data the data type can be specified

In [7]:
trade2 = pd.read_excel("data/trade_data.xlsx",dtype={'Year': str}) # convert year to string when uploading data
trade3 = pd.read_excel("data/trade_data.xlsx",dtype=str) # all columns as string
trade4 = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64}) # convert value to float opposed to integer. Floats allows for decimal points
print(trade2.dtypes,trade3.dtypes,trade4.dtypes)

Year                  object
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP              int64
Suppression notes    float64
dtype: object Year                 object
Flow                 object
Commodity Code       object
Country Code         object
Country Name         object
Value GBP            object
Suppression notes    object
dtype: object Year                   int64
Flow                  object
Commodity Code        object
Country Code          object
Country Name          object
Value GBP            float64
Suppression notes    float64
dtype: object


In [8]:
# want float for value so re-upload trade data:
trade = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64})
trade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41142 entries, 0 to 41141
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               41142 non-null  int64  
 1   Flow               41142 non-null  object 
 2   Commodity Code     41142 non-null  object 
 3   Country Code       41142 non-null  object 
 4   Country Name       41142 non-null  object 
 5   Value GBP          41142 non-null  float64
 6   Suppression notes  0 non-null      float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.2+ MB


## janitor - clean_names() equivalent. 

Working with cleaner string/column names is highlgihy recommended. 

In [9]:
trade.columns = trade.columns.str.lower().str.replace(" ","_")
trade.dtypes

year                   int64
flow                  object
commodity_code        object
country_code          object
country_name          object
value_gbp            float64
suppression_notes    float64
dtype: object

In [10]:
# using function - helpful if multiple dataframes to convert.
def  cleanCols(df): 
    df.columns = df.columns.str.lower().str.replace(" ","_")
    return(df)

trade = cleanCols(trade)
trade2 = cleanCols(trade2)
trade3 = cleanCols(trade3)
tariff = cleanCols(tariff)

## 1. Select columns ----

basic selection:

In [11]:
trade2 = trade[["year","flow","commodity_code","country_name","value_gbp"]]
trade2.dtypes

year                int64
flow               object
commodity_code     object
country_name       object
value_gbp         float64
dtype: object

In [12]:
# use an array:
cols = ["year","flow","commodity_code","country_name","value_gbp"]
trade2 = trade[cols]

In [13]:
trade2.dtypes

year                int64
flow               object
commodity_code     object
country_name       object
value_gbp         float64
dtype: object

drop columns:

In [14]:
# remove columns
trade2 = trade.drop(["year","flow","commodity_code"], 1) # index 1 reference columns to remove from df
trade2.dtypes

country_code          object
country_name          object
value_gbp            float64
suppression_notes    float64
dtype: object

In [15]:
trade2 = trade.drop(cols,1)
trade2.dtypes

country_code          object
suppression_notes    float64
dtype: object

select columns using column indexes (numbers):

### 1.a select columns using string patterns

The tariff data uploaded is a good df for this example as it has alot of strings with patterns which can be used for tidy selecitons

In [16]:
tariff.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
quota_number                                object
in_quota_tariff_line_code                  float64
preferential_applied_duty_rate_excluded     object
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                      

In [17]:
prefCol = tariff.columns[tariff.columns.str.contains(pat = 'pref')]
prefCol2 = [col for col in tariff.columns if 'pref' in col]

In [18]:
print(prefCol,prefCol2)

Index(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024',
       'preferential_applied_duty_rate_excluded'],
      dtype='object') ['preferential_applied_duty_rate_2021', 'preferential_applied_duty_rate_2022', 'preferential_applied_duty_rate_2023', 'preferential_applied_duty_rate_2024', 'preferential_applied_duty_rate_excluded']


Note difference between output types: one is an indexed array. 

In [19]:
mfnCol = [col for col in tariff.columns if 'mfn' in col]

In [20]:
codeCol = [col for col in tariff.columns if 'commodity' in col]

In [21]:
colNames = [codeCol,mfnCol,prefCol2]
print(colNames)

[['commodity_heading', 'commodity_code', 'commodity_code_description'], ['mfn_applied_duty_rate', 'mfn_applied_rate_ukgt'], ['preferential_applied_duty_rate_2021', 'preferential_applied_duty_rate_2022', 'preferential_applied_duty_rate_2023', 'preferential_applied_duty_rate_2024', 'preferential_applied_duty_rate_excluded']]


In [22]:
#tariff2 = tariff[colNames]
#tariff2.dtypes
# for error fix use:
#colNames = np.concatenate((codeCol,prefCol, mfnCol))

**NOTE the error.** Three list arrays have been combined together which then can't be used in this way to filter a pandas df. 

You can use numpy arrays for the column filters to select the data by using np.concatonate

In [41]:
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol))
tariff2 = tariff[colNames2]
tariff2.head()

Unnamed: 0,commodity_heading,commodity_code,commodity_code_description,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024,preferential_applied_duty_rate_excluded,mfn_applied_duty_rate,mfn_applied_rate_ukgt
0,01 - Live Animals,1012100,Pure-bred breeding horses,0%,0%,0%,0%,N,0%,0.0
1,01 - Live Animals,1012910,Horses for slaughter,0%,0%,0%,0%,N,0%,0.0
2,01 - Live Animals,1012990,"Live horses (excl. for slaughter, pure-bred fo...",0%,0%,0%,0%,N,10%,0.1
3,01 - Live Animals,1013000,Live asses,0%,0%,0%,0%,N,6%,0.06
4,01 - Live Animals,1019000,Live mules and hinnies,0%,0%,0%,0%,N,10%,0.1


### 1b. select columns with numerical values and combination of string patterns

Select columns which contain numerical values and where numerical values end the column string

i.e. preferntial. + 2021, 2022 etc...

```python
tariff2=tariff[["commodity_code","preferentisal_applied_duty_rate_2021,
                "preferentisal_applied_duty_rate_2022",
                "preferentisal_applied_duty_rate_2023","
                "preferentisal_applied_duty_rate_2024"]]
```

If there were even more columns to manually type everything out is tedious and time consuming when it can easily be done using string recognition

In [42]:
col = np.array(tariff.columns[tariff.columns.str.contains('.*[0-9].*', regex=True)]) # select columns with any muerical value
col

array(['x8_digit_or_10_digit', 'preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'cn8', 'hs2',
       'hs2_description', 'cn8_count', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'], dtype=object)

doesnt create what is required - can combine str.contains multiple times:

In [25]:
# doens't work when trying to extract numerical vlaues at end of string: (anyone know fix?)
col_list = [col for col in tariff.columns if col.endswith('.*[0-9].*')]
col_list

[]

In [26]:
#alternsative quick way can be a simple pattern within the numerical strings, however, extract unwanted tariff columns:
cl = tariff.columns[tariff.columns.str.contains(pat = '20')]
cl

Index(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'],
      dtype='object')

In [27]:
col = np.array(tariff.columns[tariff.columns.str.contains('20',regex=True)]) # select columns with any muerical value
col

array(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024', 'tariff_status_2021',
       'tariff_status_final_2021', 'tariff_status_2022',
       'tariff_status_final_2022', 'tariff_status_2023',
       'tariff_status_final_2023', 'tariff_status_2024',
       'tariff_status_final_2024'], dtype=object)

In [28]:
#example using startswith and endswith:
col_list = [col for col in tariff.columns if (col.startswith('pref') & col.endswith("2"))]
col_list

['preferential_applied_duty_rate_2022']

In [29]:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
c

array(['preferential_applied_duty_rate_2021',
       'preferential_applied_duty_rate_2022',
       'preferential_applied_duty_rate_2023',
       'preferential_applied_duty_rate_2024'], dtype=object)

In [43]:
# need to combine commoidty code with c in np.array
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff[c2].head()

Unnamed: 0,commodity_code,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024
0,1012100,0%,0%,0%,0%
1,1012910,0%,0%,0%,0%
2,1012990,0%,0%,0%,0%
3,1013000,0%,0%,0%,0%
4,1019000,0%,0%,0%,0%


In [44]:
# full solution:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff2 = tariff[c2]
tariff2.head()

Unnamed: 0,commodity_code,preferential_applied_duty_rate_2021,preferential_applied_duty_rate_2022,preferential_applied_duty_rate_2023,preferential_applied_duty_rate_2024
0,1012100,0%,0%,0%,0%
1,1012910,0%,0%,0%,0%
2,1012990,0%,0%,0%,0%
3,1013000,0%,0%,0%,0%
4,1019000,0%,0%,0%,0%


****

### 1c. Relocate columns:

I am currnelty unaware of a single line function which acheives this like relocate in tidyverse. However it takes a few lines having specified the columns wanting to be relocated within the df.

Example: trade data set - move flow column next to trade value

In [32]:
trade2 = trade.copy()

In [45]:
# name column(s) to be moved:
col = trade2[["flow"]]
# drop column in df
trade2.drop(labels=["flow"], axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(4,"flow",col)
trade2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,year,country_code,country_name,commodity_code,flow,value_gbp,suppression_notes
0,2020,TW,Taiwan,1012100,Exports,892.0,
1,2020,TW,Taiwan,1062000,Exports,14101.0,
2,2020,TW,Taiwan,1063100,Exports,1750.0,
3,2020,TW,Taiwan,2031913,Exports,290818.0,
4,2020,TW,Taiwan,2031990,Exports,1140.0,


In [34]:
# Can easily move multiple columns using same method:
cols = trade2[["country_name","country_code"]]
col1 = trade2["country_name"]
col2 = trade2["country_code"]
trade2.drop(cols, axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(1,"country_name",col1)
trade2.insert(1,"country_code",col2)
trade2.head()

Unnamed: 0,year,country_code,country_name,commodity_code,flow,value_gbp,suppression_notes
0,2020,TW,Taiwan,1012100,Exports,892.0,
1,2020,TW,Taiwan,1062000,Exports,14101.0,
2,2020,TW,Taiwan,1063100,Exports,1750.0,
3,2020,TW,Taiwan,2031913,Exports,290818.0,
4,2020,TW,Taiwan,2031990,Exports,1140.0,


If you want to move a larger selection of columns the above method isn't the most helpful. You can more easily specific the seleciton naming the order of columns (similar to select in tidyverse):

In [35]:
trade2 = trade[["year","country_code","country_name","flow","commodity_code","value_gbp","suppression_notes"]]
trade2.head()

Unnamed: 0,year,country_code,country_name,flow,commodity_code,value_gbp,suppression_notes
0,2020,TW,Taiwan,Exports,1012100,892.0,
1,2020,TW,Taiwan,Exports,1062000,14101.0,
2,2020,TW,Taiwan,Exports,1063100,1750.0,
3,2020,TW,Taiwan,Exports,2031913,290818.0,
4,2020,TW,Taiwan,Exports,2031990,1140.0,


However if you have alot more columns this is also not particularly helpful if you want to decrease time writing out column names..

In [36]:
#example df:
    
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
tariffCol = [col for col in tariff.columns if 'tariff' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol,tariffCol))
tariff2 = tariff[colNames2]
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
dtype: object

There are alot of pattenr recogmition strings within this dataframe. However i am approaching this as if there weren't and we wanted to relocate multiple columns ot select positions within a df.

In [37]:
tariff2 = tariff.copy()

In [38]:
# relocate MFN columns to front of data frame (method is useful when moving numerous columns to new position)
cols_to_move = ["mfn_applied_duty_rate","mfn_applied_rate_ukgt"]
#col_index = ["commo
tariff3 = tariff2[cols_to_move + [ col for col in tariff2.columns if col not in cols_to_move ]]
tariff3.dtypes

mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
quota_number                                object
in_quota_tariff_line_code                  float64
preferential_applied_duty_rate_excluded     object
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
value_usd                                  float64
cn8_count                      

In [39]:
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
quota_number                                object
in_quota_tariff_line_code                  float64
preferential_applied_duty_rate_excluded     object
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                      

In [40]:
# move pref columns to end of df
cols_to_move = [col for col in tariff.columns if 'pref' in col]
tariff3 = tariff2[[ col for col in tariff2.columns if col not in cols_to_move ]+cols_to_move]
tariff3.dtypes

commodity_heading                           object
commodity_code                               int64
x8_digit_or_10_digit                         int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
quota_number                                object
in_quota_tariff_line_code                  float64
notes                                       object
cn8                                          int64
hs2                                          int64
hs_section                                  object
hs2_description                             object
mfn_applied_rate_ukgt                       object
value_usd                                  float64
cn8_count                                    int64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023             

### **Still looking for solution to move selected columns to arbitary postion in df, i,e, relocate pref columns after "in_quota_tariff_line_code" for example**

****

## 2. Create new columns