## 02_py_strings

This notebook will cover general string manipulation in python. 

Topics covered:

**1. Basic string manipulation**
* combine strings
* extract characters from strings
* string lengths

**2. Remove string patterns** 
* remove strings
* remove whitespace

**3. String detection**
* detect pattern
* using patterns create columns and filter data

**Glossary**

In [3]:
# Set up

# pandas and numpy are universally used in python, like tidyverse is in R. 
import pandas as pd
import numpy as np

!pip install openpyxl

# chnage from scientific notation 
pd.set_option('display.float_format', lambda x: '%.5f' % x)

trade = pd.read_excel("data/trade_data.xlsx") # upload xlsxl
tariff = pd.read_excel("data/tariff_data.xlsx")
uk_trqs = pd.read_csv("data/uk_trqs.csv",dtype={'quota__order_number': str})

trade.columns = trade.columns.str.lower().str.replace(" ","_")
uk_trqs.columns = uk_trqs.columns.str.lower().str.replace(" ","_")

Looking in indexes: https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/
Collecting openpyxl
  Downloading https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/openpyxl/openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading https://s3-eu-west-2.amazonaws.com/mirrors.notebook.uktrade.io/pypi/et-xmlfile/et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
[0m

****

### 1. Basic string manipulation

for simple string manioulation we will use the trade dataset. 

In [5]:
trade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41142 entries, 0 to 41141
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   year               41142 non-null  int64  
 1   flow               41142 non-null  object 
 2   commodity_code     41142 non-null  object 
 3   country_code       41142 non-null  object 
 4   country_name       41142 non-null  object 
 5   value_gbp          41142 non-null  int64  
 6   suppression_notes  0 non-null      float64
dtypes: float64(1), int64(2), object(4)
memory usage: 2.2+ MB


In [6]:
trade

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
0,2020,Exports,01012100,TW,Taiwan,892,
1,2020,Exports,01062000,TW,Taiwan,14101,
2,2020,Exports,01063100,TW,Taiwan,1750,
3,2020,Exports,02031913,TW,Taiwan,290818,
4,2020,Exports,02031990,TW,Taiwan,1140,
...,...,...,...,...,...,...,...
41137,2019,Imports,94036090,ZM,Zambia,932,
41138,2020,Imports,95030041,ZM,Zambia,3812,
41139,2020,Imports,95030099,ZM,Zambia,3972,
41140,2020,Imports,97050000,ZM,Zambia,2213,


In [9]:
# combine strings of country_code and country_name

df = trade.copy()
df["combined_col"] = df["country_code"]+df["country_name"]
df["combined_col2"] = df["country_code"]+"_"+df["country_name"]
df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,combined_col,combined_col2
0,2020,Exports,1012100,TW,Taiwan,892,,TWTaiwan,TW_Taiwan
1,2020,Exports,1062000,TW,Taiwan,14101,,TWTaiwan,TW_Taiwan
2,2020,Exports,1063100,TW,Taiwan,1750,,TWTaiwan,TW_Taiwan
3,2020,Exports,2031913,TW,Taiwan,290818,,TWTaiwan,TW_Taiwan
4,2020,Exports,2031990,TW,Taiwan,1140,,TWTaiwan,TW_Taiwan


In [10]:
# combine year (numerical column) and flow (string)
# note you can't combine the numerical value directly so have to convert to string
# this can simply be done using map(str). 

df = trade.copy()
df["combined_col"] = df["year"].map(str) + df["flow"]
df.head()

# note try running code without map(str) to error check. 

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,combined_col
0,2020,Exports,1012100,TW,Taiwan,892,,2020Exports
1,2020,Exports,1062000,TW,Taiwan,14101,,2020Exports
2,2020,Exports,1063100,TW,Taiwan,1750,,2020Exports
3,2020,Exports,2031913,TW,Taiwan,290818,,2020Exports
4,2020,Exports,2031990,TW,Taiwan,1140,,2020Exports


#### Extract strings

In [13]:
# extract substring
# example from the commodity code column which is an 8-digt format. Extract HS2,4 and 6 formats. 
# equiv. of using LEFT in excel.
# str.slice

df = trade.copy()
df["hs2"] = df["commodity_code"].str.slice(0,2) # index 0 for start of string. 
df["hs4"] = df["commodity_code"].str.slice(0,4)
df["hs6"] = df["commodity_code"].str.slice(0,6)
df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,hs2,hs4,hs6
0,2020,Exports,1012100,TW,Taiwan,892,,1,101,10121
1,2020,Exports,1062000,TW,Taiwan,14101,,1,106,10620
2,2020,Exports,1063100,TW,Taiwan,1750,,1,106,10631
3,2020,Exports,2031913,TW,Taiwan,290818,,2,203,20319
4,2020,Exports,2031990,TW,Taiwan,1140,,2,203,20319


In [21]:
# RIGHT example 

df["right"] = df["commodity_code"].str[-2:] # n last 2 digits
df["right2"] = df["commodity_code"].str[-4:] # n last 4 digits etc. 

#alternative way of performing LEFT equiv. 
df["left"] = df["commodity_code"].str[:2]
df["left2"] = df["commodity_code"].str[:4]
df["left2"] = df["commodity_code"].str[:6]
df

# MID example
df["mid"] = df["commodity_code"].str.slice(2,6) # middle for digits. i.e. start at 2 (3 string including 0 index) up to the 6th string

df

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,hs2,hs4,hs6,right,right2,left,left2,mid
0,2020,Exports,01012100,TW,Taiwan,892,,01,0101,010121,00,2100,01,010121,0121
1,2020,Exports,01062000,TW,Taiwan,14101,,01,0106,010620,00,2000,01,010620,0620
2,2020,Exports,01063100,TW,Taiwan,1750,,01,0106,010631,00,3100,01,010631,0631
3,2020,Exports,02031913,TW,Taiwan,290818,,02,0203,020319,13,1913,02,020319,0319
4,2020,Exports,02031990,TW,Taiwan,1140,,02,0203,020319,90,1990,02,020319,0319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41137,2019,Imports,94036090,ZM,Zambia,932,,94,9403,940360,90,6090,94,940360,0360
41138,2020,Imports,95030041,ZM,Zambia,3812,,95,9503,950300,41,0041,95,950300,0300
41139,2020,Imports,95030099,ZM,Zambia,3972,,95,9503,950300,99,0099,95,950300,0300
41140,2020,Imports,97050000,ZM,Zambia,2213,,97,9705,970500,00,0000,97,970500,0500


#### String length

In [28]:
# use str.len

df = trade.copy()
df["length"] = df["commodity_code"].str.len()
df.head(3)

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,length
0,2020,Exports,1012100,TW,Taiwan,892,,8
1,2020,Exports,1062000,TW,Taiwan,14101,,8
2,2020,Exports,1063100,TW,Taiwan,1750,,8


In [27]:
# NOTE. YOu can;'t directly use str.len for numerical values. 
# you can convert the numerical value to string combininging with apply:

df["year_len"] = df["year"].map(str).apply(len)
df.head(2)

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes,length,year_len
0,2020,Exports,1012100,TW,Taiwan,892,,8,4
1,2020,Exports,1062000,TW,Taiwan,14101,,8,4


### 2. Remove string patterns

In [12]:
# UK TRQ data found here: https://data.gov.uk/dataset/4a478c7e-16c7-4c28-ab9b-967bb79342e9/uk-trade-quotas
# str.replace remove "|" from quota commodities
df = uk_trqs.copy()
df.head(2)

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0


In [14]:
#df["quota__commodities"] = df["quota__commodities"].str.replace("|","") # remove | and replace with nothing
#df["quota__commodities"] = df["quota__commodities"].str.replace("|",";") # remove | and replace with ";"
df["quota__commodities"] = df["quota__commodities"].str.replace("|"," , ") # remove | and replace with " , "
df.head(3)

  df["quota__commodities"] = df["quota__commodities"].str.replace("|"," , ") # remove | and replace with " , "


Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.0


In [18]:
# remvoe "0" from string and convert to numeric
df = trade.head(20).copy()
df["commodity_code"] = df["commodity_code"].str.replace("0","").astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   year               20 non-null     int64  
 1   flow               20 non-null     object 
 2   commodity_code     20 non-null     int64  
 3   country_code       20 non-null     object 
 4   country_name       20 non-null     object 
 5   value_gbp          20 non-null     int64  
 6   suppression_notes  0 non-null      float64
dtypes: float64(1), int64(3), object(3)
memory usage: 1.2+ KB


In [19]:
# remove strings form multiple columns:
# pre-define columns and apply function to columns
# remove "T" from country columns
df = trade.copy()
df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
0,2020,Exports,1012100,TW,Taiwan,892,
1,2020,Exports,1062000,TW,Taiwan,14101,
2,2020,Exports,1063100,TW,Taiwan,1750,
3,2020,Exports,2031913,TW,Taiwan,290818,
4,2020,Exports,2031990,TW,Taiwan,1140,


In [48]:
df = trade.copy()
df[["country_code","country_name"]] = df[["country_code","country_name"]].replace('T','*', regex=True)
df.head()

Unnamed: 0,year,flow,commodity_code,country_code,country_name,value_gbp,suppression_notes
0,2020,Exports,1012100,*W,*aiwan,892,
1,2020,Exports,1062000,*W,*aiwan,14101,
2,2020,Exports,1063100,*W,*aiwan,1750,
3,2020,Exports,2031913,*W,*aiwan,290818,
4,2020,Exports,2031990,*W,*aiwan,1140,


#### remove whitespace

In [22]:
df = uk_trqs.copy()
df["quota__commodities"] = df["quota__commodities"].str.replace("|"," , ") 
df["quota__commodities"] = df["quota__commodities"].str.strip() # remvoe whitespace across entire string
df["quota__commodities"] = df["quota__commodities"].str.lstrip() # remove whitespace from left handside of string
df["quota__commodities"] = df["quota__commodities"].str.rstrip() # remove whitespace from righthand side of string
df.head()

  df["quota__commodities"] = df["quota__commodities"].str.replace("|"," , ")


Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...","0302410000 , 0303510000 , 0304595000 , 0304599...",Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.0
3,20816,50007,ERGA OMNES,0305 –,"0305511010 , 0305511020 , 0305519010 , 0305519...",Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,30/12/2021,2000,5093.1,0.0
4,21866,50007,ERGA OMNES,0305 –,"0305511010 , 0305511020 , 0305519010 , 0305519...",Kilogram (kg),,,01/01/2022,31/12/2022,,,Critical,28/02/2022,2000,106.696,0.94665


### 3. String detection

In [28]:
# check string contain pattern to create new columns or filter data

# create flags if the quota unit in the uk_trqs dataset contian KG, Litre, Hectolitre etc. 
df = uk_trqs.copy()
df["kg_flag"] = df["quota__measurement_unit"].str.contains("(kg)")
df["hl_flag"] = df["quota__measurement_unit"].str.contains("hl")
df.head()
#df.loc[df["hl_flag"]==True]
# the default output is True/False. 

  return func(self, *args, **kwargs)


Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate,kg_flag,hl_flag
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0,True,False
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0,True,False
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.0,True,False
3,20816,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,30/12/2021,2000,5093.1,0.0,True,False
4,21866,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Critical,28/02/2022,2000,106.696,0.94665,True,False


In [31]:
# you can easily create custom columns using np.where logic if you don't
df = uk_trqs.copy()
df["kg_flag"] = np.where(df["quota__measurement_unit"].str.contains("kg"),"Yes","No")
df.head(3)

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate,kg_flag
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.0,0.0,Yes
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.0,0.0,Yes
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.0,Yes


In [36]:
# filter data using pattern match
# filter df where quota heading contains "fish"
df = uk_trqs.copy()
df_filt = df.loc[df["quota__headings"].str.contains("Fish")]
df_filt

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.00000,0.00000
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.00000,0.00000
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.00000
5,20817,50008,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302311000|0302321000|0302331000|0302341000|03...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,,29000,29000.00000,0.00000
6,21867,50008,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302311000|0302321000|0302331000|0302341000|03...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Open,,29000,29000.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2395,22105,58304,Canada,0304 – Fish fillets and other fish meat (wheth...,304839019,"Tonne (1,000 kg)",,,01/01/2022,31/12/2022,,,Open,,10,10.00000,0.00000
2396,22106,58304,Canada,0304 – Fish fillets and other fish meat (wheth...,304839019,"Tonne (1,000 kg)",,,01/01/2023,31/12/2023,,,Future,,10,,0.00000
2573,20762,58403,Canada,0304 – Fish fillets and other fish meat (wheth...,0304719000|0304791000,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,08/12/2021,791000,186732.00000,0.76393
2574,22103,58403,Canada,0304 – Fish fillets and other fish meat (wheth...,0304719000|0304791000,Kilogram (kg),,,01/01/2022,31/12/2022,,,Open,18/03/2022,791000,726460.00000,0.08159


In [33]:
# filter data where fish does not exisit in column:
df = uk_trqs.copy()
df_filt = df.loc[~(df["quota__headings"].str.contains("Fish"))]
df_filt

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
3,20816,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,30/12/2021,2000,5093.10000,0.00000
4,21866,50007,ERGA OMNES,0305 –,0305511010|0305511020|0305519010|0305519020|03...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Critical,28/02/2022,2000,106.69600,0.94665
9,20819,50013,ERGA OMNES,"4412 – Plywood, veneered panels and similar la...",4412390010|4412419900|4412419910|4412490000|44...,Cubic meter (m3),,,01/01/2021,31/12/2021,,,Exhausted,17/05/2021,167352,0.00000,1.00000
10,21869,50013,ERGA OMNES,"4412 – Plywood, veneered panels and similar la...",4412390010|4412419900|4412419910|4412490000|44...,Cubic meter (m3),,,01/01/2022,31/12/2022,,,Critical,22/03/2022,167352,49782.16200,0.70253
11,20820,50023,ERGA OMNES,7202 – Ferro-alloys,7202491020|7202495011,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,,146000,146000.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2843,22025,59281,Canada,"0202 – Meat of bovine animals, frozen|0206 – E...",0202100015|0202100099|0202201015|0202201099|02...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Open,18/02/2022,968000,951585.94000,0.01696
2844,22026,59281,Canada,"0202 – Meat of bovine animals, frozen|0206 – E...",0202100015|0202100099|0202201015|0202201099|02...,Kilogram (kg),,,01/01/2023,31/12/2023,,,Future,,968000,,0.00000
2845,20768,59282,Canada,"0203 – Meat of swine, fresh, chilled or frozen...",0203121100|0203121900|0203191100|0203191300|02...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,,4838000,4838000.00000,0.00000
2846,22033,59282,Canada,"0203 – Meat of swine, fresh, chilled or frozen...",0203121100|0203121900|0203191100|0203191300|02...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Critical,14/03/2022,4838000,5805.00000,0.99880


In [45]:
# filter using str.contains with regex:
df = uk_trqs.copy()
df_filt = df.loc[df["quota__headings"].str.contains("eggs")]
df_filt2 = df.loc[df["quota__headings"].str.contains("Fish")]
print(df_filt.shape,df_filt2.shape)

(117, 17) (103, 17)


two dataframes contain 117 and 103 rows respectively. When combined we would expect a 220 row df. 

In [47]:
# combining the or "|" operator wuthin str.contians enables a wider pattern recognition. 
df_filt = df.loc[df["quota__headings"].str.contains("Fish|eggs")]
df_filt

Unnamed: 0,quota_definition__sid,quota__order_number,quota__geographical_areas,quota__headings,quota__commodities,quota__measurement_unit,quota__monetary_unit,quota_definition__description,quota_definition__validity_start_date,quota_definition__validity_end_date,quota_definition__suspension_periods,quota_definition__blocking_periods,quota_definition__status,quota_definition__last_allocation_date,quota_definition__initial_volume,quota_definition__balance,quota_definition__fill_rate
0,20815,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,01/01/2021,14/02/2021,,,Closed,28/01/2021,2022900,2022900.00000,0.00000
1,20814,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2021,14/02/2022,,,Closed,,2112000,2112000.00000,0.00000
2,21865,50006,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302410000|0303510000|0304595000|0304599010|03...,Kilogram (kg),,,16/06/2022,14/02/2023,,,Future,,2112000,,0.00000
5,20817,50008,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302311000|0302321000|0302331000|0302341000|03...,Kilogram (kg),,,01/01/2021,31/12/2021,,,Closed,,29000,29000.00000,0.00000
6,21867,50008,ERGA OMNES,"0302 – Fish, fresh or chilled, excluding fish ...",0302311000|0302321000|0302331000|0302341000|03...,Kilogram (kg),,,01/01/2022,31/12/2022,,,Open,,29000,29000.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2823,22459,59161,Ceuta|European Union|Melilla|San Marino,1604 – Prepared or preserved fish; caviar and ...,1604140000,"Tonne (1,000 kg)",,,01/01/2022,31/12/2022,,,Open,11/03/2022,3000,2999437.15300,0.00000
2824,22460,59161,Ceuta|European Union|Melilla|San Marino,1604 – Prepared or preserved fish; caviar and ...,1604140000,"Tonne (1,000 kg)",,,01/01/2023,31/12/2023,,,Future,,3000,,0.00000
2825,21050,59162,Ceuta|European Union|Melilla|San Marino,1604 – Prepared or preserved fish; caviar and ...,1604207000,"Tonne (1,000 kg)",,,01/01/2021,31/12/2021,,,Closed,,4000,4000.00000,0.00000
2826,22491,59162,Ceuta|European Union|Melilla|San Marino,1604 – Prepared or preserved fish; caviar and ...,1604207000,"Tonne (1,000 kg)",,,01/01/2022,31/12/2022,,,Open,28/02/2022,4000,3999999.86800,0.00000


### Glossary

#### 1. String manipulaiton

```python

# combine strings

string3 = string1+string2
df["new_col"] = df["col1"] + "_" + df["col2"] + "end_string"

# map using integer

df["new_col"] = df["value_col"].map(str) + df["string_col"] 
```

```python
# extract strings

# LEFT
# first 4 strings
df["left"] = df["col1"].str.slice(0,4) # index 0 is first string. 
# alternatively:
df["left"] = df["col1"].[:4]

# RIGHT
# last 4 strings (8 digit string)
df["right"] = df["col1"].str.slice(5,8) # start on 5th string, 6, 7 and end at 8. 
df["right"] = df["col1"].str[-4:]

# MID
# middle 4 strings (8 digit string)
df["mid"] = df["col1"].str.slice(2,6)


```

```python
# string lengths

df["length"] = df["col1"].str.len() # only if column is character string
df["length"] = df["col1"].map(str).apply(len) # use map(str) + apply(len) if checking string length of numerical column
```

### 2. String removal

#### patterns

```python
# string replace
df["col"].str.replace("pattern","pattern_replace")
df["col"].str.replace(";",",") # etc. 

# replace string from multiple df columns:

cols = ["col1","col2","col3"]

df["cols"] = df["cols"].replace("string_to_remove","string_to_replace", regex = True)

# whitespace

df["col"].str.strip() # remove whitesapce in entire cell
df["col"].str.lstrip() # remove whitespace from left of cell
df["col"].str.rstrip() # remove whitespace form right of cell

```

### 3. String detection

```python
# detect string:

df["col"].str.contains("pattern")

# create column using pattern and non True/False output. 

df["new_col"] = np.where(df["col"].str.contains("pattern"),"pattern_exists","pattern_doesn't_exist")

# filter data based on pattern

df_filt = df.loc[df["col"].str.contains("pattern")]

# doesn't contain pattern:

df_filt = df.loc[~(df["col"].str.contains("pattern"))]

# combine with regex for multiple conditions:

df_filt = df.loc[df["col"].str.contains("pattern1|pattern2")

```

End. 