## 01_py_basics

#### Python repository for general data manipulation techniques and working with pandas. 

## This notebook will cover:

**1. Selecting columns**
* basic selection and drop columns
* select using pattern recognition
* relocate columns

**2. Creating new columns and values**

**3. Filtering data**

**4. Aggregations using groupby**

**5. Joins (merge)**

### 0. Set up ---

Basic set up to load and inspect data before any data exploration and analysis:

First step is to load in basic Python libraris and the data..

In [None]:
# pandas and numpy are universally used in python, like tidyverse is in R. 
import pandas as pd
import numpy as np

!pip install openpyxl

# chnage from scientific notation 
pd.set_option('display.float_format', lambda x: '%.5f' % x)

trade = pd.read_excel("data/trade_data.xlsx") # upload xlsxl
tariff = pd.read_excel("data/tariff_data.xlsx")
# upload csv

In [None]:
trade.head()

basic df exploration:

In [None]:
# column names and types:
trade.dtypes

In [None]:
# df summary:
trade.info()

using .info is very useful as in additional to Dtypes being printed you are provided with the "non-null" values or in other words NAs. For example the supression notes column is only NA values.

In [None]:
# summarise numerical values
trade.describe()

In [None]:
# simple df dimensions use shape:
trade.shape

**Note:** that the year column is uploaded as a value. It may be preferable to work with a character type rather than value for this column. When uploading data the data type can be specified

In [None]:
trade2 = pd.read_excel("data/trade_data.xlsx",dtype={'Year': str}) # convert year to string when uploading data
trade3 = pd.read_excel("data/trade_data.xlsx",dtype=str) # all columns as string
trade4 = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64}) # convert value to float opposed to integer. Floats allows for decimal points
print(trade2.dtypes,trade3.dtypes,trade4.dtypes)

In [None]:
# want float for value so re-upload trade data:
trade = pd.read_excel("data/trade_data.xlsx",dtype={'Value GBP': np.float64})
trade.info()

## janitor - clean_names() equivalent. 

Working with cleaner string/column names is highlgihy recommended. 

In [None]:
trade.columns = trade.columns.str.lower().str.replace(" ","_")
trade.dtypes

In [None]:
# using function - helpful if multiple dataframes to convert.
def  cleanCols(df): 
    df.columns = df.columns.str.lower().str.replace(" ","_")
    return(df)

trade = cleanCols(trade)
trade2 = cleanCols(trade2)
trade3 = cleanCols(trade3)
tariff = cleanCols(tariff)

## 1. Select columns ----

basic selection:

In [None]:
trade2 = trade[["year","flow","commodity_code","country_name","value_gbp"]]
trade2.dtypes

In [None]:
# use an array:
cols = ["year","flow","commodity_code","country_name","value_gbp"]
trade2 = trade[cols]

In [None]:
trade2.dtypes

drop columns:

In [None]:
# remove columns
trade2 = trade.drop(["year","flow","commodity_code"], 1) # index 1 reference columns to remove from df
trade2.dtypes

In [None]:
trade2 = trade.drop(cols,1)
trade2.dtypes

select columns using column indexes (numbers):

### 1.a select columns using string patterns

The tariff data uploaded is a good df for this example as it has alot of strings with patterns which can be used for tidy selecitons

In [None]:
tariff.dtypes

In [None]:
prefCol = tariff.columns[tariff.columns.str.contains(pat = 'pref')]
prefCol2 = [col for col in tariff.columns if 'pref' in col]

In [None]:
print(prefCol,prefCol2)

Note difference between output types: one is an indexed array. 

In [None]:
mfnCol = [col for col in tariff.columns if 'mfn' in col]

In [None]:
codeCol = [col for col in tariff.columns if 'commodity' in col]

In [None]:
colNames = [codeCol,mfnCol,prefCol2]
print(colNames)

In [None]:
#tariff2 = tariff[colNames]
#tariff2.dtypes
# for error fix use:
#colNames = np.concatenate((codeCol,prefCol, mfnCol))

**NOTE the error.** Three list arrays have been combined together which then can't be used in this way to filter a pandas df. 

You can use numpy arrays for the column filters to select the data by using np.concatonate

In [None]:
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol))
tariff2 = tariff[colNames2]
tariff2.head()

### 1b. select columns with numerical values and combination of string patterns

Select columns which contain numerical values and where numerical values end the column string

i.e. preferntial. + 2021, 2022 etc...

```python
tariff2=tariff[["commodity_code","preferentisal_applied_duty_rate_2021,
                "preferentisal_applied_duty_rate_2022",
                "preferentisal_applied_duty_rate_2023","
                "preferentisal_applied_duty_rate_2024"]]
```

If there were even more columns to manually type everything out is tedious and time consuming when it can easily be done using string recognition

In [None]:
col = np.array(tariff.columns[tariff.columns.str.contains('.*[0-9].*', regex=True)]) # select columns with any muerical value
col

doesnt create what is required - can combine str.contains multiple times:

In [None]:
# doens't work when trying to extract numerical vlaues at end of string: (anyone know fix?)
col_list = [col for col in tariff.columns if col.endswith('.*[0-9].*')]
col_list

In [None]:
#alternsative quick way can be a simple pattern within the numerical strings, however, extract unwanted tariff columns:
cl = tariff.columns[tariff.columns.str.contains(pat = '20')]
cl

In [None]:
col = np.array(tariff.columns[tariff.columns.str.contains('20',regex=True)]) # select columns with any muerical value
col

In [None]:
#example using startswith and endswith:
col_list = [col for col in tariff.columns if (col.startswith('pref') & col.endswith("2"))]
col_list

In [None]:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
c

In [None]:
# need to combine commoidty code with c in np.array
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff[c2].head()

In [None]:
# full solution:
c = np.array(tariff.columns[tariff.columns.str.contains(pat = "pref") & tariff.columns.str.contains('20',regex=True)])
cd = ["commodity_code"]
c2 = np.concatenate((cd,c))
tariff2 = tariff[c2]
tariff2.head()

****

### 1c. Relocate columns:

I am currnelty unaware of a single line function which acheives this like relocate in tidyverse. However it takes a few lines having specified the columns wanting to be relocated within the df.

Example: trade data set - move flow column next to trade value

In [None]:
trade2 = trade.copy()

In [None]:
# name column(s) to be moved:
col = trade2[["flow"]]
# drop column in df
trade2.drop(labels=["flow"], axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(4,"flow",col)
trade2.head()

In [None]:
# Can easily move multiple columns using same method:
cols = trade2[["country_name","country_code"]]
col1 = trade2["country_name"]
col2 = trade2["country_code"]
trade2.drop(cols, axis = 1, inplace = True)
# insert column back in and select position. Value gbp is column 5(4 when index starts at 0). 
trade2.insert(1,"country_name",col1)
trade2.insert(1,"country_code",col2)
trade2.head()

If you want to move a larger selection of columns the above method isn't the most helpful. You can more easily specific the seleciton naming the order of columns (similar to select in tidyverse):

In [None]:
trade2 = trade[["year","country_code","country_name","flow","commodity_code","value_gbp","suppression_notes"]]
trade2.head()

However if you have alot more columns this is also not particularly helpful if you want to decrease time writing out column names..

In [291]:
#example df:
    
prefCol = [col for col in tariff.columns if 'pref' in col]
mfnCol = [col for col in tariff.columns if 'mfn' in col]
codeCol = [col for col in tariff.columns if 'commodity' in col]
tariffCol = [col for col in tariff.columns if 'tariff' in col]
colNames2 =  np.concatenate((codeCol,prefCol, mfnCol,tariffCol))
tariff2 = tariff[colNames2]
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
dtype: object

There are alot of pattenr recogmition strings within this dataframe. However i am approaching this as if there weren't and we wanted to relocate multiple columns ot select positions within a df.

In [288]:
tariff2 = tariff.copy()

In [313]:
# relocate MFN columns to front of data frame (method is useful when moving numerous columns to new position)
cols_to_move = ["mfn_applied_duty_rate","mfn_applied_rate_ukgt"]
#col_index = ["commo
tariff3 = tariff2[cols_to_move + [ col for col in tariff2.columns if col not in cols_to_move ]]
tariff3.dtypes

mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
dtype: object

In [312]:
tariff2.dtypes

commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
dtype: object

In [292]:
# move pref columns to end of df
cols_to_move = [col for col in tariff.columns if 'pref' in col]
tariff3 = tariff2[[ col for col in tariff2.columns if col not in cols_to_move ]+cols_to_move]
tariff3.dtypes

commodity_heading                           object
commodity_code                               int64
commodity_code_description                  object
mfn_applied_duty_rate                       object
mfn_applied_rate_ukgt                       object
in_quota_tariff_line_code                  float64
tariff_status_2021                          object
tariff_status_final_2021                    object
tariff_status_2022                          object
tariff_status_final_2022                    object
tariff_status_2023                          object
tariff_status_final_2023                    object
tariff_status_2024                          object
tariff_status_final_2024                    object
preferential_applied_duty_rate_2021         object
preferential_applied_duty_rate_2022         object
preferential_applied_duty_rate_2023         object
preferential_applied_duty_rate_2024         object
preferential_applied_duty_rate_excluded     object
dtype: object

### **Still looking for solution to move selected columns to arbitary postion in df, i,e, relocate pref columns after "in_quota_tariff_line_code" for example**

****

## 2. Create new columns