<h1>Table of Contents (Clickable in sidebar)<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-revised-research-question" data-toc-modified-id="The-revised-research-question-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The revised research question</a></span></li><li><span><a href="#Functions-Section" data-toc-modified-id="Functions-Section-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions Section</a></span></li><li><span><a href="#Combine-CCKP--Timeseries-Files-and-Add-a-CountryYear-Key" data-toc-modified-id="Combine-CCKP--Timeseries-Files-and-Add-a-CountryYear-Key-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Combine CCKP  Timeseries Files and Add a CountryYear Key</a></span></li><li><span><a href="#The-Climate-Change-Knowledge-Portal.-(n.d.).-Retrieved-February-20,-2023,-from-https://climateknowledgeportal.worldbank.org/" data-toc-modified-id="The-Climate-Change-Knowledge-Portal.-(n.d.).-Retrieved-February-20,-2023,-from-https://climateknowledgeportal.worldbank.org/-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Climate Change Knowledge Portal. (n.d.). Retrieved February 20, 2023, from <a href="https://climateknowledgeportal.worldbank.org/" rel="nofollow" target="_blank">https://climateknowledgeportal.worldbank.org/</a></a></span></li><li><span><a href="#The-Relevant--FAOSTAT-and-CCKP-data-merged-to-main_df" data-toc-modified-id="The-Relevant--FAOSTAT-and-CCKP-data-merged-to-main_df-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>The Relevant  FAOSTAT and CCKP data merged to main_df</a></span></li></ul></div>

# Merging  FAO Stock Data with CCKP  weather data 

## The revised research question
How has Ireland's beef sector performed compared to the EU 27 countries since 2000, and can we forecast future prices using this historical data? Additionally, what can we learn from sentiment analysis of the beef industry during this time period? By focusing on data from 2000 onwards, we can better capture the current state of the beef industry and make more relevant predictions about future trends. 

In [1]:
## Libraries,   modules and orientation


### Data Manipulation and Analysis
import csv
import pandas as pd
import numpy as np
import fancyimpute
import missingno as msno
from functools import partial, reduce
### Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib.image as mpimg

### File System and OS
import glob
import os
# from IPython.display import display, HTML
# ### Date and Time
# import datetime
# import time
# ### Data Presentation
# from tabulate import tabulate
# from IPython.display import HTML, Image, display

# ### Data Types
# from typing import Dict, List, Tuple
import warnings

# Filter out the FutureWarning with the level keyword
# warnings.filterwarnings('ignore', message='Using the level keyword in DataFrame and Series aggregations is deprecated')

# Reset the warning filter to default
# warnings.filterwarnings('default')

## Functions Section

In order to tidy up the layout and reading and due to the logical nature of functions we collect them here rather than leave them scattered throughout the notepad.

In [2]:
def combine_timeseries_files(path: str, subfolder: str, col_name: str) -> pd.DataFrame:
    """Reads in all CSV files in the given subfolder of the directory, renames the first unnamed column based on its position,
    and combines the resulting dataframes together into a single dataframe.
    Args:
        path (str): The relative path of the directory containing the data.
        subfolder (str): The name of the subfolder within the directory to read the CSV files from.
        col_name (str): The name to assign to the specified column in the resulting dataframe.
    Returns:
        pandas.DataFrame: The resulting dataframe after combining data from all CSV files in the specified subfolder.
    """
    try:
        folder_path = os.path.join(path, subfolder)
        csv_filenames = glob.glob(folder_path + "/*.csv")
        # Read in all CSV files, rename the first unnamed column based on position, and filter to only include the 'key' and specified column
        processed_dfs = []
        for filename in csv_filenames:
            file_path = os.path.join(path, subfolder, os.path.basename(filename))
            df = pd.read_csv(file_path, on_bad_lines='skip', skiprows=1)
            if df.columns[0].startswith("Unnamed"):
                df.rename(columns={df.columns[0]: "Year"}, inplace=True)
            df['Key'] = df.columns[1] + df['Year'].astype(str)
            df.rename(columns={df.columns[1]: col_name}, inplace=True)
            df = df.filter(['Key', col_name])
            processed_dfs.append(df)
        # Concatenate all dataframes into a single dataframe
        df = pd.concat(processed_dfs, ignore_index=True)
        return df
    except Exception as e:
        print(f"Error combining CSV files in folder {folder_path}: {e}")


##  Combine CCKP  Timeseries Files and Add a CountryYear Key

The combine_timeseries_files function reads multiple CSV files from the rain or temperature subfolders.Referencing the first five rows of Rain_IRL_df  below  the first unnamed column gives the year and based on position the name of the country is extracted. The function then forms a Key by concatrenating the Country and Year values together. This 'Key' column  will be  used to merge with our Cattle Stocks data   to single dataframe. It filters to only include the first two columns, which are Year and corresponding weather measurements and as such, it drops all redundant regional readings. The resulting dataframe contains the weather data of 27 European countries.

In [3]:
import pandas as pd

# Read the first two rows of data, with headers in the second row
Rain_IRL_df = pd.read_csv('rain/pr_timeseries_annual_cru_1901-2021_IRL.csv', header=1, nrows=2)

# Show the dataframe
Rain_IRL_df.head()

Unnamed: 0.1,Unnamed: 0,Ireland,Carlow,Cavan,Clare,Cork,Donegal,Dublin,Galway,Kerry,...,Monaghan,Munster,Offaly,Roscommon,Sligo,Tipperary,Waterford,Westmeath,Wexford,Wicklow
0,1901,1068.52,983.2,1018.93,1055.24,1153.06,1288.02,942.31,1044.95,1323.52,...,933.47,933.47,851.14,1008.41,1206.83,1030.92,1129.96,901.61,988.85,978.16
1,1902,1016.31,965.42,939.44,1004.48,1119.44,1167.77,934.83,984.17,1277.05,...,864.39,864.39,831.83,943.86,1087.75,1002.13,1101.99,852.81,967.98,966.45




## The Climate Change Knowledge Portal. (n.d.). Retrieved February 20, 2023, from https://climateknowledgeportal.worldbank.org/

The Climate Change Knowledge Portal. (n.d.). Retrieved February 20, 2023, from https://climateknowledgeportal.worldbank.org/
To maintain consistency with the FAO data, annual and not monthly time series  aggregates were taken from the Climatic Research Unit (CRU) dataset for **precipitation** and   **mean-temperature**. These datasets are provided by the CRU TS 4.04 dataset, a gridded climate dataset produced by the Climatic Research Unit (CRU) at the University of East Anglia in the United Kingdom. In the statistics section, range, variance, and standard deviation of monthly data may be revisited for insights.

The file names and folder names of the CCKP data used in this project are tabulated below.

<span style="font-size: 24px;">     </span>
        
All datasets from the CCKP are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO (CC BY-NC-SA 3.0 IGO). 
Source: CCKP (2023). Time Series datasets. Retrieved from [https://climateknowledgeportal.worldbank.org/download-data].

## The Relevant  FAOSTAT and CCKP data merged to main_df

In [4]:
# Rmmy stands for rain in mm/year
rain_df = combine_timeseries_files('', 'rain', 'Rmmy')
# save the rain DataFrame to a CSV file
rain_df.to_csv('clean/rain.csv', index=False)
temperature_df = combine_timeseries_files('', 'temperature', 'T\u00b0C')
# save the rain DataFrame to a CSV file
temperature_df.to_csv('clean/temperature.csv', index=False)
temperature_df.tail(2)
# Cleaned file saved in 'beef/clean folder' in first notebook
main_df = pd.read_csv('clean/stock.csv')# loads the cleaned cattle stock  CSV file to pandas DataFrame n df
# Cast "Year" to string type and merge with "Country" for "Key"
main_df['Key'] = main_df['Country'].str.cat(main_df['Year'].astype(str), sep='_')
main_df['Key'] = main_df['Country'] + main_df['Year'].astype(str)# adds a unique key column 
main_df = pd.merge(main_df, rain_df, on='Key')
main_df = pd.merge(main_df, temperature_df, on='Key')
del rain_df
del temperature_df
main_df.sample(4)

Unnamed: 0,Country,Year,Stock,Key,Rmmy,T°C
136,Denmark,2004,1645764,Denmark2004,754.84,8.71
426,Netherlands,2008,3890000,Netherlands2008,843.28,10.53
386,Luxembourg,2012,188473,Luxembourg2012,982.17,9.87
47,Bulgaria,2003,691230,Bulgaria2003,615.62,10.69


In [8]:
main_df

Unnamed: 0,Country,Year,Stock,Key,Rmmy,T°C
0,Austria,2000,2152811,Austria2000,1171.79,7.97
1,Austria,2001,2155447,Austria2001,1058.36,7.03
2,Austria,2002,2118454,Austria2002,1212.56,7.73
3,Austria,2003,2066942,Austria2003,891.80,7.35
4,Austria,2004,2052033,Austria2004,1080.30,6.75
...,...,...,...,...,...,...
589,Sweden,2017,1448590,Sweden2017,678.22,3.16
590,Sweden,2018,1435450,Sweden2018,539.63,3.62
591,Sweden,2019,1404670,Sweden2019,682.36,3.34
592,Sweden,2020,1390960,Sweden2020,669.34,4.47


In [9]:
main_df.to_csv('clean/main_stock_cckp.csv', index=False)
print(os.listdir('clean'))

['arch', 'areas.npy', 'AreasEU.csv', 'beluxPivot.csv', 'beneluxPivot.csv', 'beneluxPivot.npy', 'benelux_pivot.csv', 'cattle_stocks.csv', 'kept_countries.txt', 'main.csv', 'main_stock_cckp.csv', 'meadowpasture.csv', 'missing.csv', 'missing.npy', 'nutrient2002', 'nutrient2002.csv', 'orderedstock.csv', 'rain.csv', 'stock.csv', 'stockkey.csv', 'stock_cckp.csv', 'temperature.csv', 'topstock.txt', 'top_10_countries_stock.csv', 'top_countries.txt', 'top_countries_stock.csv']


In [10]:
# read in the CSV file and create a DataFrame
df = pd.read_csv('clean/main_stock_cckp.csv')

# print the first five rows of the DataFrame
df.head()



Unnamed: 0,Country,Year,Stock,Key,Rmmy,T°C
0,Austria,2000,2152811,Austria2000,1171.79,7.97
1,Austria,2001,2155447,Austria2001,1058.36,7.03
2,Austria,2002,2118454,Austria2002,1212.56,7.73
3,Austria,2003,2066942,Austria2003,891.8,7.35
4,Austria,2004,2052033,Austria2004,1080.3,6.75


In [11]:
del df