<h1>Table of Contents (Clickable in sidebar)<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Research-Question" data-toc-modified-id="Research-Question-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Research Question</a></span></li><li><span><a href="#Libraries-and-modules" data-toc-modified-id="Libraries-and-modules-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Libraries and modules</a></span></li><li><span><a href="#Housekeeping" data-toc-modified-id="Housekeeping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Housekeeping</a></span></li><li><span><a href="#Apply-data-exploration-functions-to-livestock-data" data-toc-modified-id="Apply-data-exploration-functions-to-livestock-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Apply data exploration functions to livestock data</a></span></li><li><span><a href="#The-df.nunique()-method-reveals--2-issues-in-categorical-data." data-toc-modified-id="The-df.nunique()-method-reveals--2-issues-in-categorical-data.-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>The df.nunique() method reveals  2 issues in categorical data.</a></span></li><li><span><a href="#Belgium-Luxembourg-started-reporting-beef-stocks-independently-in-2000" data-toc-modified-id="Belgium-Luxembourg-started-reporting-beef-stocks-independently-in-2000-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Belgium-Luxembourg started reporting beef stocks independently in 2000</a></span></li><li><span><a href="#Investigate-historical--cattle-stock-reporting-in-the--BELUX-union-and-Belgium--Luxembourg" data-toc-modified-id="Investigate-historical--cattle-stock-reporting-in-the--BELUX-union-and-Belgium--Luxembourg-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Investigate historical  cattle stock reporting in the  BELUX union and Belgium  Luxembourg</a></span></li><li><span><a href="#A-revised-research-question" data-toc-modified-id="A-revised-research-question-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>A revised research question</a></span></li><li><span><a href="#Returning-to-the-master-beef-stock-reporting-file" data-toc-modified-id="Returning-to-the-master-beef-stock-reporting-file-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Returning to the master beef stock reporting file</a></span></li><li><span><a href="#Filter--strictly-to--1999<year<2022" data-toc-modified-id="Filter--strictly-to--1999<year<2022-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Filter  strictly to  1999&lt;year&lt;2022</a></span></li><li><span><a href="#The-df.nunique()-method-to-identify-invariant-data" data-toc-modified-id="The-df.nunique()-method-to-identify-invariant-data-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>The df.nunique() method to identify invariant data</a></span></li><li><span><a href="#Eliminate-the-Belgium-Luxembourg" data-toc-modified-id="Eliminate-the-Belgium-Luxembourg-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Eliminate the Belgium-Luxembourg</a></span></li><li><span><a href="#Lack-of-estimated-and-unofficial-reporting" data-toc-modified-id="Lack-of-estimated-and-unofficial-reporting-13"><span class="toc-item-num">13&nbsp;&nbsp;</span>Lack of estimated and unofficial reporting</a></span></li><li><span><a href="#The-devil-is-in-the-dtypes!" data-toc-modified-id="The-devil-is-in-the-dtypes!-14"><span class="toc-item-num">14&nbsp;&nbsp;</span>The devil is in the dtypes!</a></span></li><li><span><a href="#The-absence-of-missing-Stock-values-enables-us-to-only-now-cast-them-as-32-bit-integers" data-toc-modified-id="The-absence-of-missing-Stock-values-enables-us-to-only-now-cast-them-as-32-bit-integers-15"><span class="toc-item-num">15&nbsp;&nbsp;</span>The absence of missing Stock values enables us to only now cast them as 32-bit integers</a></span></li><li><span><a href="#This-Jupyter-Notebook-has-focussed-on-cleaning-our-target-variable" data-toc-modified-id="This-Jupyter-Notebook-has-focussed-on-cleaning-our-target-variable-16"><span class="toc-item-num">16&nbsp;&nbsp;</span>This Jupyter Notebook has focussed on cleaning our target variable</a></span></li><li><span><a href="#We-finished-up-with-..." data-toc-modified-id="We-finished-up-with-...-17"><span class="toc-item-num">17&nbsp;&nbsp;</span>We finished up with ...</a></span></li><li><span><a href="#Open-notebook-in-root-directory-begining-02_...." data-toc-modified-id="Open-notebook-in-root-directory-begining-02_....-18"><span class="toc-item-num">18&nbsp;&nbsp;</span>Open notebook in root directory begining 02_....</a></span></li></ul></div>

# Exploratory Data Analysis Irish Beef
## Research Question
How has Ireland's beef sector performed compared to the EU 27 countries from 1961 to 2021, and can we forecast future prices using historical data? Additionally, what can we learn from sentiment analysis of the beef industry during this time period?
## Libraries and modules

In [1]:

### Data Manipulation and Analysis
import csv
import pandas as pd
import numpy as np
import fancyimpute
import missingno as msno
from functools import partial, reduce

### Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib.image as mpimg

### Statistical Analysis
from scipy.stats import ks_2samp, shapiro

### Machine Learning
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR

### Text Processing
import html
import re

### Country Information
from countryinfo import CountryInfo
import pycountry
from countrygroups import EUROPEAN_UNION

### File System and OS
import glob
import os

### Date and Time
import datetime
import time

### Data Presentation
from tabulate import tabulate
from IPython.display import HTML, Image, display

### Data Types
from typing import Dict, List, Tuple





## Housekeeping   

In [2]:
print(os.getcwd()) # working directory.
print(os.listdir('.')) #List current directory
print(os.listdir('data')) # Our source files from FAOSTAT are in 'data' folder


C:\Users\ronan\beef
['.git', '.ipynb_checkpoints', '01_eda_beef.ipynb', '02_ml_beef.ipynb', '02_multivariate.ipynb', 'arch', 'beef-main.zip', 'beef.pdf', 'clean', 'css', 'data', 'ignore', 'images', 'rain', 'README.md', 'temperature', 'Untitled Folder', 'Untitled.ipynb']
['alive.csv', 'stocks.csv', 'temperature.csv', 'temperature_change.csv', 'temperature_sd.csv']


In [3]:
df = pd.read_csv('data/stocks.csv')# loads the cattle stock  CSV file to pandas DataFrame n df
print(df.shape) # Inspect the dimensions of the dataset (number of rows and columns). (1708, 14)

(1708, 14)


## Apply data exploration functions to livestock data

In [4]:

df.head()#: returns the first few rows of the DataFrame indicating many fields may be invariant 

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,1961,1961,Head,2386761.0,A,Official figure
1,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,1962,1962,Head,2456557.0,A,Official figure
2,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,1963,1963,Head,2437123.0,A,Official figure
3,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,1964,1964,Head,2310667.0,A,Official figure
4,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,1965,1965,Head,2350269.0,A,Official figure


In [5]:
df.dtypes

Domain Code          object
Domain               object
Area Code (M49)       int64
Area                 object
Element Code          int64
Element              object
Item Code (CPC)       int64
Item                 object
Year Code             int64
Year                  int64
Unit                 object
Value               float64
Flag                 object
Flag Description     object
dtype: object

In [6]:
#Check for missing values 
df.nunique() #  We have 28  country/ regions repoorting and 61 years of data

Domain Code            1
Domain                 1
Area Code (M49)       28
Area                  28
Element Code           1
Element                1
Item Code (CPC)        1
Item                   1
Year Code             61
Year                  61
Unit                   1
Value               1365
Flag                   3
Flag Description       3
dtype: int64

## The df.nunique() method reveals  2 issues in categorical data.

1. Flag Description 
- We have 1160		Estimated values
- and 1182 	Unofficial figure
- In the context of this data analysis, these "estimated value" and "unofficial figures" might have  significant impact on the accuracy and reliability of our conclusion. 
2. The European Union (EU) is a political and economic union of 27 member states and yet we have 28  unique values of Area. This is more pressing than the flag describtiopns and is investigated first.


In [7]:
flag_df = df[['Flag', 'Flag Description']].drop_duplicates()
flag_df # There are a lot of estimated and missing categories of reporting- We will hold back on dealing with this for now

Unnamed: 0,Flag,Flag Description
0,A,Official figure
61,,
1160,E,Estimated value
1182,T,Unofficial figure


In [8]:
missing_values = df.isnull().sum()
print(missing_values)#  Check for missing values shows 319 cases

Domain Code           0
Domain                0
Area Code (M49)       0
Area                  0
Element Code          0
Element               0
Item Code (CPC)       0
Item                  0
Year Code             0
Year                  0
Unit                319
Value               319
Flag                319
Flag Description    319
dtype: int64


## Belgium-Luxembourg started reporting beef stocks independently in 2000
As part of the BLEU, Belgium and Luxembourg often reported their beef stock numbers as a single entity until 1999. After 1999, they started reporting independently.
We zoom in on the df around 2000 and compare the three countries by plot with the Netherland s now acting as an exemplar of the rest of the coutries

In [9]:


# Assuming df is the DataFrame containing the 'Area' column
unique_areas = df['Area'].unique()

# Convert to a DataFrame and take a look to discover BELUX anomoly
areasEU_df = pd.DataFrame({'Area': unique_areas})
print(areasEU_df)


                  Area
0              Austria
1              Belgium
2   Belgium-Luxembourg
3             Bulgaria
4              Croatia
5               Cyprus
6              Czechia
7              Denmark
8              Estonia
9              Finland
10              France
11             Germany
12              Greece
13             Hungary
14             Ireland
15               Italy
16              Latvia
17           Lithuania
18          Luxembourg
19               Malta
20         Netherlands
21              Poland
22            Portugal
23             Romania
24            Slovakia
25            Slovenia
26               Spain
27              Sweden


In [10]:
# Write to CSV file
areasEU_df.to_csv('clean/AreasEU.csv', index=False)
areasEU_df.head() 

Unnamed: 0,Area
0,Austria
1,Belgium
2,Belgium-Luxembourg
3,Bulgaria
4,Croatia


In [11]:
print(os.listdir('clean')) # Not saying these are modelling ready but not source so into clean folder!

['Areas.csv', 'AreasEU.csv', 'benelux.csv', 'benelux_pivot.csv', 'cattle_stock.csv', 'country.csv', 'master_data.csv', 'stocks.csv']


In [12]:

benelux_df = df[df['Area'].isin(['Belgium-Luxembourg', 'Belgium', 'Luxembourg', 'Netherlands'])]
benelux_df.sample(5) 

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
1115,QCL,Crops and livestock products,442,Luxembourg,5111,Stocks,2111,Cattle,1978,1978,,,,
1264,QCL,Crops and livestock products,528,Netherlands,5111,Stocks,2111,Cattle,2005,2005,Head,3799000.0,A,Official figure
1158,QCL,Crops and livestock products,442,Luxembourg,5111,Stocks,2111,Cattle,2021,2021,Head,187200.0,A,Official figure
122,QCL,Crops and livestock products,58,Belgium-Luxembourg,5111,Stocks,2111,Cattle,1961,1961,Head,2684120.0,A,Official figure
164,QCL,Crops and livestock products,58,Belgium-Luxembourg,5111,Stocks,2111,Cattle,2003,2003,,,,


In [13]:
# Reshape the DataFrame with pivot()
benelux_pivot_df = benelux_df.pivot(index='Year', columns='Area', values='Value')

In [14]:
# Rename the columns of the pivot table
benelux_pivot_df.columns = ['{}_stock'.format(col.replace(' ', '_')) for col in benelux_pivot_df.columns]

In [15]:
# Display the resulting pivot table
benelux_pivot_df

Unnamed: 0_level_0,Belgium_stock,Belgium-Luxembourg_stock,Luxembourg_stock,Netherlands_stock
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1961,,2684120.0,,3622588.0
1962,,2798130.0,,3816942.0
1963,,2847478.0,,3695185.0
1964,,2641407.0,,3567379.0
1965,,2685510.0,,3750629.0
...,...,...,...,...
2017,2385988.0,,202281.0,4030000.0
2018,2398090.0,,194390.0,3690000.0
2019,2373100.0,,192100.0,3721000.0
2020,2335440.0,,190690.0,3691000.0


In [16]:
# # Keep only the 'Area', 'Year'  and 'Value' columns
# benelux = benelux[['Area','Year', 'Value']]


# Write to CSV file
benelux_pivot_df.to_csv('clean/benelux_pivot.csv', index=False)

In [17]:

print(os.listdir('clean'))


['Areas.csv', 'AreasEU.csv', 'benelux.csv', 'benelux_pivot.csv', 'cattle_stock.csv', 'country.csv', 'master_data.csv', 'stocks.csv']


In [18]:
benelux_pivot = benelux_pivot_df.loc['1997':'2002']
print(benelux_pivot.head(6))


      Belgium_stock  Belgium-Luxembourg_stock  Luxembourg_stock  \
Year                                                              
1997            NaN                 3280000.0               NaN   
1998            NaN                 3184000.0               NaN   
1999            NaN                 3395000.0               NaN   
2000      3041560.0                       NaN          205072.0   
2001      3037760.0                       NaN          205193.0   
2002      2891260.0                       NaN          197257.0   

      Netherlands_stock  
Year                     
1997          4411000.0  
1998          4283000.0  
1999          4206000.0  
2000          4070000.0  
2001          4047000.0  
2002          3858000.0  


## Investigate historical  cattle stock reporting in the  BELUX union and Belgium  Luxembourg
In Pandas, the unique() method was used above  used to return a unique array of values. We discovered that there were 28 unique entries for "Area" despite there being only 27 EU countries. Further investigation revealed that Belgium and Luxembourg reported economic data as one region, and the Netherlands also participated in the Benelux economic union with these two countries. To ensure a comprehensive investigation, we filtered the DataFrame to only include data from these three countries and The pivot() method in pandas was used to reshape the DataFrame and compare reporting and missing cattle stock reports.The  Netherlands reported cattle stock independently for all 61 years of our research interval, while Belgium 
and Luxembourg reported collectively as the Benelux region from 1961 to 1999 when they started reporting individually.
## A revised research question
How has Ireland's beef sector performed compared to the EU 27 countries since 2000, and can we forecast future prices using this historical data? Additionally, what can we learn from sentiment analysis of the beef industry during this time period? By focusing on data from 2000 onwards, we can better capture the current state of the beef industry and make more relevant predictions about future trends. At the end of 1999, the Benelux union ceased to report beef stock data as a single entity, as each member country began reporting its data individually. This change reflected the increasing economic development and growth of the individual countries within the union. This not only created the data reason for us to only researh the 21st era but it also provides a fiscal reason for the refining of the researh question.

## Returning to the master beef stock reporting file
We load  the source  cattle stock CSV file to pandas again in case of interference


In [19]:
df = pd.read_csv('data/stocks.csv')# 

In [20]:
df.shape

(1708, 14)

In [21]:
missing_df=df.isnull().sum()

In [22]:
# Count the number of NaN values in each column
print(df.isna().sum())

Domain Code           0
Domain                0
Area Code (M49)       0
Area                  0
Element Code          0
Element               0
Item Code (CPC)       0
Item                  0
Year Code             0
Year                  0
Unit                319
Value               319
Flag                319
Flag Description    319
dtype: int64


## Filter  strictly to  1999<year<2022
Apart from solving the BELUX data cleaning problem recasting the research question to 2000-2021 acknowledges
 that 
farming in the 20th centuary was significantly different than from now so dropping this old data makes the data we keep more  relevant  for modelling current multivariate trends and predicting future trends along with suggested mittigations.

In [23]:
"""Creats a new DataFrame with Year>=2000 and keeps old  familiar name df"""
df = df[df['Year'] >= 2000]

In [24]:
df = df.loc[df['Area'] != 'Belgium-Luxembourg'] #Exclude 'Belgium-Luxembourg' from the 'Area' column in df.

## The df.nunique() method to identify invariant data
The df.nunique() method returns the number of unique values in each column of our beef stocks data. This isuseful for identifying columns that have only one value like the **Domain** which is **Crops and livestock products**. All of this invariant data is redundant data that can be dropped to reduce the size of the DataFrame or improve the performance of data analyses. Note the Element header had **Stock** as every value and this was useful in that we used it to rename the Value field. It no longer of any use and gets the chop. selecting a subset of columns from a Pandas DataFrame. Selecting a  relatively small subset of columns is useful when you are keeping less than half of the fields.


In [25]:
#Check for missing values 
df.nunique() #  We have 28  country/ regions repoorting and 61 years of data

Domain Code           1
Domain                1
Area Code (M49)      27
Area                 27
Element Code          1
Element               1
Item Code (CPC)       1
Item                  1
Year Code            22
Year                 22
Unit                  1
Value               592
Flag                  1
Flag Description      1
dtype: int64

## Eliminate the Belgium-Luxembourg

By fixing our **BELUX** problem the estimation and unoffial stock reportig flags hae also dissapeared and we have killed two birds with the one stone so to speak! We will also do some relabelling and reduce dimesionality based on nvariant data.


In [26]:
df = df.rename(columns={'Value': 'Stocks', 'Area': 'Country'}) #Renamed 'Value' column to 'Stocks' and Area column to Country.

In [27]:
df.head() # Check that  filtering and renaming worked correctly

Unnamed: 0,Domain Code,Domain,Area Code (M49),Country,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Stocks,Flag,Flag Description
39,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,2000,2000,Head,2152811.0,A,Official figure
40,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,2001,2001,Head,2155447.0,A,Official figure
41,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,2002,2002,Head,2118454.0,A,Official figure
42,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,2003,2003,Head,2066942.0,A,Official figure
43,QCL,Crops and livestock products,40,Austria,5111,Stocks,2111,Cattle,2004,2004,Head,2052033.0,A,Official figure


In [28]:
df = df[["Country", "Year", "Stocks", "Flag", "Flag Description"]] # Easier to stipulate what we keep 

In [29]:
df.shape# returns the dimensions of the DataFrame as (1708, 14)
df.tail()

Unnamed: 0,Country,Year,Stocks,Flag,Flag Description
1703,Sweden,2017,1448590.0,A,Official figure
1704,Sweden,2018,1435450.0,A,Official figure
1705,Sweden,2019,1404670.0,A,Official figure
1706,Sweden,2020,1390960.0,A,Official figure
1707,Sweden,2021,1389890.0,A,Official figure


In [30]:
print(df.dtypes)

Country              object
Year                  int64
Stocks              float64
Flag                 object
Flag Description     object
dtype: object


In [31]:
df.isnull().sum()# returns the number of missing values in each column.

Country             0
Year                0
Stocks              0
Flag                0
Flag Description    0
dtype: int64

## Lack of estimated and unofficial reporting
Lack of estimated and unofficial reporting makes the Flag and  Flag Description field reduntant so we drop them

In [32]:
flag_df = df[['Flag', 'Flag Description']].drop_duplicates()
flag_df # Unofficial and estimated reports were illiminated with other interventions!

Unnamed: 0,Flag,Flag Description
39,A,Official figure


In [33]:
df.dtypes

Country              object
Year                  int64
Stocks              float64
Flag                 object
Flag Description     object
dtype: object

## The devil is in the dtypes!

While there is a somewhat academic exposition into the importance of approriate use of the Int type for discreet values $\mathbb{Z} $  and the floating type to represent real numbers  $\mathbb{R}$ suffice to say here that integers are for counting and as well as counting sheep we count cattle and that is the end of it. Bad things can  happen when floating points are used for discreet values and even if they don't we slow down our modelling.

Also int64 can represent a much larger range of values than int32, but it also requires more memory so we will recast the Year type from INT64 to INT32.  If we are still farning in two billion years in the year  2,147,483,647  someone or some bovine  can cast it back up to INT64 depending on evolution and who is in charge!

## The absence of missing Stock values enables us to only now cast them as 32-bit integers
Now that there are no missing values in the 'Stock' column of our  Pandas DataFrame, df,  you can cast the  'Stock' column to a 32-bit integer data type using the astype() method. 

In [34]:
max_int32 = int(2**32/2-1)
print(max_int32)  # the /2 is because integers are directed numbers and the -1 accounts for 0!


2147483647


In [35]:
## The absence of mising values frees us up to cast values to integers. We can't have live bovine parts!
df['Stocks'] = df['Stocks'].astype(int)
# Cast the 'Year' column from int64 to int32
df['Year'] = df['Year'].astype('int32')
df.dtypes # Check


Country             object
Year                 int32
Stocks               int32
Flag                object
Flag Description    object
dtype: object

In [36]:
df.info()# provides a concise summary of the DataFrame, including column data types, non-null values, and memory usage.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 594 entries, 39 to 1707
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Country           594 non-null    object
 1   Year              594 non-null    int32 
 2   Stocks            594 non-null    int32 
 3   Flag              594 non-null    object
 4   Flag Description  594 non-null    object
dtypes: int32(2), object(3)
memory usage: 23.2+ KB


In [37]:
df.head()


Unnamed: 0,Country,Year,Stocks,Flag,Flag Description
39,Austria,2000,2152811,A,Official figure
40,Austria,2001,2155447,A,Official figure
41,Austria,2002,2118454,A,Official figure
42,Austria,2003,2066942,A,Official figure
43,Austria,2004,2052033,A,Official figure


In [38]:
df = df.drop(['Flag', 'Flag Description'], axis=1) # Removed two columns from the DataFrame as all reporting is official

In [39]:
df['Stocks'] = df['Stocks'].astype(int) # Naive reason: Integers are for counting!

## This Jupyter Notebook has focussed on cleaning our target variable 
- Lets punctuate the workflow by  continue with quality checking, visualisation, and merging data in the next Notebook.
- We will need to do some rigorous testing and visualisaton on it as well as merging it with expected predictor variables before any modelling and machine learning but progress has been made along with piles of enjoyable learning on the authors part.

In [40]:
df

Unnamed: 0,Country,Year,Stocks
39,Austria,2000,2152811
40,Austria,2001,2155447
41,Austria,2002,2118454
42,Austria,2003,2066942
43,Austria,2004,2052033
...,...,...,...
1703,Sweden,2017,1448590
1704,Sweden,2018,1435450
1705,Sweden,2019,1404670
1706,Sweden,2020,1390960


In [41]:
df.to_csv('clean/master_data.csv', index=False) # Stash it away

In [42]:
print(os.listdir('clean')) # make sure

['Areas.csv', 'AreasEU.csv', 'benelux.csv', 'benelux_pivot.csv', 'cattle_stock.csv', 'country.csv', 'master_data.csv', 'stocks.csv']


In [43]:
del df ## free up memory

In [44]:
print(os.listdir('clean')) # Our source files from FAOSTAT are in 'data' folder

['Areas.csv', 'AreasEU.csv', 'benelux.csv', 'benelux_pivot.csv', 'cattle_stock.csv', 'country.csv', 'master_data.csv', 'stocks.csv']


In [45]:
df = pd.read_csv('clean/master_data.csv')# loads the cattle stock  CSV file to pandas DataFrame n df

In [46]:
df


Unnamed: 0,Country,Year,Stocks
0,Austria,2000,2152811
1,Austria,2001,2155447
2,Austria,2002,2118454
3,Austria,2003,2066942
4,Austria,2004,2052033
...,...,...,...
589,Sweden,2017,1448590
590,Sweden,2018,1435450
591,Sweden,2019,1404670
592,Sweden,2020,1390960


## We finished up with ...
- We check that our cleaned files are where they shoud be
- We read our master back into df
- We take look 
- Deleting the df variable using del as a means of managing memory

## Open notebook in root directory begining 02_....