## Analysis of Irish Dairy Product Production and Consumption in comparison to EU Member States, regarding the rise of non-dairy product Production and Consumption in particular

This analysis aims to analyse the production and consumption of dairy products within Ireland and the impact of increasingly popular alternatives on the population's impression of dairy products. This analysis will be performed through the comparison of Ireland's situation with an EU member state with a similarly sized dairy market and output. 

Sentiment analysis will be performed to explore the apparent increase in public interest in non-dairy alternatives in recent years, and its impact on dairy product sales and consumption.

Additionally, forecasting will be performed to best characterise the future behaviour of the investigated trends, such as the downturn in the consumption of dairy milk.

## Import Required Libraries

In [18]:
#Commonly used libraries for plotting, statistical analysis and data analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
from pandas import DatetimeIndex
from scipy.stats import poisson

#SKLearn is a widely used machine learning library that offers a host of abilities.
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,explained_variance_score
from sklearn.linear_model import LinearRegression,Ridge
from statsmodels.tools.eval_measures import mse, rmse
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import (GridSearchCV, cross_val_score)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

## Notes on Licensing of Data

## Data Sourced from FAO

"FAO encourages you to use FAO databases for research, statistical, and scientific purposes. You may access, download, create copies and re-disseminate datasets subject to these Dataset terms.

Unless specifically stated otherwise, all datasets disseminated through the databases below are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO (CC BY-NC-SA 3.0 IGO) explained here with the additional terms listed below"-https://www.fao.org/contact-us/terms/db-terms-of-use/en/+

## Notes on Metadata

## Main Sources of Error
Overall Accuracy

"It is not possible to assess the overall accuracy but as there is a substantial amount of estimated or imputed data points, the accuracy for certain products, countries and regions is not good" - https://www.fao.org/faostat/en/#data/FBSH

Sampling Error

"No information available. In the EU the coefficient of variation shall according to regulations not exceed 3% for the area of cultivation for main crops. For non EU countries the coefficient of variation might be significantly larger. For further information see country metadata when available."

Non-sampling error

"No information available of the magnitude on non-sampling errors. One such category of errors is measurement errors which are due mainly to lack of harmonisation in statistical methods. For instance, when FAO concepts do not fit with national concepts, there may be significant measurement errors."


## Preliminary inspection of available datasets

Relevant datasets available from reputable sources are initially explored and evaluated to determine suitable data for this analysis. Given the nature of the topic, there is limited geographical comparability due to differences in methods and coverage, except for regions with homogenous countries. Due to this convention, analysis and comparison of industries of EU member states is simpler and one is less likely to mis-interpret available data and reportings by a state.

## Notes on FAO Data

## Definition of Statistical Measures used in Datasets

"Areas refer to the area under cultivation. Area under cultivation means the area that corresponds to the total sown area, but after the harvest it excludes ruined areas (e.g. due to natural disasters). If the same land parcel is used twice in the same year, the area of this parcel can be counted twice. Production means the harvested production. Harvested production means production including on-holding losses and wastage, quantities consumed directly on the farm and marketed quantities, indicated in units of basic product weight. Yield means the harvested production per ha for the area under cultivation"- https://www.fao.org/faostat/en/#data/QCL

Unit of measure <br>

"LIVESTOCK PRIMARY: Laying [1000 heads], milk animals [heads], prod Population [No], prod Population [heads], producing Animals/Slaughtered [1000 heads], producing Animals/Slaughtered [heads], production [1000 heads], production [1000], production [heads], production [t], yield [100 mg/head], yield [No/head], yield [hg/head], yield/Carcass Weight [0.1 g/head], yield/Carcass Weight [hg/head]. 5) LIVESTOCK PROCESSED: Production is expressed in tonnes [t] "-https://www.fao.org/faostat/en/#data/QCL

Production quantity to be the metric of interest as a reflection of the demand?? All items explicitly identified as being produced from dairy milk were captured in the dataset. All those without explicit identification could not be verified and so were omitted to avoid the use of incorrect data.

Maybe we include all milk items even though they cant be verified just to show how much data was excluded due to this inability to verify. If it's minimal we can further justify it. This would be part of EDA and data prep.

In [19]:
#Inspect a dataset sourced from the Food and Agriculature Organisation of the United Nations
#This dataset details the Food Balance Sheet for Ireland between 1961 and 2013; a comprehensive picture 
df=pd.read_csv("/Users/markc/Documents/Data Analytics Masters/CA2/Data/Dairy Data/FAOSTAT_data_en_12-13-2022.csv")
print(df.shape)

(212, 14)


In [20]:
print(df.head)

<bound method NDFrame.head of     Domain Code                                             Domain  \
0          FBSH  Food Balances (-2013, old methodology and popu...   
1          FBSH  Food Balances (-2013, old methodology and popu...   
2          FBSH  Food Balances (-2013, old methodology and popu...   
3          FBSH  Food Balances (-2013, old methodology and popu...   
4          FBSH  Food Balances (-2013, old methodology and popu...   
..          ...                                                ...   
207        FBSH  Food Balances (-2013, old methodology and popu...   
208        FBSH  Food Balances (-2013, old methodology and popu...   
209        FBSH  Food Balances (-2013, old methodology and popu...   
210        FBSH  Food Balances (-2013, old methodology and popu...   
211        FBSH  Food Balances (-2013, old methodology and popu...   

     Area Code (M49)     Area  Element Code                Element  \
0                372  Ireland          5511             Pro

In [21]:
print(df.columns)

Index(['Domain Code', 'Domain', 'Area Code (M49)', 'Area', 'Element Code',
       'Element', 'Item Code (CPC)', 'Item', 'Year Code', 'Year', 'Unit',
       'Value', 'Flag', 'Flag Description'],
      dtype='object')


In [26]:
display(df.describe(include='all'))

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
count,212,212,212.0,212,212.0,212,212,212,212.0,212.0,212,212.0,212,212
unique,1,1,,1,,2,3,3,,,1,,1,1
top,FBSH,"Food Balances (-2013, old methodology and popu...",,Ireland,,Production,S2848,Milk - Excluding Butter,,,1000 tonnes,,I,Imputed value
freq,212,212,,212,,159,106,106,,,212,,212,212
mean,,,372.0,,5421.75,,,,1987.0,1987.0,,1241.816038,,
std,,,0.0,,154.951418,,,,15.333265,15.333265,,2017.668705,,
min,,,372.0,,5154.0,,,,1961.0,1961.0,,2.0,,
25%,,,372.0,,5421.75,,,,1974.0,1974.0,,51.0,,
50%,,,372.0,,5511.0,,,,1987.0,1987.0,,136.0,,
75%,,,372.0,,5511.0,,,,2000.0,2000.0,,1166.5,,


In [23]:
df.drop(['Domain Code', 'Domain','Area Code (M49)',''])

KeyError: "['Domain Code', ''] not found in axis"