This notebook contains some basic data visualization and evaluation.

This is to provide a starting point for your exploration of the given datasets.  

Please leave any questions/concerns in the comment section. Upvote if you find it useful.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import os
from plotly.subplots import make_subplots
import plotly.graph_objects as go
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
    

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
Crops_data = '../input/crop-statistics-fao-all-countries/Crops_AllData_Normalized.csv'
data = pd.read_csv(Crops_data,  encoding='cp1252')
data.head()

Remove columns that are irrelevant/unnecesary

In [None]:
data.drop(['Area Code','Item Code', 'Element Code', 'Year Code', 'Flag'],inplace=True,axis=1)
data.head()

Filter the data to show only data on crops in spain

In [None]:
data_spain = data.loc[data.Area == 'Spain']
data_spain.head()

Filter the data once again, to show only the data on Production quantity (in tonnes) of different crops in Spain (This is because I have chosen to evaluate only production quantity as of now).

In [None]:
production = data_spain.loc[data_spain.Element == 'Production']
production.head()

In [None]:
production_sort = production.sort_values(by=['Value'], ascending=False) 
production_sort

More Information on the funciton used above can be found at - https://www.geeksforgeeks.org/python-pandas-dataframe-sort_values-set-1/

We see that the crop with the highest level of production in Spain is Cereals. This suggests that cereals are the principal crop in Spain.
(Cereals is a clasification of crops that includes wheat, rice, maize, oat, barley, rye, millet and sorghum).

In [None]:
production_sort.dropna(subset = ["Value"], inplace=True)
production_sort

refer to - https://www.kite.com/python/answers/how-to-drop-empty-rows-from-a-pandas-dataframe-in-python#:~:text=Use%20df.,contain%20NaN%20under%20those%20columns for more information on the function used.

This function removes all rows with the Value of NA in the 'Value' Column.

In [None]:
fig2 = px.area(production_sort, x="Year", y="Value", color="Item", line_group="Item", title='Production of Crops in Spain (in Tonnes)')
fig2.show()

The graph given above is a stacked area chart and shows the total level of crop production and the contribution of each crop to that total over the time period of 1961-2019.
The production quantity of Crops (in tonnes, in spain) seeems to have peaked in 2018 at around 160 million.


Plotting Crop Production per country in 2018.
To compare the value of crops produced in spain to the rest of the world.

In [None]:
global_produce = data.loc[(data.Element == 'Production')  & (data.Year == 2018)]
global_produce
countries = data.Area.unique()
area = []
amnt = []
for country in countries:
    temp = global_produce.loc[global_produce.Area == country]
    amount = temp.Value.sum()
    area.append(country)
    amnt.append(amount)
data_global = pd.DataFrame({'Country': area, 'Amount': amnt})
data_global = data_global.sort_values(by=['Amount'], ascending=False)

In [None]:
fig = px.bar(data_global, x=data_global.Country[1:20], y=data_global.Amount[1:20], title='Countries with highest Crop Production in 2018',color=data_global.Amount[1:20],
             labels={'x': 'Country', 'y': 'Amount (tonnes)', 'color': 'Amount- Tonnes'})
fig.show()

When analysing this data we can ignore the continents and solely focus on the countries in the table.
We see that countries like India (2.34B tonnes) and Brazil (2.26B tonnes) have a much higher level of crop production compared to Spain (Even on the year with its highest level of production). This could be attributed to the difference in the population of these countries or the difference in technological advancement and Infrastructure.
I will only be looking at the relationship between crops produced in spain and the population of spain over the specified time fram (1961 to 2019), but I strongly recommned looking at the relationship between the crop production level of all these different countries and their population density/population.

Below is the above bar grpah extended to show the crop production levels of the top 60 countries (Done so the bar representing Spain can be seen (in the 10th position from the end))

In [None]:
fig1 = px.bar(data_global, x=data_global.Country[1:60], y=data_global.Amount[1:60], title='Countries with highest Crop Production in 2018',color=data_global.Amount[1:60],
             labels={'x': 'Country', 'y': 'Amount (tonnes)', 'color': 'Amount- Tonnes'})
fig1.show()

Considering Population data - So that we can see the relationship between Total population and Crop production in spain (to further understand the various factors that influence crop production)

In [None]:
population_data = "../input/crop-statistics-fao-all-countries/Total_Population_All_Countries.csv"
population = pd.read_csv(population_data,encoding='cp1252')
population.head()


Remove columns that are irrelevant/unnecesary

In [None]:
population.drop(['LocID','PopMale', 'PopFemale','VarID','MidPeriod'],inplace=True,axis=1)

In [None]:
pop_spain = population.loc[population.Location == 'Spain']
pop_spain

Note that for population data we will be reffering to the medium variant in the dataset provided. More informatio on this can be found at -  https://population.un.org/wpp/Download/Standard/Population/

In [None]:
population_medium = pop_spain.loc[pop_spain.Variant == 'Medium']
#population_medium.head()

Filter the popuilation data to only contain the data for the years present in the crop production dataset (1961-2019)

In [None]:
pop_filtered = population_medium.loc[(population_medium.Time > 1960) & (population_medium.Time < 2020)]
pop_filtered.head()

In [None]:
temp1 = production_sort
temp1 = temp1.sort_values(by=['Year'], ascending=True)
years = temp1.Year.unique()
yr = []
amnt = []
for year in years:
    temp_year = temp1.loc[temp1.Year == year]
    amount = temp_year.Value.sum()
    yr.append(year)
    amnt.append(amount)
    
data_yearly = pd.DataFrame({'Year': yr, 'Value': amnt})
data_yearly = data_yearly.sort_values(by=['Year'], ascending=True)

In [None]:
fig4 = px.line(data_yearly, x="Year", y="Value", title='Crop production per year in Spain (In tonnes)')
#fig4.show()

In [None]:
fig3 = px.line(pop_filtered, x="Time", y="PopTotal", title='Population In Spain (In thousands)')
#fig3.show()

In [None]:
fig3.show()
fig4.show()

We see that both of these graphs have a similar general trend. There is also a clear positive corelation between them. This relationship can be further explored by looking at population density. Or by comparing population with crop production in other countries.

Note that this notebook just contains some basic data analysis/visualization and is just to provide a starting point for your exploration of the given data.