# Polish Crops analysis

With this notebook I would like to playaround some machine learning Python tools. 

Therefore I've decided to create this simple project, in which I want to check a simple hypothesis - is percipitation and/or temperature in Poland affecting prices of crops (soft wheat). In order to achieve it here are the steps:

* **Step 1 Download source reports** - downloading necessary files from publically available databases.

* **Step 2 Data preprocessing** - unnecessary data are removed and prepared for further processing.

* **Step 3 Data visualization** - plotting couple of charts, just to visualize the downloaded data.

* **Step 4 Train machine learning models** - training couple of machine learning regression models to check which one of them fits the best - is it a correlaction between climate (temperature, percipitation) and prices of soft wheat.
---

## Step 1 Download source reports

First you need to download necessary data. They are downloaded from following sources:

* [IMGW (Instytut Meteorologii i Gospodarki Wodnej, *eng.* Polish Institute of Meteorology and Water Management)](https://danepubliczne.imgw.pl)
* [Eurostat](https://ec.europa.eu/eurostat/data/database)


To avoid boilerplate code and to keep this notebook short, inside `./tools` directory there are Python modules that allows to download following reports:

* [Monthly tempreture report in Poland](https://dane.imgw.pl/data/dane_pomiarowo_obserwacyjne/dane_meteorologiczne/miesieczne/klimat/)
* [Monthly precipitation report in Poland](https://dane.imgw.pl/data/dane_pomiarowo_obserwacyjne/dane_meteorologiczne/miesieczne/opad/)
* [Selling prices of crop products (absolute prices) - annual price (from 2000 onwards)](https://ec.europa.eu/eurostat/data/database?p_p_id=NavTreeportletprod_WAR_NavTreeportletprod_INSTANCE_nPqeVbPXRmWQ&p_p_lifecycle=0&p_p_state=pop_up&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_NavTreeportletprod_WAR_NavTreeportletprod_INSTANCE_nPqeVbPXRmWQ_nodeInfoService=true&nodeId=98243)

Inside reports there are no explanation for every column or for codes inside of some cells, therefore dictionaries file are also need to be downloaded.    

In [6]:
import tools.imgw as imgw
import tools.eurostat as eurostat

years = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
raw_temperature_report_paths = imgw.download_temperature_reports(years)
temperature_report_dic_path = imgw.download_temperature_report_dic()

raw_precipitation_report_paths = imgw.download_precipitation_reports(years)
precipitation_report_dic_path = imgw.download_precipitation_report_dic()

raw_crop_prices_report_path = eurostat.download_crop_prices()
crop_categories_dic_path = eurostat.download_crop_categories_dic()

## Step 2 Data preprocessing

Before training machine learning models or plotting charts, data needs to be prepared. 

### Step 2.1 Prepare temperature data

All temperature report files doesn't have column headers, therefore to check what each one of them mean we need to examin `k_m_t_format.txt` file. Here is a code that prints the headers (unfortunately in polish, I'll translate them in next step)

In [11]:
f = open(temperature_report_dic_path)
index = -1
for line in f:
    print('{}\t{}'.format(index,line))
    index += 1
f.close()

-1	

0	Kod stacji                                       9

1	Nazwa stacji                                    30

2	Rok                                              4

3	Miesiąc                                          2

4	Średnia miesięczna temperatura [°C]              5/1

5	Status pomiaru TEMP                              1

6	Średnia miesięczna wilgotność względna [%]       8/1

7	Status pomiaru WLGS                              1

8	Średnia miesięczna prędkość wiatru [m/s]         6/1

9	Status pomiaru FWS                               1

10	Średnie miesięczne zachmurzenie ogólne [oktanty] 6/1

11	Status pomiaru NOS                               1

12	   

13	Status "8" brak pomiaru


In previous step, we've downloaded one file per each year, which now needs to combined into single one. Moreover, each of these files has temperature readings for several places across Poland. I've decided to calculate an average temperature for each month for entire Poland.

In [22]:
import pandas as pd

for i in range(len(years)):
    df = pd.read_csv(raw_temperature_report_paths.get(years[i]), header = None, encoding= 'unicode_escape')
    df = df.drop(columns=[0, 1, 5, 6, 7, 8, 9, 10, 11])
    df = df.rename(columns={2: 'Year', 3: 'Month', 4: 'Temperature [°C]'})
    df = df.groupby('Month').mean()
    if i == 0:
        df.to_csv('data/processed/temperature.csv', mode = 'a')    
    df.to_csv('data/processed/temperature.csv', mode = 'a', header = False)

## Step 3 Data visualization

## Step 4 Train machine learning models