# MapSPAN 2010 data 

In this notebook we download the data that we need from [MapSPAM](http://mapspam.info/data/)

The SPAM (10 x 10 km grid-cell resolution) data is specifically based on the four variables which are calculated by the model: `physical area`, `harvest area`, `production` and `yield`, for each of the 42 crops, split by `rainfed` and `irrigated` production system, as well as the total combination of both.

SPAM generates a large collection of data: ~800.000 pixels X 42 crops X 4 production systems X 4 variables = ~500 million records

Please check the [methodology](http://mapspam.info/methodology/) section for a thorough explanation.

**SPAM 2010 v1.0 Global Data (Latest)**

Download pre-packaged SPAM 2010 v1.0 data and map files at the global level, as one separate file for each crop, production system, variable and format. The available formats are DBF and GeoTIFF. See the [ReadMe](https://s3.amazonaws.com/mapspam/2010/v1.0/ReadMe_v1r0_Global.txt) file first!

* List of available SPAM variables:   

|SPAM long name | SPAM short name |description|  
|:---|:---|:---|  
|PHYSICAL AREA| phys_area|Physical area is measured in hectare, and represents the actual area where a crop is grown, not counting how often production was harvested from it. Physical area is calculated for each  production system and crop, and the sum of all physical areas of the four production systems constitute the total physical area for that crop. The sum of the physical areas of all crops in a pixel may not be larger than the pixel size.|   
|HARVESTED AREA|harv_area|Also measured in hectare, harvested area is at least as large as physical area, but sometimes more, since it also accounts for multiple harvests of a crop on the same plot. Like for physical area, the harvested area is calculated for each production system, and the sum of all harvested areas of all production systems in a pixel amount to the total harvested area of the pixel.|  
|PRODUCTION| prod|Production, for each production system and crop, is calculated by multiplying area harvested with yield. It is measured in metric tons. The total production of a crop includes the production of all production systems of that crop.|  
|YIELD| yield|Yield is a measure of productivity, the amount of production per harvested area, and is measured in kilogram/hectare. The total yield of a crop, when considering all production systems, is not the sum of the individual yields, but the weighted average of the 4 yields|  

* List of available [crops](http://mapspam.info/wp-content/uploads/4-Methodology-Crops-of-SPAM-2005-2015-02-26.csv) 

|No. crt.| SPAM short name | SPAM long name |FAONAMES | GROUP|
|:---|:---|:---|:---|:---|  
|1 | whea | wheat                | wheat                                | cereals                       |
|2 | rice | rice                 | rice                                 | cereals                       |
|3 | maiz | maize                | maize                                | cereals                       |
|4 | barl | barley               | barley                               | cereals                       |
|5 | pmil | pearl millet         | millet                               | cereals                       |
|6 | smil | small millet         | millet                               | cereals                       |
|7 | sorg | sorghum              | sorghum                              | cereals                       |
|8 | ocer | other cereals        | other cereals ++                     | cereals                       |
|9 | pota | potato               | potato                               | roots&tubers or starchy roots |
|10| swpo | sweet potato         | sweet potato                         | roots&tubers or starchy roots |
|11| yams | yams                 | yam                                  | roots&tubers or starchy roots |
|12| cass | cassava              | cassava                              | roots&tubers or starchy roots |
|13| orts | other roots          | yautia ++                            | roots&tubers or starchy roots |
|14| bean | bean                 | beans, dry                           | pulses                        |
|15| chic | chickpea             | chickpea                             | pulses                        |
|16| cowp | cowpea               | cowpea                               | pulses                        |
|17| pige | pigeonpea            | pigeon pea                           | pulses                        |
|18| lent | lentil               | lentils                              | pulses                        |
|19| opul | other pulses         | broad beans ++                       | pulses                        |
|20| soyb | soybean              | soybean                              | oilcrops                      |
|21| grou | groundnut            | groundnut, with shell                | oilcrops                      |
|22| cnut | coconut              | coconut                              | oilcrops                      |
|23| oilp | oilpalm              | palmoil                              | oilcrops                      |
|24| sunf | sunflower            | sunflower seed                       | oilcrops                      |
|25| rape | rapeseed             | rapeseed                             | oilcrops                      |
|26| sesa | sesameseed           | sesame seed                          | oilcrops                      |
|27| ooil | other oil crops      | olives ++                            | oilcrops                      |
|28| sugc | sugarcane            | sugar cane                           | sugar crops                   |
|29| sugb | sugarbeet            | sugarbeet                            | sugar crops                   |
|30| cott | cotton               | seed cotton                          | fibres                        |
|31| ofib | other fibre crops    | other fibres ++                      | fibres                        |
|32| acof | arabica coffee       | coffee                               | stimulant                     |
|33| rcof | robusta coffee       | coffee                               | stimulant                     |
|34| coco | cocoa                | cocoa                                | stimulant                     |
|35| teas | tea                  | tea                                  | stimulant                     |
|36| toba | tobacco              | tobacco leaves                       | stimulant                     |
|37| bana | banana               | banana                               | fruits                        |
|38| plnt | plantain             | plantain                             | fruits                        |
|39| trof | tropical fruit       | oranges ++                           | fruits                        |
|40| temf | temperate fruit      | apples ++                            | fruits                        |
|41| vege | vegetables           | cabbages and other brassicas ++      | vegetables                    |
|42| rest | rest of crops        | all individual other crops           |                               |

* List of available production systems

|name| production systems |
|:---|:---|  
|Irrigated|irrigated portion of crop|  
|Rainfed|rainfed portion of crop|   
|Rainfed|rainfed high inputs portion of crop| 
|Rainfed|rainfed low inputs portion of crop| 
|Total | all technologies together, ie complete crop| 


**SPAM 2010 data structure:**

The SPAM 2010 data files are structured like this:

- Base url: https://s3.amazonaws.com/mapspam/2010/v1.0/dbf/
- Zip files: 

|File name| description |
|:---|:---| 
|spam2010v1r0_global_harv_area.dbf.zip	|SPAM area harvested, global pixels, files in 2 formats x 6 technologies, strucuture A, record type H |
|spam2010v1r0_global_prod.dbf.zip		|SPAM production, global pixels, files in 2 formats x 6 technologies, strucuture A, record type P     |

- File names: 

    All files have standard names, which allow direct identification of variable and technology:
    spam2010v1r0_global_v_t.dbf
    
    where
    
    v = variable
    
    t = technology
    
    - Variables (v):
        
| v  | SPAM name | 
|:---|:---|
| harvested-area | harvested area |
| production | production |
    
    - Technologies (t):
     
| t  | description | 
|:---|:---|
|ti | irrigated portion of crop|
|tr | rainfed portion of crop  |


In [1]:
import requests
from tqdm import tqdm
import pandas as pd
import numpy as np
import os
import zipfile
from simpledbf import Dbf5

## Download data

We download the `harvested area` and `production` variables for 2010 

In [None]:
baseUrl = 'https://s3.amazonaws.com/mapspam/2010/v1.0/dbf/'
folder = '/Volumes/MacBook HD/data/aqueduct/data_source/spamdata2010/'
variables = ['harv_area', 'prod']

for variable in variables:
    filename = 'spam2010v1r0_global'+'_'+variable+'.dbf.zip'
    dataUrl=baseUrl+filename
    response = requests.get(dataUrl, stream=True)
    
    file_output = folder + filename
    print(filename)
    with open(file_output, 'wb') as handle:
        for chunk in tqdm(response.iter_content(chunk_size=128)):
            handle.write(chunk)
            
    # uncompress zip file
    with zipfile.ZipFile(file_output,"r") as zip_ref:
        zip_ref.extractall(folder)
      
    # remove zip file
    os.remove(file_output)

## Data preprocessing

### Read data

We read only the next 
- Variabres:
    - Harvested area
    - Production 
- Production systems:
    - Irrigated
    - Rainfed 
    
and merge them all together in a single table

In [2]:
baseFolder = '/Volumes/MacBook HD/data/aqueduct/data_source/spamdata2010/'
variables_folder = ['harv_area', 'prod']
variables_file = ['harvested-area', 'production']
production_types = ['ti', 'tr']
suffix = ['_i', '_r']

irrigation = {'irrigated': 'ti', 'rainfed': 'tr'}

crops = {"whea": "wheat", "rice": "rice", "maiz": "maize", "barl": "barley", "pmil": "pearl millet", 
         "smil": "small millet", "sorg": "sorghum", "ocer": "other cereals", "pota": "potato", 
         "swpo": "sweet potato", "yams": "yams", "cass": "cassava", "orts": "other roots", 
         "bean": "bean", "chic": "chickpea", "cowp": "cowpea", "pige": "pigeonpea", "lent": "lentil", 
         "opul": "other pulses", "soyb": "soybean", "grou": "groundnut", "cnut": "coconut", 
         "oilp": "oilpalm", "sunf": "sunflower", "rape": "rapeseed", "sesa": "sesameseed", 
         "ooil": "other oil crops", "sugc": "sugarcane", "sugb": "sugarbeet", "cott": "cotton", 
         "ofib": "other fibre crops", "acof": "arabica coffee", "rcof": "robusta coffee", 
         "coco": "cocoa", "teas": "tea", "toba": "tobacco", "bana": "banana", "plnt": "plantain", 
         "trof": "tropical fruit", "temf": "temperate fruit", "vege": "vegetables", "rest": "rest of crops"}                           

df_all = pd.DataFrame(columns=['cell5m', 'crop', 'irrigation', 'area', 'prod', 
                               'unit_area', 'unit_prod', 'iso', 'name_cntr',
                               'prod_level'
                              ])

for itype in irrigation.keys():
    
    df_type = pd.DataFrame(columns=['cell5m', 'crop', 'irrigation', 'area', 'prod', 
                               'unit_area', 'unit_prod', 'iso', 'name_cntr',
                               'prod_level'
                              ])
    
    for nV in range(len(variables_folder)):
        folder = 'spam2010v1r0_global'+'_'+variables_folder[nV]+'.dbf'
        filename = 'spam2010v1r0_global'+'_'+variables_file[nV]+'_'+irrigation[itype]+'.dbf'
        
        print(filename)
        #dbf = Dbf5(baseFolder+folder+'/'+filename, codec='cp1252')
        #df = dbf.to_dataframe()
        #df.columns = df.columns.str.lower()
    

spam2010v1r0_global_harvested-area_ti.dbf
spam2010v1r0_global_production_ti.dbf
spam2010v1r0_global_harvested-area_tr.dbf
spam2010v1r0_global_production_tr.dbf


In [None]:
df_all = pd.DataFrame(columns=['cell5m', 'crop', 'irrigation', 'area', 'prod', 
                               'unit_area', 'unit_prod', 'iso', 'name_cntr',
                               'prod_level'
                              ])

In [3]:
folder = 'spam2010v1r0_global_prod.dbf'
filename='spam2010v1r0_global_production_tr.dbf'
dbf = Dbf5(baseFolder+folder+'/'+filename, codec='cp1252')
df = dbf.to_dataframe()
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,iso3,prod_level,alloc_key,cell5m,x,y,rec_type,tech_type,unit,whea_r,...,trof_r,temf_r,vege_r,rest_r,crea_date,year_data,source,name_cntr,name_adm1,name_adm2
0,CHN,CH08078,4383640,1891479,123.291667,53.541667,P,R,mt,0.0,...,0.0,0.0,0.0,0.0,07/10/18 11:04:45 PM,avg(2009-2011),F avg2,China,Heilongjiang,Mohexian
1,CHN,CH08078,4393627,1895786,122.208333,53.458333,P,R,mt,0.0,...,0.0,0.0,0.0,0.0,07/10/18 11:04:45 PM,avg(2009-2011),F avg2,China,Heilongjiang,Mohexian
2,CHN,CH08078,4393628,1895787,122.291667,53.458333,P,R,mt,0.0,...,0.0,0.0,0.0,0.0,07/10/18 11:04:45 PM,avg(2009-2011),F avg2,China,Heilongjiang,Mohexian
3,CHN,CH08078,4393629,1895788,122.375,53.458333,P,R,mt,0.0,...,0.0,0.0,0.0,0.0,07/10/18 11:04:45 PM,avg(2009-2011),F avg2,China,Heilongjiang,Mohexian
4,CHN,CH08078,4393637,1895796,123.041667,53.458333,P,R,mt,0.0,...,0.0,0.0,0.0,0.0,07/10/18 11:04:45 PM,avg(2009-2011),F avg2,China,Heilongjiang,Mohexian


In [59]:
crop_names = [x + '_r' for x in list(crops.keys())]

In [61]:
df_prod = pd.DataFrame(columns=['cell5m', 'crop', 'irrigation', 'prod'])

for i in tqdm(range(len(df))):
    df_prod_i = pd.DataFrame(columns=['cell5m', 'crop', 'irrigation', 'prod'])

    df_prod_i['crop'] = list(crops.values())
    df_prod_i['prod'] = list(df[crop_names].iloc[i])
    df_prod_i['cell5m'] = df['cell5m'].iloc[i]
    df_prod_i['irrigation'] = 'rainfed'
    
    df_prod = pd.concat([df_prod, df_prod_i])

  0%|          | 231/832904 [00:55<55:54:58,  4.14it/s]

KeyboardInterrupt: 

In [58]:
df_prod

Unnamed: 0,cell5m,crop,irrigation,prod
0,1891479,wheat,rainfed,0.0
1,1891479,rice,rainfed,0.0
2,1891479,maize,rainfed,4.3
3,1891479,barley,rainfed,0.0
4,1891479,pearl millet,rainfed,0.0
5,1891479,small millet,rainfed,0.0
6,1891479,sorghum,rainfed,0.0
7,1891479,other cereals,rainfed,1.5
8,1891479,potato,rainfed,435.8
9,1891479,sweet potato,rainfed,0.0


**SPAM 2005 v3.2 Global Data**

In [None]:
df = pd.read_csv('/Users/ikersanchez/Downloads/spam2005v3r2_global_harv_area.csv/spam2005V3r2_global_H_TI.csv', sep=',', index_col=False, encoding='ISO-8859-1')

In [None]:
df.head()

In [None]:
df.columns.values

In [None]:
dbf = Dbf5('/Users/ikersanchez/Downloads/spam2005v3r2_global_yield.dbf/spam2005V3r2_global_Y_TA.DBF', codec='ISO-8859-1')
df = dbf.to_dataframe()

In [None]:
df.head()

In [None]:
df['REC_TYPE'].nunique()