# Exploring the datasets

After discussing our project on [readme.ipynb](readme.ipynb) it's now time to obtain the data e transform it as needed for our project. We will the proceed with some data exploration to get familiar with the dataset. 

**TABLE OF CONTENTS**

 1. 

## Importing the libraries

In [1]:
import folium
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests
import seaborn as sns
sns.set()#(style="whitegrid")
import shapefile as shp

from IPython.display import Image, IFrame
from IPython.core.display import HTML 
from sklearn.cluster import KMeans

## Getting the relevant jeojson data

The <code>SHP</code> file containg the data regarding DOCGs in the Veneto region has been made freely available at this [link](http://www.datiopen.it/it/opendata/Regione_Veneto_Zone_Denominazione_Origine_Controllata_Garantita) by DatiOpen.

A SHP (shapefile) is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas). The workspace containing shapefiles may also contain dBASE tables, which can store additional attributes that can be joined to a shapefile's features.

in Python, in order to deal with the shapefile we can use the <code>pyshp</code> library. We will then need to convert the file to <code>geojson</code> format in order to utilise it with <code>folium</code>. 

In [2]:
shp_path = 'data/shapefile_data/Zone_DOCG.shp'

sf = shp.Reader(shp_path)

We then create a function to read the shapefile into a pandas <code>DataFrame</code>.

In [3]:
def read_shapefile(sf):
    """
    Read a shapefile into a Pandas dataframe with a 'coords' 
    column holding the geometry information. This uses the pyshp
    package
    """
    fields = [x[0] for x in sf.fields][1:]
    records = sf.records()
    shps = [s.points for s in sf.shapes()]
    df = pd.DataFrame(columns=fields, data=records)
    df = df.assign(coords=shps)
    return df

In [4]:
df = read_shapefile(sf)
df.head()

Unnamed: 0,denominazi,codice,zona,atto,GU,coords
0,RECIOTO SOAVE CLASSICO,A021,A,D.M. 18/11/2010,n. 283 del 03/12/2010,"[(11.252029507865142, 45.41758433331832), (11...."
1,RECIOTO SOAVE,A021,X,D.M. 18/11/2010,n. 283 del 03/12/2010,"[(11.207064614961942, 45.4507371295929), (11.2..."
2,BARDOLINO SUPERIORE CLASSICO,A025,A,D. M. 08/01/2011,n. 275 del 25/11/2011,"[(10.794778650134427, 45.518760038125784), (10..."
3,BARDOLINO SUPERIORE,A025,X,D. M. 08/01/2011,n. 275 del 25/11/2011,"[(10.843049258063267, 45.43160165449561), (10...."
4,SOAVE SUPERIORE CLASSICO,A026,A,D.M. 18/11/2010,n. 283 del 03/12/2010,"[(11.252029507865142, 45.41758433331832), (11...."


In [6]:
def shape2json(fname, outfile="veneto_docg.json", country='Italy'):
    reader = shp.Reader(fname)
    
    fields = reader.fields[1:]
    field_names = [field[0] for field in fields]
    buffer = []
    for sr in reader.shapeRecords():
        atr = dict(zip(field_names, sr.record))
        geom = sr.shape.__geo_interface__
        buffer.append(dict(type="Feature", \
        geometry=geom, properties=atr)) 

    with open(outfile, "w") as geojson:
        geojson.write(json.dumps({"type": "FeatureCollection",
                             "features": buffer}, indent=2) + "\n")

shape2json(shp_path, outfile="data/veneto_docg.json", country='Italy')