# Importing trees from a CSV

Step one: find and load the CSV file

In [1]:
ls ..

arboladolineal.csv  [34mmendoza_trees[m[m/      [34mvenv[m[m/


In [2]:
data = open('../arboladolineal.csv').read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 12: invalid continuation byte

This file clearly isn't UTF-8 - but then what is it?

We can read the data in as bytes using `mode='rb'`:

In [3]:
data = open('../arboladolineal.csv', 'rb').read()

In [4]:
data[:1000]

b'FID;Especie;\xdaltima modificaci\xf3n;Altura;Circunferencia tronco (cm);Di\xe1metro tronco;Inclinaci\xf3n;Longitud;Latitud\r\n0;Jacarand;6/5/2011 14:49;Medio (4 - 8 mts);600.000.000;Mediano;Nula (< 15\xf8);-68.840.414.843.900.000;-32.874.731.181.900.000\r\n1;Jacarand;6/5/2011 14:48;Medio (4 - 8 mts);560.000.000;Mediano;Nula (< 15\xf8);-68.840.399.524.299.900;-32.874.687.014.599.900\r\n2;Jacarand;6/5/2011 14:47;Medio (4 - 8 mts);500.000.000;Mediano;Nula (< 15\xf8);-68.840.393.443.200.000;-32.874.642.858.999.900\r\n3;Fresno europeo;2/3/2012 13:26;Bajo (2 - 4 mts);520.000.000;Mediano;Nula (< 15\xf8);-68.857.421.264.899.900;-32.895.994.448.300.000\r\n4;Fresno europeo;2/3/2012 13:25;Bajo (2 - 4 mts);480.000.000;Mediano;Nula (< 15\xf8);-68.857.416.222.900.000;-32.895.916.002.600.000\r\n5;Morera;2/3/2012 13:23;Medio (4 - 8 mts);700.000.000;Grande;Leve (15\xf8 - 30);-68.857.408.047.199.900;-32.895.857.598.900.000\r\n6;Fresno europeo;2/3/2012 13:21;Bajo (2 - 4 mts);600.000.000;Mediano;Nula (<

Since we don't know the file's character encoding, we can use the `chardet` library to make an educated guess:

In [5]:
import chardet

In [6]:
chardet.detect(data[:1000])

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

OK, let's try opening it as ISO-8859-1

In [7]:
data = open('../arboladolineal.csv', encoding='ISO-8859-1').read()

In [8]:
data[:1000]

'FID;Especie;Última modificación;Altura;Circunferencia tronco (cm);Diámetro tronco;Inclinación;Longitud;Latitud\n0;Jacarand;6/5/2011 14:49;Medio (4 - 8 mts);600.000.000;Mediano;Nula (< 15ø);-68.840.414.843.900.000;-32.874.731.181.900.000\n1;Jacarand;6/5/2011 14:48;Medio (4 - 8 mts);560.000.000;Mediano;Nula (< 15ø);-68.840.399.524.299.900;-32.874.687.014.599.900\n2;Jacarand;6/5/2011 14:47;Medio (4 - 8 mts);500.000.000;Mediano;Nula (< 15ø);-68.840.393.443.200.000;-32.874.642.858.999.900\n3;Fresno europeo;2/3/2012 13:26;Bajo (2 - 4 mts);520.000.000;Mediano;Nula (< 15ø);-68.857.421.264.899.900;-32.895.994.448.300.000\n4;Fresno europeo;2/3/2012 13:25;Bajo (2 - 4 mts);480.000.000;Mediano;Nula (< 15ø);-68.857.416.222.900.000;-32.895.916.002.600.000\n5;Morera;2/3/2012 13:23;Medio (4 - 8 mts);700.000.000;Grande;Leve (15ø - 30);-68.857.408.047.199.900;-32.895.857.598.900.000\n6;Fresno europeo;2/3/2012 13:21;Bajo (2 - 4 mts);600.000.000;Mediano;Nula (< 15ø);-68.857.357.071.400.000;-32.895.826.568

That looks right - the accents appear to be in the correct place.

Now let's split the file into lines:

In [9]:
lines = data.split('\n')

In [10]:
lines[0]

'FID;Especie;Última modificación;Altura;Circunferencia tronco (cm);Diámetro tronco;Inclinación;Longitud;Latitud'

In [11]:
lines[1]

'0;Jacarand;6/5/2011 14:49;Medio (4 - 8 mts);600.000.000;Mediano;Nula (< 15ø);-68.840.414.843.900.000;-32.874.731.181.900.000'

In [12]:
lines[-1]

''

We only want lines that aren't blank. Let's strip any outstanding whitespace too:

In [13]:
lines = [l.strip() for l in lines if l.strip()]

In [14]:
lines[-1]

'48418;Fresno europeo;29/8/2011 10:19;Alto (> 8 mts);1.300.000.000;Grande;Leve (15ø - 30;-68.829.120.483.500.000;-32.868.771.131.599.900'

This isn't CSV, it's delimited by semicolons instead.

In [15]:
lines[1].split(';')

['0',
 'Jacarand',
 '6/5/2011 14:49',
 'Medio (4 - 8 mts)',
 '600.000.000',
 'Mediano',
 'Nula (< 15ø)',
 '-68.840.414.843.900.000',
 '-32.874.731.181.900.000']

Those latitudes and longitudes are in a really weird format. Let's write a function to fix them.

In [16]:
weird = '-68.840.414.843.900.000'

In [17]:
parts = weird.split('.')

In [18]:
parts[0]

'-68'

In [19]:
''.join(parts[1:])

'840414843900000'

In [20]:
def fix(value):
    parts = value.split('.')
    return float('{}.{}'.format(parts[0], ''.join(parts[1:])))

In [21]:
fix('-32.874.731.181.900.000')

-32.8747311819

## Import each row into a corresponding Django model

We have the raw pieces we need to read our CSV file. Let's create a Django Tree instance for every row.

First we'll import the Django models and demonstrate that they work. Thanks to `django_extensions` we don't need to do anything special to set up our import environemnt - everything just works:

In [23]:
from trees.models import Tree, Species

In [24]:
Tree.objects.count()

0

In [26]:
def load_row(row):
    "Given a row from the CSV, create a Tree in our database"
    # The species name is the second column:
    species_name = row[1]
    # lat/long are the last two columns, counting from the end:
    latitude = row[-1]
    longitude = row[-2]
    # We want one `Species` object for each unique species name.
    # Django's get_or_create is a convenient way of making these:
    species, created = Species.objects.get_or_create(
        name=species_name
    )
    # Create the tree instance:
    Tree.objects.create(
        species=species,
        latitude=fix(latitude),
        longitude=fix(longitude),
    )

In [None]:
# The first line is the headings, so skip it:
for line in lines[1:]:
    row = line.split(';')
    load_row(row)

This step takes a few minutes. Once it's done, we can confirm that the trees have been created:

In [27]:
Tree.objects.count()

48419

In [28]:
Species.objects.count()

33

In [31]:
[(t.species.name, t.latitude, t.longitude) for t in Tree.objects.all()[:10]]

[('Jacarand', -32.8747311819, -68.8404148439),
 ('Jacarand', -32.8746870145999, -68.8403995242999),
 ('Jacarand', -32.8746428589999, -68.8403934432),
 ('Fresno europeo', -32.8959944483, -68.8574212648999),
 ('Fresno europeo', -32.8959160026, -68.8574162229),
 ('Morera', -32.8958575989, -68.8574080471999),
 ('Fresno europeo', -32.8958265686, -68.8573570714),
 ('Morera', -32.8958308492999, -68.8572981214),
 ('Morera', -32.895838834, -68.8572185353999),
 ('Morera', -32.8958462025, -68.8571426345999)]