# Import packages

A package (or library) contains several functions useful in a particular context. Here we import the ubiquitous data set manipulation package, `pandas`.

Other packages that are frequently used in data science:
- `numpy`: for manipulating numerical arrays and matrices (`pandas` uses it internally)
- `matplotlib`: to display graphs
- `sklearn`: contains a lot of machine learning functions and models

In [1]:
# import packages
import pandas as pd

# Set some jupyter display options
from IPython.display import display
pd.options.display.max_columns = None

# Load the list of stations

In this part, we use the Pandas library in order to import the data files into our notebook. The data is then saved into a data frame called `stations`.

Look at an excerpt of the file to get an idea of how it is formatted.

In [2]:
# A line that begins with `!` tells Jupyter to execute a shell command.
# Here `head` is a Linux command to display the first few lines of a text file
!head data/stations.csv

Identifiant;Geo Point;Marque;Nom
21360004;47.37861,4.38168;Avia;AVIA Creux Moreau
27600003;49.14759,1.33189;Elan;GARAGE POUPARDIN
27310003;49.35606,0.81532;Carrefour Market;Carrefour Market
29200009;48.40238,-4.4791;Intermarché;INTERMARCHE BREST CEDEX 2
29420001;48.62827,-3.98804;Système U;Super U PLOUENAN
5600005;44.67646,6.63422;Intermarché;INTERMARCHE
12310001;44.39367,2.77776;Total;MR MAJOREL
42510001;45.82099,4.17891;Carrefour Market;Carrefour Market
42450001;45.54196,4.18553;Elan;SARL GARAGE BARCET


In [3]:
# Read the file in memory as a data frame
stations = pd.read_csv('data/stations.csv', sep='''###CODE HERE###''')

In [4]:
# Check the first few records
stations.head()

Unnamed: 0,Identifiant,Geo Point,Marque,Nom
0,21360004,"47.37861,4.38168",Avia,AVIA Creux Moreau
1,27600003,"49.14759,1.33189",Elan,GARAGE POUPARDIN
2,27310003,"49.35606,0.81532",Carrefour Market,Carrefour Market
3,29200009,"48.40238,-4.4791",Intermarché,INTERMARCHE BREST CEDEX 2
4,29420001,"48.62827,-3.98804",Système U,Super U PLOUENAN


In the table above, we notice:
- the header with the list of fields (or columns) making up the data frame: *Identifiant, Geo Point, Marque, Nom*
- the *index*, which in this case is akin to the row number in Excel (only it starts at 0). There was no such column in the file, Pandas added it automatically for us

# Split the `Geo Point` field

You may have noticed that `Geo Point` is a a pair of numbers (most probably *latitude, longitude*). Pandas read it as a string but we want the two components.

```
'49.12625,6.27557'   -->   (49.12625, 6.27557)
```

For this, we are going to split the field in two, storing the result as a small data frame.

Before we split, here are a few operations on data frames we are going to use:
- `stations['Some Column']` is for taking the `Some Column` field from the data frame, yielding a _series_
- `series.str` means "Let's say the series is made of strings, so we can apply stringy functions on each value"
- `series.str.split()` is the text splitting function. We give it a separator, and the `expand` parameter tells Pandas to make a data frame with one column per split component
- `stations.columns` returns the names of the fields making up the data frame. We can also set the names, with the `=` operator

In [5]:
# Create a smaller data frame from the split strings
geo_points = stations['Geo Point'].str.split(',', expand=True)
geo_points.columns = ['''###CODE HERE###''']

In [6]:
# Check we have split right
geo_points.head()

Unnamed: 0,Latitude,Longitude
0,47.37861,4.38168
1,49.14759,1.33189
2,49.35606,0.81532
3,48.40238,-4.4791
4,48.62827,-3.98804


In [7]:
# How does it look?
geo_points.dtypes

Latitude     object
Longitude    object
dtype: object

`.dtypes` gives the *data types* of the columns in the data frame. Here are a few:
- `int64` means integers
- `float64` means decimal numbers
- `datetime64` means timestamps (dates with times)
- `object` means anything else -- usually strings

In [8]:
# We must convert the cells of the data frame into the correct type
geo_points = geo_points.astype(float)

In [9]:
# Let's check
geo_points.dtypes

Latitude     float64
Longitude    float64
dtype: object

# Inject the coordinates back into the original data frame

Now we have 2 data frames: the original one, and the smaller one with the coordinates after splitting. We want to put them together and get a "complete" data frame.

There are several ways to do this; we are going to use a Pandas operations called **merge** (if you're familiar with SQL, it is exactly like a join).

In order to merge data frames together, we need to know some information that is common to them, so Pandas can match rows from both sides. Here, the most appropriate information is not a field, it's the **index**.

Recall how we computed `geo_points` from `stations`, by applying a function (`str.split()`) on each row? It turns out that Pandas has copied the values of the index into the new data frame: each row from `geo_points` can be traced back to its original from `stations`.

In [10]:
# In case we don't believe me :)
# `.index` yields the list of values from an index of a dataframe
# The results indicates that *all* values are identical
(geo_points.index == stations.index).all()

True

Now we do the merge, asking Pandas to match on indices of both sides. While we're at it, we remove the column with the 2 coordinates together as a string, we don't need it any more. We tell `.drop()` that the label we want to remove lies on axis 1, the axis for columns.

In [15]:
stations_with_coordinates = pd.merge('''###CODE HERE###''', '''###CODE HERE###''', left_index=True, right_index=True).drop('Geo Point', axis=1)

In [16]:
# Always give a look at the result
stations_with_coordinates.head()

Unnamed: 0,Identifiant,Marque,Nom,Latitude,Longitude
0,21360004,Avia,AVIA Creux Moreau,47.37861,4.38168
1,27600003,Elan,GARAGE POUPARDIN,49.14759,1.33189
2,27310003,Carrefour Market,Carrefour Market,49.35606,0.81532
3,29200009,Intermarché,INTERMARCHE BREST CEDEX 2,48.40238,-4.4791
4,29420001,Système U,Super U PLOUENAN,48.62827,-3.98804


In [17]:
# One last check!
stations_with_coordinates.dtypes

Identifiant      int64
Marque          object
Nom             object
Latitude       float64
Longitude      float64
dtype: object

# Save the result

Now we have enhanced our data frame, save it back to a new CSV file. The `pd.read_csv()` from the beginning has a counterpart, `stations.to_csv()`. One notable difference is the `index` arguments, which we set to `False` because we don't want the index of the data frame in the file.

In [18]:
stations_with_coordinates.to_csv('output/stations_with_coordinates.csv', sep=';', index=False)

In [19]:
!head output/stations_with_coordinates.csv

Identifiant;Marque;Nom;Latitude;Longitude
21360004;Avia;AVIA Creux Moreau;47.37861;4.38168
27600003;Elan;GARAGE POUPARDIN;49.14759;1.33189
27310003;Carrefour Market;Carrefour Market;49.35606;0.81532
29200009;Intermarché;INTERMARCHE BREST CEDEX 2;48.40238;-4.4791
29420001;Système U;Super U PLOUENAN;48.62827;-3.98804
5600005;Intermarché;INTERMARCHE;44.67646;6.63422
12310001;Total;MR MAJOREL;44.39367;2.77776
42510001;Carrefour Market;Carrefour Market;45.82099;4.17891
42450001;Elan;SARL GARAGE BARCET;45.54196;4.18553
