## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [15]:
# ADD YOUR OWN CODE HERE
#import packages
import geopandas as gpd
#Read the input file kruger_points.shp into a geo-data frame kruger_points
kruger_points = gpd.read_file("data/kruger_points.shp")
#Transform the data from WGS84 to an EPSG:32735 projection (UTM Zone 35S, suitable for South Africa). This CRS has metres as units.
kruger_points = kruger_points.to_crs(epsg = 32735)


In [16]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (-4695752.719 14973674.275)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (-4748939.258 15014098.837)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (-4672729.591 14859391.193)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (-4679391.656 14969037.444)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (-4686373.982 14973910.589)


In [17]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [20]:
grouped_by_users = kruger_points.groupby('userid')

In [None]:
# ADD YOUR OWN CODE HERE

In [21]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [55]:
# ADD YOUR OWN CODE HERE
from shapely.geometry import LineString
import geopandas as gpd
movement = {"userid":[],
           "geometry":[]}
for id_ in grouped_by_users.groups:
    points = kruger_points[kruger_points['userid']== id_].sort_values(by ='timestamp')['geometry']
    if len(points) >= 2:
        movement["userid"].append(id_)
        movement["geometry"].append(LineString(points))
    else:
        pass
 


In [56]:
#movement
movements = gpd.GeoDataFrame.from_dict(movement, crs =32735)  

In [57]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry
0,16301,"LINESTRING (-4681550.088 14943799.279, -468323..."
1,45136,"LINESTRING (-4770692.23 14940874.449, -4770692..."
2,50136,"LINESTRING (-4680731.675 14947431.176, -468798..."
3,88775,"LINESTRING (-4773713.345 14938272.132, -477371..."
4,88918,"LINESTRING (-4699374.159 14988142.858, -468798..."
...,...,...
9021,99921781,"LINESTRING (-4769482.3 14935355.642, -4769055...."
9022,99936874,"LINESTRING (-4688007.089 14987931.595, -468803..."
9023,99964140,"LINESTRING (-4636958.081 14905786.11, -4636955..."
9024,99986933,"LINESTRING (-4638612.172 14901687.488, -463575..."


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [60]:
# ADD YOUR OWN CODE HERE
#check that CRS is correct
assert movements.crs == pyproj.CRS("EPSG:32735")
#Compute the lengths of the lines, and store it in a new column called distance
movements['distance'] = movements.length

In [61]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (-4681550.088 14943799.279, -468323...",454640.989781
1,45136,"LINESTRING (-4770692.23 14940874.449, -4770692...",0.0
2,50136,"LINESTRING (-4680731.675 14947431.176, -468798...",205712.690132
3,88775,"LINESTRING (-4773713.345 14938272.132, -477371...",0.095916
4,88918,"LINESTRING (-4699374.159 14988142.858, -468798...",11388.30514


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [66]:
# ADD YOUR OWN CODE HERE
shortest_distance = movements['distance'].min()
mean_distance = movements['distance'].mean()
longest_distance = movements['distance'].max()
print('The shortest distance a user travelled between all their posts (in meters) is {}'.format(shortest_distance))
print('The average distance a user travelled between all their posts (in meters) is {}'.format(mean_distance))
print('The longest distance a user travelled between all their posts (in meters) is {}'.format(longest_distance))

The shortest distance a user travelled between all their posts (in meters) is 0.0
The average distance a user travelled between all their posts (in meters) is 138871.1419446001
The longest distance a user travelled between all their posts (in meters) is 8457917.497356469


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [67]:
# ADD YOUR OWN CODE HERE
movements.to_file("data/movements.shp")

In [69]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()

NameError: name 'DATA_DIRECTORY' is not defined


---

# Fantastic job!

That’s all for this week! 