### Access your data in S3 through EC2
This notebook will introduce how to access file from S3 and use pandas to analyze the citibike data trips in New York City. You can download the citibik trip data from, https://s3.amazonaws.com/tripdata/index.html 

In [1]:
!pip install boto3
!pip install smart_open



#### Check all available buckets in the S3

In [2]:
import boto3, os


# Configure you AWS credentials, google how to find your access key id and access key
os.environ["AWS_ACCESS_KEY_ID"] = "your_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret_access_key"

# Create S3 resource
s3 = boto3.resource('s3')

s3 = boto3.resource('s3')

for bucket in s3.buckets.all():
    print(bucket.name)

aws-logs-339851289029-us-east-1
bsds-ec
my-first-emr-cluster
www.urbanspatial.info


#### Loop the file list in the cluster, use your own bucket name to replace `my-first-emr-cluster` here.

In [3]:
# loop all csv file in my S3 buckets
import pandas as pd

for obj in s3.Bucket(name='my-first-emr-cluster').objects.all():
    filename = os.path.join(obj.bucket_name, obj.key)
    
    if obj.key.endswith('.csv'): 
        print(obj.key)
        

2013-07 - Citi Bike trip data.csv
2013-08 - Citi Bike trip data.csv
2013-09 - Citi Bike trip data.csv
2013-10 - Citi Bike trip data.csv
2013-11 - Citi Bike trip data.csv
2013-12 - Citi Bike trip data.csv
2014-01 - Citi Bike trip data.csv
2014-02 - Citi Bike trip data.csv
2014-03 - Citi Bike trip data.csv
2014-04 - Citi Bike trip data.csv
2014-05 - Citi Bike trip data.csv
2014-06 - Citi Bike trip data.csv
2014-07 - Citi Bike trip data.csv
2014-08 - Citi Bike trip data.csv
201409-citibike-tripdata.csv
patterns-part1.csv
tutorialEMR/Condom_distribution_sites.csv
uber-raw-data-apr14.csv
uber-raw-data-aug14.csv
uber-raw-data-jul14.csv
uber-raw-data-jun14.csv
uber-raw-data-may14.csv
uber-raw-data-sep14.csv


#### Open the files in S3 using the EC2
Go to your S3 and locate your file

In [18]:
from smart_open import smart_open

path = "s3n://my-first-emr-cluster/2014-03 - Citi Bike trip data.csv"

df = pd.read_csv(smart_open(path))

In [5]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,949,2014-03-01 00:00:16,2014-03-01 00:16:05,317,E 6 St & Avenue B,40.724537,-73.981854,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17440,Subscriber,1942,1
1,533,2014-03-01 00:00:57,2014-03-01 00:09:50,457,Broadway & W 58 St,40.766953,-73.981693,441,E 52 St & 2 Ave,40.756014,-73.967416,20855,Subscriber,1960,1
2,122,2014-03-01 00:01:06,2014-03-01 00:03:08,146,Hudson St & Reade St,40.71625,-74.009106,276,Duane St & Greenwich St,40.717488,-74.010455,15822,Subscriber,1984,1
3,134,2014-03-01 00:01:14,2014-03-01 00:03:28,146,Hudson St & Reade St,40.71625,-74.009106,276,Duane St & Greenwich St,40.717488,-74.010455,17793,Subscriber,1985,1
4,997,2014-03-01 00:01:18,2014-03-01 00:17:55,150,E 2 St & Avenue C,40.720874,-73.980858,461,E 20 St & 2 Ave,40.735877,-73.98205,20756,Subscriber,1977,1


### Rename the columns of of the dataframe
Remove the space in names would be easier for the following indexing

In [6]:
df_bike = df.rename(columns={"start station id": "start_station_id", 
                             "start station name": "start_station_name",
                             "start station latitude": "start_station_lat",
                             "start station longitude": "start_station_lon", 
                             "end station id": "end_station_id", 
                             "end station name": "end_station_name", 
                             "end station latitude": "end_station_lat",
                             "end station longitude": "end_station_lon", 
                             "birth year": "birth_year"})
df_bike.head()

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_lat,start_station_lon,end_station_id,end_station_name,end_station_lat,end_station_lon,bikeid,usertype,birth_year,gender
0,949,2014-03-01 00:00:16,2014-03-01 00:16:05,317,E 6 St & Avenue B,40.724537,-73.981854,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17440,Subscriber,1942,1
1,533,2014-03-01 00:00:57,2014-03-01 00:09:50,457,Broadway & W 58 St,40.766953,-73.981693,441,E 52 St & 2 Ave,40.756014,-73.967416,20855,Subscriber,1960,1
2,122,2014-03-01 00:01:06,2014-03-01 00:03:08,146,Hudson St & Reade St,40.71625,-74.009106,276,Duane St & Greenwich St,40.717488,-74.010455,15822,Subscriber,1984,1
3,134,2014-03-01 00:01:14,2014-03-01 00:03:28,146,Hudson St & Reade St,40.71625,-74.009106,276,Duane St & Greenwich St,40.717488,-74.010455,17793,Subscriber,1985,1
4,997,2014-03-01 00:01:18,2014-03-01 00:17:55,150,E 2 St & Avenue C,40.720874,-73.980858,461,E 20 St & 2 Ave,40.735877,-73.98205,20756,Subscriber,1977,1


### Count the number of trajectories from and stop at different stations

In [7]:
start_station = df_bike.groupby('start_station_id').size().to_frame('size')
start_station

Unnamed: 0_level_0,size
start_station_id,Unnamed: 1_level_1
72,1433
79,1012
82,563
83,615
116,2628
...,...
2017,649
2021,1411
2022,800
2023,682


In [8]:
end_station = df_bike.groupby('end_station_id').size().to_frame('size')
end_station

Unnamed: 0_level_0,size
end_station_id,Unnamed: 1_level_1
72,1340
79,1039
82,583
83,636
116,2655
...,...
2017,542
2021,1382
2022,907
2023,683


#### Rename the axis and prepare for merging
The start station and the end station are both the station id, we cand change the number both as station id

In [9]:
s_station = start_station.rename_axis("station_id")
e_station = end_station.rename_axis("station_id")

#### Merge the start and end station dataframes

In [10]:
SE_station = s_station.merge(e_station, on='station_id')

# rename the size column as number of trips start and end
SE_station = SE_station.rename(columns={"size_x": 'start', "size_y": "end"})
SE_station.head()

Unnamed: 0_level_0,start,end
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1
72,1433,1340
79,1012,1039
82,563,583
83,615,636
116,2628,2655


#### Our table has no spatial information, we need join with the original data that has the cooridnate information
Change the index name back to the start_station_id, we going to use this join with the original table.

In [11]:
SE_station = SE_station.rename_axis("start_station_id")
SE_station

Unnamed: 0_level_0,start,end
start_station_id,Unnamed: 1_level_1,Unnamed: 2_level_1
72,1433,1340
79,1012,1039
82,563,583
83,615,636
116,2628,2655
...,...,...
2017,649,542
2021,1411,1382
2022,800,907
2023,682,683


#### Our table has no spatial information, we need join with the original data that has the cooridnate information
Change the index name back to the start_station_id, we going to use this join with the original table.

In [12]:
SE_station = SE_station.rename_axis("start_station_id")
SE_station

Unnamed: 0_level_0,start,end
start_station_id,Unnamed: 1_level_1,Unnamed: 2_level_1
72,1433,1340
79,1012,1039
82,563,583
83,615,636
116,2628,2655
...,...,...
2017,649,542
2021,1411,1382
2022,800,907
2023,682,683


#### Extract the lon, lat of each stations by removing all duplicate stations info, and then merge with the created table

In [13]:
df_bike_coord = df_bike.drop_duplicates(subset = ["start_station_id"])

df_bike_coord = df_bike_coord.merge(SE_station, on="start_station_id")
df_bike_coord

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_lat,start_station_lon,end_station_id,end_station_name,end_station_lat,end_station_lon,bikeid,usertype,birth_year,gender,start,end
0,949,2014-03-01 00:00:16,2014-03-01 00:16:05,317,E 6 St & Avenue B,40.724537,-73.981854,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17440,Subscriber,1942,1,544,744
1,533,2014-03-01 00:00:57,2014-03-01 00:09:50,457,Broadway & W 58 St,40.766953,-73.981693,441,E 52 St & 2 Ave,40.756014,-73.967416,20855,Subscriber,1960,1,2182,1898
2,122,2014-03-01 00:01:06,2014-03-01 00:03:08,146,Hudson St & Reade St,40.716250,-74.009106,276,Duane St & Greenwich St,40.717488,-74.010455,15822,Subscriber,1984,1,880,872
3,997,2014-03-01 00:01:18,2014-03-01 00:17:55,150,E 2 St & Avenue C,40.720874,-73.980858,461,E 20 St & 2 Ave,40.735877,-73.982050,20756,Subscriber,1977,1,1841,1627
4,720,2014-03-01 00:01:27,2014-03-01 00:13:27,382,University Pl & E 14 St,40.734927,-73.992005,79,Franklin St & W Broadway,40.719116,-74.006667,19377,Subscriber,1983,1,2643,2943
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,717,2014-03-02 16:04:08,2014-03-02 16:16:05,2005,Railroad Ave & Kay Ave,40.705312,-73.971001,271,Ashland Pl & Hanson Pl,40.685282,-73.978058,14850,Subscriber,1984,1,74,83
326,879,2014-03-02 19:47:56,2014-03-02 20:02:35,443,Bedford Ave & S 9th St,40.708531,-73.964090,430,York St & Jay St,40.701485,-73.986569,15590,Subscriber,1986,1,77,77
327,911,2014-03-03 12:26:07,2014-03-03 12:41:18,2001,7 Ave & Farragut St,40.698921,-73.973330,2005,Railroad Ave & Kay Ave,40.705312,-73.971001,18370,Subscriber,1990,1,118,106
328,1198,2014-03-05 08:28:03,2014-03-05 08:48:01,431,Hanover Pl & Livingston St,40.688646,-73.982634,395,Bond St & Schermerhorn St,40.688070,-73.984106,17784,Subscriber,1970,1,79,83


#### Create a GeoPandas based on the lon, lat info in the dataframe

In [48]:
# from shapely.geometry import Point
# from geopandas import GeoDataFrame

# geometry = [Point(xy) for xy in zip(df_bike_coord.end_station_lon, df_bike_coord.end_station_lat)]
# gdf = GeoDataFrame(df_bike_coord, crs="EPSG:4326", geometry=geometry)

from geopandas import GeoDataFrame as gdf
import geopandas as gpd
import pandas as pd

crs = {'init': 'epsg:4326'} #http://www.spatialreference.org/ref/epsg/2263/

# create a geo-dataframe
points_gdf= gpd.GeoDataFrame(df_bike_coord, crs=crs, \
                geometry=gpd.points_from_xy(df_bike_coord.end_station_lon, df_bike_coord.end_station_lat))

# save the geo-dataframe as a shapefile
points_gdf.to_file("stations.json")


### Copy your file from EC2 to your local computer
Transfer my notebook from EC2 to my local computer, type the following commands in your terminal, 

`scp ubuntu@ec2-54-90-90-167.compute-1.amazonaws.com:/home/ubuntu/EC2_S3_trips.ipynb .`

Transfer the shapefile to local computer, 

`scp ubuntu@ec2-54-90-90-167.compute-1.amazonaws.com:/home/ubuntu/stations.* .`


### Homework:
Map the number of female (gender 2) cyclists older than 40 (birth year before 1971) start and end at different stations in Citibike data of August, 2018. Upload the bike data to S3 and do your analysis in your EC2 instance. Submit the map and your notebook to the Canvas. The citibike data can be download here, https://s3.amazonaws.com/tripdata/index.html