# Turn exportable Bixi data (pdf) into usable csv format

PER CELL
- Import packages
- Create dictionary for station names to station codes
- Load users Bixi data (csv format after being converted from pdfs using zamzar)
- Keep rows with start and end stations
- Remove NaNs
- For stationnames that are not present, compute distance in stringspace to all stations and pick smallest
- Setup dataframe, fill the start/end station code and duration
- Save to csv

FURTHER USE
- filter-rides.ipynb to create  [user]-filtered.csv or immediately compute danger-index

PROS
- Robust

CONS
- Abandons entry on when ride is taken
- Requires manually converting pdf to csv file
- Can we code something that does what Zamzar does?

In [19]:
# import dependencies
import pandas as pd
import numpy as np
import csv
import editdistance

In [20]:
# Create dictionary for station names to station codes
with open('../data/bixi/BixiMontrealRentals2018/Stations_2018.csv', mode='r') as f_in:
    reader = csv.reader(f_in)
    station_dict = {rows[1]:rows[0] for rows in reader}

In [21]:
# Set username and import data file (csv). Use https://www.zamzar.com/convert/pdf-to-csv/ to create csv files
username = 'saad'
df_master = pd.read_csv('../data/bixi/%s.csv'%username,names = ["timestamp",'location','duration'],skiprows=7)

In [25]:
# Keep rows that say station names
df = df_master.drop(df_master[~(df_master['timestamp'].str.startswith('Start') | df_master['timestamp'].str.startswith('End'))].index)
df.reset_index(drop = True, inplace = True)
df.rename(columns = {'timestamp':'date'}, inplace = True)
df.head()


Unnamed: 0,date,location,duration
0,Start: 11/15/2018,Duluth / St-Laurent,11 min 10 s
1,End: 11/15/2018,de la Montagne / Sherbrooke,
2,Start: 11/15/2018,Milton / University,8 min 23 s
3,End: 11/15/2018,Duluth / St-Laurent,
4,Start: 11/15/2018,Mackay / de Maisonneuve,5 min 51 s


In [10]:
# Remove all NaNs from the start/end stations and including its trip-partner (end/start)
invalid_index = df['location'].isna()
for index, row in invalid_index.iteritems():
    if row:
        if np.mod(index,2) == 1:
            invalid_index.loc[index-1] = True
        elif np.mod(index,2) == 0:
            invalid_index.loc[index+1] = True
            
df.drop(df[invalid_index].index,inplace=True)

In [11]:
# For those entries that do not match any key in the dictionary, compute Levenshtein distance for all keys. Pick the smallest
for index, item in df['location'].iteritems():
    if item in station_dict.keys():
        continue
    else:
        min_dist = 100
        for station_name in station_dict.keys():
            dist = editdistance.eval(item, station_name)
            if dist < min_dist:
                min_dist = dist
                min_station = station_name
        df.loc[index,'location'] = min_station

In [12]:
# Create empty dataframe with headers corresponding to known format
with open('../data/bixi/BixiMontrealRentals2018/OD_2018-04.csv', 'r') as f:
    reader = csv.reader(f)
    header = next(reader)

df_full = pd.DataFrame(columns = header)

In [13]:
# Place the start/end stations and duration in df_full
for index, row in df.iterrows():
    if np.mod(index,2) == 0:
        df_full.loc[int(index/2),'start_station_code'] = station_dict[row[1]]
        df_full.loc[int(index/2),'duration_sec'] = 60*int(row[2].split()[0])+int(row[2].split()[2])
    elif np.mod(index,2) == 1:
        df_full.loc[int(index/2),'end_station_code'] = station_dict[row[1]]


In [14]:
print(df_full)

    start_date start_station_code end_date end_station_code duration_sec  \
0          NaN               6213      NaN             6065          670   
1          NaN               6070      NaN             6213          503   
2          NaN               6100      NaN             7080          351   
3          NaN               6432      NaN             6100          527   
4          NaN               6065      NaN             6432          545   
..         ...                ...      ...              ...          ...   
302        NaN               6065      NaN             7084          105   
303        NaN               6070      NaN             6205           94   
304        NaN               6065      NaN             7084          120   
305        NaN               6070      NaN             6065          337   
306        NaN               6070      NaN             6205          148   

    is_member  
0         NaN  
1         NaN  
2         NaN  
3         NaN  
4      

In [15]:
# and save the file
df_full.to_csv('../data/bixi/%s-complete.csv'%username,index=False)

In [16]:
'''
Adding the time is too time-consuming (get it?). There are more important jobs to do. Surely one can find to robustly add the time to the userprofile

df_time = df_master.drop(df_master[~(df_master['timestamp'].str.contains('AM') | df_master['timestamp'].str.contains('PM'))].index)
df_time.reset_index(drop = True, inplace = True)
df_time.rename(columns = {'timestamp':'time'}, inplace = True)

''';

In [17]:
df_time[~df_time['location'].isna()];

NameError: name 'df_time' is not defined

In [18]:
'''Full empty dataframe with 
1. start_date & end_date (date + time)
2. start_station_code & end_station_code
3. duration-sec


SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-18-ea66ea8cccca>, line 4)