## Data cleaning of Citi Bike data

Includes:
- trip duration (sec)
- start: time and date, station ID, lat/long
- stop: time and date, station ID, lat/long
- bike ID
- user type (Customer = 24-hr pass or 3-day pass; Subscriber = annual pass)
- user gender (0 = unknown; 1 = male; 2 = female)
- user year of birth

In [1]:
# Import dependencies
import pandas as pd
import os
import re

In [2]:
# Get a list of file names
path = "Data/"
files = os.listdir(path)

# Choose 2018 data only
# ".*" = any string with matching substring before the "."
# "$" = end of string
r = re.compile("2018.*\.csv$")

# Create a new list of csv files
newList = list(filter(r.match, files))
newList.sort() # sorts the names

In [None]:
# Create a list of dataframes
files = [pd.read_csv(path + file) for file in newList]

In [None]:
files[0].head()

In [None]:
# Concatenate dataframes into one dataframe
df = pd.concat(files, ignore_index = True)

In [None]:
# Drop the "bikeid" column
df1 = df.drop("bikeid", axis = 1)

In [None]:
# Convert data types for each column that needs conversion
df1[["start station id", "end station id", "birth year", "gender"]] = df1[["start station id", "end station id", "birth year", "gender"]].astype(str)

# Convert date from string to datetime format
# Source: https://chrisalbon.com/python/basics/strings_to_datetime/
df1 = pd.to_datetime(df1["starttime", "stoptime"])

df1.head()

In [None]:
df1.dtypes