The data cleaning & merging process is based on the documentation 'Codebook from Data Source' in the 'Discuss' session, which can be found here: https://www.kaggle.com/benhamner/sf-bay-area-bike-share/discussion/23165

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

In [None]:
trip = pd.read_csv('../input/trip.csv')
station = pd.read_csv('../input/station.csv')
weather = pd.read_csv('../input/weather.csv')

## Exploration

Show the first 5 rows of 3 datasets.

In [None]:
trip.head()

In [None]:
station.head()

In [None]:
weather.head()

Find the unique identifier in **trip** dataset.

In [None]:
result = trip.groupby('id')['start_date'].count().sort_values(ascending = False)
result.head()

All results equal to 1, meaning that *id* is the unique identifier of *trip* dataset.

Get an idea of missing values.

In [None]:
trip.isnull().sum()

In [None]:
weather.isnull().sum()

In [None]:
station.isnull().sum()

## Merge Data

### Transform trip data
First, get rid of *zip_code* column. As 'Codebook from Data Source' in the Discussion session describes, *zip_code* represents "Home zip code of subscriber (customers can choose to manually enter zip at kiosk however data is unreliable)". The column itself may not be very reliable, and the analysis will not need the home zip of customers. 

In [None]:
df1 = trip.drop(columns = ['zip_code'])

Second, change *start_date* and *end_date* to datetime object. Extract the date in %Y-%m-%d format to join on **weather** dataset later on.

In [None]:
##Transform start and end date to datetime objects
df1['start_date'] = df1['start_date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M'))
df1['end_date'] = df1['end_date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M'))
##Extracc only year, month and date to join on weather data later on
df1['date_for_join'] = df1['start_date'].apply(lambda x: x.strftime('%Y-%m-%d'))
df1['date_for_join'] = df1['date_for_join'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

### Join trip with station

First, add a *zip_for_join* column to **station** data before join. This is for linking **station**.*city* column with **weather**.*zip_code* column. As 'Codebook from Data Source' in the Discussion session describes, "94107=San Francisco, 94063=Redwood City, 94301=Palo Alto, 94041=Mountain View, 95113= San Jose". 

In [None]:
city_zip = pd.DataFrame({'city': ['San Jose', 'Redwood City', 'Mountain View', 'Palo Alto','San Francisco'], \
                         'zip_for_join': [95113,94063,94041,94301,94107]})
merge1 = station.merge(city_zip, how = 'left', left_on = 'city', right_on = 'city')

Second, create **merge2** as a copy of **merge1** to join on transformed **trip** data to find the station info of *start_station*, create **merge3** as another copy of **merge1** to find the station info of *end_station*.

In [None]:
merge2 = merge1.copy()
merge2.columns = ['start_station_id','start_name','start_lat','start_long','start_dock_count','start_city','start_installation_date','start_zip']

In [None]:
merge3 = merge1.copy()
merge3.columns =  ['end_station_id','end_name','end_lat','end_long','end_dock_count','end_city','end_installation_date','end_zip']

In [None]:
merge4 = df1.merge(merge2, how = 'left', left_on = 'start_station_id',right_on = 'start_station_id')

In [None]:
merge5 = merge4.merge(merge3,how = 'left', left_on = 'end_station_id',right_on = 'end_station_id' )

In [None]:
merge6 = merge5.drop(columns = ['start_name','end_name'])

### Transform and join weather data

First, change the *date* column into datetime object.

In [None]:
weather['date'] = weather['date'].apply(lambda x: datetime.strptime(x,'%m/%d/%Y'))

Second, create **start_weather** as a copy of **weather** to join on **merge6** to find the weather info of start_station, create **end_weather** as another copy of **weather** to find the weather info of end_station.

In [None]:
start_weather = weather.copy()
columns = list(start_weather.columns)
new_columns = []
for i in columns:
    i = 'start_' + i
    new_columns.append(i)
start_weather.columns = new_columns

In [None]:
end_weather = weather.copy()
columns = list(end_weather.columns)
new_columns = []
for i in columns:
    i = 'end_' + i
    new_columns.append(i)
end_weather.columns = new_columns

In [None]:
merge7 = merge6.merge(start_weather, how = 'left', left_on = ['date_for_join','start_zip'], \
                      right_on = ['start_date','start_zip_code'])

In [None]:
merge8 = merge7.merge(end_weather,how = 'left', left_on = ['date_for_join','end_zip'], \
                      right_on = ['end_date','end_zip_code'])

In [None]:
merge8.head(5).transpose()

In [None]:
merge9 = merge8.drop(columns = ['end_zip_code','end_date_y','start_date_y',\
                                'start_zip_code','date_for_join'])
merge9.rename(columns={'start_date_x':'start_date','end_date_x':'end_date'}, inplace=True)

## Clean Merged Data

In [None]:
merge9.shape

Find all columns with NAs, sort by #of NAs.

In [None]:
na_list = pd.DataFrame(merge9.isnull().sum())
na_list['column_name'] = na_list.index
na_list.columns = ['count_na','column_name']
na_column = na_list[na_list['count_na']>0]

In [None]:
na_column.sort_values(by = 'count_na')

*start_events* and *end_event* denote the unusal weather events (fog, rain, ect.) on a particular day. NA means that there was no unusal event on that day. Therefore I filled NAs with "No Special Events".

In [None]:
merge9['start_events'] = merge9['start_events'].fillna('No Special Events')

In [None]:
merge9['end_events'] = merge9['end_events'].fillna('No Special Events')

In [None]:
merge9 = merge9.drop(columns = ['start_max_gust_speed_mph','end_max_gust_speed_mph'])

Next, remove about 0.07% of records that has NA values (mostly in weather data).

In [None]:
merge10 = merge9.dropna()

Check the final data shape and export the merged and cleaned dataset.

In [None]:
merge10.shape

In [None]:
merge10.isna().sum()

In [None]:
merge10.to_csv('SF_Bay_Area_Bike_Share_Data_Cleaned.csv', index = False)