## Citibike Data Combining

This is a notebook about the results of citibike tripdata exploration.
We are looking at citibike trips in 2017. 
Here I will analyze the user types, and citibike tripdata trends I found.

In [37]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [71]:
jan_raw = pd.read_csv('./final/jan_final.csv')
feb_raw = pd.read_csv('./final/feb_final.csv')
march_raw = pd.read_csv('./final/march_final.csv')
april_raw = pd.read_csv('./final/april_final.csv')

In [72]:
may_raw = pd.read_csv('./final/may_final.csv')
june_raw = pd.read_csv('./final/june_final.csv')
july_raw = pd.read_csv('./final/july_final.csv')
august_raw = pd.read_csv('./final/august_final.csv')

In [73]:
sep_raw = pd.read_csv('./final/sep.csv')
oct_raw = pd.read_csv('./final/oct.csv')
nov_raw = pd.read_csv('./final/nov.csv')
dec_raw = pd.read_csv('./final/dec.csv')

### Additional Data Cleaning

In [81]:
jan_raw.columns

Index(['Unnamed: 0', 'tripduration', 'starttime', 'stoptime',
       'start station id', 'start station name', 'end station id',
       'end station name', 'usertype', 'birth year', 'gender',
       'start_station_geoid', 'end_station_geoid'],
      dtype='object')

In [82]:
may_raw.columns

Index(['Unnamed: 0', 'tripduration', 'starttime', 'stoptime',
       'start station id', 'start station name', 'end station id',
       'end station name', 'usertype', 'birth year', 'gender',
       'start_station_geoid', 'end_station_geoid'],
      dtype='object')

In [83]:
dec_raw.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station name',
       'end station name', 'usertype', 'birth year', 'gender',
       'start_station_geoid', 'end_station_geoid'],
      dtype='object')

For Jan - April the columns are named 'start time' 'stop time' 'trip duration''user type' while for May-Dec the columns are named 'starttime' 'stoptime''tripduration''usertype'. I will clean the Jan-April data so that the columns are 'starttime' 'stoptime' 'tripduration'' usertype'.  (Without the space in between)

In [84]:
jan_raw = jan_raw.rename(columns={'start time': 'starttime', 'stop time': 'stoptime', 'trip duration': 'tripduration','user type': 'usertype'})

In [85]:
feb_raw = feb_raw.rename(columns={'start time': 'starttime', 'stop time': 'stoptime', 'trip duration': 'tripduration','user type': 'usertype'})

In [86]:
march_raw = march_raw.rename(columns={'start time': 'starttime', 'stop time': 'stoptime', 'trip duration': 'tripduration','user type': 'usertype'})

In [87]:
april_raw = april_raw.rename(columns={'start time': 'starttime', 'stop time': 'stoptime', 'trip duration': 'tripduration','user type': 'usertype'})

### Combine all the months together

In this step we will combine all of the citibike trip data together. 
We cleaned and added geoid data of 4months each. 

In [88]:
frames = [jan_raw,feb_raw, march_raw, april_raw, may_raw, june_raw ,july_raw,august_raw ,sep_raw ,oct_raw,nov_raw,dec_raw]
result = pd.concat(frames,ignore_index=True)

In [89]:
result.columns

Index(['Unnamed: 0', 'tripduration', 'starttime', 'stoptime',
       'start station id', 'start station name', 'end station id',
       'end station name', 'usertype', 'birth year', 'gender',
       'start_station_geoid', 'end_station_geoid'],
      dtype='object')

In [93]:
result.loc[10790070]

Unnamed: 0                                     NaN
tripduration                                   881
starttime                      2017-09-29 07:56:25
stoptime                       2017-09-29 08:11:06
start station id                               NaN
start station name           Pershing Square South
end station id                                 NaN
end station name       Greenwich St & W Houston St
usertype                                Subscriber
birth year                                    1966
gender                                           1
start_station_geoid                     3.6061e+10
end_station_geoid                       3.6061e+10
Name: 10790070, dtype: object

In [91]:
result.loc[8790090]

Unnamed: 0                            238237
tripduration                            1004
starttime                2017-08-04 19:59:25
stoptime                 2017-08-04 20:16:09
start station id                         284
start station name     Greenwich Ave & 8 Ave
end station id                           426
end station name       West St & Chambers St
usertype                            Customer
birth year                               NaN
gender                                     0
start_station_geoid               3.6061e+10
end_station_geoid                 3.6061e+10
Name: 8790090, dtype: object

In [92]:
result.loc[100091]

Unnamed: 0                          100091
tripduration                           737
starttime              2017-01-05 21:06:47
stoptime               2017-01-05 21:19:04
start station id                       280
start station name         E 10 St & 5 Ave
end station id                         394
end station name         E 9 St & Avenue C
usertype                        Subscriber
birth year                            1961
gender                                   2
start_station_geoid             3.6061e+10
end_station_geoid               3.6061e+10
Name: 100091, dtype: object

Because we did the data cleaning separately, we made a mistake in removing the start, end station ids . 
Fortunately we won't be using them for this project

#### Drop start,end station ids

In [97]:
result = result.drop(['start station id','end station id'], axis = 1).copy()

#### Save to csv  and upload to our google drive

Saving the csv file and uploading to google drive so we can use for later

In [99]:
final_trip_data = result.to_csv('./final_trip_data.csv')

In [94]:
result.dtypes

Unnamed: 0             float64
tripduration             int64
starttime               object
stoptime                object
start station id       float64
start station name      object
end station id         float64
end station name        object
usertype                object
birth year             float64
gender                   int64
start_station_geoid    float64
end_station_geoid      float64
dtype: object

These are the datatypes of the columns