# Data Preparation
First, let's inspect the CSV files containing both Citi Bike and Weather data. We will load it, inspect, and fix quality issues if we find any.
## Reading 

In [39]:
import glob
import pandas as pd

# reading the weather data
df_weather = pd.read_csv('data/newark_airport_2016.csv')

# reading all citibike data files
citibike_data = glob.glob('data/JC-2016*.csv')

df_citibike = pd.concat(pd.read_csv(file) for file in citibike_data)

## Analysis
Let's see some information about our dataframes.

### Weather

In [19]:
df_weather.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TSUN,WDF2,WDF5,WSF2,WSF5
0,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,12.75,,0.0,0.0,0.0,41,43,34,,270,280.0,25.9,35.1
1,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-02,9.4,,0.0,0.0,0.0,36,42,30,,260,260.0,21.0,25.1
2,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-03,10.29,,0.0,0.0,0.0,37,47,28,,270,250.0,23.9,30.0
3,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-04,17.22,,0.0,0.0,0.0,32,35,14,,330,330.0,25.9,33.1
4,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-05,9.84,,0.0,0.0,0.0,19,31,10,,360,350.0,25.1,31.1


In [15]:
df_weather.describe()

Unnamed: 0,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TSUN,WDF2,WDF5,WSF2,WSF5
count,366.0,0.0,366.0,366.0,366.0,366.0,366.0,366.0,0.0,366.0,364.0,366.0,364.0
mean,9.429973,,0.104945,0.098087,0.342623,57.196721,65.991803,48.459016,,217.84153,228.269231,20.484426,26.801648
std,3.748174,,0.307496,1.276498,2.07851,17.466981,18.606301,17.13579,,102.548282,97.415777,6.84839,8.88261
min,2.46,,0.0,0.0,0.0,8.0,18.0,0.0,,10.0,10.0,6.9,10.1
25%,6.765,,0.0,0.0,0.0,43.0,51.25,35.0,,150.0,150.0,15.0,19.9
50%,8.72,,0.0,0.0,0.0,56.0,66.0,47.0,,240.0,260.0,19.9,25.1
75%,11.41,,0.03,0.0,0.0,74.0,83.0,64.0,,300.0,300.0,23.9,31.1
max,22.82,,2.79,24.0,20.1,89.0,99.0,80.0,,360.0,360.0,48.1,66.0


In [16]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 16 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   STATION  366 non-null    object 
 1   NAME     366 non-null    object 
 2   DATE     366 non-null    object 
 3   AWND     366 non-null    float64
 4   PGTM     0 non-null      float64
 5   PRCP     366 non-null    float64
 6   SNOW     366 non-null    float64
 7   SNWD     366 non-null    float64
 8   TAVG     366 non-null    int64  
 9   TMAX     366 non-null    int64  
 10  TMIN     366 non-null    int64  
 11  TSUN     0 non-null      float64
 12  WDF2     366 non-null    int64  
 13  WDF5     364 non-null    float64
 14  WSF2     366 non-null    float64
 15  WSF5     364 non-null    float64
dtypes: float64(9), int64(4), object(3)
memory usage: 45.9+ KB


In [17]:
df_weather.isna().sum()

STATION      0
NAME         0
DATE         0
AWND         0
PGTM       366
PRCP         0
SNOW         0
SNWD         0
TAVG         0
TMAX         0
TMIN         0
TSUN       366
WDF2         0
WDF5         2
WSF2         0
WSF5         2
dtype: int64

We notice that there are a lot of `NaN` values in the `PGTM` and `TSUN` fields, and a few in `WDF5` and `WSF5`. Let's analyze it a little bit. First of all, we know that we will not be dropping all the rows without values, as we would have to drop every single row. We need to think if we can replace the missing values somehow.

`PGTM` stands for **peak gust time** (hours and minutes, i.e., HHMM), which in our case will be impossible to fill in. We don't have any values that would let us calculate at which time peak gust happened. Similar thing with `TSUN`, which is **daily total sunshine** (minutes), and there's no other data that would help us define daily total sunshine. Hence, we can just drop these columns, as they are not useful at all.

In [40]:
df_weather.drop(columns=['PGTM', 'TSUN'], inplace=True)

In [41]:
df_weather.isna().sum()

STATION    0
NAME       0
DATE       0
AWND       0
PRCP       0
SNOW       0
SNWD       0
TAVG       0
TMAX       0
TMIN       0
WDF2       0
WDF5       2
WSF2       0
WSF5       2
dtype: int64

Now let's focus on the other two remaining columns with null values.

`WDF5` stands for **direction of fastest 5-second wind** (degrees), and `WSF2` for **fastest 2-minute wind speed** (tenths of meters per second). In case of the weather data, we could perform linear interpolation (included in `pandas`), so the missing data will be filles based on neighboring values.

In [42]:
df_weather.interpolate(method='linear', inplace=True)

  df_weather.interpolate(method='linear', inplace=True)


In [43]:
df_weather.isna().sum()

STATION    0
NAME       0
DATE       0
AWND       0
PRCP       0
SNOW       0
SNWD       0
TAVG       0
TMAX       0
TMIN       0
WDF2       0
WDF5       0
WSF2       0
WSF5       0
dtype: int64

In [46]:
df_weather['DATE'] = pd.to_datetime(df_weather['DATE'])

In [47]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   STATION  366 non-null    object        
 1   NAME     366 non-null    object        
 2   DATE     366 non-null    datetime64[ns]
 3   AWND     366 non-null    float64       
 4   PRCP     366 non-null    float64       
 5   SNOW     366 non-null    float64       
 6   SNWD     366 non-null    float64       
 7   TAVG     366 non-null    int64         
 8   TMAX     366 non-null    int64         
 9   TMIN     366 non-null    int64         
 10  WDF2     366 non-null    int64         
 11  WDF5     366 non-null    float64       
 12  WSF2     366 non-null    float64       
 13  WSF5     366 non-null    float64       
dtypes: datetime64[ns](1), float64(7), int64(4), object(2)
memory usage: 40.2+ KB


In [48]:
df_weather.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,12.75,0.0,0.0,0.0,41,43,34,270,280.0,25.9,35.1
1,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-02,9.4,0.0,0.0,0.0,36,42,30,260,260.0,21.0,25.1
2,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-03,10.29,0.0,0.0,0.0,37,47,28,270,250.0,23.9,30.0
3,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-04,17.22,0.0,0.0,0.0,32,35,14,330,330.0,25.9,33.1
4,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-05,9.84,0.0,0.0,0.0,19,31,10,360,350.0,25.1,31.1


In [49]:
df_weather.duplicated().sum()

0

OK, the weather data should be fine now. Let's move forward to the other one.

### Citi Bike

In [53]:
df_citibike.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,1
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0


In [9]:
df_citibike.describe()

Unnamed: 0,Trip Duration,Start Station ID,Start Station Latitude,Start Station Longitude,End Station ID,End Station Latitude,End Station Longitude,Bike ID,Birth Year,Gender
count,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,228585.0,247584.0
mean,885.6305,3207.065206,40.723121,-74.046438,3203.572553,40.722594,-74.045855,24935.260481,1979.335276,1.123534
std,35937.98,26.955103,0.008199,0.011211,61.579494,0.007958,0.011283,748.469712,9.596809,0.518687
min,61.0,3183.0,40.69264,-74.096937,147.0,40.692216,-74.096937,14552.0,1900.0,0.0
25%,248.0,3186.0,40.717732,-74.050656,3186.0,40.71654,-74.050444,24491.0,1974.0,1.0
50%,390.0,3201.0,40.721525,-74.044247,3199.0,40.721124,-74.043117,24609.0,1981.0,1.0
75%,666.0,3211.0,40.727596,-74.038051,3211.0,40.727224,-74.036486,24719.0,1986.0,1.0
max,16329810.0,3426.0,40.752559,-74.032108,3426.0,40.801343,-73.95739,27274.0,2000.0,2.0


In [10]:
df_citibike.info()

<class 'pandas.core.frame.DataFrame'>
Index: 247584 entries, 0 to 15113
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Trip Duration            247584 non-null  int64  
 1   Start Time               247584 non-null  object 
 2   Stop Time                247584 non-null  object 
 3   Start Station ID         247584 non-null  int64  
 4   Start Station Name       247584 non-null  object 
 5   Start Station Latitude   247584 non-null  float64
 6   Start Station Longitude  247584 non-null  float64
 7   End Station ID           247584 non-null  int64  
 8   End Station Name         247584 non-null  object 
 9   End Station Latitude     247584 non-null  float64
 10  End Station Longitude    247584 non-null  float64
 11  Bike ID                  247584 non-null  int64  
 12  User Type                247204 non-null  object 
 13  Birth Year               228585 non-null  float64
 14  Gender    

In [11]:
df_citibike.isna().sum()

Trip Duration                  0
Start Time                     0
Stop Time                      0
Start Station ID               0
Start Station Name             0
Start Station Latitude         0
Start Station Longitude        0
End Station ID                 0
End Station Name               0
End Station Latitude           0
End Station Longitude          0
Bike ID                        0
User Type                    380
Birth Year                 18999
Gender                         0
dtype: int64

In [50]:
df_citibike.duplicated().sum()

0

In [51]:
user_types = df_citibike['User Type'].unique()
print(user_types)

['Subscriber' 'Customer' nan]


Looking at the result from `df_citibike.describe()`, we can easily notice that our starting station are nearby (we know that by looking at the latitude, longtitude, its standard deviation and max and min values), so we don't need to limit our data frame to only Newark. We need to handle the null values and other issues tho!

We don't really need data about `Birth Year` to analyze impact of the weather on bike rental. So we will just allow these null values, as they don't affect our results that much. However, let's take a deeper look into the ones with null in `User Type`.

In [76]:
df_citibike[df_citibike['User Type'].isnull()]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
9538,156,2016-03-23 09:08:34,2016-03-23 09:11:11,3214,Essex Light Rail,40.712774,-74.036486,3183,Exchange Place,40.716247,-74.033459,24444,,1987.0,1
9939,164,2016-03-23 22:17:45,2016-03-23 22:20:29,3183,Exchange Place,40.716247,-74.033459,3214,Essex Light Rail,40.712774,-74.036486,24675,,1987.0,1
10165,171,2016-03-24 11:46:39,2016-03-24 11:49:31,3214,Essex Light Rail,40.712774,-74.036486,3183,Exchange Place,40.716247,-74.033459,24697,,1987.0,1
10460,204,2016-03-24 20:45:45,2016-03-24 20:49:10,3183,Exchange Place,40.716247,-74.033459,3214,Essex Light Rail,40.712774,-74.036486,24387,,1987.0,1
10901,380,2016-03-25 19:15:56,2016-03-25 19:22:17,3183,Exchange Place,40.716247,-74.033459,3184,Paulus Hook,40.714145,-74.033552,24412,,1987.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12667,1266,2016-12-24 18:21:54,2016-12-24 18:43:00,3186,Grove St PATH,40.719586,-74.043117,3199,Newport Pkwy,40.728745,-74.032108,26200,,1991.0,1
14102,1791,2016-12-28 18:51:00,2016-12-28 19:20:52,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,26194,,1982.0,1
14103,1248,2016-12-28 18:51:07,2016-12-28 19:11:55,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,26292,,1987.0,2
14153,1130,2016-12-28 20:52:18,2016-12-28 21:11:08,3199,Newport Pkwy,40.728745,-74.032108,3199,Newport Pkwy,40.728745,-74.032108,26194,,1982.0,1


Do we really need information about whether it's a customer or a subscriber? In case when we want to analyze the weather, not really. We don't know why these fields are empty. Let's try to search for a correlation between some attributes and the user type.

In [86]:
df_citibike[((df_citibike['Gender'] == 1) | (df_citibike['Gender'] == 2)) & (df_citibike['User Type'] == 'Subscriber')]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,1
6,445,2016-01-01 01:07:45,2016-01-01 01:15:11,3186,Grove St PATH,40.719586,-74.043117,3203,Hamilton Park,40.727596,-74.044247,24510,Subscriber,1988.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15108,377,2016-12-31 23:10:00,2016-12-31 23:16:17,3206,Hilltop,40.731169,-74.057574,3225,Baldwin at Montgomery,40.723659,-74.064194,24421,Subscriber,1984.0,1
15109,557,2016-12-31 23:10:16,2016-12-31 23:19:33,3214,Essex Light Rail,40.712774,-74.036486,3203,Hamilton Park,40.727596,-74.044247,24465,Subscriber,1981.0,2
15111,173,2016-12-31 23:44:37,2016-12-31 23:47:31,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,24641,Subscriber,1978.0,1
15112,2424,2016-12-31 23:44:50,2017-01-01 00:25:14,3214,Essex Light Rail,40.712774,-74.036486,3214,Essex Light Rail,40.712774,-74.036486,26219,Subscriber,1960.0,2


In [82]:
df_citibike[((df_citibike['Gender'] == 1) | (df_citibike['Gender'] == 2)) & (df_citibike['User Type'] == 'Customer')].head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
11254,1536,2016-07-15 12:24:51,2016-07-15 12:50:28,3199,Newport Pkwy,40.728745,-74.032108,3187,Warren St,40.721124,-74.038051,24634,Customer,1984.0,1
11315,702,2016-07-15 14:09:21,2016-07-15 14:21:04,3187,Warren St,40.721124,-74.038051,3199,Newport Pkwy,40.728745,-74.032108,24515,Customer,1984.0,1
11518,729,2016-07-15 18:14:05,2016-07-15 18:26:14,3184,Paulus Hook,40.714145,-74.033552,3185,City Hall,40.717732,-74.043845,24532,Customer,1987.0,1
11569,6368,2016-07-15 18:56:47,2016-07-15 20:42:56,3185,City Hall,40.717732,-74.043845,3203,Hamilton Park,40.727596,-74.044247,24510,Customer,1987.0,1
11676,403,2016-07-15 20:45:14,2016-07-15 20:51:58,3203,Hamilton Park,40.727596,-74.044247,3186,Grove St PATH,40.719586,-74.043117,24558,Customer,1987.0,1


In [83]:
df_citibike[(df_citibike['Gender'] == 0) & (df_citibike['User Type'] == 'Subscriber')]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
255,410,2016-01-02 13:58:54,2016-01-02 14:05:45,3210,Pershing Field,40.742677,-74.051789,3195,Sip Ave,40.730743,-74.063784,24559,Subscriber,1993.0,0
260,564,2016-01-02 14:09:48,2016-01-02 14:19:12,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24559,Subscriber,1993.0,0
3661,469,2016-01-13 07:35:40,2016-01-13 07:43:29,3210,Pershing Field,40.742677,-74.051789,3195,Sip Ave,40.730743,-74.063784,24575,Subscriber,1993.0,0
4425,555,2016-01-15 09:12:34,2016-01-15 09:21:50,3210,Pershing Field,40.742677,-74.051789,3195,Sip Ave,40.730743,-74.063784,24612,Subscriber,1993.0,0
4661,648,2016-01-15 19:17:51,2016-01-15 19:28:39,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24595,Subscriber,1993.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14859,223,2016-12-31 08:51:48,2016-12-31 08:55:32,3203,Hamilton Park,40.727596,-74.044247,3273,Manila & 1st,40.721651,-74.042884,26248,Subscriber,,0
14864,1155,2016-12-31 09:12:12,2016-12-31 09:31:28,3267,Morris Canal,40.712419,-74.038526,3267,Morris Canal,40.712419,-74.038526,24704,Subscriber,,0
14884,226,2016-12-31 10:21:00,2016-12-31 10:24:47,3273,Manila & 1st,40.721651,-74.042884,3203,Hamilton Park,40.727596,-74.044247,24641,Subscriber,,0
15067,224,2016-12-31 18:01:38,2016-12-31 18:05:23,3194,McGinley Square,40.725340,-74.067622,3195,Sip Ave,40.730743,-74.063784,24716,Subscriber,,0


In [91]:
df_citibike[(df_citibike['Gender'] == 0) & (df_citibike['User Type'] == 'Customer')]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0
5,883,2016-01-01 01:03:28,2016-01-01 01:18:11,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24442,Customer,,0
22,988,2016-01-01 03:16:33,2016-01-01 03:33:02,3196,Riverview Park,40.744319,-74.043991,3209,Brunswick St,40.724176,-74.050656,24662,Customer,,0
53,3090,2016-01-01 11:07:15,2016-01-01 11:58:46,3203,Hamilton Park,40.727596,-74.044247,3203,Hamilton Park,40.727596,-74.044247,24444,Customer,,0
57,788,2016-01-01 11:50:30,2016-01-01 12:03:39,3210,Pershing Field,40.742677,-74.051789,3195,Sip Ave,40.730743,-74.063784,24573,Customer,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14953,7081,2016-12-31 13:41:33,2016-12-31 15:39:34,3185,City Hall,40.717732,-74.043845,3213,Van Vorst Park,40.718489,-74.047727,26265,Customer,,0
14954,7075,2016-12-31 13:41:43,2016-12-31 15:39:38,3185,City Hall,40.717732,-74.043845,3213,Van Vorst Park,40.718489,-74.047727,24513,Customer,,0
14972,2484,2016-12-31 14:35:19,2016-12-31 15:16:44,3275,Columbus Drive,40.718355,-74.038914,3199,Newport Pkwy,40.728745,-74.032108,24627,Customer,,0
14973,79669,2016-12-31 14:35:25,2017-01-01 12:43:14,3275,Columbus Drive,40.718355,-74.038914,3199,Newport Pkwy,40.728745,-74.032108,26217,Customer,,0


In [88]:
df_citibike[(df_citibike['Birth Year'].isnull()) & (df_citibike['User Type'] == 'Subscriber')]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
2513,499,2016-06-04 12:23:12,2016-06-04 12:31:31,3203,Hamilton Park,40.727596,-74.044247,3186,Grove St PATH,40.719586,-74.043117,24462,Subscriber,,0
11555,534,2016-06-16 15:08:59,2016-06-16 15:17:53,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24654,Subscriber,,0
13952,829,2016-06-19 13:38:59,2016-06-19 13:52:48,3209,Brunswick St,40.724176,-74.050656,3183,Exchange Place,40.716247,-74.033459,24624,Subscriber,,0
14005,557,2016-06-19 14:16:42,2016-06-19 14:26:00,3184,Paulus Hook,40.714145,-74.033552,3185,City Hall,40.717732,-74.043845,24471,Subscriber,,0
14046,490,2016-06-19 15:01:35,2016-06-19 15:09:45,3185,City Hall,40.717732,-74.043845,3209,Brunswick St,40.724176,-74.050656,24474,Subscriber,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14859,223,2016-12-31 08:51:48,2016-12-31 08:55:32,3203,Hamilton Park,40.727596,-74.044247,3273,Manila & 1st,40.721651,-74.042884,26248,Subscriber,,0
14864,1155,2016-12-31 09:12:12,2016-12-31 09:31:28,3267,Morris Canal,40.712419,-74.038526,3267,Morris Canal,40.712419,-74.038526,24704,Subscriber,,0
14884,226,2016-12-31 10:21:00,2016-12-31 10:24:47,3273,Manila & 1st,40.721651,-74.042884,3203,Hamilton Park,40.727596,-74.044247,24641,Subscriber,,0
15067,224,2016-12-31 18:01:38,2016-12-31 18:05:23,3194,McGinley Square,40.725340,-74.067622,3195,Sip Ave,40.730743,-74.063784,24716,Subscriber,,0


In [96]:
df_citibike[(df_citibike['Birth Year'].isnull()) & (df_citibike['User Type'] == 'Customer')]

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0
5,883,2016-01-01 01:03:28,2016-01-01 01:18:11,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24442,Customer,,0
22,988,2016-01-01 03:16:33,2016-01-01 03:33:02,3196,Riverview Park,40.744319,-74.043991,3209,Brunswick St,40.724176,-74.050656,24662,Customer,,0
53,3090,2016-01-01 11:07:15,2016-01-01 11:58:46,3203,Hamilton Park,40.727596,-74.044247,3203,Hamilton Park,40.727596,-74.044247,24444,Customer,,0
57,788,2016-01-01 11:50:30,2016-01-01 12:03:39,3210,Pershing Field,40.742677,-74.051789,3195,Sip Ave,40.730743,-74.063784,24573,Customer,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14953,7081,2016-12-31 13:41:33,2016-12-31 15:39:34,3185,City Hall,40.717732,-74.043845,3213,Van Vorst Park,40.718489,-74.047727,26265,Customer,,0
14954,7075,2016-12-31 13:41:43,2016-12-31 15:39:38,3185,City Hall,40.717732,-74.043845,3213,Van Vorst Park,40.718489,-74.047727,24513,Customer,,0
14972,2484,2016-12-31 14:35:19,2016-12-31 15:16:44,3275,Columbus Drive,40.718355,-74.038914,3199,Newport Pkwy,40.728745,-74.032108,24627,Customer,,0
14973,79669,2016-12-31 14:35:25,2017-01-01 12:43:14,3275,Columbus Drive,40.718355,-74.038914,3199,Newport Pkwy,40.728745,-74.032108,26217,Customer,,0


Even though it's more likely that no year and no gender (gender = 0) will be a Customer, otherwise a Subscriber, there's no strict correlation, so we cannot say it's the reason for that. Hence, let's just drop the rows with null rows in the user type.

In [97]:
df_citibike.dropna(subset=['User Type'], inplace=True)

In [98]:
df_citibike.isna().sum()

Trip Duration                  0
Start Time                     0
Stop Time                      0
Start Station ID               0
Start Station Name             0
Start Station Latitude         0
Start Station Longitude        0
End Station ID                 0
End Station Name               0
End Station Latitude           0
End Station Longitude          0
Bike ID                        0
User Type                      0
Birth Year                 18999
Gender                         0
dtype: int64

And some data types clarification:

In [101]:
df_citibike['Start Time'] = pd.to_datetime(df_citibike['Start Time'])
df_citibike['Stop Time'] = pd.to_datetime(df_citibike['Stop Time'])

In [104]:
import numpy as np


df_citibike['Gender'] = df_citibike['Gender'].replace({2: 'F', 1: 'M', 0: np.nan})

In [109]:
df_citibike['Birth Year'] = df_citibike['Birth Year'].astype('Int64')

In [110]:
df_citibike.info()

<class 'pandas.core.frame.DataFrame'>
Index: 247204 entries, 0 to 15113
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   Trip Duration            247204 non-null  int64         
 1   Start Time               247204 non-null  datetime64[ns]
 2   Stop Time                247204 non-null  datetime64[ns]
 3   Start Station ID         247204 non-null  int64         
 4   Start Station Name       247204 non-null  object        
 5   Start Station Latitude   247204 non-null  float64       
 6   Start Station Longitude  247204 non-null  float64       
 7   End Station ID           247204 non-null  int64         
 8   End Station Name         247204 non-null  object        
 9   End Station Latitude     247204 non-null  float64       
 10  End Station Longitude    247204 non-null  float64       
 11  Bike ID                  247204 non-null  int64         
 12  User Type             

In [111]:
df_citibike.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,F
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,M
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,F
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,M
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,


Let's move to the next step.

# Schema Creation

Let's take a look at our data frames again.

In [113]:
df_citibike.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,F
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,M
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,F
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,M
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,


In [114]:
df_weather.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,12.75,0.0,0.0,0.0,41,43,34,270,280.0,25.9,35.1
1,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-02,9.4,0.0,0.0,0.0,36,42,30,260,260.0,21.0,25.1
2,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-03,10.29,0.0,0.0,0.0,37,47,28,270,250.0,23.9,30.0
3,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-04,17.22,0.0,0.0,0.0,32,35,14,330,330.0,25.9,33.1
4,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-05,9.84,0.0,0.0,0.0,19,31,10,360,350.0,25.1,31.1


In [132]:
# unique_rows = df_citibike.drop_duplicates(subset=['Bike ID', 'Birth Year', 'Gender', 'User Type'])
# unique_rows

In [131]:
# combinations_df = pd.DataFrame(df_citibike, columns=['Gender', 'Birth Year', 'User Type', 'Bike ID'])
# print(combinations_df.duplicated().sum())
# print(combinations_df['Bike ID'].duplicated().sum())
# combinations_df

Given the dataframes, let's design a schema. We're gonna have:
## Station
- ID (PK)
- Name
- Latitude
- Longtitude



In [196]:
station = df_citibike[['Start Station ID', 'Start Station Name', 'Start Station Latitude', 'Start Station Longitude']].copy()
station.rename(columns={
    'Start Station ID': 'id',
    'Start Station Name': 'name',
    'Start Station Latitude': 'latitude',
    'Start Station Longitude': 'longitude'
}, inplace=True)

station_stop = df_citibike[['End Station ID', 'End Station Name', 'End Station Latitude', 'End Station Longitude']].copy()
station_stop.rename(columns={
    'End Station ID': 'id',
    'End Station Name': 'name',
    'End Station Latitude': 'latitude',
    'End Station Longitude': 'longitude'
}, inplace=True)


station = pd.concat([station, station_stop])
station.drop_duplicates(subset=['id'], inplace=True)
station.reset_index(drop=True, inplace=True)
station.head()

Unnamed: 0,id,name,latitude,longitude
0,3186,Grove St PATH,40.719586,-74.043117
1,3209,Brunswick St,40.724176,-74.050656
2,3195,Sip Ave,40.730743,-74.063784
3,3211,Newark Ave,40.721525,-74.046305
4,3187,Warren St,40.721124,-74.038051


## Weather
- ID (YYYYMMDD) (PK)
- Date
- AWND
- PRCP
- SNOW
- SNWD
- TAVG
- TMAX
- TMIN
- WDF2
- WDF5
- WSF2
- WSF5

In [194]:
new_columns = {
    'DATE': 'date',
    'AWND': 'avg_wind_speed',
    'PRCP': 'precipitation',
    'SNOW': 'snowfall',
    'SNWD': 'snow_depth',
    'TAVG': 'avg_temp',
    'TMAX': 'max_temp',
    'TMIN': 'min_temp',
    'WDF2': 'fast_wind_2s_dir',
    'WDF5': 'fast_wind_5s_dir',
    'WSF2': 'fast_wind_2s_speed',
    'WSF5': 'fast_wind_5s_speed'
}

weather = df_weather.rename(columns=new_columns)
weather.drop(columns=['STATION', 'NAME'], inplace=True)
weather['id'] = weather['date'].dt.strftime('%Y%m%d').astype(int)
weather = weather[['id', 'date', 'avg_wind_speed', 'precipitation', 'snowfall', 'snow_depth',
                   'avg_temp', 'max_temp', 'min_temp', 'fast_wind_2s_dir', 
                   'fast_wind_5s_dir', 'fast_wind_2s_speed', 'fast_wind_5s_speed']]
weather.head()

Unnamed: 0,id,date,avg_wind_speed,precipitation,snowfall,snow_depth,avg_temp,max_temp,min_temp,fast_wind_2s_dir,fast_wind_5s_dir,fast_wind_2s_speed,fast_wind_5s_speed
0,20160101,2016-01-01,12.75,0.0,0.0,0.0,41,43,34,270,280.0,25.9,35.1
1,20160102,2016-01-02,9.4,0.0,0.0,0.0,36,42,30,260,260.0,21.0,25.1
2,20160103,2016-01-03,10.29,0.0,0.0,0.0,37,47,28,270,250.0,23.9,30.0
3,20160104,2016-01-04,17.22,0.0,0.0,0.0,32,35,14,330,330.0,25.9,33.1
4,20160105,2016-01-05,9.84,0.0,0.0,0.0,19,31,10,360,350.0,25.1,31.1


## Trip
- Trip ID (PK)
- Start Station ID (FK to Station)
- End Station ID (FK to Station)
- Start Time
- Stop Time
- Bike ID (FK to Bike)
- Duration
- User Type
- Gender
- Birth Year

In [201]:
trip = df_citibike[['Start Station ID', 'End Station ID', 'Start Time', 'Stop Time', 'Trip Duration', 'Bike ID', 'User Type', 'Gender', 'Birth Year']].copy()

trip.rename(columns={
    'Start Station ID': 'start_station_id',
    'End Station ID': 'end_station_id',
    'Start Time': 'start_time',
    'Stop Time': 'stop_time',
    'Trip Duration': 'duration',
    'Bike ID': 'bike_id',
    'User Type': 'user_type',
    'Gender': 'gender',
    'Birth Year': 'birth_year'
}, inplace=True)

trip.head()

Unnamed: 0,start_station_id,end_station_id,start_time,stop_time,duration,bike_id,user_type,gender,birth_year
0,3186,3209,2016-01-01 00:02:52,2016-01-01 00:08:54,362,24647,Subscriber,F,1964.0
1,3186,3213,2016-01-01 00:18:22,2016-01-01 00:21:42,200,24605,Subscriber,M,1962.0
2,3186,3213,2016-01-01 00:18:25,2016-01-01 00:21:47,202,24689,Subscriber,F,1962.0
3,3209,3203,2016-01-01 00:23:13,2016-01-01 00:27:21,248,24693,Subscriber,M,1984.0
4,3195,3210,2016-01-01 01:03:20,2016-01-01 01:18:24,903,24573,Customer,,


## Bike
- ID (PK)
- Number (just for fun so it is not so empty lol)

In [192]:
bike = df_citibike[['Bike ID']].copy()
bike.drop_duplicates(inplace=True)
bike['id'] = bike['Bike ID']
bike['number'] = 'BK00' + bike['id'].astype(str)
bike.drop(columns=['Bike ID'], inplace=True)
bike.head()

Unnamed: 0,id,number
0,24647,BK0024647
1,24605,BK0024605
2,24689,BK0024689
3,24693,BK0024693
4,24573,BK0024573


# Database Creation
The script can be found [here](/table_creation.sql).

## Creating Connection

In [163]:
from sqlalchemy import create_engine

postgresql_url = 'postgresql://postgres:postgres@localhost:5432/citibike'

engine = create_engine(postgresql_url)

connection = engine.connect()

## Data Insertion

In [204]:
weather.to_sql('weather', connection, if_exists='replace', index=False)

366

In [206]:
trip.to_sql('trip', connection, if_exists='replace', index=False)

204

In [208]:
bike.to_sql('bike', connection, if_exists='replace', index=False)

566

In [207]:
station.to_sql('station', connection, if_exists='replace', index=False)

102

# Views Creation

Scripts for the views creation can be found [here](/views.sql). Explanation of each view below.

## Trips per Month
We can analyze in which months users take the most trips, so we know what we can expect.
## Trips per Season
Similar thing as with months, but grouped into seasons.
## Average duration per Month
We analyze how long each trip last by average in each month.
## Trips per Hour
We can analyze in which hour users are more likely to rent a bike.
## Trips per Temperature Range
We can analyze in which temperatures users are more likely to rent a bike.
## Trips per Wind Speed
We can analyze when users are more likely to rent a bike regards to the wind speed.
## Trips per Precipitation
Regarding whether there is precipitation or not, we see how users are more likely to rent a bike when it's not. However, they still do when it is.


## *Note*
We can analyze also other factor easily replacing the attributes. There are many more views possible to create for this project!