# Bike rental data management

A bike rental company has asked you to create a database to help their analysts understand the effects of weather on bike rentals. 

You’ve been given a year of bike rental data from the company and you’ll source weather data from the government. 

You’ll need to clean and validate both data sets, design a relational PostgreSQL database to store the data, and develop views for the database to assist the analytics team.

### Steps

- Prepare the Data
- Create a Schema

In [10]:
#|export

import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
import glob

In [11]:
import warnings

# Заранее отключим ворнинги
warnings.filterwarnings("ignore") # Заранее уберем ворнинги

## Pre-processing

- data import
- data cleaning

In [None]:
#|export

jc_files = glob.glob(r'..\bike_rental_data\data\JC-2016**-citibike-tripdata.csv')
jc_files

In [None]:
#Citi Bike data from Jersey City

df_jc = []

In [None]:
for file in jc_files:
    df_jc.append(pd.read_csv(file))

df_jc = pd.concat(df_jc).reset_index(drop=True)

In [15]:
# Выведем шапки датасетов для трёх регионов

print('Шапка исходных данных:')
print()
df_jc.head(5)

Шапка исходных данных:



Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,1
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0


In [14]:
df_jc.describe()

Unnamed: 0,Trip Duration,Start Station ID,Start Station Latitude,Start Station Longitude,End Station ID,End Station Latitude,End Station Longitude,Bike ID,Birth Year,Gender
count,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,247584.0,228585.0,247584.0
mean,885.6305,3207.065206,40.723121,-74.046438,3203.572553,40.722594,-74.045855,24935.260481,1979.335276,1.123534
std,35937.98,26.955103,0.008199,0.011211,61.579494,0.007958,0.011283,748.469712,9.596809,0.518687
min,61.0,3183.0,40.69264,-74.096937,147.0,40.692216,-74.096937,14552.0,1900.0,0.0
25%,248.0,3186.0,40.717732,-74.050656,3186.0,40.71654,-74.050444,24491.0,1974.0,1.0
50%,390.0,3201.0,40.721525,-74.044247,3199.0,40.721124,-74.043117,24609.0,1981.0,1.0
75%,666.0,3211.0,40.727596,-74.038051,3211.0,40.727224,-74.036486,24719.0,1986.0,1.0
max,16329810.0,3426.0,40.752559,-74.032108,3426.0,40.801343,-73.95739,27274.0,2000.0,2.0


In [16]:
df_jc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247584 entries, 0 to 247583
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Trip Duration            247584 non-null  int64  
 1   Start Time               247584 non-null  object 
 2   Stop Time                247584 non-null  object 
 3   Start Station ID         247584 non-null  int64  
 4   Start Station Name       247584 non-null  object 
 5   Start Station Latitude   247584 non-null  float64
 6   Start Station Longitude  247584 non-null  float64
 7   End Station ID           247584 non-null  int64  
 8   End Station Name         247584 non-null  object 
 9   End Station Latitude     247584 non-null  float64
 10  End Station Longitude    247584 non-null  float64
 11  Bike ID                  247584 non-null  int64  
 12  User Type                247204 non-null  object 
 13  Birth Year               228585 non-null  float64
 14  Gend

In [18]:
# None check

(
    df_jc.isna()
    .sum()
    .sort_values(ascending=False)
)

Birth Year                 18999
User Type                    380
Trip Duration                  0
Start Time                     0
Stop Time                      0
Start Station ID               0
Start Station Name             0
Start Station Latitude         0
Start Station Longitude        0
End Station ID                 0
End Station Name               0
End Station Latitude           0
End Station Longitude          0
Bike ID                        0
Gender                         0
dtype: int64

In [None]:
display(df_jc)

In [None]:
#|default_exp core
#| hide
import nbdev; nbdev.nbdev_export()