# Analisis Exploratorio - SF Bay Area Bike Share

| Apellido y Nombre | Padrón | Correo electrónico  |
| ----------------- | ------ | ------------------- |
| Alvarez Avalos, Dylan Gustavo| 98225 | dylanalvarez1995@gmail.com |
| Gerstner, Facundo Agustin | 96255 | facugerstner_29@hotmail.com |
| Llauró, Manuel Luis | 95736 | llauromanuel@gmail.com |
| Prediger Vianello, Emiliano Javier | 94165 | ej.prediger@gmail.com |

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import math
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (17, 7)

pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Clean-up

## Station.csv

In [None]:
stations = pd.read_csv('../input/station.csv', sep=',', parse_dates=['installation_date'],
                      infer_datetime_format=True,low_memory=False)

### Types

In [None]:
stations.dtypes

### Null values

In [None]:
stations.isnull().any()

### Size

In [None]:
stations.shape

### Head

In [None]:
stations.head()

## Weather.csv

In [None]:
weather = pd.read_csv('../input/weather.csv', sep=',', parse_dates=['date'],
                      infer_datetime_format=True,low_memory=False)

### Types

In [None]:
weather.dtypes

In [None]:
weather['precipitation_inches'].unique()

'T', es un dato válido, proveniente de "trace", significa que se detectó lluvia, pero no la suficiente para poder ser medida.

[Fuente 1](http://help.wunderground.com/knowledgebase/articles/656875-what-does-t-stand-for-on-the-rain-precipitation)

Aquí, [Fuente 2](http://www.experts123.com/q/what-does-the-t-mean-in-the-precipitation-column-of-the-data-listing.html) indica que la precipitación debe ser menor a 0,01 pulgadas, equivalente a 0,254 mm

In [None]:
weather[weather['precipitation_inches'] == 'T']['events'].unique()

Los eventos muestran que fueron días de lluvia, o al menos de humedad debido a la presencia de niebla.

In [None]:
weather['events'].unique()

In [None]:
weather['events'] = weather['events'].apply(lambda x: 'Rain' if x == 'rain' else x)

### Null values

In [None]:
weather.isnull().any()

### Size

In [None]:
weather.shape

### Head

In [None]:
weather.head()

## Trip.csv

In [None]:
trips = pd.read_csv('../input/trip.csv', sep=',', parse_dates=['start_date','end_date'],
                      infer_datetime_format=True,low_memory=False)

### Types

In [None]:
trips.dtypes

In [None]:
print (trips[pd.to_numeric(trips['zip_code'], errors='coerce').isnull()]['zip_code'])

In [None]:
trips.zip_code = pd.to_numeric(trips.zip_code, errors='coerce')

### Null values

In [None]:
trips.isnull().any()

### Outliers

In [None]:
trips.sort_values(by='duration',ascending=False).head()

Existen viajes que duran mas de 1 millon de segundos, casi 200 días, claramente existen outliers que deben ser removidos

In [None]:
# realizando un scatter plot se aprecia mejor lo expresado anteriormente

fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

duration_count = trips.loc[:,['duration','id']].groupby('duration').agg('count').reset_index()
duration_count.columns = ['duration','trip_count']
duration_count.plot.scatter('duration','trip_count',ax=ax);

plt.title('Cantidad de viajes vs. Duracion')
plt.xlabel('Duracion')
plt.ylabel('Cantidad de viajes')

In [None]:
trips.sort_values(by='duration').head()

Tambien existen viajes que duran 1 minuto, ademas de que la estacion de inicio y fin es la misma

In [None]:
# la mayoria de los viajes duran entre 200 y 600 segundos
%matplotlib inline

x = duration_count.loc[ ( 100 < duration_count.duration ) & ( duration_count.duration < 800)].duration
y = duration_count.loc[ ( 100 < duration_count.duration ) & ( duration_count.duration < 800)].trip_count

fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Scatter plot duracion de los viajes')
ax.set_ylabel('Duracion')
ax.set_xlabel('trip_id')
ax.scatter(x,y)

In [None]:
same_destination = trips[trips['start_station_name'] == trips['end_station_name']]
same_destination = same_destination[same_destination['duration'] > 120]
same_destination = same_destination[same_destination['duration'] < 3600]
same_destination.sort_values(by='duration',ascending=True)

In [None]:
viajes_sin_destino = trips[trips['start_station_name'] == trips['end_station_name']].shape[0]

print (str(viajes_sin_destino) + " \"viajes\"")

In [None]:
# Describe de la duracion de los viajes con outliers

trips.duration.describe()

In [None]:
# Limpiando datos, consideramos un tiempo de viaje posible máximo en 12 horas, equivalente a 43200 segundos.
clean_trips = trips
clean_trips = clean_trips[clean_trips['duration'] <= 43200 ]
clean_trips = clean_trips[clean_trips['duration'] > 120 ]

trips = clean_trips

clean_trips.duration.describe()

Todavía existen viajes "largos". Realizando un histograma para observar como se distribuyen las duraciones

In [None]:
clean_trips.sort_values(by='duration', ascending=False).head()

In [None]:
trips[trips['duration'] < 3600].shape[0]

In [None]:
trips.shape

Sin embargo, la gran mayoria de los viajes duran menos de 1 hora que es razonable

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
    
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.hist(clean_trips['duration'], bins=200)
plt.xlabel("Duracion")
plt.ylabel("Cantidad de Viajes")
plt.title("Histograma de duracion de viajes")
plt.show()

Aún habiendo filtrado los viajes menores a 12 horas, seguimos observando una gran cola en el histograma, la gran mayoría duran menos.

In [None]:
%matplotlib inline

fig, ax = plt.subplots(figsize=(8, 8))
    
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

fig.suptitle('Box plot de la duracion de los viajes menores a 1 hora', fontweight='bold')
plt.xlabel("trips < 3600")
sns.boxplot(x=trips.loc[trips.duration < 3600].duration, orient='v');

### Size

In [None]:
trips.shape

### Sample

In [None]:
trips.head()

## Status.csv

In [None]:
iter_status = pd.read_csv('../input/status.csv', iterator = True, chunksize = 100000)

### Types

In [None]:
iter_status.get_chunk(1).dtypes

# Analisis

## Cual es la estacion de donde salen mas viajes?

In [None]:
trips['start_station_name'].value_counts()[:5]

## Top 20 recorridos

In [None]:
routes_count = trips.loc[:,['id','start_station_name','end_station_name']]\
        .groupby(['start_station_name','end_station_name']).agg(['count','mean','std'])
routes_count.columns = ['trips_count','mean_duration','duration_std']
routes_count = routes_count.reset_index().sort_values('trips_count', ascending=False)[:20]
routes_count

## ¿Qué tipos de subscripciones existen?

In [None]:
trips.subscription_type.unique()

### ¿Cuántos viajes realizaron?

In [None]:
subscribers_trip_count = pd.DataFrame(trips.groupby('subscription_type')['id'].agg('count'))
subscribers_trip_count.columns = ['travel_count']
subscribers_trip_count

## ¿Cómo varia la cantidad de viajes segun el dia de la semana?

In [None]:
trips.isnull().any()

In [None]:
# para responder la pregunta se realizara un plot el cual requiere una columna "day_of_week"
trips['day_of_week'] = trips['start_date'].dt.dayofweek

In [None]:
trips_by_day_count = trips['day_of_week'].value_counts().sort_index()
trips_by_day_count.index = ['Lunes','Martes','Miercoles','Jueves','Viernes','Sabado','Domingo']

fig, ax = plt.subplots(figsize=(17, 7))
    
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.title('Total de viajes por dia de la semana')
plt.xlabel('Dia de la semana')
plt.ylabel('Cantidad de viajes')

trips_by_day_count.plot('bar');

## ¿Los "Customers" tienden a estar mas los fines de semana?

In [None]:
customers_trips_by_day = trips.loc[trips['subscription_type'] == 'Customer',['day_of_week']]['day_of_week'].value_counts().sort_index()
customers_trips_by_day.index = ['Lunes','Martes','Miercoles','Jueves','Viernes','Sabado','Domingo']

fig, ax = plt.subplots(figsize=(17, 7))
    
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.title('Total de viajes por dia de la semana de los "Customers"')
plt.xlabel('Dia de la semana')
plt.ylabel('Cantidad de viajes')

customers_trips_by_day.plot('bar',rot=0);

## En promedio, ¿la duracion de los viajes en la semana es constante? ¿Las personas van al trabajo/colegio en bicicleta?

In [None]:
def getNames(seriesOfNumbers):
    names = []
    days = ['Lunes','Martes','Miercoles','Jueves','Viernes','Sabado','Domingo']
    
    for numDay in seriesOfNumbers:
        names.append(days[numDay])
    return names

In [None]:
tripsByDayAndDuration = trips.loc[:,['day_of_week','duration']].sort_values('day_of_week')
tripsByDayAndDuration = tripsByDayAndDuration.groupby('day_of_week').mean().reset_index()
tripsByDayAndDuration['day_of_week'] = tripsByDayAndDuration[['day_of_week']].apply(lambda dates: getNames(dates))

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trip_duration_values = tuple(tripsByDayAndDuration.duration)
ind = np.arange(len(tripsByDayAndDuration.day_of_week))
width = 0.5

plt.bar(ind, trip_duration_values, width)
plt.title('Media de duracion de viajes')
plt.xlabel('Dia de la semana')
plt.ylabel('Duracion del viaje en segundos')
plt.xticks(ind, tuple(tripsByDayAndDuration.day_of_week))

plt.show()

## Mean Duration Customers vs. Mean Duration Subscribers

In [None]:
duration_trip_by_subscription = pd.DataFrame({'mean_duration': trips.groupby(['subscription_type'])['duration'].mean(),\
                                'std_duration': trips.groupby(['subscription_type'])['duration'].std()})

duration_trip_by_subscription

## Media de duracion de viajes de los 'Subscribers'

In [None]:
subscriptors_trips = trips.loc[trips.subscription_type == 'Subscriber',['day_of_week','duration']]
subscriptors_trips = subscriptors_trips.groupby('day_of_week').mean().reset_index()
subscriptors_trips['day_of_week'] = subscriptors_trips[['day_of_week']].apply(lambda dates: getNames(dates))

In [None]:
subscriptors_trips

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Media de duracion de viaje de los \'Subscribers\'')
ax.set_ylabel('Dia de la semana')
ax.set_xlabel('Duracion del viaje (en segundos)')

trip_duration_values = tuple(subscriptors_trips.duration)
ind = np.arange(len(subscriptors_trips.day_of_week))
width = 0.5

plt.bar(ind, trip_duration_values, width)

plt.xticks(ind, tuple(subscriptors_trips.day_of_week))

plt.show()

## Media de duracion de los 'costumers'

In [None]:
customers_trips = trips.loc[trips.subscription_type == 'Customer',['day_of_week','duration']]
customers_trips = customers_trips.groupby('day_of_week').mean().reset_index()
customers_trips['day_of_week'] = customers_trips[['day_of_week']].apply(lambda dates: getNames(dates))

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Media de duracion de viaje de los \'Customers\'')
ax.set_ylabel('Dia de la semana')
ax.set_xlabel('Duracion del viaje (en segundos)')

trip_duration_values = tuple(customers_trips.duration)
ind = np.arange(len(customers_trips.day_of_week))
width = 0.5

plt.bar(ind, trip_duration_values, width)

plt.xticks(ind, tuple(subscriptors_trips.day_of_week))

plt.show()

## Como es la variabilidad de la duracion de los viajes en la semana?

In [None]:
# Variabilidad de los 'Subscribers'
trips_by_day = trips.loc[(trips.duration < 3600) & (trips.subscription_type == 'Subscriber'),['duration','day_of_week','id']]\
            .pivot_table(index='id', columns='day_of_week')
days = ['Lunes','Martes','Miercoles','Jueves','Viernes','Sabado','Domingo']

fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax = trips_by_day.boxplot();
ax.set_xticklabels(days);
plt.title('Variabilidad de duracion de viaje de los \'Subscribers\'')
plt.xlabel('Dia de la semana')
plt.ylabel('Duracion del viaje (en segundos)');

In [None]:
# Variabilidad de los 'customers'
trips_by_day = trips.loc[trips.subscription_type == 'Customer',['duration','day_of_week','id']]\
            .pivot_table(index='id', columns='day_of_week')
days = ['Lunes','Martes','Miercoles','Jueves','Viernes','Sabado','Domingo']

fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax = trips_by_day.boxplot();
ax.set_xticklabels(days);
plt.title('Variabilidad de duracion de viaje de los \'Customers\'')
plt.xlabel('Dia de la semana')
plt.ylabel('Duracion del viaje (en segundos)');

## Como varían los viajes en función de la hora?

In [None]:
trips_with_hour = trips.loc[:, ['start_date','duration','id']]
trips_with_hour['start_time'] = trips_with_hour.start_date.dt.time
trips_with_hour['start_date'] = trips_with_hour.start_date.dt.date
trips_with_hour['start_hour'] = trips_with_hour.start_time.apply(lambda x: x.hour)

In [None]:
trips_by_hour = trips_with_hour.loc[:, ['start_hour','id']].groupby('start_hour').count()
trips_by_hour.columns = ['trips_count']
# trips_by_hour.plot(kind='bar',rot=0);

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Cantidad de viajes vs. hora de inicio')
ax.set_xlabel('Hora de inicio')
ax.set_ylabel('Cantidad de viajes')

trip_count_values = tuple(trips_by_hour.trips_count)
ind = np.arange(len(trips_by_hour.index))
width = 0.5

plt.bar(ind, trip_count_values, width)

plt.xticks(ind, tuple(trips_by_hour.index))

plt.show()

## ¿Cómo varía la duración promedio de los viajes en función de la hora?

In [None]:
duration_by_hour = trips_with_hour.loc[:,['start_hour','duration']].groupby('start_hour').mean()

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Media de duracion vs. hora de inicio')
ax.set_xlabel('Hora de inicio')
ax.set_ylabel('Duracion del viaje (en segundos)')

mean_duration_by_hour = tuple(duration_by_hour.duration)
ind =  np.arange(len(duration_by_hour.index))

width = 0.5

plt.bar(ind, duration_by_hour.duration, width)

plt.xticks(ind, tuple(duration_by_hour.index))

plt.show()

## ¿Qué sucede a las 3 a.m.?

In [None]:
trips_with_hour.loc[trips_with_hour['start_hour'] == 3].describe()

## ¿Cómo varía la duración promedio de los viajes en función del mes?

In [None]:
trips['month'] = trips.start_date.apply(lambda x: x.month)

trips_by_month = trips.loc[:,['month','duration']].groupby('month').mean()
trips_by_month.index = ['Enero','Febrero','Marzo','Abril','Mayo','Junio','Julio','Agosto','Septiembre','Octubre','Noviembre','Diciembre']

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Media de duracion vs. Mes')
ax.set_xlabel('Mes')
ax.set_ylabel('Duracion del viaje (en segundos)')

mean_duration_by_month = tuple(trips_by_month.duration)
ind = np.arange(len(trips_by_month.index))
width = 0.5

plt.bar(ind, mean_duration_by_month, width)

plt.xticks(ind, tuple(trips_by_month.index))

plt.show()

## ¿Cómo varía la cantidad de viajes entre estaciones?

In [None]:
stations.rename(columns={'name': 'start_station_name'}, inplace=True)
geo_station_trips = pd.merge(trips,stations, on = ['start_station_name'], how = 'outer') 
geo_station_trips = geo_station_trips[['start_station_name','end_station_name','lat','long']]

In [None]:
# Se agrega una columna para contar la cantidad de viajes a cada estación, partiendo de una estación determinada
geo_station_trips['cant'] = 1

In [None]:
trips_btw_stations = geo_station_trips.groupby(['lat','long','start_station_name', 'end_station_name'])\
                    [['cant']].sum().reset_index()
trips_btw_stations.head()

In [None]:
# Para tener una mejor idea de la geografia del lugar, se ordenan las estaciones de Norte a Sur y de Oeste a Este
stations = stations.sort_values(by = ['lat','long'],ascending=[False, True]).reset_index()

In [None]:
# Se crea una Matriz que contenga como indice y columnas los nombres de cada estación,
# para luego rellenar con los datos previamente obtenidos de la cantidad de viajes entre estaciones
columns = stations[['start_station_name']]
matrix = pd.DataFrame(index =columns, columns = columns)
matrix = matrix.fillna(0)

In [None]:
# Se cargan los datos a la matriz
maxTripPos =  trips_btw_stations['cant'].argmax()
minTripPos =  trips_btw_stations['cant'].argmin()
for x in range(0, trips_btw_stations.shape[0]):
        #Start Station
        pos_x = stations[stations['start_station_name'] == trips_btw_stations.iloc[x,2]].index.tolist()
        #End Station
        pos_y =  stations[stations['start_station_name'] == trips_btw_stations.iloc[x,3]].index.tolist()
        # Es posible normalizar usando el valor comentado 
#         val = (trips_btw_stations.iloc[x,4] - trips_btw_stations['cant'].mean())\
#         / (trips_btw_stations.iloc[maxTripPos,4] - trips_btw_stations.iloc[minTripPos,4])
        val = trips_btw_stations.iloc[x,4]
        matrix.iloc[pos_x,pos_y] = val
        if (val <= 0) : matrix.iloc[pos_x,pos_y] = 0

In [None]:
plt.rcParams['ytick.labelsize']

In [None]:
# get the tick label font size
fontsize_pt = 10 #plt.rcParams['ytick.labelsize']
dpi = 72.27

# comput the matrix height in points and inches
matrix_height_pt = fontsize_pt * 70
matrix_height_in = matrix_height_pt / dpi

# compute the required figure height 
top_margin = 0.04  # in percentage of the figure height
bottom_margin = 0.04 # in percentage of the figure height
figure_height = matrix_height_in / (1 - top_margin - bottom_margin)


# build the figure instance with the desired height
fig, ax = plt.subplots(
        figsize=(30,figure_height), 
        gridspec_kw=dict(top=2,wspace = 12))

fig.suptitle('Quantity of trips between station on SF Bay Area', fontsize=44, fontweight='bold',x= 0.4,y=2.08)

# let seaborn do it's thing
ax = sns.heatmap(matrix,cmap='Blues', linewidths=.8, ax=ax)

Como la diferencia de viajes entre estaciones es muy amplia, se robustece los valores para lograr un mejor entendimiento visual sobre entre que estaciones se viaja mas

In [None]:
# get the tick label font size
fontsize_pt = 10 #plt.rcParams['ytick.labelsize']
dpi = 72.27

# comput the matrix height in points and inches
matrix_height_pt = fontsize_pt * 70
matrix_height_in = matrix_height_pt / dpi

# compute the required figure height 
top_margin = 0.04  # in percentage of the figure height
bottom_margin = 0.04 # in percentage of the figure height
figure_height = matrix_height_in / (1 - top_margin - bottom_margin)


# build the figure instance with the desired height
fig, ax = plt.subplots(
        figsize=(25,figure_height), 
        gridspec_kw=dict(top=2,wspace = 12))


fig.suptitle('Quantity of trips between station on SF Bay Area (robust colors)', fontsize=34, fontweight='bold',x= 0.4,y=2.08)
# let seaborn do it's thing
ax = sns.heatmap(matrix,cmap='Blues', robust= True, linewidths=.8, ax=ax, xticklabels=True)

Se puede ver perfectamente que mientras mas al notre y mas al oeste, mas se viaja entre si. Mientras que al sur este la frecuencia de muchos viajes entre estaciones es mas bien aislada, siendo mucho menor la frecuencia de viajes en esta zona

## Cantidad de viajes por sectores

### Sector noroeste

In [None]:
matrix_nor_west = matrix.iloc[:35,:36]

In [None]:
# Se analiza el resultado con la diferencia real de viajes entre las estaciones
fig, ax = plt.subplots(figsize=(10,10));  
sns.heatmap(matrix_nor_west,cmap='Blues' , linewidths=.8, ax=ax)
fig.suptitle('Quantity of trips between stations on North-West sector of SF Bay Area', fontsize=22, fontweight='bold')
plt.xlabel('End Stations', fontsize=18)
plt.ylabel('Start Stations', fontsize=16);

Si bien la diferencia entre los distintos valores de las estaciones es muy grande, se ve claramente que hay una gran
cantidad de viajes en todo este sector de Fc Bay Area

In [None]:
description = matrix_nor_west.describe()
description

In [None]:
# Cantidad de viajes a cada estacion
matrix_nor_west.sum().sort_values(ascending=False)

In [None]:
matrix_nor_west.shape

In [None]:
# Donde la media de los viajes en el sector es de: 
all_trips = 0
counter = 0
for x in range(0, 35):
    for y in range(0, 36):
        all_trips = all_trips + matrix_nor_west.iloc[x,y]
        counter += 1
print("Sum of all trips in the region: ",all_trips)
print("Mean of trips in the region: ",all_trips/counter)

In [None]:
# Y las estaciones del sector a las que mas se viaja son:
# 1) San Francisco Caltrain (Townsend at 4th) con un promedio de 1781 viajes
# 2) San Francisco Caltrain 2 (330 Townsend) con un promedio de 987 viajes
# 3) Harry Bridges Plaza (Ferry Building) con un promedio de 932 viajes
# A continuacion se muestra cuanto se viaja en promedio a cada estacion
matrix_nor_west.mean()

### Sector Sureste

In [None]:
matrix_south_east = matrix.ix[54:,54:]

In [None]:
# se realiza un analisis con los datos reales de la matriz
fig, ax = plt.subplots(figsize=(10,10));  
sns.heatmap(matrix_south_east,cmap='Blues' , linewidths=.8, ax=ax)
fig.suptitle('Quantity of trips between stations on South-East sector of SF Bay Area', fontsize=22, fontweight='bold')
plt.xlabel('End Stations', fontsize=18)
plt.ylabel('Sart Stations', fontsize=16);

In [None]:
matrix_south_east.describe()

In [None]:
# Cantidad de viajes a cada estacion
matrix_south_east.sum().sort_values(ascending=False)

In [None]:
matrix_south_east.shape

In [None]:
# Donde la media del sector es:
all_trips = 0
counter = 0
for x in range(0, 16):
    for y in range(0, 16):
        all_trips = all_trips + matrix_south_east.iloc[x,y]
        counter += 1
print("Sum of all trips in the region: ",all_trips)
print("Mean of trips in the region: ",all_trips/counter)

### Sector Centro

In [None]:
matrix_center = matrix.ix[35:54,36:54]

In [None]:
# Analisis con los datos reales de los viajes
fig, ax = plt.subplots(figsize=(10,10));  
sns.heatmap(matrix_center,cmap='Blues' , linewidths=.8, ax=ax)
fig.suptitle('Quantity of trips between stations on Central sector of SF Bay Area', fontsize=22, fontweight='bold')
plt.xlabel('End Stations', fontsize=18)
plt.ylabel('Start Stations', fontsize=16);

In [None]:
matrix_center.describe()

In [None]:
# Cantidad de viajes a cada estacion
matrix_center.sum().sort_values(ascending=False)

In [None]:
matrix_center.shape

In [None]:
# Donde la media del sector es:
all_trips = 0
counter = 0
for x in range(0, 19):
    for y in range(0, 18):
        all_trips = all_trips + matrix_center.iloc[x,y]
        counter += 1
print("Sum of all trips in the region: ",all_trips)
print("Mean of trips in the region: ",all_trips/counter)

## Como influye el clima en la cantidad y duracion de los viajes?

In [None]:
trips['date'] = trips['start_date'].apply(lambda x: x.date())
trips.date = pd.to_datetime(trips.date)

In [None]:
trip = trips.groupby(['date', 'zip_code']).agg(['mean', 'sum','count'])
trip['mean_duration_sec'] = trip['duration']['mean']
trip['trip_count'] = trip['duration']['count']
trip = trip.loc[:,['mean_duration_sec', 'trip_count']]
trip = trip.reset_index()
trip['zip_code'] = trip['zip_code'].apply(lambda x: int(x))
trip.columns = trip.columns.droplevel(1)
trip

In [None]:
def is_int(x):
    try:
        int(x)
    except:
        return False
    return True

trip = trip[trip['zip_code'].apply(lambda x: True if is_int(x) else False)]

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

weather = weather[weather['precipitation_inches'].apply(lambda x: True if is_float(x) else False)]
weather['precipitation_inches'] = weather['precipitation_inches'].apply(lambda x: float(x))

In [None]:
weather['Fog'] = weather['events'].apply(lambda x: x == 'Fog' or x == 'Fog-Rain')
weather['Rain'] = weather['events'].apply(lambda x: x == 'Rain' or x == 'Fog-Rain' or x == 'Rain-Thunderstorm')
weather['Thunderstorm'] = weather['events'].apply(lambda x: x == 'Rain-Thunderstorm')
weather = weather.drop('events', 1)
weather

In [None]:
trips_and_weather = pd.merge(weather, trip, how='inner', on=['date','zip_code'])

In [None]:
trips_and_weather.shape

In [None]:
weather_corr = trips_and_weather.corr().abs().loc[:,['mean_duration_sec', 'trip_count']]

In [None]:
weather_corr = weather_corr[weather_corr.index != 'mean_duration_sec']
weather_corr = weather_corr[weather_corr.index != 'trip_count']
weather_corr = weather_corr[weather_corr.index != 'zip_code']

In [None]:
plt.rcParams['figure.figsize'] = (15, 10)
sns.heatmap(weather_corr,cmap='Oranges');

In [None]:
weather_corr.sort_values(by='trip_count', ascending=False)

Pareciera que el clima no afecta fuertemente la duración de los viajes, pero sí su cantidad. El órden por relevancia 
sería:
    
Si hay mucho viento:

In [None]:
a = trips_and_weather.loc[:,['mean_wind_speed_mph', 'trip_count']]
a = a.groupby(['mean_wind_speed_mph']).agg(['mean'])
a = a.reset_index()
a.columns = a.columns.droplevel(1)

fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.plot(a['mean_wind_speed_mph'],a['trip_count'])

plt.title('Cantidad de viajes vs. Velocidad del viento')
plt.xlabel('Media de velocidad del viento (mph)')
plt.ylabel('Cantidad de viajes');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather['mean_wind_speed_mph'].hist(bins=50,figsize=(17,5), ax=ax); # TODO: deal with the speed > 9 noise

plt.title('Histograma - Velocidad del viento');

Si está muy nublado:    

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather.groupby(['cloud_cover']).agg(['mean']).reset_index().plot('cloud_cover','trip_count',ax=ax).set_ylim(0);

plt.title('Media de la cantidad de viajes vs. Nubosidad')
plt.xlabel('Nubosidad (okta)')
plt.ylabel('Cantidad de viajes');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather['cloud_cover'].value_counts().plot(kind='bar',ax=ax, rot=0)

plt.title('Cantidad de viajes vs. Nubosidad')
plt.xlabel('Nubosidad (okta)')
plt.ylabel('Cantidad de viajes')

En días muy nublados (cloud_cover > 5) hay una relativa disparidad en cantidad de registros, por lo que el aparente gusto de los ciclistas por los días nublados puede no ser tan marcado como parece. Sin embargo, aún considerando los días menos nublados se puede ver dicha tendencia.

El mínimo de humedad:

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather.plot.scatter('min_humidity','trip_count',alpha=0.07,figsize=(17,8), ax=ax);

plt.title('Scatter - Cantidad de viajes vs. Humedad minima');
plt.xlabel('Humedad minima')
plt.ylabel('Cantidad de viajes');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather.plot.scatter('min_humidity','trip_count',alpha=0.3,figsize=(17,8), ylim=(0,50),ax=ax);

plt.title('Scatter - Cantidad de viajes vs. Humedad minima');
plt.xlabel('Humedad minima')
plt.ylabel('Cantidad de viajes');

Número de viajes promedio en un día con lluvia, niebla y tormenta eléctrica, vs un día promedio, vs un día promedio sin eventos climáticos:

In [None]:
average_trip_count = trips_and_weather.describe().iloc[1]['trip_count']
fog = trips_and_weather[trips_and_weather['Fog'] == True]
average_trip_count_fog = fog.describe().iloc[1]['trip_count']
rain = trips_and_weather[trips_and_weather['Rain'] == True]
average_trip_count_rain = rain.describe().iloc[1]['trip_count']
thunderstorm = trips_and_weather[trips_and_weather['Thunderstorm'] == True]
average_trip_count_thunderstorm = thunderstorm.describe().iloc[1]['trip_count']
no_events = trips_and_weather[trips_and_weather['Fog'] == False]
no_events = no_events[no_events['Rain'] == False]
no_events = no_events[no_events['Thunderstorm'] == False]
average_trip_count_no_events = no_events.describe().iloc[1]['trip_count']

In [None]:
objects = (
    'Dia promedio',
    'Dia lluvioso',
    'Dia con niebla',
    'Dia con tormenta electrica',
    'Dia sin eventos'
)
y_pos = np.arange(len(objects))
performance = [
    average_trip_count,
    average_trip_count_rain,
    average_trip_count_fog,
    average_trip_count_thunderstorm,
    average_trip_count_no_events
]

fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.bar(y_pos, performance, align='center',alpha=0.5)
plt.xticks(y_pos, objects)
plt.title('Cantidad de viajes promedio vs. Tipo de dia')
plt.ylabel('Cantida de viajes promedio')
plt.xlabel('Tipo de dia')
plt.show()

Por que los días de tormenta eléctrica sean tan populares?

In [None]:
thunderstorm.shape[0]

Solo existe un registro. Esta es la razon del elevado promedio de cantidad de viajes observado en el grafico anterior.
Se verifica que haya "suficientes" días para las demás condiciones climáticas

In [None]:
print('Todos', trips_and_weather.shape[0])
print('Lluvia', rain.shape[0])
print('Niebla', fog.shape[0])
print('Sin eventos', no_events.shape[0])

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

trips_and_weather['trip_count'].hist(bins=250,figsize=(17,5),ax=ax);

plt.title('Histograma - Cantidad de viajes (con todos los eventos) ');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

fog['trip_count'].hist(bins=250,figsize=(17,5),ax=ax);

plt.title('Histograma - Cantidad de viajes (niebla)');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

rain['trip_count'].hist(bins=250,figsize=(17,5),ax=ax);

plt.title('Histograma - Cantidad de viajes (lluvia)');

In [None]:
fig, ax = plt.subplots(figsize=(17, 7))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

no_events['trip_count'].hist(bins=250,figsize=(17,5),ax=ax);

plt.title('Histograma - Cantidad de viajes (sin eventos)');

La elevada cantidad promedio de viajes con niebla también pareciera deberse a falta de datos, dado que por ejemplo la única ocasión de un zip code nublado con 220 viajes se ve sobrerrepresentada.         
Esto sucede también parcialmente con los días lluviosos, si bien en mucho menor medida.          
Además, parecieran ser muy comunes los casos en que una región (con igual zip code) tiene pocos viajes en el día. 