### Desafio - Kaggle - Sql e Pandas
---

Passos:

1 - Criar uma conta no Kaggle.

2 - Entrar na página: https://www.kaggle.com/benhamner/sf-bay-area-bike-share

3 - Criar um notebook em python.

4 - Utilizar os dados armazenados no banco de dados do desafio.

5 - Criar as relações entre as tabelas - Modelo lógico.

6 - Realizar as análises pedidas no pandas.

##### Configuração do notebook
---

In [1]:
## Bibliotecas.
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date

In [2]:
df = pd.read_csv('trip.csv')
df.head()

Unnamed: 0,id,duration,start_date,start_station_name,start_station_id,end_date,end_station_name,end_station_id,bike_id,subscription_type,zip_code
0,4576,63,8/29/2013 14:13,South Van Ness at Market,66,8/29/2013 14:14,South Van Ness at Market,66,520,Subscriber,94127
1,4607,70,8/29/2013 14:42,San Jose City Hall,10,8/29/2013 14:43,San Jose City Hall,10,661,Subscriber,95138
2,4130,71,8/29/2013 10:16,Mountain View City Hall,27,8/29/2013 10:17,Mountain View City Hall,27,48,Subscriber,97214
3,4251,77,8/29/2013 11:29,San Jose City Hall,10,8/29/2013 11:30,San Jose City Hall,10,26,Subscriber,95060
4,4299,83,8/29/2013 12:02,South Van Ness at Market,66,8/29/2013 12:04,Market at 10th,67,319,Subscriber,94103


In [3]:
## Verificar se a coluna "duration" está no formato correto
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669959 entries, 0 to 669958
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   id                  669959 non-null  int64 
 1   duration            669959 non-null  int64 
 2   start_date          669959 non-null  object
 3   start_station_name  669959 non-null  object
 4   start_station_id    669959 non-null  int64 
 5   end_date            669959 non-null  object
 6   end_station_name    669959 non-null  object
 7   end_station_id      669959 non-null  int64 
 8   bike_id             669959 non-null  int64 
 9   subscription_type   669959 non-null  object
 10  zip_code            663340 non-null  object
dtypes: int64(5), object(6)
memory usage: 56.2+ MB


In [4]:
## Entrada de dados.
db = sqlite3.connect('database.sqlite')

## Função para rodar querys.
def run_query(query):
    return pd.read_sql_query(query, db)

##### Análises no Pandas:
---

#### Q1: Qual foi a viagem com a maior duração?

#### Q2: Os usuários não registrados fazem viagens mais longas ou mais curtas?

#### Q3: quais estações são as mais populares?

#### Q4: quais rotas são as mais populares?

In [5]:
# Conexão ao banco de dados:
sqliteConnection = sqlite3.connect('database.sqlite')
cursor = sqliteConnection.cursor()
print("Conectado ao banco de dados!")

Conectado ao banco de dados!


In [6]:
## Q1: Viagem com maior duração:
query = """SELECT * FROM trip 
ORDER BY duration DESC 
LIMIT 5; """
df_query = pd.read_sql_query(query,sqliteConnection)
df_query.head(1)

Unnamed: 0,id,duration,start_date,start_station_name,start_station_id,end_date,end_station_name,end_station_id,bike_id,subscription_type,zip_code
0,568474,17270400,12/6/2014 21:59,South Van Ness at Market,66,6/24/2015 20:18,2nd at Folsom,62,535,Customer,95531


In [7]:
## Viagens acima de 24h
## 86400 = Total de segundos por dia
query = """SELECT start_date, end_date, duration 
from trip 
WHERE duration >= 86400 
ORDER BY duration DESC; """
df_query = pd.read_sql_query(query,sqliteConnection)
df_query.head()

Unnamed: 0,start_date,end_date,duration
0,12/6/2014 21:59,6/24/2015 20:18,17270400
1,6/28/2015 21:50,7/23/2015 15:27,2137000
2,5/2/2015 6:17,5/23/2015 16:53,1852590
3,7/10/2015 10:35,7/23/2015 13:27,1133540
4,11/30/2013 13:29,12/8/2013 22:06,722236


In [8]:
## Converter o percurso de segundos para minutos
query = """SELECT start_date, end_date, (duration)/60 AS 'total_minutos'
from trip 
ORDER BY duration DESC
LIMIT 10; """
df_query = pd.read_sql_query(query,sqliteConnection)
df_query.head(1)

Unnamed: 0,start_date,end_date,total_minutos
0,12/6/2014 21:59,6/24/2015 20:18,287840


In [9]:
df_query.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   start_date     10 non-null     object
 1   end_date       10 non-null     object
 2   total_minutos  10 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 368.0+ bytes


In [10]:
## Q2 Descobrir quantos tipos de usuários existem e o total de cada tipo
query = """SELECT subscription_type, count(*) from trip GROUP by subscription_type; """

df_query = pd.read_sql_query(query,sqliteConnection)
df_query.head()

Unnamed: 0,subscription_type,count(*)
0,Customer,103213
1,Subscriber,566746


In [11]:
## Q3 Estações mais populares
query = """SELECT station.name AS Estacoes, count(*) AS Viagens_Iniciadas FROM station
INNER JOIN trip ON station.id = trip.start_station_id
GROUP BY station.name
ORDER BY Viagens_Iniciadas DESC; """
df_query = pd.read_sql_query(query,sqliteConnection)
df_query.head()

Unnamed: 0,Estacoes,Viagens_Iniciadas
0,San Francisco Caltrain (Townsend at 4th),49092
1,San Francisco Caltrain 2 (330 Townsend),33742
2,Harry Bridges Plaza (Ferry Building),32934
3,Embarcadero at Sansome,27713
4,Temporary Transbay Terminal (Howard at Beale),26089


In [12]:
## Estação mais utilizadas (populares)
query = '''
SELECT start_station_name, end_station_name, count(*) AS rotas_populares 
FROM trip
GROUP BY start_station_name, end_station_name
ORDER BY rotas_populares DESC'''
df_query = pd.read_sql_query(query,sqliteConnection)
df_query

Unnamed: 0,start_station_name,end_station_name,rotas_populares
0,San Francisco Caltrain 2 (330 Townsend),Townsend at 7th,6216
1,Harry Bridges Plaza (Ferry Building),Embarcadero at Sansome,6164
2,Townsend at 7th,San Francisco Caltrain (Townsend at 4th),5041
3,2nd at Townsend,Harry Bridges Plaza (Ferry Building),4839
4,Harry Bridges Plaza (Ferry Building),2nd at Townsend,4357
...,...,...,...
1911,South Van Ness at Market,Palo Alto Caltrain Station,1
1912,Stanford in Redwood City,California Ave Caltrain Station,1
1913,Stanford in Redwood City,Cowper at University,1
1914,University and Emerson,Evelyn Park and Ride,1
