# Part I: Ford GoBike Dataset Exploration
## by Thárcyla Mourão

> Project [Rubric](https://review.udacity.com/#!/rubrics/3592/view)

## Introduction
> Introduce the dataset

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.  



## Preliminary Wrangling


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.


In [2]:
# load dataset
df = pd.read_csv('data/fordgobike.csv')
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


#### General Information

In [3]:
# get general info on the dataset
print(f"This Ford GoBike dataset has {df.shape[0]} lines and {df.shape[1]} columns", "\n")
df.info()

This Ford GoBike dataset has 183412 lines and 16 columns 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  

#### Drop null values

In [4]:
# drop nulls in place
df.dropna(inplace=True)

In [5]:
# check to see if it worked
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             174952 non-null  int64  
 1   start_time               174952 non-null  object 
 2   end_time                 174952 non-null  object 
 3   start_station_id         174952 non-null  float64
 4   start_station_name       174952 non-null  object 
 5   start_station_latitude   174952 non-null  float64
 6   start_station_longitude  174952 non-null  float64
 7   end_station_id           174952 non-null  float64
 8   end_station_name         174952 non-null  object 
 9   end_station_latitude     174952 non-null  float64
 10  end_station_longitude    174952 non-null  float64
 11  bike_id                  174952 non-null  int64  
 12  user_type                174952 non-null  object 
 13  member_birth_year        174952 non-null  float64
 14  memb

#### Convert `start_station_id`, `end_station_id` and `member_birth_year` from float to int

Since there were some null values in these columns initially, pandas [automatically](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) cast those values to floating-point dtype. 

In [6]:
# list with the columns to convert
int_columns = ['start_station_id', 'end_station_id', 'member_birth_year']

# iterate through the list
for column in int_columns:
    df[column] = df[column].astype(int)

In [7]:
# check to see if it worked
df[['start_station_id', 'end_station_id', 'member_birth_year']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 3 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   start_station_id   174952 non-null  int32
 1   end_station_id     174952 non-null  int32
 2   member_birth_year  174952 non-null  int32
dtypes: int32(3)
memory usage: 3.3 MB


#### Convert `user_type` from object to categorical dtype

In [8]:
# check user types available in the dataset
df['user_type'].unique()

array(['Customer', 'Subscriber'], dtype=object)

In [9]:
# turn customer into a categorical variable

# list of user types available in the dataset
user_types = ['Customer', 'Subscriber']

# create categorical variable with the list
cat_user_types = pd.api.types.CategoricalDtype(ordered=False, categories=user_types)

# apply categorical variable to the column
df['user_type'] = df['user_type'].astype(cat_user_types)

In [10]:
# check to see if it worked
df['user_type'].unique()

['Customer', 'Subscriber']
Categories (2, object): ['Customer', 'Subscriber']

#### Convert `member_gender` from object to categorical dtype

In [11]:
# check genders available in the dataset 
df['member_gender'].unique()

array(['Male', 'Other', 'Female'], dtype=object)

In [12]:
# turn member_gender into a categorical variable

# list of genders available in the dataset
genders = ['Male', 'Other', 'Female']

# create categorical variable with the list
cat_genders = pd.api.types.CategoricalDtype(ordered=False, categories=genders)

# apply categorical variable to the column
df['member_gender'] = df['member_gender'].astype(cat_genders)

In [13]:
# check to see if it worked
df['member_gender'].unique()

['Male', 'Other', 'Female']
Categories (3, object): ['Male', 'Other', 'Female']

#### Convert `start_time` and `end_time` to datetime

In [14]:
# convert start time and end time to datetime
dates = ['start_time', 'end_time']

for column in dates:
    df[column] = pd.to_datetime(df[column])

In [15]:
# check to see if it worked
df[['start_time', 'end_time']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   start_time  174952 non-null  datetime64[ns]
 1   end_time    174952 non-null  datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 4.0 MB


#### Convert `bike_share_for_all_trip` from object to bool

In [16]:
# check which values the column currently has
df['bike_share_for_all_trip'].unique()

array(['No', 'Yes'], dtype=object)

In [17]:
# create function to make the conversion
def convert_bool(value):
    if value == 'Yes':
        return True
    elif value == 'No':
        return False

df['bike_share_for_all_trip'] = df['bike_share_for_all_trip'].apply(convert_bool)

In [18]:
# check to see if it worked
print(df['bike_share_for_all_trip'].unique(), '\n')
df['bike_share_for_all_trip'].info()

[False  True] 

<class 'pandas.core.series.Series'>
Int64Index: 174952 entries, 0 to 183411
Series name: bike_share_for_all_trip
Non-Null Count   Dtype
--------------   -----
174952 non-null  bool 
dtypes: bool(1)
memory usage: 1.5 MB


#### Cleaned dataset

In [19]:
print(f"After cleaning the data, the dataset now has {df.shape[0]} lines and {df.shape[1]} columns", "\n")
df.info()

After cleaning the data, the dataset now has 174952 lines and 16 columns 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  int32         
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  int32         
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bik

In [20]:
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.145,2019-03-01 08:01:55.975,21,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984,Male,False
2,61854,2019-02-28 12:13:13.218,2019-03-01 05:24:08.146,86,Market St at Dolores St,37.769305,-122.426826,3,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972,Male,False
3,36490,2019-02-28 17:54:26.010,2019-03-01 04:02:36.842,375,Grove St at Masonic Ave,37.774836,-122.446546,70,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989,Other,False
4,1585,2019-02-28 23:54:18.549,2019-03-01 00:20:44.074,7,Frank H Ogawa Plaza,37.804562,-122.271738,222,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974,Male,True
5,1793,2019-02-28 23:49:58.632,2019-03-01 00:19:51.760,93,4th St at Mission Bay Blvd S,37.770407,-122.391198,323,Broadway at Kearny,37.798014,-122.40595,5200,Subscriber,1959,Male,False


### What is the structure of your dataset?

The cleaned Ford GoBike dataset has 174,952 lines and 16 columns. Among the columns, we have: 
- `duration_sec`: the duration of the trip in seconds, 
- `start_time`: date and time the trip started, 
- `end_time`: date and time the trip ended, 
- `start_station_id`: id of the station where the trip started,
- `start_station_name`: name of the station where the trip started,
- `start_station_latitude`: latitude of the station where the trip started,
- `start_station_longitude`: longitude of the station where the trip started,
- `end_station_id`: id of the station where the trip ended,
- `end_station_name`: name of the station where the trip ended,
- `end_station_latitude`: latitude of the station where the trip ended,
- `end_station_longitude`: longitude of the station where the trip ended,
- `bike_id`: id of the bike chosen for the trip,
- `user_type`: type of user, which can be either *customer* or *subscriber*,
- `member_birth_year`: user gender,
- `member_gender`: user gender,
- `bike_share_for_all_trip`: discounted [memberships](https://mtc.ca.gov/news/ford-gobike-model-equitable-bike-share-access-us-thanks-community-engagement) for low-income riders

### What is the main feature of interest in your dataset?

The main feature of interest for me is figuring out what factors influence the decision to grab a bike. 

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Looking at the information available to me, I imagine the features that'll support my investigation are start time, end time, names of the start and end stations, user type, member gender and birth year.

Most likely, there are times during the day with an increase in the number of trips. Also, places that are more crowded (more popular stations) will also see an increase in the number of trips. User type, member gender and birth year will offer additional insight into what those users look like.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

