In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# (Visulization Fordbike Data 2017)
## by (Mostafa Elmehy)
<br>
<br>
Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California. Beginning operation in August 2013 as Bay Area Bike Share, the Ford GoBike system currently has over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose. On June 28, 2017, the system officially launched as Ford GoBike in a partnership with Ford Motor Company.

Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
df = pd.read_csv("../input/ford-gobike-data/2017-fordgobike-tripdata.csv")

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df["user_type"].value_counts()

In [None]:
df["member_gender"].value_counts()

In [None]:
df.columns

In [None]:
df['start_time']=pd.to_datetime(df['start_time'])

In [None]:
df['end_time']=pd.to_datetime(df['end_time'])

In [None]:
df.info()

### What is the structure of your dataset?

There are 519700 trips in the dataset with 15 features ('duration_sec', 'start_time', 'end_time', 'start_station_id',
 'start_station_name', 'start_station_latitude','start_station_longitude', 'end_station_id', 'end_station_name',
 'end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type','member_birth_year', 'member_gender'). Most variables are numeric in nature, but the variables user_type and member_gender which are categorical


### What is/are the main feature(s) of interest in your dataset?

The main interest will be to know which time of the year is the most busy , what is the most popular gender and user type

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

start and end time , user type and member gender

## Univariate Exploration


In [None]:
c=sb.color_palette()[0]

In [None]:
data = df["user_type"].value_counts()
plt.pie(data=df,x=data,labels=data.index,autopct='%1.1f%%',shadow=True);
plt.title("user type");

most of the user types are subscribers

In [None]:
sb.countplot(data=df,x="member_gender",color=c);
plt.xlabel("")
plt.title("member gender")

Most of the users are males

In [None]:
months = df['start_time'].dt.month
order = []
for i in range(1,13):
    order.append(i)
sb.countplot(data=df,x=months,order=order,color=c)
plt.xlabel("Months")
plt.xlim([4,12])
plt.xticks([4,5,6,7,8,9,10,11,12],["may","june","jul","aug","sep","oct","nov","dec"]);
plt.title("busiest month");

october is the busiest month

In [None]:
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sb.countplot(data=df,x=df['start_time'].dt.day_name(),color=c,order=weekday);
plt.title("busiest weekday")
plt.xlabel("days")
plt.xticks(rotation=20);

wednesday and tuesday are the busiest days 

In [None]:
sb.countplot(data=df,x=df['start_time'].dt.hour,color=c);
plt.title("busiest hours")
plt.xlabel("hours")
plt.xticks(rotation=20);

As expected the distribution is bimodal The two peaks are during the normal rush hours of a day

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Most of the distributions we found were expected. From the majority users being subscribers, to the busiest hours being during the rush hour time of the day and we found that the busiest month is october

## Bivariate Exploration


In [None]:
plt.scatter(data = df , x = df['start_time'].dt.dayofweek , y='duration_sec',alpha=1/10)
plt.title('Duration Of Trip Per Day')
plt.xlabel('Days Of The Week')
plt.ylabel('Duration Of Trip In Seconds');

The distrution is not as expected. Thursday seem like the day where people take the longest trips. And the weekends are the days where people take the shortest trips on average

In [None]:
df_ = df
df_["hour"]=df['start_time'].dt.hour
data=df_.groupby(["hour","user_type"]).size().reset_index()
plt.figure(figsize=(8, 5))
sb.pointplot(data=data,x="hour",hue="user_type",y=0)
plt.title("Hours vs user type");
plt.ylabel("count");

In the first hours of the day there is no a big diffrence but after that specially in the rush hours the diffrence become very big

In [None]:
data=df_.groupby(["hour","member_gender"]).size().reset_index()
plt.figure(figsize=(8, 5))
sb.pointplot(data=data,x="hour",hue="member_gender",y=0)
plt.title("Hours vs member gender");
plt.ylabel("count");

In the first hours of the day there is no a big diffrence but after it become very obvious that the males is the most users

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

it became very obvious that the males and subscribers are the most users and the rush hours are very clear in the graphs 

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

the rush hours and ride bikes number

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

In [None]:
xbin = np.arange(0,7+1,1)
ybin = np.arange(0,4200+250,250)
df_new = df
df_new["days"] = df_new['start_time'].dt.dayofweek
g = sb.FacetGrid (data = df_new , col='user_type', height=5)
g.map(plt.hist2d, 'days', 'duration_sec', cmin=0.5, cmap = 'inferno_r', bins=[xbin,ybin])
plt.colorbar();


From this heatmap we can similarities between customers and subscribers. We can see that both have an average of 4 and 12 min rides during the week. However customers are not only more present during the weekend they take longer trips as well.

In [None]:
plt.figure(figsize=(8, 5))
ax = sb.pointplot(data=df_new, x='days' , y='duration_sec', hue='user_type')
plt.xlabel('Days Of The Week')
plt.ylabel('Trip Duration in Seconds')
plt.title('Average Trip Duration During The Week');

This graph is as expected where the duration for a trip is longer during the weekends than weekdays.

In [None]:
plt.figure(figsize=(8, 5))
ax = sb.barplot(data=df_new, x='days' , y='duration_sec', hue='member_gender')
plt.xlabel('Days Of The Week')
plt.ylabel('Trip Duration in Seconds')
plt.title('Average Trip Duration During The Week');

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

This section enhanced the previous topics of data exploration through different visuals and adding in more variables to compare to other variables. Plotting a heatmap of demand of the bikes throughout the day on a weekly basis shed a new light on the time based discussion on when and what time each user group uses the bike sharing system and the customers and females spend more time that the others