# (Ford GoBike Data Exploration)
> Udacity Project
## by (Mahmoud Hesham)


> The dataset is about trips of the Ford GoBike.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
df = pd.read_csv('../input/ford-gobike-2019feb-tripdata/201902-fordgobike-tripdata.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape  #183412 rows"observation"

In [None]:
#checking for duplicates
sum(df.duplicated())

In [None]:
df.isna().sum()

# Data Wrangling 

In [None]:
#changing/fixing types
df['user_type'] = df['user_type'].astype('category')

df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

df['bike_share_for_all_trip'] = (df['bike_share_for_all_trip'] == 'Yes')

#since there is no point of keeping "ID" as float
# df['start_station_id'] =df['start_station_id'].astype('str')
# df['end_station_id'] = df['end_station_id'].astype('str')
# df['bike_id'] =df['bike_id'].astype('str')
# turns out I need them as numerical values xD

In [None]:
df = df.dropna()   # dropping null values since they are just around 8K not much out of ~180K

In [None]:
df.info()         # now we are ready with our dataset :)

In [None]:
df.describe()

In [None]:
df.head()

### What is the structure of your dataset?

> There are 174952 data trips for fordgobike in the dataset with 16 columns (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude ,end_station_longitude, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip). 
6 numerical, and others are 6 object type(String), 2 datetime , 1 is boolean type and 1 category type.

### What is/are the main feature(s) of interest in your dataset?

> I am interested in finding out how trip duration is dependant on other features.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I think that trip duration is highly dependant on start station and end station , crowded places should recieve more rides and age,gender and user_type will have effect on trip duration

## Univariate Exploration


In [None]:
plt.figure(figsize=[10, 6])
plt.hist(data = df, x = 'duration_sec', bins = np.arange(0, df['duration_sec'].max()+500, 500))
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (sec)')
plt.ylabel('Number of Trips')
plt.axis([-500, 10000, 0, 90000])
plt.show()

In [None]:
# we can use log scale since there is a long tail !
bins_log = 10 ** np.arange(2.4, np.log10(df['duration_sec'].max()) + 0.05, 0.05)
plt.figure(figsize=[10, 6])
plt.hist(data = df, x = 'duration_sec', bins = bins_log)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (sec)')
plt.ylabel('Number of Trips')
plt.xscale('log')
plt.xticks([500, 1e3, 2e3, 5e3, 1e4], ['500', '1k', '2k', '5k', '10k'])
plt.axis([0, 10000, 0, 15000])
plt.show()

>It is a low spectrum graph with most values are less than 2k seconds and the peak is around 600 seconds, The number of trips increases starting around 8000 trips to around 12000 trips at 600 seconds then starts to fall.

#### next.. investegating start station and end station 

In [None]:
plt.figure(figsize=[20, 8])
plt.hist(data = df, x = 'start_station_id', bins = np.arange(0, df['start_station_id'].astype(float).max()+2, 2))
plt.xticks(np.arange(0, 410, 10))
plt.title('Distribution of Start Stations')
plt.xlabel('Start Station')
plt.ylabel('Number of Stations')
plt.show()

In [None]:
plt.figure(figsize=[20, 8])
plt.hist(data = df, x = 'end_station_id', bins = np.arange(0, df['end_station_id'].astype(float).max()+2, 2))
plt.xticks(np.arange(0, 410, 10))
plt.title('Distribution of End Stations')
plt.xlabel('End Station')
plt.ylabel('Number of Stations')
plt.show()

>Since both graphs are the same thus the same stations are more frequent as start stations and end stations

#### next.. investegating age

In [None]:
plt.figure(figsize=[8, 5])
plt.hist(data = df.dropna(), x = 'member_birth_year', bins = np.arange(0, df['member_birth_year'].astype(float).max()+1, 1))
plt.axis([1939, 2009, 0, 12000])
plt.xticks([1939, 1949, 1959, 1969, 1979, 1989, 1999, 2009], [(2019-1939), (2019-1949), (2019-1959), (2019-1969), (2019-1979), (2019-1989), (2019-1999), (2019-2009)])
plt.gca().invert_xaxis()
plt.title('Distribution of User Age')
plt.xlabel('Age')
plt.ylabel('Number of Users')
plt.show()

# I used 2019 as a reference for age as it is the year of the dataset 

>Most users are between the age of 40 and 20 

#### next.. investegating User_type

In [None]:
user_type=df['user_type'].value_counts()
plt.figure(figsize=(8, 8), dpi= 80, facecolor='w', edgecolor='k')
plt.pie(user_type,labels=user_type.index,autopct= '%1.1f%%')
plt.title('User Type')

>Most users are Subscribers 

#### next.. investegating Gender

In [None]:
gender=df['member_gender'].value_counts()
plt.figure(figsize=(8, 8), dpi= 80, facecolor='w', edgecolor='k')
plt.pie(gender,labels=gender.index,autopct= '%1.1f%%')
plt.title('Gender')

>Most of Users are Males

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> The trip duration graph looked like a  tail so I went throught and used log transform and found that peak occurs at 600 seconds and then distribution starts to go down and does not regain any more peak value

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Birth_Date was used to indicate age 

## Bivariate Exploration


#### investegating start station and trip duration

In [None]:
def barplot_vs_trip_duration(s,df):
    list1 = []
    list2 = []
    ids =sorted(df[s].unique())
    for x in ids :
        list1.append(df[df[s] == x].duration_sec.sum())
        list2.append(x)
    total_duration = pd.Series(list1)
    total_duration = total_duration/3600
    total_start_stations=pd.Series(list2)
    df1 =pd.concat([total_start_stations, total_duration], axis=1)
    df1.columns=[s,'total_duration']
    df1.sort_values(by=['total_duration'], inplace=True,ascending=False)
    df_sample = df1.head(30)
    plt.figure(figsize = [20, 8])
    base_color = sb.color_palette()[9]
    sb.barplot(x = df_sample[s].astype(int), y = df_sample['total_duration'],color=base_color)
    plt.xlabel(s.upper())
    plt.ylabel('Total Duration in hours')
    plt.title('Top 30 '+s.upper()[:len(s)-3] +' in Trip duration')
    plt.show()
    return;
# this works for start and end stations :)

In [None]:
barplot_vs_trip_duration('start_station_id',df)

#### and of course investegating end station and trip duration

In [None]:
barplot_vs_trip_duration('end_station_id',df)

> Since there are variations for the same station in both graphs we can tell which stations starting of longer trips and which stations come to end longer trips 

#### Investegating Gender and trip duration

In [None]:
plt.figure(figsize = [10, 6])
base_color = sb.color_palette()[9]
sb.boxplot(data = df, x = 'member_gender', y = 'duration_sec', color = base_color)
plt.ylim([0, 2000])
plt.xlabel('Gender')
plt.ylabel('Duration (sec)')
plt.title('Gender and Trip duration')
plt.show()

>Surprisingly Female and Other Gender have longer Duration trips although most users are males ~ 75%

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Start and End station do not have noticable effect on trip duration.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Gender has high dependancy on trip duration 

## Multivariate Exploration


#### Investegating Gender , Age and Trip duration

In [None]:
gender_markers = [['Male', 's'],['Female', 'o'],['Other', (5,1)]]
plt.figure(figsize = [12, 8])
for gender, marker in gender_markers:
    df_gender = df[df['member_gender'] == gender]
    plt.scatter((2019 - df_gender['member_birth_year']), df_gender['duration_sec'], marker = marker, alpha=0.35)
plt.legend(['Male','Female','Other'])
plt.axis([10, 90, 0, 9000 ])
plt.xlabel('Age (year)')
plt.ylabel('Duration (sec)')
plt.title('Gender, Age and Trip duration relation')
plt.show()

#### Lets seperate the gender for more clearness 

In [None]:
plt.figure(figsize = [15, 10])
df['age'] = (2019 - df['member_birth_year'])
genders = sb.FacetGrid(data = df, col = 'member_gender', col_wrap = 3, size = 6,xlim = [10, 90], ylim = [-0, 9000])
genders.map(plt.scatter, 'age', 'duration_sec', alpha=0.2)
genders.set_xlabels('Age (year)')
genders.set_ylabels('Duration (sec)')
plt.title('Age and Trip Duration for the 3 Genders')
plt.show()

>We can notice an increase in trip duration for others who are around 55 years

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Other has an increase at nearly the age of 55 years for higher duration time.

### Were there any interesting or surprising interactions between features?

> Gender and Age has an effect on trip duration