### In this session we are going to:  

1. Clean the data
2. Bussiness task look back
2. Form questions base on the data at hand
3. Do visaulization and form conclusion

(Note: To save my poor laptop, forgive me not to present you with every results of my code. I will run the neccesary codes for easier understanding. You are welcomed to use the dataset offered and run the codes for yourself. Any comment and suggestion are also welcomed!)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import glob

In [None]:
all_files = glob.glob(os.path.join("../input/bike-share-case-study", "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files))

In [None]:
df.head()

---

### 1. Data cleansing

First we drop rows with empty cell

In [None]:
df.dropna(inplace=True)

Then we seperate date from time

In [None]:
df[['started_date', 'started_time']] = df['started_at'].str.split(' ', 1, expand=True)

In [None]:
df[['end_date', 'end_time']] = df['ended_at'].str.split(' ', 1, expand=True)

In [None]:
df = df.drop(['started_at', 'ended_at'], axis=1)

Next, since we are not using the location data, we are dropping it to save memory

In [None]:
df = df.drop(['start_lat', 'start_lng', 'end_lat', 'end_lng'], axis=1)

Going further, we found that there are weird '###' in our ride_length column, let's clean it, and seperate them into hour and minute column

In [None]:
df = df[df.ride_length != '#######################################################################################################################################################################################################']

In [None]:
df = df[df.ride_length != '###############################################################################################################################################################################################################################################################']

In [None]:
df[['ridelength_hour', 'ridelength_minute']] = df['ride_length'].str.split(':', 1, expand=True)

In [None]:
df = df.drop(['ride_length'], axis=1)

To calculate mean of the ride_length, we need to convert it into int data type

In [None]:
df.astype({'ridelength_hour': 'int64'}).dtypes

In [None]:
ridelength_hour_list = df.ridelength_hour.tolist()

In [None]:
hour_list = [int(s) for s in ridelength_hour_list]

In [None]:
hour_to_minute_list = [x * 60 for x in hour_list]

In [None]:
df['hour_to_minute'] = hour_to_minute_list

In [None]:
ridelength_minute_list = df.ridelength_minute.tolist()
minute_list = [int(s) for s in ridelength_minute_list]

In [None]:
df['minute_to_int'] = minute_list

In [None]:
df = df.drop(['ridelength_hour', 'ridelength_minute'], axis=1)

In [None]:
sum_list = []
for (x, y) in zip(minute_list, hour_to_minute_list):
    sum_list.append(x + y)

In [None]:
filtered = []
for x in sum_list:
    if x >= 60:
       filtered.append(x) 

In [None]:
df['ridelength_in_minutes'] = sum_list

In [None]:
df = df.drop(['hour_to_minute', 'minute_to_int'], axis=1)

#### Now our dataset are clean and tidy, it's now ready for analysis!

----

### 2. Basis calculation and sum up

In [None]:
df.to_csv('df_clean.csv', index = False)

In [None]:
df_cleaned = pd.read_csv("../input/cleaned-data/df_clean.csv")

Here I'm just exporting the cleaned file then import it back so that I don't need to run every cells again everytime I reopen it

In [None]:
df_cleaned.head()

---

#### Before the calculation started, here's a quick look to our "Bussiness task"

Three questions will guide the future marketing program for the sharing bike company:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
<br>
<br>

We are responsible for the first question: How do annual members and casual riders use Cyclistic bikes differently? And we are going to answer it by dividing it into the **following 3 questions**:

1. How do ride_length differ from casual and member riders? Why is there a difference?(Can't be done, leave it for now)

2. When do each type ride mostly? Weektime? Weekend? Why is there a difference?

3. Who ride electrical bike more? Why?

-----

#### First : How do ride_length differs?

In [None]:
df_casual = df_cleaned.loc[df_cleaned['member_casual'] == 'casual']

In [None]:
df_member = df_cleaned.loc[df_cleaned['member_casual'] == 'member']

In [None]:
df_member.describe()

In [None]:
df_casual.describe()

#### Answer : 
Casual riders have average ride length of about 34 minutes while member have an average of only 14 minutes.

--------

#### Second: When do each type ride?

In [None]:
df_member.groupby(by=["day_of_week"]).sum()

In [None]:
df_casual.groupby(by=["day_of_week"]).sum()

In [None]:
casual_count = [12833150, 6314262, 5650729, 5762943, 5755590, 8193701, 14747435]

In [None]:
member_count = [4703186, 4145138, 4445319, 4767769, 4443507, 4731885, 5352431]

In [None]:
day_of_week = [1, 2, 3, 4, 5, 6, 7]

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

In [None]:
figure(figsize=(12, 6), dpi=80)

plt.bar(day_of_week, casual_count)
plt.bar(day_of_week, member_count) 
plt.xlabel("Day of Week")
plt.ylabel("Ride count")
plt.title("Ride pattern of members and casual riders")
plt.legend(['Casual riders', 'Members'])
plt.show()

#### Answer : 
Member rides are evenly distributed during the week, while casual riders ride more often on weekends.

--------------

#### Last but not least: What type of bikes does each group ride more?

In [None]:
df_casual.groupby(by=['rideable_type', 'day_of_week']).sum()

In [None]:
df_member.groupby(by=['rideable_type', 'day_of_week']).sum()

We want to graph the pattern of rideable_type to ridelength_in_minutes with week_day as hue, so we are going to use seaborn instead of matplotlib

In [None]:
member_casual_sum = pd.read_csv("../input/member-casual-sum/member_casual_sum.csv")

The cell above is the sum of both casual riders and members in 1 csv file

In [None]:
member_casual_sum.head()

In [None]:
import seaborn as sns

In [None]:
plot = sns.catplot(x = "day_of_week", y = "ridelength_in_minutes", hue = "rideable_type", col = "member_casual", data = member_casual_sum, kind="bar", height=4, aspect=1.5)
plot.fig.subplots_adjust(top = 0.8)
plot.fig.suptitle('What bikes do they ride?')

#### Answer: 
Though docked bikes are most used by both members and casual riders, the difference of the usage of docked bike from others among casual riders are much greater than those of members.

----

### Conclusion
1. Casual riders tend to ride longer than members, by a difference of 20 minutes on average
2. Casual riders tend to ride more on weekend, whereas members rides are evenly distributed thoughout the week
3. Casual riders ride docked bike much more than members!