# Google Data Analyst Professional: 

### Case Study: Cyclist Bike Share


### The following is a case study done as part of Google's Data Analyst Professional certification. 

**Note:**  While the course does not include Python, I've decided to use it for this case study.  Some of the steps were done more for fun then for time efficiency.  The case study includes some guided steps that I diverge from slightly.  I acknowledge that there would have been faster or more efficient methods if this were a real assignment. 

Additionally, this document is reflective of the process as outlined in the course, not as a final stakeholder deliverable.

### Intro Summary: 

We work for the marketing department for a bike share company out of Chicago.  The company has three payment methods: single ride, full day passes, and annual memberships.  Annual membership holders are considered members, while all others are considered casual riders.  The companies marketing strategy has to now been based around building awareness and appealing to a large audience.  The marketing director, Moreno, believes that the future of the company resides in turning more casual riders into members and wants to focus the marketing strategy, but requires approval from the executive team.



#### We are assigned the question:

How do annual members and casual riders use Cyclistic bikes differently?
    
Following the **ask, prepare, process, analyze, share and act** data analysis process you are to deliver a report including:
- A clear statement of the business task
- A description of all data sources used
- Documentation of any cleaning or manipulation of data
- A summary of your analysis
- Supporting visualizations and key findings
- Your top three recomendations based on your analysis

# ASK

What problem are we trying to solve and how will it drive business decisions?

#### Key Stakeholders:
- Moreno, Director of Marketing and your boss, is certain the key to the company's future is turning casual riders into members.

- The executive team who will be deciding whether to move forward with the marketing program.  All we know is that they are described as 'detail-oriented'

#### Facts worth note in the case study:
- Cyclist has 5824 bicycles and 692 docking stations all in Chicago
- Cyclist offers reclinging bikes, hand tricycles and cargo bikes to aid with riders who may have dissabilities.  
 - These only make up for ~8 of riders
- 30% of cyclists ride to commute to work each day 

#### Potential Bias' to keep in mind:
- Your told that membership holders are more profitable.  It makes logical sense, but we have not seen any data to back that up.
- Moreno appears to have already decided the course of action, market towards casual riders to get more members.  They may be looking to steer the data to prove this point.
- Moreno believes that casual riders are already aware of membership offerings.

## Business task:

With the provided data what can we determine about the ways casual riders and members use Cyclistic bikes, what tendancies they share, and how do they differ.

# Prepare

Locate, Organize and Verify the Data

### Data Summary:

We are provided zip files from an Amazon S3 Cloud storage including monthly and quarterly data since the company started.

We download the last 12 months (April 2020 - April 2021) which include csv files with data on:
- ride_id
- rideable_type
- start & end times, stations and longitute/latitude
- rider status: member or casual

### Data Acquisition method:

Currently unknown

### Data Licensing:

Can be found [here](https://www.divvybikes.com/data-license-agreement)

### Data organization and verification:

These files contain 10s of thousands of rows each.  

Due to the amount of data, we will use Python via Jupyter Notebook and clean/view/merge our data with the Pandas library.  

### For this notebook, I loaded and merged all files with pandas, as described below, and then uploaded to kaggle

The merged, pre-cleaned dataset can be found here: [Divvy Cyclist Dataset](https://www.kaggle.com/michaelmcray/divvy-cyclist-04-20-to-04-21)

```
import pandas as pd
import numpy as np

#List to hold dataframes
df_list = []

for file in /file_location/csv_files:
    df_csv = pd.read_csv(file)
    df_list.append(df_csv)
    
#Ensure all 13 files were added to the list
print(len(df_list))
```

#### Use Concat to combine all files into single dataframe

`df = pd.concat(df_list, axis=0)`

## Let's get started  with our data:  

#### We'll import our libraries for analysis and load our merged csv.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../input/divvy-cyclist-04-20-to-04-21/Divvy_Cyclist.csv', low_memory=False)

## Begin Inspecting:

We'll look through our data for nulls, duplicates, poor formatting and outliers.

### Check number of rows and columns:

In [None]:
print(df.shape)

### Check our available column names and data types

In [None]:
print(df.dtypes)


**Note**:
- started_at & ended_at should be datetime columns, but are showing as object.
 - Object means the columns may have string or mixed data.  
 - We'll need to investigate further.

### Check for Null values

In [None]:
df.isna().sum()

**Note**:
- Sizable amount of nulls in the station columns (~20%).  
- Ended up deleting after further evaluation.

### Quickview of our first 5 rows

In [None]:
print(df.head())

### Check for duplicate rows

In [None]:
df.shape

In [None]:
duplicate_df = df[df.duplicated()]
print(duplicate_df.shape)

**0 duplicated rows, great!**

### Check if ids unique or specific to rider/location

In [None]:
id_ride = df.ride_id.unique()
id_start = df.start_station_id.unique()
id_end = df.end_station_id.unique()
print(len(id_ride), len(id_start), len(id_end))

- id_ride = approximately 3826800;
- id_start, id_end were ~1800

#### id_ride is apparently unique while station id is tied to station name
- We'll delete the id_ride column since we're not planning to use SQL and currently don't need a unique column
- Well delete the station id's because the station names will be easier for our stakeholder's to understand

### Delete Columns

In [None]:
df = df.drop(['ride_id', 'start_station_id', 'end_station_id'], axis = 1)

### Rename Columns

In [None]:
df = df.rename(columns = {'rideable_type' : 'Bike_Type', 'started_at' : 'Start_Time', 'ended_at' : 'End_Time', 'start_station_name' : 'Starting_Station', 'end_station_name' : 'Ending_Station', 'member_casual' : 'Rider_Status', 'start_lat' : 'Starting_Latitude', 'start_lng' : 'Starting_Longitude','end_lat' : 'Ending_Latitude', 'end_lng' : 'Ending_Longitude'})
df.head()

# Process Data

Determine any errors, reformat as necessary, prepare to create visualizations

### Reviewing data types

In [None]:
df.dtypes

#### Start Time, End Time should are objects but should be datetime dtypes

In [None]:
from datetime import datetime

In [None]:
#Convert to time format
df['Start_Time'] = pd.to_datetime(df['Start_Time'])
df['End_Time'] = pd.to_datetime(df['End_Time'])


In [None]:
df.dtypes

### Add some informative Columns:
- We'll start by creating a 'Ride Length' Column expressing a difference in Start and End times.  I found this a bit clunky in Pandas, but ended up with a result in decimal of hours.
- Second, we'll find the day of the week coresponding to each Start Time.


In [None]:
df['Ride_Length'] = ((df['End_Time'] - df['Start_Time']).dt.days) * 24 + ((df['End_Time'] - df['Start_Time']).dt.seconds)/ 3600

In [None]:
df['Day_of_Week'] = df['Start_Time'].dt.day_name()
print(df.head())

### Investigate data to determine any additional cleaning needed

Start by creating a new dataframe with reduced columns to check our ride length calculation. Ride length times are in hours.

In [None]:
df_short = df.drop(['Start_Time', 'End_Time', 'Starting_Station', 'Ending_Station', 'Starting_Latitude', 'Ending_Latitude', 'Starting_Longitude', 'Ending_Longitude'], axis=1 )

In [None]:
df_short.groupby(['Day_of_Week', 'Rider_Status']).mean()

In [None]:
df_short.groupby('Rider_Status')['Rider_Status'].count()

It looks like we might have some questionable data for ride length with some times showing negative.  Lets make a quick box plot to check outliers.


In [None]:
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline     
sns.set(color_codes=True)

sns.boxplot(x=df['Ride_Length'])

In [None]:
df['Ride_Length'].describe()

We indeed have some negative values and an unexpectedly large standard deviation.  Well create a new dataframe and sort and remove these values.

In [None]:
df_sort = df.sort_values('Ride_Length')
df_sort.head()

In [None]:
df_time_error = df[df['Ride_Length']<0]
df_time_error.count()

**Results**: 10557 rows where End time is before start time.  Out of 3826978 rides this is less then 1/4 of a percent.  Still We'll take a quick look to see if we can find any other correlations.

In [None]:
df_time_error['Start_Time'].max()

In [None]:
df_time_error['Start_Time'].min()

In [None]:
df_time_error.groupby('Day_of_Week')['Rider_Status'].count()

In [None]:
df_time_error.groupby('Rider_Status')['Start_Time'].count()

Appear to have been happening for the entire year, reasonably spread out through the days of the week.  For now, I'm going to remove this data, but definitely something I'd investigate a little further if possible. 

In [None]:
df_time_high = df[df['Ride_Length']>24]
df_time_high.count()

Less then 0.1 % of rides last over 24 hours.  For now, we'll remove these values as well. I'll create a new dataframe with the removed rows in case I want to compare the two.

In [None]:
df_clean = df[(df['Ride_Length'] < 24) & (df['Ride_Length'] > 0)]
df.shape

In [None]:
df_clean.shape

In [None]:
sns.boxplot(x=df_clean['Ride_Length'])

In [None]:
df_clean.groupby(['Day_of_Week', 'Rider_Status'])['Ride_Length'].mean()

Now we'll check for any more nulls and remove those as well.

In [None]:
print(df_clean.isnull().sum())

In [None]:
df_clean = df_clean.dropna()

In [None]:
print(df_clean.isnull().sum())

### Now we'll export the data to a csv and load into Tableau Public to visualize.

df_clean = df_clean.to_csv('//Divvy_Cyclist.csv', index=False)


# Analyze & Share

Our **Assignment** was 'How do annual members and casual riders use Cyclistic bikes differently?'

First lets list some questions that we can ask to determine thes differences:
- Who are taking these rides, members or casuals?
- Where are they riding?
- For how long are they riding?
- What times of the day and year are they riding?
- What type of bikes are they riding?

After loading the data into Tableau Public the following charts were developed to address these questions.

## Who is taking the rides?
![TotalTime.PNG](attachment:TotalTime.PNG)

#### What we found:
- There was over 3.5 million rides between April 2020 and April 2021.
- For these rides, nearly 59% were members and 41% were casuals.

#### What we don't know:
- How many of these rides were from the same individuals.  Our information is specific to the ride and not the user.

#### Considerations:
- Suprised by the high percentage of members.  I expected them to be a minority.

## Where are they riding?

The following gif shows start locations for members, casuals and combined:

![5aer3n.gif](attachment:5aer3n.gif)

#### What we found:
- From this graphic there doesn't appear to be a large difference between members and casuals.
- A seperate graph of ending locations showed comprable results.
- Most rides appear to be clustered in the metro area.

#### What we don't know:
- The relation of ending location or duration of these trips. 
- The time of the day/year these trips were taken

#### Considerations:
- Further examinations of this based on time or specific areas may add beneficial information.

## For how long are they riding?
![Duration.PNG](attachment:Duration.PNG)
#### What we found:
- Casual rides tend to last almost twice as long as member rides.
- Shorter rides are unsurprisingly in the winter and longest in the winter.
- Members are more consistant with their ride durations with average ride duration ranging between 12-19 minutes compared to 22 and 44 for the casuals.

#### What we don't know:
- What ending and starting locations relate to these shorter and longer rides.
- How the number of unique users may relate to this data. 

#### Considerations:
- It was stated in the case study that members tended to be more likely to use the vehicles for transportation to work.  This could explain the steadier trend and lower average

## What time of day and year are they riding?

### Monthly breakdown:

![Monthly.PNG](attachment:Monthly.PNG)

#### What we found:
- In summer months, member and casual rides are nearly equal.
- This drastically falls off between November and March with members going for over twice as many rides as casuals for those months.

#### What we don't know:
- The weather is likely a large factor here, but so may location of bikes. 

#### Considerations:
- Could something be done to spark casual interest in rides even in the colder months?


### Weekly Breakdown
![dpw.png](attachment:dpw.png)
#### What we found:
- Member rides are highest on Saturday but fairly consistent.
- Casual rides are much higher in the weekend.

#### What we don't know:
- If any monthly outliers are efficting this.

#### Considerations:
- How can we leverage more Casual rides during the week or market to member riders on the weekend.

### Hourly Breakdown

![Hourly.PNG](attachment:Hourly.PNG)

#### What we found:
- Member rides spike heavily between 6-8 AM and again between 4 and 6 PM. They also have a peak around noon (lunch)
- Casual rides begin more slowly at the start of the day but ramp up more consistently.

#### What we don't know:
- How this varies by month to month

#### Considerations:
- This is likely again do the people using the bikes for transportation to work.
- It's interesting that casual rides don't decrease as quickly in the evenings, although they do begin declining at the same time as member rides.

### What type of bikes are they riding?
![Bike%20Rides.PNG](attachment:Bike%20Rides.PNG)
#### What we found:
- Classic bikes appear to be implemented around December 2020.
- Members appear to have taken to the classic bikes more quickly then the casual riders.
- Around the end of February casual riders became interested in the docked bikes again while the member riders stuck with the classic.

#### What we don't know:
- Really much of anything about what entails these different bike types.
- What incitive there is to ride any of these variations.

#### Considerations:
- We'd need more information on the types of bikes to make a solid recommendation but it's worth pointing out the difference in user levels especially as the difference is increasing at the time of the investigation.

# Act

## Conclusion:
**So how do annual members and casual riders use Cyclistic bikes differently**
1. Length (Duration)
- Members are more consistent riders from day to day, while casual rides last on average twice as long.  This could be doe to price structure making short trips less economic for casual rides, marketing towards leisure instead of convenience etc.
2. When (Day, Week, Month)
- Member rides spike during the normal start and of the business day, where casual rides start slower and increase/decrease at a steadier rate
- Casual riders prefer weekends, where members are more frequent on the weekdays especially around the start and end of work.  This is likely due to members being more likely to use the bikes for commutes, and casuals for leisure.
- All rides are down in the colder months, but casual riders all but dissapear then are nearly as frequent as members in the summer.
3. What (Type of Bike)
- Members adopted the classic bike faster then casual riders and now strongly prefer them where casual riders are showing a returned interest in docked bikes.
- **(Assumption)** Members are more likely to adapt to a new type of bike, but casual riders will more likely ride what is available.  
 - This is an assumption due to not knowing what accentives, bike availability, or price structures each group includees.


# Wrapping Up:  Google's Data Analysis Case Study

This concludes the case study portion of my analysis.  I enjoyed my time pouring through the data and trying to gleam some insight into what makes these ride types different.  If I was an employee of this company, I'd love to know more about the locations, the number of users of each type involved with these rides, and spend more time looking into the paths and distances of each trip.

In the next section, I'm going to experiment a little bit with running the data through some beginner Machine Learning algorithms. This is unrelated to the assigned case study, but may provide some fun insight.

# Experimenting with Machine Learning

It seems irresponsible to run this on Kaggle and not at least put the data through a machine learning algorithm. 

### Goals:
- Gain insight into how similar/different our two ride types are based on ability for ML to accurately predict. 
- Determine which variables are most important in predictions

### Note:  This is a very basic approach to ML using Random Forest Classification to see what insight can be gained on the goals listed.  Any feedback, or suggestions for other algorithms/methods are appreciated.

## First we will need to further process our data before we can begin:

Quick review of our header:

In [None]:
df.head()

**Let's break down a few of these columns and see how many unique values we have:**

In [None]:
df['Bike_Type'].value_counts()


In [None]:
df['Starting_Station'].value_counts(),

### Let's take another look at longitude and latitude

In [None]:
df['Starting_Longitude'].describe()

In [None]:
df['Starting_Latitude'].describe()

**There might be enough variance in latitude to take a closer look.  We might be able to scale it to something useful, but for now we'll remove it for simplicity**

### Reducing our variables:
- Both the start station name and ID have too many distinct values to categorize for this.  If I knew more about the area, maybe we could break them into sections of the town, but for now we'll remove.
- We'll continue to use Start Time and remove End Time
- ride_id is a unique column
- Latitude and Longitude data

In [None]:
#Cleaning up our columns:

dfML = df.drop(['Starting_Station', 'Ending_Station', 'End_Time', 'Starting_Latitude', 'Ending_Latitude', 'Starting_Longitude', 'Ending_Longitude'], axis='columns')
dfML.head()


**Lets move ahead and clean up our ride length data by getting minutes (multiple x 60) and setting to integer.**

In [None]:
dfML['Ride_Length'] = dfML['Ride_Length'] * 60

In [None]:
dfML['Ride_Length'] = dfML['Ride_Length'].astype(int)

In [None]:
dfML.head()

### From our earlier analysis we accessed the key difference in ride types were ride length(duration) , ride time(hour of day, day of week, month of year), and to a lesser extend bike type   

**Let's break these down a bit so they can be categories**

#### From our Rider per Month chart their seems to be two distinct areas: Colder months (October-March) and Warmer months (April-September)

In [None]:
#Break datetime into integer value for Month
dfML['Month'] = dfML['Start_Time'].dt.month

In [None]:
dfML.Month.value_counts()

**The following will let us split our monthly time into warm/cold**

In [None]:
dfML['Season'] = pd.cut(dfML['Month'],
                            [0,3,9,12],
                            labels = ['Cold', 'Warm', 'Cold'],
                             include_lowest=True,
                            ordered=False)

### Let's seperate our 'Day of Week' column based on weekday or weekend
- In our chart we can see the count of casual rides drops much lower then members during weekdays

In [None]:
dfML['Part_of_Week'] = np.where((dfML['Day_of_Week'] == 'Saturday') | (dfML['Day_of_Week'] == 'Sunday'), 1,0)

In [None]:
dfML.head()

### Now we'll seperate our daily time into hours

**In our analysis we determined members had large spikes in rides during morning and evening commutes.  These maybe beneficial for our model to predict**
- The following will let us split our hourly time into parts of the day

In [None]:
#Break datetime into integer value for hour
dfML['Hours'] = dfML['Start_Time'].dt.hour

**The following will let us split our hourly time into parts of the day**

In [None]:
dfML['Part_of_Day'] = pd.cut(dfML['Hours'],
                            [0,4,8,16,20,23],
                            labels = ['Late_Night', 'Morning', 'Afternoon','Evening', 'Night'],
                             include_lowest=True,
                            ordered=False)
                                                         

In [None]:
dfML.Part_of_Day.value_counts()

In [None]:
dfML.head()

### Seperate our Rider Status into member = 1, casual = 0

In [None]:
dfML['Ride_Type'] = np.where(dfML['Rider_Status'] == 'member',1,0)

## Now we'll break our categorical data into numerical form using One-Hot Encoding via pandas get_dummies.  

### We'll do this for our 'Rider Status', 'Bike Type', 'Season' and 'Part_of_Day' columns

[Here]('https://towardsdatascience.com/what-is-one-hot-encoding-and-how-to-use-pandas-get-dummies-function-922eb9bd4970') is a simple guide on One hot encoding.

I could likely do this by changing the labels of the columns we cut, but I'm going to experiment with this method first

In [None]:
bt_dummies = pd.get_dummies(dfML['Bike_Type'])

In [None]:
seasons_dummies = pd.get_dummies(dfML['Season'])

In [None]:
pod_dummies = pd.get_dummies(dfML['Part_of_Day'])

## Time to add all our dummy frames into our dataFrame

**Note: Also remove redundant columns**

In [None]:
dfML = pd.concat([dfML, bt_dummies, seasons_dummies, pod_dummies], axis=1)

In [None]:
dfML = dfML.drop(columns = ['Part_of_Day', 'Day_of_Week', 'Season','Start_Time', 'Bike_Type', 'Rider_Status'])

In [None]:
dfML.head()

In [None]:
dfML.isna().sum()

# Import sklearn and split our data into training and testing data

In [None]:
from sklearn.model_selection import train_test_split
y = dfML['Ride_Type']
X = dfML.drop(['Ride_Type'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=30)

In [None]:
#checking training data 
y_train.value_counts()

In [None]:
X_train.value_counts()

In [None]:
#checking test data
X_test.value_counts()

In [None]:
y_test.value_counts()

In [None]:
#Lets double check we have no relevant nulls
X_test.isna().sum()

# Import RandomForestClassifier for our algorithm:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Train our RFC model

In [None]:
clf = RandomForestClassifier(n_estimators = 100, random_state = 20)
rf_train = clf.fit(X_train,y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.predict_proba(X)[0:5]

In [None]:
importance = list(zip(X, clf.feature_importances_))
importance

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
accuracy = (metrics.accuracy_score(y_test, y_pred)) * 100
accuracy

### Observations:
- Not surprised Ride Length is the highest importance since this data showed the greatest deviation, but surprised it seems the only relevant variable.
- Expected more signifigance from some of the grouped time and season categories.  The way I have these split may need to be reassesed.


## Let's try with fewer variables
- We'll keep Ride Length, but focus on the time variables
- The 'Part of Day' variables may need some adjusting or maybe its just adding too many variables.

In [None]:
dfML_time = dfML[['Ride_Type','Ride_Length', 'Month', 'Hours', 'Part_of_Week']].copy()

In [None]:
yt = dfML_time['Ride_Type']
Xt = dfML_time.drop(['Ride_Type'], axis=1)
Xt_train, Xt_test, yt_train, yt_test = train_test_split(Xt,yt, test_size=0.3, random_state=30)

In [None]:
clft = RandomForestClassifier(n_estimators = 100, random_state = 20)
rft_train = clft.fit(Xt_train,yt_train)

In [None]:
yt_pred = clft.predict(Xt_test)

In [None]:
clft.predict_proba(Xt)[0:5]

In [None]:
impt = list(zip(Xt, clft.feature_importances_))
impt

In [None]:
print(classification_report(yt_test, yt_pred))

In [None]:
acct = (metrics.accuracy_score(yt_test, yt_pred)) * 100
acct

#### Observations

- Nearly identical accuracy with ride length now nearly 75% of the importance
- Hours measuring as high in importance as months and part of week combined.

## Just  Ride lenght and cold/warm months

In [None]:
dfML_temp = dfML[['Ride_Type','Ride_Length', 'Cold', 'Warm']].copy()

In [None]:
ytp = dfML_temp['Ride_Type']
Xtp = dfML_temp.drop(['Ride_Type'], axis=1)
Xtp_train, Xtp_test, ytp_train, ytp_test = train_test_split(Xtp,ytp, test_size=0.3, random_state=30)

clftp = RandomForestClassifier(n_estimators = 100, random_state = 20)
rftp_train = clftp.fit(Xtp_train,ytp_train)

ytp_pred = clftp.predict(Xtp_test)

In [None]:
clftp.predict_proba(Xtp)[0:5]

In [None]:
print(classification_report(ytp_test, ytp_pred))

In [None]:
imptp = list(zip(Xtp, clftp.feature_importances_))
imptp

In [None]:
acctp = (metrics.accuracy_score(ytp_test, ytp_pred)) * 100
acctp

#### Observation:
- Only slightly lower scores
- Ride Length now taking up over 92% of the importance

## Just the times of day by our grouped categories

In [None]:
dfML_PoD = dfML[['Ride_Type','Late_Night', 'Morning', 'Afternoon', 'Evening', 'Night']].copy()

In [None]:
yp = dfML_PoD['Ride_Type']
Xp = dfML_PoD.drop(['Ride_Type'], axis=1)
Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp,yp, test_size=0.3, random_state=30)

clfp = RandomForestClassifier(n_estimators = 100, random_state = 20)
rfp_train = clfp.fit(Xp_train,yp_train)

yp_pred = clfp.predict(Xp_test)

In [None]:
clfp.predict_proba(Xp)[0:5]

In [None]:
print(classification_report(yp_test, yp_pred))

In [None]:
imp = list(zip(Xp, clfp.feature_importances_))
imp

In [None]:
accp = (metrics.accuracy_score(yp_test, yp_pred))* 100
accp

#### Observation
- Much lower accuracy ~8% and poor F1 score without Ride Length included
- Importance score for morning now reflective of the difference we saw in our analysis

## Just the best performers:

In [None]:
dfML_Best = dfML[['Ride_Type','Ride_Length', 'Morning', 'Hours', 'docked_bike']].copy()

In [None]:
yb = dfML_Best['Ride_Type']
Xb = dfML_Best.drop(['Ride_Type'], axis=1)
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb,yb, test_size=0.3, random_state=30)

clfb = RandomForestClassifier(n_estimators = 100, random_state = 20)
rfb_train = clfb.fit(Xb_train,yb_train)

yb_pred = clfb.predict(Xb_test)

In [None]:
clfb.predict_proba(Xb)[0:5]

In [None]:
print(classification_report(yb_test, yb_pred))

In [None]:
impb = list(zip(Xb, clfb.feature_importances_))
impb

In [None]:
accb= (metrics.accuracy_score(yb_test, yb_pred)) * 100
accb

#### Observation
- Only slightly worse performance then with all our variables, 0.8% accuracy
- Once again, heavily weighted on Ride Length

# Final thoughts

- Ride Length comes out the heavy winner in most important variable when trying to determine if a ride was casual or a member 
- Business accumen on this industry and the city of Chicago, along with further user data could help with our assessment
- Much more time could be spent trying different algorithms, adjusting our variables and hypertuneing

# Final Note:

To anyone tha makes it this far, thanks for reading.  I'd appreciate any feedback.  Many of these steps could have been simplified, but I've enjoyed experimenting in different areas.

### To do:
- Code and Text clean up
- Consistent formatting
- Create a data cleaning summary
