# ML Title: Predicting Bicycle Imbalances in a bike-sharing System

This notebook covers the exploration of the data identified and collected to be used to model a predictive model that can predict imbalances in a bike sharing system.

The analysis shows investigations and plots conducted with intuition and includes the following

1. Heat maps
2. Correlation Matrices
3. Boxplots
4. Feature Independence Plots
5. Outlier Plots
6. Illustration of Patterns of Interest
7. Illustration of Trends in Time and Space

#### Our Approach
> Approach will use a minimun of 10 ML models
1. Problem Identification
2. Avaibale Data
3. How Evaluation is done
4. The features avalibale
5. Modelling
6. Experimenting

#### 1. Problem Identified
> From the identified features in the dataset, can we predict whether or not there will be an imbalance in any given bike docking station or not?

#### 2. Data

The original data was initially from [Blue Bikes](https://www.bluebikes.com/system-data) system data. Another version is also available on [Kaggle](https://www.kaggle.com/datasets/jackdaoud/bluebikes-in-boston?resource=download).

#### 3. Evaluation
> For this model to be considered accurate a score of 97% will be considered.


#### 4. Features
> Details about each feature in the dataset

**Data Features/Columns**
   - ```tripduration```: duration of trip in seconds
   - ```starttime```: start time and date of trip
   - ```stoptime```: stop time and date of trip
   - ```start station id```: unique ID of station the trip started at
   - ```start station name```: name of station the trip started at
   - ```start station latitude```: latitude of start station of trip
   - ```start station longitude```: longitude of start station of trip
   - ```end station id```: unique ID of station the trip started at
   - ```end station name```: name of station the trip ended at
   - ```end station latitude```: latitude of end station of trip
   - ```end station longitude```: longitude of end station of trip
   - ```bikeid```: unique ID of bike used for trip
   - ```usertype```: type of user can be Customer or Subscriber
   - ```postalcode```: postal code of user
   - ```year```: year of when the trip took place
   - ```month```: month of when the trip took place
   - ```birth year```: birth year of user
   - ```gender```: gender of user

### EDA-TOOLS

#### The Tools to be used
> importing the necessary libraries - pandas, numpy, matplotlib, sikit-learn models for training and evaluation

In [13]:
# imports
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
import seaborn as sms

%matplotlib inline

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# Set the seaborn style formating
plt.style.use('seaborn-colorblind')

#### Load the datasets and merge

In [15]:
bike_df1 = pd.read_csv('datasets/bluebikes_tripdata_2019.csv')
bike_df2 = pd.read_csv('datasets/bluebikes_tripdata_2020.csv')

  bike_df2 = pd.read_csv('datasets/bluebikes_tripdata_2020.csv')


 Create a copy of each dataset

In [16]:
bike_df1 = bike_df1.copy()
bike_df2 = bike_df2.copy()

Count the Number of rows

In [17]:
bike_df1.shape

(2522771, 17)

In [18]:
bike_df2.shape

(1999446, 18)

Check the column features on each dataset before merge

In [19]:
bike_df1.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender', 'year', 'month'],
      dtype='object')

In [20]:
bike_df2.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'postal code', 'year', 'month', 'birth year', 'gender'],
      dtype='object')

Drop the column that does not exist on both datasets

Merge the dataset into one

### Data Cleaning

### EDA-Exploratory Data Analysis

> Investigating more details about the data and relation to the subject problem to be solved