### Importing Relevant Libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from pandas import set_option
plt.rcParams['figure.figsize'] = (25,18)

### Reading Data

The data has repeated values. Hence, I will be taking the first 30 days of data.

In [None]:
nyc_data = pd.read_csv('../input/nyc-east-river-bicycle-counts.csv',index_col='Date', parse_dates = True)
nyc_data = nyc_data.head(30)

In [None]:
nyc_data.head()

The head and tail commands show that the data is available for the month of April, i.e. starting from 1st April, 2016. to 30th April, 2016. As seen, we have the High/Low and Precipitation temperatures for all these days and the number of bikes crossing the 4 bridges. 

### Data Cleanup 

Dropping column 'Day' as 'Date' covers the same data. Also, cleaning up column names.

In [None]:
nyc_data =  nyc_data.drop('Day', axis = 1)
nyc_data =  nyc_data.drop('Unnamed: 0', axis = 1)
nyc_data = nyc_data.rename(columns={"High Temp (°F)": "HighTemp", "Low Temp (°F)": "LowTemp", "Precipitation	":"Precipitation"})
nyc_data.head()

Looking at the types of columns, we see Precipitation as an object. To use it in our analysis, we need to clean it up

In [None]:
nyc_data.dtypes

In [None]:
prep = nyc_data['Precipitation'].replace(['0.47 (S)'], '0.47')
prep = prep.replace(['T'], '0')
prep = prep.astype(float)
nyc_data['Precipitation'] = prep
nyc_data.head(4)

In [None]:
nyc_data.dtypes

In [None]:
nyc_data.describe()

In [None]:
nyc_data.corr()

### Data Visualization

Visualizing the correlations:

In [None]:
sns.heatmap(nyc_data.corr(),annot = True)
plt.show()

The heatmap shows that the number of rides clearly shows a negative correlation with precipitation. We also see a high positive correlation with HighTemp and rides, indicating a correlation.
To validate this, lets plot Total Rides with Precipitation. As seen, the number of rides is the more on days when precipitation is 0. 

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="darkgrid")
sns.relplot( x="Total", y="Precipitation", data=nyc_data);

In [None]:
sns.pairplot(nyc_data,x_vars=['HighTemp', 'LowTemp','Precipitation'],y_vars='Total',kind='reg',size=6)
plt.show()

Lets see the rides across the bridges spread for the month. 

In [None]:
nyc_data.resample('D').sum().plot()

As seen from the chart, the highest number of rides ever recorded in a day is around 22-23k, and the lowest reaching just below 5k rides. 

### Regression

Lets go ahead and fit a linear model on the data. Before we do that, let us create data for train and test

In [None]:
x =nyc_data.drop(['Total','Brooklyn Bridge','Manhattan Bridge','Williamsburg Bridge','Queensboro Bridge'],axis=1)
y = nyc_data['Total']

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(x,y,random_state=1)

In [None]:
linear_model = LinearRegression()
output = linear_model.fit(xtrain,ytrain)
print(output.intercept_)
output.coef_

In [None]:
y_pred = linear_model.predict(xtest)
np.sqrt(metrics.mean_squared_error(ytest,y_pred))

In [None]:
df = pd.DataFrame({})
df = pd.concat([xtest,ytest],axis=1)
df['Predicted Total'] = np.round(y_pred,2)
df['Error'] = df['Total'] - df['Predicted Total']
df

This completes this notebook for linear regression