Quick walkthrough of day one of the five day challenge, using Python. Goal is to get a quick understanding of the data, and test an initial hypothesis with a scatter plot including a regression line.

In [1]:
import pandas as pd

bikes = pd.read_csv("../input/nyc-east-river-bicycle-crossings/nyc-east-river-bicycle-counts.csv")

bikes.head()

In [2]:
# Unnamed: 0 appears to be index column. Easy enough to get rid of it
# Could add parameter to pd.read_csv to import first column as index,  but not needed
bikes.drop('Unnamed: 0', axis=1, inplace=True)
bikes.head()

In [3]:
# Any NaN values in the data?
len(bikes[bikes.isnull().any(axis=1)])

In [4]:
# Call describe to get an overview - maybe see columns with data issues
bikes.describe()

Hypothesis: People are less likely to go cycling when it is cold outside. Thus, there exists a relationship between the temperature of the day and the total number of cyclists.

I will choose the low temperature to start.

In [None]:
# Wish to see the relationship between low temp and number of cyclists
# import libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline

In [None]:
# Separate the data
# low temp (4th col) and total bikers (10th). Python index starts at 0
# using .values to convert df to numpy array
data = bikes.iloc[:, [3,9]].values 

In [None]:
# scale the data so the relationships can make sense
# the number of people being in the thousands versus temps in the tens makes comparison clumsy
scaler = MinMaxScaler()
data = scaler.fit_transform(data)
data # Now each entry is between 0 and 1 in their respective columns

In [None]:
X = data[:, 0] # Low temps. Feature
y = data[:, 1] # Total bikers. Label
plt.scatter(X, y) # Produce scatter plot
plt.xlim((-.1, 1.1)) # Set x and y limits on graph
plt.ylim((-.1, 1.1))

In [None]:
#Use numpy polyfit to plot a quick linear regression line
import numpy as np

a, b = np.polyfit(X, y, deg=1) # deg=1 for linear
f = lambda x: a*x + b
plt.scatter(X, y)
plt.plot(X, f(X), lw=2.5, c="orange")
plt.xlim((-.1, 1.1))

plt.ylim((-.1, 1.1))

Looks like a very error-filled approach. Likely other variables not considered play a big role, such as precipitation and the type of precipitation. A quick dtypes call to the data shows precipitation to be an object, and one entry in the head call above shows a (S) on a day with a low of 33.1, suggesting snow. A feature engineering exercise may yield good information from the remaining data.

Possible other features predicting number of bikers per day include the day of the week (e.g., weekend numbers likely larger), and even the high temperature.

In [None]:
bikes.dtypes