# Lecture 9: August 25th, 2023

__Reminders:__
* All EDA outcome quizzes have been posted. Attempt the ones you're missing, and let me know if any issues come up. Come to student hours it any issues come up! Anthony and I are here to help.

* "50 years of data science" token-earning assignment due tonight at midnight. As always, this is optional.

* I'm almost done writing the new homeworks for next week and they will be uploaded by tonight. They will be due Week 4 Friday at midnight instead of Wednesday.

__Coming up:__
* On Monday, we'll go through the instructions for the final project.
* The planning worksheet for the final project will be due during Week 5.

__Today:__

* We'll introduce Machine Learning (ML)
* We'll start by coding for linear regression
* Anthony will go through a worksheet on generating data for regression problems. Definitely go, if you are able to!

## Introduction to Machine Learning

Let's take another fieldtrip...to the iPad!

![](Teaching-43.jpg)

![](Teaching-44.jpg)

![](Teaching-45.jpg)

![](Teaching-46.jpg)

![](Teaching-47.jpg)

![](Teaching-48.jpg)

## Performing Linear Regression Using scikit-learn 

In [1]:
import pandas as pd
import altair as alt
import seaborn as sns

* Import the taxis data from Seaborn.

In [2]:
df = sns.load_dataset("taxis")

In [3]:
df.sample(5)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
2771,2019-03-29 23:20:06,2019-03-29 23:21:46,1,0.2,3.0,0.0,0.0,4.3,yellow,cash,Astoria,Astoria,Queens,Queens
1421,2019-03-04 19:50:35,2019-03-04 20:00:08,5,2.68,10.0,2.14,0.0,16.44,yellow,credit card,Midtown East,East Harlem South,Manhattan,Manhattan
3490,2019-03-03 02:31:40,2019-03-03 02:41:53,1,2.44,10.0,0.0,0.0,13.8,yellow,cash,Lower East Side,Midtown South,Manhattan,Manhattan
5708,2019-03-04 09:42:07,2019-03-04 10:35:59,1,18.6,58.0,0.0,5.76,64.56,green,credit card,Bushwick South,Central Harlem,Brooklyn,Manhattan
1119,2019-03-03 14:15:20,2019-03-03 14:24:24,1,2.3,9.5,0.0,0.0,12.8,yellow,cash,World Trade Center,West Village,Manhattan,Manhattan


* Drop rows with missing values

In [4]:
df = df.dropna()

* Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis.

In [5]:
alt.Chart(df).mark_circle().encode(
    x="distance",
    y="fare"
)

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).

See https://altair-viz.github.io/user_guide/large_datasets.html for information on how to plot large datasets, including how to install third-party data management tools and, in the right circumstance, disable the restriction

alt.Chart(...)

Here, we get a `MaxRowsError`; Altair can only work with data that has less than or equal to 5000 rows.

* Choose 5000 random rows to avoid the `max_rows` error.

Let's get a random selection of 5000 rows from `df`. I'm not going to worry about getting reliable random rows, the point of this part is just to get a feel for what the data looks like.

In [6]:
alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare"
)

Looking at the data, it seems to be roughly linear. It's not perfectly linear, but we should be able to approximate a line pretty well. The only weird thing is that horizontal line...let's see what's going on there by adding a tooltip.

James brought up a great point: some of the rides go a distance of zero miles...and are still charged. Let's remove these points from our data, because this seems very strange.

In [10]:
alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)

In [15]:
df2 = df.sample(5000,random_state=10)

In [17]:
df2 = df2[df2["distance"] > 0]
alt.Chart(df2).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)

The horizontal line all involves rides going to or from an airport. This looks like some kind of fixed price promotion where you can go to the airport (or get picked up from the airport) and go anywhere within a region for a fixed price. 

* What would you estimate is the slope of the “line of best fit” for this data?

We have the points $(0.02,2.5)$ and $(5,16)$

In [18]:
#The slope 
(16-2.5)/(5-0.02)

2.710843373493976

If I had to approximte the line, I'd say the slope is about 2.71.

__There is a routine in scikit-learn that we will see many times! Starting now!__ 

1.) Import 
2.) Instantiate (create an instance of an object from an appropriate class)
3.) Fit 
4.) Predict

* Find this slope using the `LinearRegression` class from scikit-learn.

In [21]:
#1.) import
from sklearn.linear_model import LinearRegression

Create a LinearRegression object and name it `reg` (for regression)

In [22]:
#2.) Instantiate
reg = LinearRegression()

In [23]:
type(reg)

sklearn.linear_model._base.LinearRegression

We see `reg` is a linear regression object. This is not from base python, it belongs to scikit-learn.

Below, let's try to fit the data. We're going to get an error, and I can say that you will most likely run into this error many times on your own.

In [28]:
#3.) Fit
reg.fit(df2["distance"],df2["fare"])

ValueError: Expected 2D array, got 1D array instead:
array=[2.8  1.2  2.1  ... 2.68 1.6  1.47].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

What goes wrong here is that `reg.fit` expects a two dimensional array for the input, but we passed the pandas Series `df["distance]`. We should think of pandas Series as one-dimensional objects.

In [29]:
df2["distance"].shape

(4972,)

Notice the blank after the comma when we call shape. This is letting us know that the pandas Series in one dimension.

Observe the difference with the following:

In [30]:
df2[["distance"]]

Unnamed: 0,distance
2871,2.80
898,1.20
845,2.10
1580,3.35
4002,10.70
...,...
1812,1.20
2191,13.11
4827,2.68
4326,1.60


In [33]:
df2[["distance"]].shape

(4972, 1)

The example above is treated as a DataFrame with just one column. This is what happens when I pass a list `df[[...]]`.

One way that we can remember when we did two dimensions versus one dimenion is the use of capital letters. The capital "X" means that we need two dimensions, while the lower-case "y" means we need a single dimension.

In [34]:
reg.fit(df2[["distance"]],df2["fare"])

At this point, `reg` has done all of the hard work of finding a linear equation that approximates our data ("fare" as a linear function of "distance".)

Recall: The original question was asking us to find the slope. Here's how we can get it:

Slop is stored as the `coef_` attribute.

In [35]:
reg.coef_

array([2.72848668])

Notice that this is a NumPy array, if I wanted to extract just the number, I could do this:

In [36]:
reg.coef_[0]

2.7284866819996245

We had estimated before that the slope would be about 2.71, so I think we did a pretty good job :)

* Find the intercept.

The intercept is stored as the `intercept_` attribute.

In [37]:
reg.intercept_

4.660714229453321

Putting these together, the equation of our line is given by:
$$
\text{fare} \approx 2.7284866819996245*(\text{distance}) + 4.660714229453321
$$

Good Question from the Chat: Why does `reg.intercept_` not give you an array. 

Answer: It has to do with how the function looks. In our case, we had just one input that we were training on: distance. So our model looks like what we wrote above. We don't need to just consider distance by itself, we could also consider distance, number of people, and the hour of the taxi ride. If we train on these variables, then we get 3 distinct coefficients. These coefficients will be returned in a NumPy array.

$$
\text{fare} \approx c_0*(\text{distance}) + c_1*(\text{number of people}) + c_2*(\text{time}) + \text{intercept}
$$

* What are the predicted outputs for the first 5 rows? What are the actual outputs?

In [38]:
df2[:5]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
2871,2019-03-12 20:28:02,2019-03-12 20:43:16,1,2.8,12.0,3.15,0.0,18.95,yellow,credit card,Upper East Side South,East Village,Manhattan,Manhattan
898,2019-03-24 13:17:38,2019-03-24 13:31:41,1,1.2,10.0,2.65,0.0,15.95,yellow,credit card,Murray Hill,Clinton East,Manhattan,Manhattan
845,2019-03-04 13:22:23,2019-03-04 13:38:07,1,2.1,11.5,2.96,0.0,17.76,yellow,credit card,Midtown East,Upper West Side South,Manhattan,Manhattan
1580,2019-03-21 23:31:03,2019-03-21 23:42:56,1,3.35,12.0,3.16,0.0,18.96,yellow,credit card,Kips Bay,Lincoln Square East,Manhattan,Manhattan
4002,2019-03-16 08:55:35,2019-03-16 09:37:31,3,10.7,39.0,9.1,5.76,54.66,yellow,credit card,Manhattan Valley,LaGuardia Airport,Manhattan,Queens


Notice, we have a distance of 2.8 and a fare of 12. The model will predict the following for a distance of 2.8:

In [39]:
reg.coef_*2.8 + reg.intercept_

array([12.30047694])

In [41]:
reg.predict(df2[:5][["distance"]])

array([12.30047694,  7.93489825, 10.39053626, 13.80114461, 33.85552173])

`reg.fit' is still a little mysterious, but `reg.predict` is not, it just evaluates our linear function at the distances.

## Interpreting Linear Regression Coefficients

* Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.

In [49]:
df2.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough', 'hour'],
      dtype='object')

In [48]:
df2.dtypes

pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
hour                        int64
dtype: object

In [50]:
df2["hour"] = df2["pickup"].dt.hour

In [52]:
df2.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,hour
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,20
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan,16
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan,17
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan,1
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan,13


* Remove all rows from the DataFrame where the hour is 16 or earlier. (So we are only using late afternoon and evening taxi rides.)

__That's all we got to today!__ We'll pick back up on Monday.

* Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.

Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.

Hint 2. I expected there to be a minutes attribute (after using the dt accessor) but there wasn’t. Call dir to see some options.

* Fit a new `LinearRegression` object, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=67281dea-fa59-4e85-b2dc-42238ce0b9e2' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>