A first example of machine learning
==
In this notebook we'll apply a scikit-learn pipeline to a simple dataset (the listing of apartments in Airbnb of Berlin), and see how overfitting looks like.

In [None]:
import numpy as np

We can invoke system commands by prepending them with a `!`, commands like `head`, `tail`, `wc` can be useful to quickly inspect a text file. Most of them are not present on Windows.

In [None]:
!head listings.csv

numpy provides the function `loadtxt` to load simple CSV files

In [None]:
#np.loadtxt('listings.csv', delimiter=',', usecols=(54, 59, 48, 49, 79 ), skiprows=1)

It does not work because this file contains newlines inside the fields. Luckily the Python CSV module can still process it.

This code loads some columns from the CSV into separate numpy arrays.

First, we create plain Python lists, then replace them with proper arrays (faster and smaller).

Don;t worry: with Pandas this kind of operation becomes much easier.

In [None]:
from csv import DictReader

review_scores_rating = []
price = []
latitude = []
longitude = []
bathrooms = []

for l in DictReader(open('listings.csv')):
    price.append(l['price'])
    review_scores_rating.append(l['review_scores_rating'])
    latitude.append(l['latitude'])
    longitude.append(l['longitude'])
    bathrooms.append(l['bathrooms'])

latitude = np.array([float(l) for l in latitude])
longitude = np.array([float(l) for l in longitude])
price = np.array([float(l[1:].replace(',', '')) for l in price])

# We assume the rating is 1 if not specified
review_scores_rating = np.array([int(l) if l != '' else 0 for l in review_scores_rating])

# We assume there's 1 bathroom if not stated otherwise
bathrooms = np.array([float(l) if l != '' else 1 for l in bathrooms])

It's very useful to have a look at the shape of the numpy arrays.

In [None]:
print(latitude.shape)
print(bathrooms.shape)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# change the figure size
from matplotlib.pyplot import figure
figure(num=None, figsize=(8, 6), dpi=80)

# reshape is needed to create a second dimension of size 1
X = price.T.reshape(-1, 1)
Y = review_scores_rating.T
model = LinearRegression()
model.fit(X, Y)
model.score(X,Y)


plt.scatter(X, Y, marker='X')

x_plot = np.linspace(0, 9000, 200)
y_plot = model.predict(x_plot.reshape(-1, 1))
plt.plot(x_plot, y_plot, color='red')

plt.show()

Turns out there are prices much much greater than the rest, making the visualization and the model less effective. Let's ignore them by placing a cap of 500 on the price.


In [None]:
too_high = np.argwhere(price > 500)
print(f'shape before: {price.shape}')
Ylow = np.delete(Y, too_high)
Xlow = np.delete(price, too_high).reshape(-1, 1)
print(f'shape after: {Xlow.shape}')

In [None]:
model = LinearRegression()
model.fit(Xlow, Ylow)
model.score(Xlow, Ylow)


plt.scatter(Xlow, Ylow, marker='X')

x_plot = np.linspace(0, 500, 200)
y_plot = model.predict(x_plot.reshape(-1, 1))
plt.plot(x_plot, y_plot, color='red')

plt.show()

In scikit you can combine models using `make_pipeline`, in this case we combine `PolynomialFeatures` with `LinearRegression`, to run a linear regression on the features generated by the first step, which are the original ones multiplied and to various powers.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

X = np.vstack((latitude, longitude, bathrooms)).T
print(f'the shape of X is {X.shape}')
Y = review_scores_rating.T
print(f'the shape of Y is {Y.shape}')


for degree in range(1, 20):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, Y)
    score = model.score(X, Y)
    print(f'with degree {degree} the score was {score:.5f}')

The model reaches the best score at degree 11 (notice it could change with other cities). This seems the best result, but what is happening here is that we have overfitting. The dataset we use to check the model is the same we used to train it.

Let's try instead by partitioning the data in train and test datasets.

In [None]:
train_X = X[:21000,:]
test_X = X[21000:,:]

train_Y = Y[:21000]
test_Y = Y[21000:]

for degree in range(1, 20):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(train_X, train_Y)
    score = model.score(test_X, test_Y)
    print(f'with degree {degree} the score was {score}')

In [None]:
# change the figure size
from matplotlib.pyplot import figure
figure(num=None, figsize=(8, 6), dpi=80)

# reshape is needed to create a second dimension of size 1
X = price.T.reshape(-1, 1)



model = make_pipeline(PolynomialFeatures(20), LinearRegression())
#model = LinearRegression()
model.fit(X, Y)
model.score(X,Y)



plt.scatter(X, Y, marker='X')

x_plot = np.linspace(0, 9000, 200)
y_plot = model.predict(x_plot.reshape(-1, 1))
plt.plot(x_plot, y_plot, color='red')

plt.show()

Turns out there are prices much much greater than the rest, making the visualization and the model pointless. Let's ignore them by placing a cap of 500 on the data.

In [None]:
figure(num=None, figsize=(8, 6), dpi=80)


model = make_pipeline(PolynomialFeatures(30), LinearRegression())
#model = LinearRegression()
model.fit(Xlow, Ylow)
model.score(Xlow, Ylow)


plt.scatter(Xlow, Ylow, marker='X')

x_plot = np.linspace(0, 500, 200)
y_plot = model.predict(x_plot.reshape(-1, 1))
plt.plot(x_plot, y_plot, color='red')

plt.show()

Just for fun, let's draw a map of prices

In [None]:
figure(num=None, figsize=(9, 7), dpi=80)

plt.scatter(latitude, longitude, c=review_scores_rating, marker='.', cmap=plt.cm.get_cmap('inferno'))