# Exercise 01

Build a Linear Regression model for the `california_housing` dataset

 - **PART 1**
    - **Question 1** - What are the $R^2$ metrics for train and test?
    - **Question 2** - Imagine that me and my wife, we want to sell our house in *528-426 W Scott Ave
Clovis, CA 93612* but we have no idea about the price. Our house is 30 years old, with 6 rooms and 3 bedrooms. In our geographic block group we are 300 people. Our income is 60K.

    - *NOTE: Don't use Latitude and Longitude for this part*

 - **PART 2** - Repeat the process, but now include new three variables called `distance2SF`, `distance2SJ` and `distance2SD` containing the distance from each area to San Francisco, San Jose and San Diego, respectively, in Km.

    - **Question 3** - What is the recomended for sale price of my house now? 
    - *NOTE: You can use the `geopy` library to calculate distances between locations. https://geopy.readthedocs.io/en/stable/#module-geopy.distance*

**Don't forget...**
 - Split data into train and test in order to evaluate the model with unseen data
 - Normalize data to avoid scaling issues when fitting the model (specially for the LR model)
 - Train your model and apply it to the test data.
 - Evaluate the model with the `score` function, for both train and test datasets.

In [1]:
from sklearn import datasets

In [2]:
california_housing = datasets.fetch_california_housing()

In [3]:
print(california_housing["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [4]:
#Building a Linear Regression model for the california_housing dataset
import pandas as pd
#PART 1
X = pd.DataFrame(california_housing["data"][:,0:6], columns=california_housing["feature_names"][0:6])
y = pd.Series(california_housing["target"], name=california_housing["target_names"][0])

In [5]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467


In [6]:
#MedInc is in 10,000s
X.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333


In [7]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

In [8]:
#y is in 100,000s
y.describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42)

In [11]:
print(f"X_train size: {X_train.shape}")
print(f"y_train size: {y_train.shape}")
print(f"X_test size: {X_test.shape}")
print(f"y_test size: {y_test.shape}")

X_train size: (16512, 6)
y_train size: (16512,)
X_test size: (4128, 6)
y_test size: (4128,)


In [12]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [13]:
scaler = MinMaxScaler()

In [14]:
X_train_scaled = scaler.fit_transform(X_train)
X_train.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,3.880754,28.608285,5.435235,1.096685,1426.453004,3.096961
std,1.904294,12.602499,2.387375,0.433215,1137.05638,11.578744
min,0.4999,1.0,0.888889,0.333333,3.0,0.692308
25%,2.5667,18.0,4.452055,1.006508,789.0,2.428799
50%,3.5458,29.0,5.235874,1.049286,1167.0,2.81724
75%,4.773175,37.0,6.061037,1.100348,1726.0,3.28
max,15.0001,52.0,141.909091,25.636364,35682.0,1243.333333


In [15]:
# the result is returned as a numpy array, so let's convert it again to a Pandas Df
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,0.233159,0.541339,0.032239,0.030168,0.039896,0.001935
std,0.131329,0.247108,0.016929,0.017121,0.031869,0.009318
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.142536,0.333333,0.025267,0.026604,0.02203,0.001397
50%,0.210059,0.54902,0.030825,0.028295,0.032624,0.00171
75%,0.294705,0.705882,0.036677,0.030313,0.048292,0.002082
max,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model

In [17]:
model.fit(X_train_scaled, y_train)

In [18]:
model.coef_

array([  7.91944053,   0.85618336, -31.57723574,  28.22534335,
         0.82658243,  -5.73880374])

In [19]:
model.intercept_

-0.09339589714919061

### Question 1 - What are the  R<sup>2</sup>  metrics for train and test?

In [20]:
model.score(X_train_scaled, y_train)

0.5459161602818385

In [21]:
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
X_test_scaled.describe()
#Standard scaling transform on test dataset is making the values negative. Will repeat with MinMaxScaler
#Successfully completed with MinMaxScaler

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
count,4128.0,4128.0,4128.0,4128.0,4128.0,4128.0
mean,0.229682,0.544398,0.032018,0.030166,0.039759,0.001829
std,0.129758,0.245457,0.019818,0.024116,0.031223,0.001411
min,0.0,0.0,-0.000303,0.006587,0.00014,0.00046
25%,0.141281,0.333333,0.024874,0.026519,0.021806,0.001403
50%,0.206901,0.54902,0.03052,0.028194,0.032428,0.001715
75%,0.285929,0.705882,0.036307,0.030138,0.04818,0.00209
max,1.0,1.0,0.933515,1.333174,0.451778,0.066374


In [22]:
model.score(X_test_scaled, y_test)

0.5099337366296424

### Question 2
Imagine that me and my wife, we want to sell our house in 528-426 W Scott Ave Clovis, CA 93612 but we have no idea about the price. Our house is 30 years old, with 6 rooms and 3 bedrooms. In our geographic block group we are 300 people. Our income is 60K.

In [23]:
my_house = pd.DataFrame(
    [[6, 30, 6, 3, 300, 2]],
    #60k as 6 because max MedInc is 15 so it seems to be in 10000s
    #Average house occupancy is considered to be 2: "Me and my wife"
    columns=["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population","AveOccup"])
my_house

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,6,30,6,3,300,2


In [24]:
my_house_scaled = scaler.transform(my_house)
my_house_scaled = pd.DataFrame(my_house_scaled, columns=my_house.columns)
my_house_est_price = model.predict(my_house_scaled)
print(f"Estimated price for my house is: {my_house_est_price[0]} ")

Estimated price for my house is: 5.22840104899342 


## PART 2
Repeat the process, but now include new three variables called distance2SF, distance2SJ and distance2SD containing the distance from each area to San Francisco, San Jose and San Diego, respectively, in Km.


In [25]:
from geopy.distance import geodesic #geopy distance uses geodesic to calculate the distance

In [26]:
#San Francisco
SF = (37.7749, 122.4194)

#San Jose
SJ = (37.3382, 121.8863)

#San Diego
SD = (32.7157, 117.1611)

In [27]:
X1 = pd.DataFrame(california_housing["data"], columns=california_housing["feature_names"])

In [28]:
distances = []
for i in range(len(X1)):
    distances.append(geodesic((X1["Latitude"][i], X1["Longitude"][i]), SF).km)

In [29]:
X1["distance2SF"] = distances

In [30]:
distances = []
for i in range(len(X1)):
    distances.append(geodesic((X1["Latitude"][i], X1["Longitude"][i]), SJ).km)

In [31]:
X1["distance2SJ"] = distances

In [32]:
distances = []
for i in range(len(X1)):
    distances.append(geodesic((X1["Latitude"][i], X1["Longitude"][i]), SD).km)

In [33]:
X1["distance2SD"] = distances

In [34]:
X1 = X1.drop(['Latitude','Longitude'], axis=1)

In [35]:
X1_train, X1_test, y1_train, y1_test = train_test_split(
    X1, y,
    test_size=0.2,
    random_state=62)

In [36]:
X1_train.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,distance2SF,distance2SJ,distance2SD
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,3.873649,28.711301,5.428209,1.096125,1431.211059,3.096228,9677.499901,9745.058802,10413.120973
std,1.88893,12.605095,2.317701,0.458069,1141.317282,11.581581,290.080287,290.072214,289.803456
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,8902.710472,8970.296044,9639.595263
25%,2.5664,18.0,4.43958,1.006179,793.0,2.430995,9365.289284,9432.85175,10100.935019
50%,3.5382,29.0,5.23669,1.048829,1171.0,2.816122,9852.557559,9920.107131,10587.831394
75%,4.7569,37.0,6.058824,1.099415,1731.0,3.282376,9911.648848,9979.20602,10647.138604
max,15.0001,52.0,132.533333,34.066667,35682.0,1243.333333,10222.362356,10289.946975,10958.823585


In [37]:
X1_train_scaled = scaler.fit_transform(X1_train)
X1_train_scaled = pd.DataFrame(X1_train_scaled, columns=X1_train.columns)
X1_train_scaled.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,distance2SF,distance2SJ,distance2SD
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,0.232669,0.543359,0.034795,0.022612,0.040029,0.001935,0.587117,0.587097,0.586347
std,0.130269,0.247159,0.0176,0.013579,0.031988,0.00932,0.219816,0.21981,0.219676
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.142515,0.333333,0.027288,0.019946,0.022142,0.001399,0.350531,0.350514,0.349704
50%,0.209535,0.54902,0.033341,0.02121,0.032736,0.001709,0.719771,0.719744,0.718781
75%,0.293582,0.705882,0.039584,0.02271,0.048432,0.002084,0.764549,0.764528,0.763737
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [38]:
model2 = LinearRegression(fit_intercept=True)
model2

In [39]:
model2.fit(X1_train_scaled, y1_train)

In [40]:
model2.coef_

array([ 6.45031659e+00,  5.03977330e-01, -1.53521437e+01,  2.05189607e+01,
        6.85147204e-02, -4.76500289e+00,  2.22058861e+04, -2.16753601e+04,
       -5.31283826e+02])

In [41]:
model2.intercept_

-0.015586229867774026

In [42]:
model2.score(X1_train_scaled, y1_train)

0.6174991982974809

In [43]:
X1_test_scaled = scaler.transform(X1_test)
X1_test_scaled = pd.DataFrame(X1_test_scaled, columns=X1_test.columns)
X1_test_scaled.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,distance2SF,distance2SJ,distance2SD
count,4128.0,4128.0,4128.0,4128.0,4128.0,4128.0,4128.0,4128.0,4128.0
mean,0.231642,0.536318,0.034825,0.022694,0.039226,0.001832,0.594053,0.594033,0.593283
std,0.133996,0.245186,0.022936,0.01579,0.030723,0.001349,0.219688,0.219681,0.21955
min,0.0,0.0,0.005918,0.001235,5.6e-05,0.000421,0.037576,0.037576,0.037499
25%,0.141667,0.333333,0.027339,0.019927,0.02149,0.001392,0.356665,0.356645,0.355993
50%,0.208418,0.54902,0.03299,0.021189,0.03219,0.001721,0.724657,0.724632,0.72374
75%,0.289272,0.705882,0.039315,0.022736,0.04729,0.002078,0.765832,0.765801,0.764944
max,1.0,1.0,1.071197,0.75009,0.434541,0.050745,0.998752,0.998751,0.998714


In [44]:
model2.score(X1_test_scaled, y1_test)

0.6152256818876662

In [45]:
my_house_coords = (36.8252, -119.7029)
dist2SF = geodesic(my_house_coords, SF).km
dist2SJ = geodesic(my_house_coords, SJ).km
dist2SD = geodesic(my_house_coords, SD).km

my_house1 = pd.DataFrame(
    [[6, 30, 6, 3, 300, 2, dist2SF, dist2SJ, dist2SD]],
    columns=["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup","distance2SF","distance2SJ","distance2SD"])
my_house1

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,distance2SF,distance2SJ,distance2SD
0,6,30,6,3,300,2,9574.487683,9642.067596,10310.715499


In [46]:
my_house1_scaled = scaler.transform(my_house1)
my_house1_scaled = pd.DataFrame(my_house1_scaled, columns=my_house1.columns)
model2.predict(my_house1_scaled)


array([3.61137759])