I am going to build on the work i did in the previous project. i am going to create a more complex wrangle function, use it to clean more data, and build a model that considers more features when predicting apartment price.

In [2]:
import warnings
import pandas as pd
import numpy  as np
import plotly.express as px
import plotly.graph_objects as go

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
warnings.filterwarnings("ignore")

In [3]:
def wrangle(filepath):
    df = pd.read_csv(filepath)
    
    mask_place = df["place_with_parent_names"].str.contains("Capital Federal")
    
    mask_apt = df["property_type"] == "apartment"
    
    mask_price = df["price_aprox_usd"] < 400_000
    
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    
    df = df[mask_place & mask_apt & mask_price & mask_area]
    
    df.drop(columns=["lat-lon"], inplace=True)
    
    return df

In [4]:
df = wrangle("properati-AR-2016-11-01-properties-2.csv")
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14114 entries, 1 to 145742
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   created_on                  14114 non-null  object 
 1   operation                   14114 non-null  object 
 2   property_type               14114 non-null  object 
 3   place_name                  14114 non-null  object 
 4   place_with_parent_names     14114 non-null  object 
 5   geonames_id                 13633 non-null  float64
 6   lat                         13562 non-null  float64
 7   lon                         13562 non-null  float64
 8   price                       14114 non-null  float64
 9   currency                    14114 non-null  object 
 10  price_aprox_local_currency  14114 non-null  float64
 11  price_aprox_usd             14114 non-null  float64
 12  surface_total_in_m2         10015 non-null  float64
 13  surface_covered_in_m2       14

Unnamed: 0,created_on,operation,property_type,place_name,place_with_parent_names,geonames_id,lat,lon,price,currency,...,surface_covered_in_m2,price_usd_per_m2,price_per_m2,floor,rooms,expenses,properati_url,description,title,image_thumbnail
1,2012-10-10,sell,apartment,Villa Crespo,|Argentina|Capital Federal|Villa Crespo|,3427458.0,-34.603684,-58.381559,83000.0,USD,...,40.0,2075.0,2075.0,1.0,2.0,300.0,http://villa-crespo.properati.com.ar/13tz_vent...,"2 AMBIENTES, VENTA, VILLA CRESPO1ER PISO POR E...",DEPARTAMENTO EN VENTA,https://thumbs-cf.properati.com/8/ujkSk81S7fhu...
195,2013-05-20,sell,apartment,Villa Devoto,|Argentina|Capital Federal|Villa Devoto|,3427451.0,-34.598942,-58.500647,324720.0,USD,...,140.0,,2319.428571,8.0,4.0,,http://villa-devoto.properati.com.ar/73iz_vent...,Corredor Responsable: Patricia Maria Sodor - C...,ULTIMA UNIDAD DE 4 AMBIENTES!,https://thumbs-cf.properati.com/0/G63ECCkemzvp...
197,2013-05-20,sell,apartment,Chacarita,|Argentina|Capital Federal|Chacarita|,3435506.0,-34.585106,-58.462549,80000.0,USD,...,55.0,,1454.545455,9.0,2.0,,http://chacarita.properati.com.ar/79qp_venta_d...,Corredor Responsable: Jorge Salafia - CUCICBA ...,DIVINO MONOAMB EN TRIUNVIRATO PESOSSS!!!! FINA...,https://thumbs-cf.properati.com/3/eRJkRtFUzcGu...
199,2013-05-20,sell,apartment,Palermo,|Argentina|Capital Federal|Palermo|,3430234.0,-34.600627,-58.392334,150000.0,USD,...,36.0,,4166.666667,,2.0,,http://palermo.properati.com.ar/79ty_venta_dep...,Corredor Responsable: Jorge Salafia - CUCICBA ...,Dos ambientes y cochera - CON RENTA,https://thumbs-cf.properati.com/6/usFZh3CUU5RA...
306,2013-05-24,sell,apartment,Barracas,|Argentina|Capital Federal|Barracas|,3436134.0,-34.634679,-58.371091,80000.0,USD,...,60.0,1333.333333,1333.333333,,2.0,,http://barracas.properati.com.ar/7fmr_venta_de...,"Departamento en torre con living comedor,cocin...",Departamento,https://thumbs-cf.properati.com/7/rVJm13qMQ0eV...


For the model, i am going to consider apartment location, specifically, latitude and longitude. Looking at the output from df.info(), we can see that the location information is in a single column where the data type is object (pandas term for str in this case). In order to build our model, i need latitude and longitude to each be in their own column where the data type is float.

### Split

Even though i am building a different model, the steps we follow will be the same. Let's separate our features (latitude and longitude) from our target (price).

In [6]:
feature = ["lat", "lon"]
target = "price_aprox_usd"
X = df[feature]
y = df[target]
print("X_train:", X.head())
print("y_train:", y.head())

X_train:            lat        lon
1   -34.603684 -58.381559
195 -34.598942 -58.500647
197 -34.585106 -58.462549
199 -34.600627 -58.392334
306 -34.634679 -58.371091
y_train: 1       83000.0
195    324720.0
197     80000.0
199    150000.0
306     80000.0
Name: price_aprox_usd, dtype: float64


In [14]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14114 entries, 1 to 145742
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   lat     13562 non-null  float64
 1   lon     13562 non-null  float64
dtypes: float64(2)
memory usage: 330.8 KB


In [8]:
y.shape

(14114,)

## Build Model
### Baseline

Again, i need to set a baseline so i can evaluate the model's performance. You'll notice that the value of y_mean is not exactly the same as it was in the previous lesson. That's because we've added more observations to our training data.

In [9]:
y_mean = y.mean()
y_pred_baseline = [y_mean] * len(y)
mae_baseline = mean_absolute_error(y, y_pred_baseline)
print("mean apt price:", y_mean.round(2))
print("baseline mae:", mae_baseline.round(2))

mean apt price: 154234.66
baseline mae: 59231.71


### Iterate

In the last project, i simply dropped rows that contained NaN values, but this isn't ideal. Models generally perform better when they have more data to train with, so every row is precious. Instead, w can fill in these missing values using information we get from the whole column — a process called imputation. There are many different strategies for imputing missing values, and one of the most common is filling in the missing values with the mean of the column.

In addition to predictors like LinearRegression, scikit-learn also has transformers that help us deal with issues like missing values. Let's see how one works, and then we'll add it to our model

In [19]:
imputer = SimpleImputer().fit(X)
xt = imputer.transform(X)

xt

array([[-34.6036844 , -58.3815591 ],
       [-34.598942  , -58.500647  ],
       [-34.585106  , -58.462549  ],
       ...,
       [-34.6224065 , -58.477773  ],
       [-34.60064886, -58.37024074],
       [-34.58544047, -58.41141038]])

Create a pipeline named model that contains a SimpleImputer transformer followed by a LinearRegression predictor.

In [17]:
model = make_pipeline(
    SimpleImputer(),
    LinearRegression()
).fit(X, y)

In [18]:
model

### Evaluate

As always, we'll start by evaluating our model's performance on the training data.

In [20]:
y_pred_train = model.predict(X)
mae_train = mean_absolute_error(y, y_pred_train)
print("y train:", y_pred_train.round(2))
print("training mae:", mae_train.round((2)))

y train: [153812.56 154617.47 154469.84 ... 154301.46 153761.01 154136.38]
training mae: 59216.41


### Communicate Results

Extract the intercept and coefficients for your model.

In [22]:
intercept = model.named_steps["linearregression"].intercept_
coefficient =  model.named_steps["linearregression"].coef_[0]
print("intercept:", intercept.round(2))
print("coefficient:", coefficient.round(2))

intercept: 23488.49
coefficient: 7156.35
