## Apache Spark machine library (MLlib)

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

## Load sample data

In [5]:
# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("data_geo_local.csv")
# databricks
#data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse

DataFrame[2014 rank: int, City: string, State: string, State Code: string, 2014 Population estimate: int, 2015 median sales price: double]

View the data in tabular form

In [None]:
display(data)

2014 rank,City,State,State Code,2014 Population estimate,2015 median sales price
101,Birmingham,Alabama,AL,212247.0,162.9
125,Huntsville,Alabama,AL,188226.0,157.7
122,Mobile,Alabama,AL,194675.0,122.5
114,Montgomery,Alabama,AL,200481.0,129.0
64,Anchorage[19],Alaska,AK,301010.0,
78,Chandler,Arizona,AZ,254276.0,
86,Gilbert[20],Arizona,AZ,239277.0,
88,Glendale,Arizona,AZ,237517.0,
38,Mesa,Arizona,AZ,464704.0,
148,Peoria,Arizona,AZ,166934.0,


## Prepare and visualize data

In this linear regression example, the label is the 2015 median sales price and the feature is the 2014 Population Estimate. That is, you use the feature (population) to predict the label (sales price).

In [None]:
from pyspark.sql.functions import col

# First drop rows with missing values
data = data.dropna() # drop rows with missing values
# rename the feature and label columns, replacing spaces with _
exprs = [col(column).alias(column.replace(' ', '_')) for column in data.columns]

To simplify the creation of features, register a UDF to convert the feature (`2014_Population_estimate`) column vector to a `VectorUDT` type and apply it to the column.

#### UDF = User-Defined Function

Spark ships with a large catalogue of SQL functions, but real-world logic (domain-specific parsing, proprietary scoring rules, complex geo math, etc.) often isn’t covered. A UDF lets you drop into *your* language (Python, Scala, Java, R) to express that logic.

Use a UDF when:  

• No built-in expression can do the job.  
• The logic is complex but not latency-critical.  

**Key takeaways**

• A UDF extends Spark’s function library with custom code.  
• Easy to write; potentially expensive to run (serialization + no optimization).  
• Prefer built-ins; fall back to vectorized (Pandas) UDFs or native Scala UDFs when custom logic is unavoidable.

In [None]:
from pyspark.ml.linalg import Vectors, VectorUDT

spark.udf.register("oneElementVec", lambda d: Vectors.dense([d]), returnType=VectorUDT())
tdata = data.select(*exprs).selectExpr("oneElementVec(2014_Population_estimate) as features", "2015_median_sales_price as label")

In [None]:
display(tdata)

features,label
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""212247.0""]}",162.9
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""188226.0""]}",157.7
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""194675.0""]}",122.5
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""200481.0""]}",129.0
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""1537058.0""]}",206.1
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""527972.0""]}",178.1
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""197706.0""]}",131.8
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""346997.0""]}",685.7
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""3928864.0""]}",434.7
"{""type"":""1"",""size"":null,""indices"":null,""values"":[""319504.0""]}",281.0


## Perform linear regression

We run two different linear regression models using different regularization parameters to determine how well either of these two models predict the *sales price* (label) based on the *population* (feature).

### Build the model

In [None]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm
lr = LinearRegression()

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JError[0m                                 Traceback (most recent call last)
File [0;32m<command-6475900955999849>, line 5[0m
[1;32m      2[0m [38;5;28;01mfrom[39;00m [38;5;21;01mpyspark[39;00m[38;5;21;01m.[39;00m[38;5;21;01mml[39;00m[38;5;21;01m.[39;00m[38;5;21;01mregression[39;00m [38;5;28;01mimport[39;00m LinearRegression
[1;32m      4[0m [38;5;66;03m# Define LinearRegression algorithm[39;00m
[0;32m----> 5[0m lr [38;5;241m=[39m LinearRegression()

File [0;32m/databricks/python/lib/python3.11/site-packages/pyspark/__init__.py:120[0m, in [0;36mkeyword_only.<locals>.wrapper[0;34m(self, *args, **kwargs)[0m
[1;32m    118[0m     [38;5;28;01mraise[39;00m [38;5;167;01mTypeError[39;00m([38;5;124m"[39m[38;5;124mMethod [39m[38;5;132;01m%s[39;00m[38;5;124m forces keyword arguments.[39m[38;5;124m"[39m [38;5;241m%[39m func[38;5;241m.[39m[38;5;18m__na

In [None]:
# Fit 2 models, using different regularization parameters
modelA = lr.fit(data, {lr.regParam:0.0})
modelB = lr.fit(data, {lr.regParam:100.0})