<a href="https://colab.research.google.com/github/vnlvih/Estudos-PySpark/blob/main/03_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 32.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=39cd49b8b209b93eaad127b438f32724a7c4a7f61975763af339d5949b9de0e8
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

## **Data Preparation**

In [None]:
schema = StructType([
                     StructField("mon", IntegerType()),
                     StructField("dom", IntegerType()),
                     StructField("dow", IntegerType()),
                     StructField("carrier", StringType()),
                     StructField("flight", StringType()),
                     StructField("org", StringType()),
                     StructField("mile", DoubleType()),
                     StructField("depart", DoubleType()),
                     StructField("duration", DoubleType()),
                     StructField("delay", DoubleType()),
])

flights = spark.read.csv("/content/drive/MyDrive/Data Science & Afins/DATACAMP/03. Big Data with PySpark/05. Machine Learning with PySpark/00. DataSets/flights.csv", header=True, schema=schema)
flights.show(5)

+---+---+---+-------+------+---+------+------+--------+-----+
|mon|dom|dow|carrier|flight|org|  mile|depart|duration|delay|
+---+---+---+-------+------+---+------+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153.0|  9.48|   351.0| null|
|  0| 22|  2|     UA|  1107|ORD| 316.0| 16.33|    82.0| 30.0|
|  2| 20|  4|     UA|   226|SFO| 337.0|  6.17|    82.0| -8.0|
|  9| 13|  1|     AA|   419|ORD|1236.0| 10.33|   195.0| -5.0|
|  4|  2|  5|     AA|   325|ORD| 258.0|  8.92|    65.0| null|
+---+---+---+-------+------+---+------+------+--------+-----+
only showing top 5 rows



## **Dropping columns**

In [None]:
#Either drop the columns you don't want...
flights = flights.drop("flight")
flights.show(5)
# ... or select  the columns you want to retain 
"""flights_2 = flights.select('mon',
 'dom',
 'dow',
 'carrier',
 'org',
 'mile',
 'depart',
 'duration',
 'delay')
flights_2.show(5)"""

+---+---+---+-------+---+------+------+--------+-----+
|mon|dom|dow|carrier|org|  mile|depart|duration|delay|
+---+---+---+-------+---+------+------+--------+-----+
| 11| 20|  6|     US|JFK|2153.0|  9.48|   351.0| null|
|  0| 22|  2|     UA|ORD| 316.0| 16.33|    82.0| 30.0|
|  2| 20|  4|     UA|SFO| 337.0|  6.17|    82.0| -8.0|
|  9| 13|  1|     AA|ORD|1236.0| 10.33|   195.0| -5.0|
|  4|  2|  5|     AA|ORD| 258.0|  8.92|    65.0| null|
+---+---+---+-------+---+------+------+--------+-----+
only showing top 5 rows



"flights_2 = flights.select('mon',\n 'dom',\n 'dow',\n 'carrier',\n 'org',\n 'mile',\n 'depart',\n 'duration',\n 'delay')\nflights_2.show(5)"

## Filtering out missing data

In [None]:
#How many missing values?

flights.filter("delay IS NULL").count()

2978

In [None]:
#Drop records with missing values in the delay column
flights = flights.filter("delay IS NOT NULL")

#Drop records with missing values in any column

flights = flights.dropna()

## **Mutating columns** 

- Derive a new km column from the mile column, rounding to zero decimal places. One mile is 1.60934 km.
Remove the mile column.
- Create a label column with a value of 1 indicating the delay was 15 minutes or more and 0 otherwise.

In [None]:
# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights = flights.withColumn('km', round(flights.mile * 1.60934, 0)) \
                    .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
#flights = flights.withColumn('label', (flights.delay >= 15).cast('integer'))

# Check first five records
#flights.show(5)

## Categorical columns

In the flights data there are two columns, carrier and org, which hold categorical data. You need to transform those columns into indexed numerical values.

In [None]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
#indexer = StringIndexer(inputCol="carrier", outputCol='carrier_idx')

# Indexer identifies categories in the data
#indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
#flights = indexer_model.transform(flights)

# Repeat the process for the other categorical feature
flights = StringIndexer(inputCol="org", outputCol='org_idx').fit(flights).transform(flights)

## One-hot encoding


### Encoding flight origin
The org column in the flights data is a categorical variable giving the airport from which a flight departs.

- ORD — O'Hare International Airport (Chicago)
- SFO — San Francisco International Airport
- JFK — John F Kennedy International Airport (New York)
- LGA — La Guardia Airport (New York)
- SMF — Sacramento
- SJC — San Jose
- TUS — Tucson International Airport
- OGG — Kahului (Hawaii)
Obviously this is only a small subset of airports. Nevertheless, since this is a categorical variable, it needs to be one-hot encoded before it can be used in a regression model.

The data are in a variable called flights. You have already used a string indexer to create a column of indexed values corresponding to the strings in org.

Note:: You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.


In [None]:
# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoder

# Create an instance of the one hot encoder
onehot = OneHotEncoder(inputCols=["org_idx"], outputCols=["org_dummy"])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights = onehot.transform(flights)

# Check the results
#flights.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()
flights.show()

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+
|  0| 22|  2|     UA|ORD| 16.33|    82.0| 30.0| 509.0|    0.0|(7,[0],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|    82.0| -8.0| 542.0|    1.0|(7,[1],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|   195.0| -5.0|1989.0|    0.0|(7,[0],[1.0])|
|  5|  2|  1|     UA|SFO|  7.98|   102.0|  2.0| 885.0|    1.0|(7,[1],[1.0])|
|  7|  2|  6|     AA|ORD| 10.83|   135.0| 54.0|1180.0|    0.0|(7,[0],[1.0])|
|  1| 16|  6|     UA|ORD|   8.0|   232.0| -7.0|2317.0|    0.0|(7,[0],[1.0])|
|  1| 22|  5|     UA|SJC|  7.98|   250.0|-13.0|2943.0|    5.0|(7,[5],[1.0])|
| 11|  8|  1|     OO|SFO|  7.77|    60.0| 88.0| 254.0|    1.0|(7,[1],[1.0])|
|  4| 26|  1|     AA|SFO| 13.25|   210.0|-10.0|2356.0|    1.0|(7,[1],[1.0])|
|  4| 25|  0|     AA|ORD| 13.75|   160.0| 31.0|1574.0|    0.0|(7,[0],[1.0])|

## Assembling columns
The final stage of data preparation is to consolidate all of the predictor columns into a single column.

An updated version of the flights data, which takes into account all of the changes from the previous few exercises, has the following predictor columns:

- mon, dom and dow
- carrier_idx (indexed value from carrier)
- org_idx (indexed value from org)
- km
- depart
- duration

In [None]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    "km","org_dummy"
], outputCol='features')

# Consolidate predictor columns
flights = assembler.transform(flights)

# Check the resulting column
#flights.select('features', 'delay').show(5, truncate=False)
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+--------------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|            features|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+--------------------+
|  0| 22|  2|     UA|ORD| 16.33|    82.0| 30.0| 509.0|    0.0|(7,[0],[1.0])|(8,[0,1],[509.0,1...|
|  2| 20|  4|     UA|SFO|  6.17|    82.0| -8.0| 542.0|    1.0|(7,[1],[1.0])|(8,[0,2],[542.0,1...|
|  9| 13|  1|     AA|ORD| 10.33|   195.0| -5.0|1989.0|    0.0|(7,[0],[1.0])|(8,[0,1],[1989.0,...|
|  5|  2|  1|     UA|SFO|  7.98|   102.0|  2.0| 885.0|    1.0|(7,[1],[1.0])|(8,[0,2],[885.0,1...|
|  7|  2|  6|     AA|ORD| 10.83|   135.0| 54.0|1180.0|    0.0|(7,[0],[1.0])|(8,[0,1],[1180.0,...|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+--------------------+
only showing top 5 rows



## Train/test split
To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

- training data (used to train the model) and
- testing data (used to test the model).


Randomly split the flights data into two sets with 80:20 proportions. For repeatability set a random number seed of 17 for the split.
Check that the training data has roughly 80% of the records from the original data.

In [None]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.randomSplit([.8,.2], seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() /flights_test.count()
print(training_ratio)

3.9253168534618204


## Flight duration model: Just distance
In this exercise you'll build a regression model to predict flight duration (the duration column).

For the moment you'll keep the model simple, including only the distance of the flight (the km column) as a predictor.

- Create a linear regression object. Specify the name of the label column. Fit it to the training data.
- Make predictions on the testing data.
- Create a regression evaluator object and use it to evaluate RMSE on the testing data.

In [None]:
flights_train.show()

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+--------------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|            features|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+--------------------+
|  0|  1|  2|     AA|JFK|   7.0|   385.0|-16.0|4162.0|    2.0|(7,[2],[1.0])|(8,[0,3],[4162.0,...|
|  0|  1|  2|     AA|JFK|  12.0|   370.0| 11.0|3983.0|    2.0|(7,[2],[1.0])|(8,[0,3],[3983.0,...|
|  0|  1|  2|     AA|JFK|  17.0|   379.0|-10.0|3983.0|    2.0|(7,[2],[1.0])|(8,[0,3],[3983.0,...|
|  0|  1|  2|     AA|LGA|   6.5|   240.0| 40.0|2235.0|    3.0|(7,[3],[1.0])|(8,[0,4],[2235.0,...|
|  0|  1|  2|     AA|LGA|  8.25|   250.0| 27.0|2235.0|    3.0|(7,[3],[1.0])|(8,[0,4],[2235.0,...|
|  0|  1|  2|     AA|LGA| 14.58|   165.0| -4.0|1180.0|    3.0|(7,[3],[1.0])|(8,[0,4],[1180.0,...|
|  0|  1|  2|     AA|LGA| 20.42|   185.0| 31.0|1765.0|    3.0|(7,[3],[1.0])|(8,[0,4],[1765.0,...|
|  0|  1|  2|     AA

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol = "duration").fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
RegressionEvaluator(labelCol='duration').evaluate(predictions)

+--------+-----------------+
|duration|prediction       |
+--------+-----------------+
|230.0   |259.1362558120262|
|170.0   |150.1497766375601|
|120.0   |132.0591167025192|
|135.0   |132.0591167025192|
|70.0    |75.2332350687274 |
+--------+-----------------+
only showing top 5 rows



10.84177434569271

## Interpreting the coefficients
The linear regression model for flight duration as a function of distance takes the form


duration = a + b.distance

where

 — a: intercept (component of duration which does not depend on distance) and


 — b: coefficient (rate at which duration increases as a function of distance; also called the slope).


By looking at the coefficients of your model you will be able to infer

- how much of the average flight duration is actually spent on the ground and
what the average speed is during a flight.
The linear regression model is available as regression.

- What's the intercept?
- What are the coefficients? This is a vector.
- Extract the element from the vector which corresponds to the slope for distance.
- Find the average speed in km per hour.



In [None]:
# Intercept (average minutes on ground)
#inter = regression.intercept
#print(inter)

# Coefficients
#coefs = regression.coefficients
#print(coefs)

# Average minutes per km
#minutes_per_km = regression.coefficients[0]
#print(minutes_per_km)

# Average speed in km per hour
#avg_speed = 60 / regression.coefficients[0]
#print(avg_speed)

## Interpreting coefficients 2 
Remember that origin airport, org, has eight possible values (ORD, SFO, JFK, LGA, SMF, SJC, TUS and OGG) which have been one-hot encoded to seven dummy variables in org_dummy.

The values for km and org_dummy have been assembled into features, which has eight columns with sparse representation. Column indices in features are as follows:

- 0 — km
- 1 — ORD
- 2 — SFO
- 3 — JFK
- 4 — LGA
- 5 — SMF
- 6 — SJC and
- 7 — TUS.
Note that OGG does not appear in this list because it is the reference level for the origin airport category.

In this exercise you'll be using the intercept and coefficients attributes to interpret the model.

The coefficients attribute is a list, where the first element indicates how flight duration changes with flight distance.

Instructions

- Find the average speed in km per hour. This will be different to the value that you got earlier because your model is now more sophisticated.
- What's the average time on the ground at OGG?
- What's the average time on the ground at JFK?
- What's the average time on the ground at LGA?

In [None]:
# Average speed in km per hour
avg_speed_hour = 60 / regression.coefficients[0]
print(avg_speed_hour)

# Average minutes on ground at OGG
inter = regression.intercept
print(inter)

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print(avg_ground_jfk)

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print(avg_ground_lga)

807.7305389786569
15.484675567183574
68.23100640177145
62.49678274491395


## Bucketing departure time
Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.


- Create a bucketizer object with bin boundaries which correspond to 0:00, 03:00, 06:00, …, 24:00. Specify input column as depart and output column as depart_bucket.
- Bucket the departure times. Show the first five values for depart and depart_bucket.
- Create a one-hot encoder object. Specify output column as depart_dummy.
- Train the encoder on the data and then use it to convert the bucketed departure times to dummy variables. Show the first five values for depart, depart_bucket and depart_dummy.

In [None]:
from pyspark.ml.feature import Bucketizer

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[
    3 * x for x in range(9)
], inputCol='depart', outputCol='depart_bucket')

# Bucket the departure times
bucketed = buckets.transform(flights)
bucketed.select('depart', 'depart_bucket').show(5)

# Create a one-hot encoder
onehot = OneHotEncoder(inputCols=['depart_bucket'], outputCols=['depart_dummy'])

# One-hot encode the bucketed departure times
flights_onehot = onehot.fit(bucketed).transform(bucketed)
flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)

+------+-------------+
|depart|depart_bucket|
+------+-------------+
| 16.33|          5.0|
|  6.17|          2.0|
| 10.33|          3.0|
|  7.98|          2.0|
| 10.83|          3.0|
+------+-------------+
only showing top 5 rows

+------+-------------+-------------+
|depart|depart_bucket| depart_dummy|
+------+-------------+-------------+
| 16.33|          5.0|(7,[5],[1.0])|
|  6.17|          2.0|(7,[2],[1.0])|
| 10.33|          3.0|(7,[3],[1.0])|
|  7.98|          2.0|(7,[2],[1.0])|
| 10.83|          3.0|(7,[3],[1.0])|
+------+-------------+-------------+
only showing top 5 rows



### Flight duration model - Adding departure time
In the previous exercise the departure time was bucketed and converted to dummy variables. Now you're going to include those dummy variables in a regression model for flight duration.

The data are in `flights`. The `km`, `org_dummy` and `depart_dummy` columns have been assembled into `features`, where `km` is index 0, `org_dummy` runs from index 1 to 7 and `depart_dummy` from index 8 to 14.


In [None]:
assembler = VectorAssembler(inputCols=['km', 'org_dummy', 'depart_dummy'], outputCol='features')

flights = assembler.transform(flights_onehot.drop('features'))

flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|            features|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+
|  0| 22|  2|     UA|ORD| 16.33|    82.0| 30.0| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(15,[0,1,13],[509...|
|  2| 20|  4|     UA|SFO|  6.17|    82.0| -8.0| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(15,[0,2,10],[542...|
|  9| 13|  1|     AA|ORD| 10.33|   195.0| -5.0|1989.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(15,[0,1,11],[198...|
|  5|  2|  1|     UA|SFO|  7.98|   102.0|  2.0| 885.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(15,[0,2,10],[885...|
|  7|  2|  6|     AA|ORD| 10.83|   135.0| 54.0|1180.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(15,[0,1,11],[

In [None]:
flights_train, flights_test = flights.randomSplit([0.8, 0.2])

# Train with training data
regression = LinearRegression(labelCol='duration').fit(flights_train)
predictions = regression.transform(flights_test)

RegressionEvaluator(labelCol='duration', metricName='rmse').evaluate(predictions)

# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.intercept
print(avg_eve_ogg)

# Average minutes on ground at OGG for flights departing between 00:00 and 03:00
avg_night_ogg = regression.intercept + regression.coefficients[8]
print(avg_night_ogg)

# Average minutes on ground at JFK for flights departing between 00:00 and 03:00
avg_night_jfk = regression.intercept + regression.coefficients[3] + regression.coefficients[8]
print(avg_night_jfk)

9.945421780469697
-3.488281235413975
48.55293126557178


## Regularization
- Feature Selection

- Loss function
    - Linear regression aims to minimize the MSE
$$ MSE = \frac{1}{N} \sum_{i=1}^{N}(y_i - \hat{y_i})^2 $$
- Loss function with regularization
    - Add a regularization term which depends on coefficients
$$ MSE = \frac{1}{N} \sum_{i=1}^{N}(y_i - \hat{y_i})^2 + \lambda f(\beta) $$
    - Regularizer
        - Lasso - absolute value of the coefficients
        - Ridge - square of the coefficients
    - Both will shrink the coefficients of unimportant predictors
    - Strength of regularization determined by parameter $\lambda$:
        - $\lambda = 0$ - no regularization (standard regression)
        - $\lambda = \infty$ - complete regularization (all coefficients zero)

        

### Flight duration model - More features!
Let's add more features to our model. This will not necessarily result in a better model. Adding some features might improve the model. Adding other features might make it worse.

More features will always make the model more complicated and difficult to interpret.

These are the features you'll include in the next model:

- `km`
- `org` (origin airport, one-hot encoded, 8 levels)
- `depart` (departure time, binned in 3 hour intervals, one-hot encoded, 8 levels)
- `dow` (departure day of week, one-hot encoded, 7 levels) and
- `mon` (departure month, one-hot encoded, 12 levels).

These have been assembled into the `features` column, which is a sparse representation of 32 columns (remember one-hot encoding produces a number of columns which is one fewer than the number of levels).

In [None]:
onehot = OneHotEncoder(inputCols=['dow'], outputCols=['dow_dummy'])
flights = onehot.fit(flights).transform(flights)

onehot = OneHotEncoder(inputCols=['mon'], outputCols=['mon_dummy'])
flights = onehot.fit(flights).transform(flights)

flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+-------------+--------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|            features|    dow_dummy|     mon_dummy|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+-------------+--------------+
|  0| 22|  2|     UA|ORD| 16.33|    82.0| 30.0| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(15,[0,1,13],[509...|(6,[2],[1.0])|(11,[0],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|    82.0| -8.0| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(15,[0,2,10],[542...|(6,[4],[1.0])|(11,[2],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|   195.0| -5.0|1989.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(15,[0,1,11],[198...|(6,[1],[1.0])|(11,[9],[1.0])|
|  5|  2|  1|     UA|SFO|  7.98|   102.0|  2.0| 885.0|    1.0|(7,[1],[

In [None]:
assembler = VectorAssembler(inputCols=[
    'km', 'org_dummy', 'depart_dummy', 'dow_dummy', 'mon_dummy'
], outputCol='features')

flights = assembler.transform(flights.drop('features'))
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+--------------+--------------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|    dow_dummy|     mon_dummy|            features|
+---+---+---+-------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+--------------+--------------------+
|  0| 22|  2|     UA|ORD| 16.33|    82.0| 30.0| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(6,[2],[1.0])|(11,[0],[1.0])|(32,[0,1,13,17,21...|
|  2| 20|  4|     UA|SFO|  6.17|    82.0| -8.0| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(6,[4],[1.0])|(11,[2],[1.0])|(32,[0,2,10,19,23...|
|  9| 13|  1|     AA|ORD| 10.33|   195.0| -5.0|1989.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(6,[1],[1.0])|(11,[9],[1.0])|(32,[0,1,11,16,30...|
|  5|  2|  1|     UA|SFO|  7.98|   102.0|  2.0| 885.0|    1.0|(7,[1],[

In [None]:
flights_train, flights_test = flights.randomSplit([0.8, 0.2])

# Fit linear regressino model to training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Make predictions on test data
predictions = regression.transform(flights_test)

# Calculate the RMSE on test data
rmse = RegressionEvaluator(labelCol='duration', metricName='rmse').evaluate(predictions)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

The test RMSE is 10.596941156428072
[0.0743745081357682,28.080022182909694,20.992850796566717,52.23265365441294,46.506711175054306,15.825096365880174,18.221482985119398,18.20700851254615,-14.840138824863654,0.8887440123978566,4.104608736311577,6.969772435531358,4.562187033555394,8.813302149847772,8.636299463691822,0.49588802266374105,0.23587285926243468,-0.03584378615316089,0.2055225291739623,0.3908229517033893,0.2432785829719603,-2.105666928287855,-2.5363969226737644,-2.218056117955404,-3.785174620784235,-4.44475763167747,-4.436487130997319,-4.5474021968989415,-4.376836875092133,-4.1081280578824,-3.024788439179824,-1.0642346673378236]


### Flight duration model - Regularization!
In the previous exercise you added more predictors to the flight duration model. The model performed well on testing data, but with so many coefficients it was difficult to interpret.

In this exercise you'll use Lasso regression (regularized with a L1 penalty) to create a more parsimonious model. Many of the coefficients in the resulting model will be set to zero. This means that only a subset of the predictors actually contribute to the model. Despite the simpler model, it still produces a good RMSE on the testing data.

You'll use a specific value for the regularization strength. Later you'll learn how to find the best value using cross validation.


In [None]:
# Fit Lasso model (α = 1) to training data
regression = ____(____, ____, elasticNetParam=1).____(____)

# Calculate the RMSE on testing data
rmse = ____(____).____(____)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.____
print(coeffs)

# Number of zero coefficients
zero_coeff = sum([____ for beta in regression.coefficients])
print("Number of coefficients equal 

In [None]:
# Fit Lasso model (α = 1) to training data
regression = LinearRegression(labelCol='duration', regParam=1, elasticNetParam=1).fit(flights_train)
#predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration', metricName='rmse').evaluate(regression.transform(flights_test))
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

# Number of zero coefficients
#zero_coeff = sum([beta == 0 for beta in regression.coefficients])
zero_coeff = len(regression.coefficients[regression.coefficients.values == 0])
print("Number of coefficients equal to 0:", zero_coeff)

The test RMSE is 11.574541262501798
[0.07347190274623178,5.635946175467288,0.0,28.783410017215616,21.914022164297975,-2.27188784640192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0561951743739815,0.9953671985670326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
Number of coefficients equal to 0: 25
