# HW 3: Getting Comfortable with Feature Engineering
**OPIM 5512: Big Data Analytics with Cloud Computing - University of Connecticut**

* Your Name Here: Urvashi Vijay Bhurase
* Your StudentID Here: 3059409

It's time to get some practice with the different options for feature engineering in PySpark! 

Check out these resources:
* https://spark.apache.org/docs/1.4.0/ml-features.html
* https://dhiraj-p-rai.medium.com/essentials-of-feature-engineering-in-pyspark-part-i-76a57680a85
* https://www.kaggle.com/code/dhirajrai87/feature-engineering-with-pyspark
* https://datascience.stackexchange.com/questions/45900/when-to-use-standard-scaler-and-when-normalizer
  * I personally have never used the normalizer, but I am curious how well it would do as feature engineering tool when we start fitting models - students, please do some reading on this and see if/when it's useful for feature engineering! Sample-based (row) feature engineering vs. feature-based (column) approaches.

And you will notice that the common things you can do in PySpark are:
* Feature Transformers
* PolynomialExpansion*
* StringIndexer
* OneHotEncoder
* VectorIndexer
* Normalizer*
* StandardScaler* (do this one last!)
* Bucketizer*
* ElementwiseProduct
* VectorAssembler
  * You PROBABLY need to run this one first, so I will do it for you to start.

# Rubric
Using the CA Housing Dataset on the right, please try these FOUR methods (PolynomialExpansion, Bucketizer, Normalizer and StandardScaler). I want one subheader per method with a description of what you did, the code to apply the method (which runs successfully on train and is correctly applied to test), and for you to check your work with printing a few rows and/or checking shape. 

You should be adding one more column each time you do a method! Give each new column a nice name that corresponds to the method (like 'columns_scaled').

I give the `VectorAssembler` and some data reading code for you to get started. 

20 points per method (80 points) and 20 points for five meaningful bullets on what you learned. 

**Note:** if you are referencing other useful materials online, please feel free to share them on the Discussion Board!

# Start PySpark and Read Data from Colab

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317145 sha256=00419babd1c996943db5b53eb439bedfae21ad55753bc505e5f025958cd9cb87
  Stored in directory: /root/.cache/pip/wheels/9f/34/a4/159aa12d0a510d5ff7c8f0220abbea42e5d81ecf588c4fd884
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder\
        .appName("FeatureEngineering_HW")\
        .getOrCreate()

In [None]:
# specify the directory
DIRECTORY = '/content/sample_data'

In [None]:
import os

# read the train data
train = spark.read.csv(
 path=os.path.join(DIRECTORY, "california_housing_train.csv"),
 sep=",",
 header=True,
 inferSchema=True,
 timestampFormat="yyyy-MM-dd", # used to tell spark the format of dateTime columns
)

# read the test data
test = spark.read.csv(
 path=os.path.join(DIRECTORY, "california_housing_test.csv"),
 sep=",",
 header=True,
 inferSchema=True,
 timestampFormat="yyyy-MM-dd", # used to tell spark the format of dateTime columns
)

In [None]:
# view first few rows
train.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|    

# (Dave's demo) VectorAssembler
This method will combine all of the columns of interest into a single vector column. This vector column has been optimized for ML pipelines. You can think of it as each row having a list with ALL of the columns of interest inside of it.

## Train

In [None]:
train.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

In [None]:
CONTINUOUS_COLUMNS = ['longitude',
                      'latitude',
                      'housing_median_age',
                      'total_rooms',
                      'total_bedrooms',
                      'population',
                      'households',
                      'median_income'] # note that we dropped the target variable here!
print(CONTINUOUS_COLUMNS)

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


In [None]:
TARGET_COLUMN = ['median_house_value']
print(TARGET_COLUMN)

['median_house_value']


In [None]:
from pyspark.ml.feature import VectorAssembler

# we input all the continuous columns as a vector "CONTINUOUS_COLUMNS" 
continuous_features = VectorAssembler(inputCols=CONTINUOUS_COLUMNS, outputCol="continuous_features")

In [None]:
#remove all the vector data with null values
for x in CONTINUOUS_COLUMNS:
  vector_df_train = train.where(~F.isnull(F.col(x)))
  vector_df_test = test.where(~F.isnull(F.col(x)))

In [None]:
#transform
vector_variable_train = continuous_features.transform(vector_df_train)
vector_variable_train.show() # see how all of the features are now in one column?

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|[-114.31,34.19,15...|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|[-114.47,34.4,19....|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|[-114.56,33.69,17...|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|[-114.57,33.64,14...|

## Test

In [None]:
# now apply to test
vector_variable_test = continuous_features.transform(vector_df_test)
vector_variable_test.show() # see how all of the features are now in one column?

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|[-122.05,37.37,27...|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|[-118.3,34.26,43....|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|[-117.81,33.78,27...|
|  -118.36|   33.82|              28.0|       67.0|          15.0|      49.0|      11.0|       6.1359|          330000.0|[-118.36,33.82,28...|

# Polynomial Expansion

## Train

In the below code I have first defined the Polynomial expansion method with input column as "continuous_features" and output column named "poly_features".Then I have applied this method using the transform function on the vector_variable_train created in previous demo. The final dataframe is the train dataframe with one extra column named "poly_features" containing a vector having the term and its square for each continuous variable in continuous_features vector. The same is applied and tested successfully on the test data.

In [None]:
from pyspark.ml.feature import PolynomialExpansion

polyExpansion = PolynomialExpansion(degree=2, inputCol="continuous_features", outputCol="poly_Features")

In [None]:
poly_df_train = polyExpansion.transform(vector_variable_train)
poly_df_train.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|       poly_Features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|[-114.31,34.19,15...|[-114.31,13066.77...|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|[-114.47,34.4,19....|[-114.47,13103.38...|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|[-114.56,33.69,17...|[-114.56,13123.99...|
|  -114.57|   33

In [None]:
display_df = poly_df_train.select("continuous_features", "poly_Features")
display_df.show(10, False)

+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                    |poly_Features                                                                                                                                                                                                                                                                                                                                                                                                                 

## Test

In [None]:
# etc... feel free to add your own headers after this!

In [None]:
poly_df_test = polyExpansion.transform(vector_variable_test)
poly_df_test.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|       poly_Features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|[-122.05,37.37,27...|[-122.05,14896.20...|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|[-118.3,34.26,43....|[-118.3,13994.89,...|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|[-117.81,33.78,27...|[-117.81,13879.19...|
|  -118.36|   33

In [None]:
display_df = poly_df_test.select("continuous_features", "poly_Features")
display_df.show(10, False)

+-----------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                  |poly_Features                                                                                                                                                                                                                                                                                                                                                                                                       

As we can see in first row the value of poly_Features goes on like........
-122.05 = 1st term from continuous features
14896.2025 = Square of 1st term from continuous features = (-122.05)^2 ......and so on.....

# Bucketizer

In the below code I have first defined the Bucketizer method with input column as "median_house_value" and output column named "categorical_median_house_value".Then I have applied this method using the transform function on the vector_variable_train created in previous demo. The final dataframe is the train dataframe with one extra column named "categorical_median_house_value" containing the number of bucket in which the median_house_value that particular observation falls in.The same is applied and tested successfully on the test data.

## Train

In [None]:
from pyspark.ml.feature import Bucketizer

bucketizer = Bucketizer(splits=[-float("inf"), 25000, 50000, 75000, 100000, float("inf")], inputCol="median_house_value", outputCol="categorical_median_house_value")

In [None]:
bucketed_df_train = bucketizer.transform(vector_variable_train)
bucketed_df_train.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+------------------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|categorical_median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+------------------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|[-114.31,34.19,15...|                           2.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|[-114.47,34.4,19....|                           3.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|[-114.56,

In [None]:
display_df = bucketed_df_train.select("median_house_value", "categorical_median_house_value")
display_df.show(10, False)

+------------------+------------------------------+
|median_house_value|categorical_median_house_value|
+------------------+------------------------------+
|66900.0           |2.0                           |
|80100.0           |3.0                           |
|85700.0           |3.0                           |
|73400.0           |2.0                           |
|65500.0           |2.0                           |
|74000.0           |2.0                           |
|82400.0           |3.0                           |
|48500.0           |1.0                           |
|58400.0           |2.0                           |
|48100.0           |1.0                           |
+------------------+------------------------------+
only showing top 10 rows



In [None]:
bucketed_df_train.createOrReplaceTempView("Bucket")
Bucket = spark.sql(
 "select median_house_value, categorical_median_house_value from Bucket where categorical_median_house_value = 5"
)
Bucket.show()

+------------------+------------------------------+
|median_house_value|categorical_median_house_value|
+------------------+------------------------------+
+------------------+------------------------------+



## Test

In [None]:
bucketed_df_test = bucketizer.transform(vector_variable_test)
bucketed_df_test.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+------------------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|categorical_median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+------------------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|[-122.05,37.37,27...|                           4.0|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|[-118.3,34.26,43....|                           4.0|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|[-117.81,

In [None]:
display_df = bucketed_df_test.select("median_house_value", "categorical_median_house_value")
display_df.show(10, False)

+------------------+------------------------------+
|median_house_value|categorical_median_house_value|
+------------------+------------------------------+
|344700.0          |4.0                           |
|176500.0          |4.0                           |
|270500.0          |4.0                           |
|330000.0          |4.0                           |
|81700.0           |3.0                           |
|67000.0           |2.0                           |
|67000.0           |2.0                           |
|166900.0          |4.0                           |
|194400.0          |4.0                           |
|164200.0          |4.0                           |
+------------------+------------------------------+
only showing top 10 rows



In [None]:
bucketed_df_test.createOrReplaceTempView("Bucket")
Bucket = spark.sql(
 "select median_house_value, categorical_median_house_value from Bucket where categorical_median_house_value = 5"
)
Bucket.show()

+------------------+------------------------------+
|median_house_value|categorical_median_house_value|
+------------------+------------------------------+
+------------------+------------------------------+



As we can verify from the above displayed result, since the first observationis greater than 100000 it fell in bin 4. Also one cool thing we can verify is since no median_house_value is greater than infinity no observation fell in bin 5 hence and empty dataframe was returned. :)

# Normalizer

In the below code I have first defined the Normalizer method with input column as "continuous_features" and output named "normalized_features".Then I have applied this method using the transform function on the vector_veriable_train created in previous demo. The final dataframe is the train dataframe with one extra column named "normalized_features" containing a normalized value for each continuous variable in "continuous_features" vector.The same is applied and tested successfully on the test data.

##Train

In [None]:
from pyspark.ml.feature import Normalizer

normalizer = Normalizer(inputCol="continuous_features", outputCol="normalized_Features")

In [None]:
normalized_df_train = normalizer.transform(vector_variable_train)
normalized_df_train.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features| normalized_Features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|[-114.31,34.19,15...|[-0.0194873976644...|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|[-114.47,34.4,19....|[-0.0143491682138...|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|[-114.56,33.69,17...|[-0.1381339479148...|
|  -114.57|   33

In [None]:
display_df = normalized_df_train.select("continuous_features", "normalized_Features")
display_df.show(10, False)

+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                    |normalized_Features                                                                                                                                                  |
+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936] |[-0.01948739766447878,0.005828660013546754,0.0025571775432349027,0.9567253581756182,0.21872391919802533,0.17303568042556175,0.08046585336045826,2.546266919050434E-4]|
|[-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82]    |[-0.014349168213811022,0.004312146296454085,0.0

## Test

In [None]:
normalized_df_test = normalizer.transform(vector_variable_test)
normalized_df_test.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features| normalized_Features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|[-122.05,37.37,27...|[-0.0285487770672...|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|[-118.3,34.26,43....|[-0.0669265763386...|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|[-117.81,33.78,27...|[-0.0298267178827...|
|  -118.36|   33

In [None]:
display_df = normalized_df_test.select("continuous_features", "normalized_Features")
display_df.show(10, False)

+-----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                  |normalized_Features                                                                                                                                                  |
+-----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085]|[-0.028548777067242737,0.008741235551027128,0.006315583619955378,0.9087423097602461,0.15461484343668536,0.35952044532857097,0.1417497656923318,0.0015457975686101895]|
|[-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599]   |[-0.06692657633867845,0.01938211754322167,0.0243266507401

As we can clearly see from the data displayed above that all elements of the normalized features vector in every row lies between -1 to 1. Thus the method have done its job successfully i.e to bring down all the values to the same scale irrespective of its units.

# StandardScaler

In the below code I have first defined the Standard Scaler method with input column as "continuous_features" and output named "scaled_features".Then I have applied this method using the transform function on the vector_veriable_train created in previous demo. The final dataframe is the train dataframe with one extra column named "scaled_features" containing a normalized value for each continuous variable in "continuous_features" vector.The same is applied and tested successfully on the test data.

## Train

In [None]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="continuous_features", outputCol="scaled_features")

In [None]:
scaled_df_train = scaler.fit(vector_variable_train).transform(vector_variable_train)
scaled_df_train.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|     scaled_features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|[-114.31,34.19,15...|[-57.007737372644...|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|[-114.47,34.4,19....|[-57.087531248767...|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|[-114.56,33.69,17...|[-57.132415304086...|
|  -114.57|   33

In [None]:
display_df = scaled_df_train.select("continuous_features", "scaled_features")
display_df.show(10, False)

+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                    |scaled_features                                                                                                                                            |
+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936] |[-57.00773737264437,15.996520574532891,1.191711694581097,2.574374430228716,3.043894826413315,0.8842596012848164,1.2275017368353283,0.7827450136369047]     |
|[-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82]    |[-57.0875312487674,16.09477355261572,1.5095014798027229,3.509259513765089,4.51008890491

## Test

In [None]:
scaled_df_test = scaler.fit(vector_variable_test).transform(vector_variable_test)
scaled_df_test.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value| continuous_features|     scaled_features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+--------------------+--------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|[-122.05,37.37,27...|[-61.179898510960...|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|[-118.3,34.26,43....|[-59.300139236760...|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|[-117.81,33.78,27...|[-59.054517358265...|
|  -118.36|   33

In [None]:
display_df = scaled_df_test.select("continuous_features", "scaled_features")
display_df.show(10, False)

+-----------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|continuous_features                                  |scaled_features                                                                                                                                           |
+-----------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085]|[-61.17989851096071,17.54732346515654,2.1504698821968056,1.8022880025659722,1.5902635715431972,1.491446724190841,1.6583534184903461,3.5634716643714146]   |
|[-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599]   |[-59.30013923676077,16.087002994815712,3.4248224049800977,0.7005031876125143,0.7458119624483981,0.785

As you can see all the values in scaled vector are scaled with mean = 0 and variance =1.

# Bullets
Don't skip this - 20 points on the line! Do a good job here.

--The Polynomial Expansion method can be used when we suspect that there may be non-linear relationships between the input features and the output variable. 

--The Bucketizer method  can be used when we want to transform a continuous variable into a categorical variable.

--The Normalizer method can be used when we want to ensure that the values of a feature are on the same scale, regardless of the units or magnitude of the values. 

--The Standard Scaler method can be used when we want to ensure that the values of a feature are normally distributed, or when we want to remove the effect of outliers from the data. 

--In general, the best time to use each of these methods depends on the specific problem and the nature of the data. It is always a good idea to experiment with different feature engineering techniques to see which ones work best for a particular problem.

--I also got a chance to work with one housing dataframe during my OPIM 5604 class where one of the key insights was that the price of a house is directly proprotion to its size in square feet, this was found after we tried to fit the data taking square of house_size instead of just house_size. I don't know if I am right, but I can visualize using the polynomial expansion function in the same light i.e. if a particular degree of an attribute is more correlated with the target variable than the variable itself.

--Its very obvious from the way a bucketizer is defined that it is used to convert continuous variables into categorical variables, which can be is definately useful to reduce a variabe when we don't care much about its exact value and are just looking in which range it falls. Also I learned that in order for bucketizer to work you need supply all numerical values i.e it should range from negative infinity to positive infinity irrespective of the fact that your data contains only positive numbers (kind of weird!!!). Anyway, just make sure if you want to create n bins you need to mention n+1 terms in the splits argument of the method.

--The normalizer function is so useful cause many times while dealing with data you need it to be relative rather than absolute to get a bigger picture!!! Its very easy to visualize the data if you can bring all the attributes given on to a single scale. It is one of the helpful methods in my opinion.

--Given the vast applications of normal distribution and how at times we convert data that is not normally distributed to normally distributed for our conveinience of interpretability it is definately one of the very useful methods, given the drastic impact that outliers can have on our insights.

--Ok, I might have blabbered a lot above, but the most important thing I learned from this assignment is I got equipped with various intresting feature engineering methods that I can use for data modelling using coding in python (I am sorry but I will be really honest, click and point tools for data modelling like JMP kindoff suck!!! ). Coding the same thing is way more fun!!!:)
