## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/taxi_data_sub__1_.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,borough_pu,median_rlst_pu,tourist_pu,entert_pu,park_pu,workplace_pu,residential_pu,borough_do,median_rlst_do,tourist_do,entert_do,park_do,workplace_do,residential_do,rate_fare,temperature,humidity,wind speed,pressure,precip,condition,date,year,month,trip_time,covid
2,2019-12-09 07:21:45,2019-12-09 07:46:41,1,6.15,1,N,262,244,1,23.5,0.0,0.5,2.5,0.0,0.3,29.3,2.5,Manhattan,0,0,1,0,0,0,Manhattan,603593,0,0,0,0,1,3.82113821138211,42,89,10,30.16,0.0,Light Rain,2019-12-09,2019,12,-441.75,0
2,2019-08-29 18:08:40,2019-08-29 18:15:11,2,1.17,1,N,233,107,1,6.5,1.0,0.5,1.0,0.0,0.3,11.8,2.5,Manhattan,0,0,0,0,1,0,Manhattan,0,0,0,1,0,0,5.55555555555556,79,38,18,29.88,0.0,Fair,2019-08-29,2019,8,-1088.66666666667,0
2,2019-07-12 19:48:09,2019-07-12 20:03:57,1,6.2,1,N,261,162,1,20.5,1.0,0.5,4.96,0.0,0.3,29.76,2.5,Manhattan,0,1,0,0,0,0,Manhattan,0,0,1,0,0,0,3.30645161290323,83,41,16,29.76,0.0,Fair,2019-07-12,2019,7,-1188.15,0
2,2019-12-31 19:07:46,2019-12-31 19:20:57,1,2.32,1,N,79,170,1,11.0,1.0,0.5,3.06,0.0,0.3,18.36,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.74137931034483,44,71,10,29.61,0.0,Mostly Cloudy,2019-12-31,2019,12,-1147.76666666667,0
2,2019-09-09 08:52:01,2019-09-09 08:58:50,2,1.41,1,N,179,146,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3,0.0,Queens,642130,0,0,0,0,1,Queens,1318454,0,0,0,0,1,4.60992907801419,68,65,12,30.2,0.0,Mostly Cloudy,2019-09-09,2019,9,-532.016666666667,0
1,2019-12-18 07:58:31,2019-12-18 08:15:09,1,2.9,1,N,75,162,1,13.0,2.5,0.5,3.25,0.0,0.3,19.55,2.5,Manhattan,1360925,0,0,0,0,1,Manhattan,0,0,1,0,0,0,4.48275862068966,33,66,12,29.83,0.0,Fair,2019-12-18,2019,12,-478.516666666667,0
2,2019-09-17 17:43:57,2019-09-17 18:18:02,2,4.81,1,N,68,263,1,23.0,1.0,0.5,3.0,0.0,0.3,30.3,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.78170478170478,75,23,14,30.06,0.0,Fair,2019-09-17,2019,9,-1063.95,0
1,2019-11-23 00:10:59,2019-11-23 00:22:53,2,2.2,1,N,148,164,1,10.0,3.0,0.5,2.75,0.0,0.3,16.55,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.54545454545455,41,70,5,29.72,0.0,Light Rain,2019-11-23,2019,11,-10.9833333333333,0
2,2019-07-17 15:46:54,2019-07-17 15:54:00,5,1.37,1,N,237,43,1,7.0,0.0,0.5,2.06,0.0,0.3,12.36,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,1,0,1,0,0,5.10948905109489,92,54,13,29.83,0.0,Fair,2019-07-17,2019,7,-946.9,0
2,2019-09-12 14:20:36,2019-09-12 14:33:58,1,0.94,1,N,164,163,1,9.0,0.0,0.5,2.46,0.0,0.3,14.76,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,9.57446808510638,73,81,13,30.06,0.0,Cloudy,2019-09-12,2019,9,-860.6,0


In [0]:
# Create a view or table

temp_table_name = "df"

df.createOrReplaceTempView(temp_table_name)

In [0]:
%sql

/* Query the created temp table in a SQL cell */

select * from `df`

vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,borough_pu,median_rlst_pu,tourist_pu,entert_pu,park_pu,workplace_pu,residential_pu,borough_do,median_rlst_do,tourist_do,entert_do,park_do,workplace_do,residential_do,rate_fare,temperature,humidity,wind speed,pressure,precip,condition,date,year,month,trip_time,covid
2,2019-12-09 07:21:45,2019-12-09 07:46:41,1,6.15,1,N,262,244,1,23.5,0.0,0.5,2.5,0.0,0.3,29.3,2.5,Manhattan,0,0,1,0,0,0,Manhattan,603593,0,0,0,0,1,3.82113821138211,42,89,10,30.16,0.0,Light Rain,2019-12-09,2019,12,-441.75,0
2,2019-08-29 18:08:40,2019-08-29 18:15:11,2,1.17,1,N,233,107,1,6.5,1.0,0.5,1.0,0.0,0.3,11.8,2.5,Manhattan,0,0,0,0,1,0,Manhattan,0,0,0,1,0,0,5.55555555555556,79,38,18,29.88,0.0,Fair,2019-08-29,2019,8,-1088.66666666667,0
2,2019-07-12 19:48:09,2019-07-12 20:03:57,1,6.2,1,N,261,162,1,20.5,1.0,0.5,4.96,0.0,0.3,29.76,2.5,Manhattan,0,1,0,0,0,0,Manhattan,0,0,1,0,0,0,3.30645161290323,83,41,16,29.76,0.0,Fair,2019-07-12,2019,7,-1188.15,0
2,2019-12-31 19:07:46,2019-12-31 19:20:57,1,2.32,1,N,79,170,1,11.0,1.0,0.5,3.06,0.0,0.3,18.36,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.74137931034483,44,71,10,29.61,0.0,Mostly Cloudy,2019-12-31,2019,12,-1147.76666666667,0
2,2019-09-09 08:52:01,2019-09-09 08:58:50,2,1.41,1,N,179,146,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3,0.0,Queens,642130,0,0,0,0,1,Queens,1318454,0,0,0,0,1,4.60992907801419,68,65,12,30.2,0.0,Mostly Cloudy,2019-09-09,2019,9,-532.016666666667,0
1,2019-12-18 07:58:31,2019-12-18 08:15:09,1,2.9,1,N,75,162,1,13.0,2.5,0.5,3.25,0.0,0.3,19.55,2.5,Manhattan,1360925,0,0,0,0,1,Manhattan,0,0,1,0,0,0,4.48275862068966,33,66,12,29.83,0.0,Fair,2019-12-18,2019,12,-478.516666666667,0
2,2019-09-17 17:43:57,2019-09-17 18:18:02,2,4.81,1,N,68,263,1,23.0,1.0,0.5,3.0,0.0,0.3,30.3,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.78170478170478,75,23,14,30.06,0.0,Fair,2019-09-17,2019,9,-1063.95,0
1,2019-11-23 00:10:59,2019-11-23 00:22:53,2,2.2,1,N,148,164,1,10.0,3.0,0.5,2.75,0.0,0.3,16.55,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,4.54545454545455,41,70,5,29.72,0.0,Light Rain,2019-11-23,2019,11,-10.9833333333333,0
2,2019-07-17 15:46:54,2019-07-17 15:54:00,5,1.37,1,N,237,43,1,7.0,0.0,0.5,2.06,0.0,0.3,12.36,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,1,0,1,0,0,5.10948905109489,92,54,13,29.83,0.0,Fair,2019-07-17,2019,7,-946.9,0
2,2019-09-12 14:20:36,2019-09-12 14:33:58,1,0.94,1,N,164,163,1,9.0,0.0,0.5,2.46,0.0,0.3,14.76,2.5,Manhattan,0,0,1,0,0,0,Manhattan,0,0,1,0,0,0,9.57446808510638,73,81,13,30.06,0.0,Cloudy,2019-09-12,2019,9,-860.6,0


In [0]:
# With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = "df"

# df.write.format("parquet").saveAsTable(permanent_table_name)

In [0]:
# Import the required libraries

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml import Pipeline

In [0]:
borough_pu_indexer = StringIndexer(inputCol='borough_pu',outputCol='borough_pu_index',handleInvalid='keep')
condition_indexer = StringIndexer(inputCol='condition',outputCol='condition_index',handleInvalid='keep')
borough_do_indexer = StringIndexer(inputCol='borough_do',outputCol='borough_do_index',handleInvalid='keep')


In [0]:
# Vector assembler is used to create a vector of input features
assembler=VectorAssembler(inputCols=['passenger_count','pulocationid','dolocationid','fare_amount','tip_amount','total_amount','trip_distance','borough_pu_index','condition_index','borough_do_index'],outputCol="features")


In [0]:
pipe = Pipeline(stages=[borough_pu_indexer,condition_indexer,borough_do_indexer,assembler])

In [0]:
final_df=pipe.fit(df).transform(df)

In [0]:
kmeans_model = KMeans(k=4)


In [0]:
fit_model=kmeans_model.fit(final_df)


In [0]:
#wssse = fit_model.computeCost(final_data)
wssse=fit_model.summary.trainingCost
print("The within set sum of squared error of the mode is {}".format(wssse))

The within set sum of squared error of the mode is 176665917.24196485


In [0]:
From our above model we got sum of squared error of the mode is 176665917.24196485. 

In [0]:
centers = fit_model.clusterCenters()

In [0]:
print("Cluster Centers")
index=1
for cluster in centers:
    print("Centroid {}: {}".format(index,cluster))
    index+=1

Cluster Centers
Centroid 1: [1.53074058e+00 2.35223457e+02 1.97766347e+02 1.08951603e+01
 2.62768728e+00 1.72705251e+01 2.20950363e+00 7.51572135e-02
 2.42383540e+00 1.19176995e-01]
Centroid 2: [1.55402470e+00 1.10904013e+02 1.27825120e+02 1.42735142e+01
 3.26855305e+00 2.15959319e+01 3.44646067e+00 1.17996799e-01
 2.42030643e+00 1.69163046e-01]
Centroid 3: [1.54120062e+00 1.96427707e+02 6.56214469e+01 1.45568938e+01
 3.27954438e+00 2.17144361e+01 3.45308979e+00 9.24576706e-02
 2.42144690e+00 2.92560287e-01]
Centroid 4: [1.54513154e+00 1.24981932e+02 2.35324161e+02 1.37555314e+01
 3.15117705e+00 2.08446356e+01 3.20227623e+00 1.03416994e-01
 2.41253402e+00 1.45751436e-01]


In [0]:
#CENTROID 3 and 1 contains trips by the taxi for whom fare is very high as compared to other centroids who contains who are relatively low fare.
#CENTROID 2  and 4 has trips by the taxi for whom fare is very low as compared to other centroids.

In [0]:
results = fit_model.transform(final_df)

In [0]:
results.select(['passenger_count','pulocationid','trip_distance','dolocationid','fare_amount','tip_amount','total_amount','prediction']).show()

+---------------+------------+-------------+------------+-----------+----------+------------+----------+
|passenger_count|pulocationid|trip_distance|dolocationid|fare_amount|tip_amount|total_amount|prediction|
+---------------+------------+-------------+------------+-----------+----------+------------+----------+
|              1|         262|         6.15|         244|       23.5|       2.5|        29.3|         0|
|              2|         233|         1.17|         107|        6.5|       1.0|        11.8|         2|
|              1|         261|          6.2|         162|       20.5|      4.96|       29.76|         0|
|              1|          79|         2.32|         170|       11.0|      3.06|       18.36|         1|
|              2|         179|         1.41|         146|        6.5|       1.0|         8.3|         1|
|              1|          75|          2.9|         162|       13.0|      3.25|       19.55|         1|
|              2|          68|         4.81|         26

In [0]:
Observation:
The above table states prediction table for our k means model.As we can see that prediction value which we are having are in the range of 1-3. 
If we get prediction value of 3 then there is high probablity of occuring that condition.As we can see for some of the pulocatioid value like 262 we are getting prediction value as 2 whereas if we observe for some of the trip _distance and fare amount values like 4.81 and 263 we are getting prediction value as 3 which is the highest.
We can say that somewhere on the line if passanger count is more than 1 we are getting high prediction values whereas if passanger count is 1 it is sure that there is less chances of occuring that condition.In our model.For high dolocation idwe are  chances of getting high prediction values.
We can predict that prediction values is low for high fare amount and its usually high where fare amount is less.

In [0]:
results.groupby('prediction').count().sort('prediction').show()

+----------+-----+
|prediction|count|
+----------+-----+
|         0|16379|
|         1|17492|
|         2| 9745|
|         3|13228|
+----------+-----+

