## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/agg_match_stats_0-2.parquet"
file_type = "parquet"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

dbutils.fs.rm('dbfs:/user/hive/warehouse/agg_match_stats_0', True)
permanent_table_name = "agg_match_stats_0"
df.write.format("parquet").saveAsTable(permanent_table_name)

df = spark.table("agg_match_stats_0")

In [0]:
# Create a view or table

temp_table_name = "agg_match_stats_0"

df.createOrReplaceTempView(temp_table_name)

In [0]:
from pyspark.ml.classification import LogisticRegression

In [0]:
tmp_df = df.drop("match_id", "match_mode", "player_name").filter(df.party_size == 1).drop("party_size")

In [0]:
Got rid of unwanted variables and changed party size to just solo.

[0;36m  File [0;32m"<command-1242095112001275>"[0;36m, line [0;32m1[0m
[0;31m    Got rid of unwanted variables and changed party size to just solo.[0m
[0m        ^[0m
[0;31mSyntaxError[0m[0;31m:[0m invalid syntax


In [0]:
from pyspark.ml.feature import VectorAssembler
# Describe how columns should be collapsed into a single row-vector
vecAssembler = VectorAssembler(
    inputCols = ["game_size", "player_assists", "player_dbno", "player_dist_ride",
                 "player_dist_walk", "player_dmg", "player_kills", "player_survive_time"],
    outputCol = "features")
# Apply the transformation
vec_data = vecAssembler.transform(tmp_df)

# Preview the new Spark DataFrame
vec_data.select("game_size", "player_assists", "player_dbno", "player_dist_ride",
                 "player_dist_walk", "player_dmg", "player_kills", "player_survive_time", "features",  "team_placement").show(10)

+---------+--------------+-----------+----------------+------------------+----------+------------+-------------------+--------------------+--------------+
|game_size|player_assists|player_dbno|player_dist_ride|  player_dist_walk|player_dmg|player_kills|player_survive_time|            features|team_placement|
+---------+--------------+-----------+----------------+------------------+----------+------------+-------------------+--------------------+--------------+
|       90|             0|          0|             0.0|        505.361755|       128|           1|             534.95|[90.0,0.0,0.0,0.0...|            39|
|       90|             0|          0|             0.0|          1151.554|       215|           1|  616.5880000000001|[90.0,0.0,0.0,0.0...|            33|
|       90|             0|          0|      3341.69238|1482.3076199999998|         0|           0|           1205.061|(8,[0,3,4,7],[90....|            23|
|       90|             0|          0|             0.0|        481.469

In [0]:
train_data, test_data = vec_data.randomSplit([0.8, 0.2])

In [0]:
Vectorized variables for training and test datasets.

[0;36m  File [0;32m"<command-1242095112001277>"[0;36m, line [0;32m1[0m
[0;31m    Vectorized variables for training and test datasets.[0m
[0m               ^[0m
[0;31mSyntaxError[0m[0;31m:[0m invalid syntax


In [0]:
from pyspark.ml.classification import LogisticRegression



In [0]:
logr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
logr = logr.setFeaturesCol("features").setLabelCol("team_placement")
lrModel = logr.fit(train_data)

In [0]:
predicted_values_ord = lrModel.transform(test_data)
predicted_values_ord.show(10)

+--------------------+---------+--------------+-----------+----------------+------------------+----------+------------+-------------------+-------+--------------+--------------------+--------------------+--------------------+----------+
|                date|game_size|player_assists|player_dbno|player_dist_ride|  player_dist_walk|player_dmg|player_kills|player_survive_time|team_id|team_placement|            features|       rawPrediction|         probability|prediction|
+--------------------+---------+--------------+-----------+----------------+------------------+----------+------------+-------------------+-------+--------------+--------------------+--------------------+--------------------+----------+
|2017-10-20T08:19:...|       95|             0|          0|             0.0|        211.052887|        34|           0| 219.74400000000003| 100083|            80|(8,[0,4,5,7],[95....|[-9.8891388892395...|[4.33620693767092...|      57.0|
|2017-10-20T08:19:...|       95|             0|     

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# Create a prediction evaluator focused on assessing RMSE criteria
regression_eval = RegressionEvaluator(
  predictionCol = "prediction", 
  labelCol = "team_placement", 
  metricName = "rmse")

# Calculate the RMSE
rmse = regression_eval.evaluate(predicted_values_ord)

# Display the result
print(f"RMSE is {rmse:.1f}")

RMSE is 28.9


In [0]:
Logistic Regression Reference - https://spark.apache.org/docs/latest/ml-classification-regression.html