# Lebron James 
Purpose: PySpark Hands-on code

Data Preprocessing, Exploratory Data Analysis, Machine Learning


![](lebronjames.jpeg)

In [1]:
import findspark
findspark.init('/home/sam/spark-2.3.0-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('lebron').getOrCreate()

In [4]:
data = spark.read.csv('data/LebronJamesRegularSeason.csv', inferSchema = True, header = True)

### Peek at Data 
Lebron James Stats from: www.basketball-reference.com/players/j/jamesle01/gamelog/2018
- Rank (RK)
- Season Game (G)
- Date when Game was played (Date): YYYY-MM-DD
- Age (Age): Year-Day
- Team (Tm)
- column 5: away or home, if @ is present, then Lebron is Away
- Opponent (Opp)
- column 7: W/L (+/- final score difference)
- Game Started (GS): 1/0 
- Minutes Played (MP)
- Field Goal Made (FG)
- Field Goal Attempts (FGA)
- Field Goal Percentage (FG%)
- 3-pt field goals (3P)
- 3-pt field goal attempts (3PA)
- Free throw (FT)
- Free throw attempts (FTA)
- Free throw percentage (FT%)
- Offence Rebound (ORB)
- Defensive Rebound (DRB)
- Total Rebound (TRB)
- Assists (AST)
- Steals (STL)
- Blocks (BLK)
- TurnOver (TOV)
- Personal Fouls (PF)
- Game Score (GmSc): the formula is $$ PTS + 0.4  FG - 0.7 \times FGA - 0.4 \times (FTA - FT) + 0.7 \times ORB + 0.3 \times DRB + STL + 0.7 \times AST + 0.7 \times BLK - 0.4 \times PF - TOV$$ Game Score was created by John Hollinger to give a rough measure of a player's productivity for a single game. The scale is similar to that of points scored, (40 is an outstanding performance, 10 is an average performance, etc.)
- +/-: reflects how well the team did when Lebron was on the court. Ex: +5 means Cavs outscored team by 5 when Lebron was on court

For more definitions go to https://www.basketball-reference.com/about/glossary.html

In [5]:
data.printSchema()

root
 |-- Rk: integer (nullable = true)
 |-- G: integer (nullable = true)
 |-- Date: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Tm: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- Opp: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- GS: integer (nullable = true)
 |-- MP: string (nullable = true)
 |-- FG: integer (nullable = true)
 |-- FGA: integer (nullable = true)
 |-- FG%: double (nullable = true)
 |-- 3P: integer (nullable = true)
 |-- 3PA: integer (nullable = true)
 |-- 3P%: double (nullable = true)
 |-- FT: integer (nullable = true)
 |-- FTA: integer (nullable = true)
 |-- FT%: double (nullable = true)
 |-- ORB: integer (nullable = true)
 |-- DRB: integer (nullable = true)
 |-- TRB: integer (nullable = true)
 |-- AST: integer (nullable = true)
 |-- STL: integer (nullable = true)
 |-- BLK: integer (nullable = true)
 |-- TOV: integer (nullable = true)
 |-- PF: integer (nullable = true)
 |-- PTS: integer (nullable = true)
 |-- G

In [6]:
data.show(5)

+---+---+--------+------+---+----+---+-------+---+--------+---+---+-----+---+---+-----+---+---+-----+---+---+---+---+---+---+---+---+---+----+---+
| Rk|  G|    Date|   Age| Tm| _c5|Opp|    _c7| GS|      MP| FG|FGA|  FG%| 3P|3PA|  3P%| FT|FTA|  FT%|ORB|DRB|TRB|AST|STL|BLK|TOV| PF|PTS|GmSc|+/-|
+---+---+--------+------+---+----+---+-------+---+--------+---+---+-----+---+---+-----+---+---+-----+---+---+---+---+---+---+---+---+---+----+---+
|  1|  1|10/17/17|32-291|CLE|null|BOS| W (+3)|  1|41:12:00| 12| 19|0.632|  1|  5|  0.2|  4|  4|  1.0|  1| 15| 16|  9|  0|  2|  4|  3| 29|28.2|  2|
|  2|  2|10/20/17|32-294|CLE|   @|MIL|W (+19)|  1|37:25:00| 10| 16|0.625|  2|  4|  0.5|  2|  2|  1.0|  1|  4|  5|  8|  1|  1|  5|  1| 24|20.6| 13|
|  3|  3|10/21/17|32-295|CLE|null|ORL|L (-21)|  1|31:12:00|  8| 15|0.533|  1|  3|0.333|  5|  6|0.833|  0|  4|  4|  2|  1|  1|  1|  0| 22|17.6|-31|
|  4|  4|10/24/17|32-298|CLE|null|CHI| W (+7)|  1|37:15:00| 13| 20| 0.65|  4|  6|0.667|  4|  5|  0.8|  0|  2|  2| 13| 

### Data Preprocessing

In [7]:
data = data.withColumnRenamed('_c5', 'HomeAway').withColumnRenamed('_c7', 'WinLoss')

In [8]:
# Don't know what rank means, game number is not important, and age is constant
# A player like Lebron will probably not get traded during the year, so drop Tm 
output = data.drop('Rk', 'G', 'Age', 'Tm')

In [9]:
from pyspark.sql.functions import *

In [10]:
# modify column to state Home and Away
output = output.withColumn('HomeAway',when(output.HomeAway=='@','Away').otherwise('Home'))

In [11]:
import re
test = 'W (-55)'
pattern = re.compile('(?P<Winloss>[A-Z]+)\s+\((?P<Value>(\+|\-)+(\d+|\d))')
match = re.search(pattern, test)
result = match.group('Winloss') #if match else None
value = match.group('Value') #if match else None
print(result,value)

W -55


In [12]:
# split WinLoss column into WinLossResult and WinBy 
# WinLossResult is win or lose and WinBy is how much they won/loss by
output=output.withColumn('WinLossResult', 
                         regexp_extract(col('WinLoss'), 
                                        '([A-Z]+)\s+\(((\+|\-)+(\d+|\d))',1)).withColumn('WinBy', 
                                                                                         regexp_extract(col('WinLoss'), '([A-Z]+)\s+\(((\+|\-)+(\d+|\d))',2))

In [13]:
output = output.drop('WinLoss')

In [14]:
output.show(3)

+--------+--------+---+---+--------+---+---+-----+---+---+-----+---+---+-----+---+---+---+---+---+---+---+---+---+----+---+-------------+-----+
|    Date|HomeAway|Opp| GS|      MP| FG|FGA|  FG%| 3P|3PA|  3P%| FT|FTA|  FT%|ORB|DRB|TRB|AST|STL|BLK|TOV| PF|PTS|GmSc|+/-|WinLossResult|WinBy|
+--------+--------+---+---+--------+---+---+-----+---+---+-----+---+---+-----+---+---+---+---+---+---+---+---+---+----+---+-------------+-----+
|10/17/17|    Home|BOS|  1|41:12:00| 12| 19|0.632|  1|  5|  0.2|  4|  4|  1.0|  1| 15| 16|  9|  0|  2|  4|  3| 29|28.2|  2|            W|   +3|
|10/20/17|    Away|MIL|  1|37:25:00| 10| 16|0.625|  2|  4|  0.5|  2|  2|  1.0|  1|  4|  5|  8|  1|  1|  5|  1| 24|20.6| 13|            W|  +19|
|10/21/17|    Home|ORL|  1|31:12:00|  8| 15|0.533|  1|  3|0.333|  5|  6|0.833|  0|  4|  4|  2|  1|  1|  1|  0| 22|17.6|-31|            L|  -21|
+--------+--------+---+---+--------+---+---+-----+---+---+-----+---+---+-----+---+---+---+---+---+---+---+---+---+----+---+-------------