In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Kobe Bryant  shot selection

Cсылка на соревнование: https://www.kaggle.com/c/kobe-bryant-shot-selection

Goal: Fun and education

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag).

We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.

In [2]:
data = pd.read_csv('data/Kobe.csv')

In [3]:
data.head()

Unnamed: 0,action_type,combined_shot_type,game_event_id,game_id,lat,loc_x,loc_y,lon,minutes_remaining,period,...,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,team_id,team_name,game_date,matchup,opponent,shot_id
0,Jump Shot,Jump Shot,10,20000012,33.9723,167,72,-118.1028,10,1,...,2PT Field Goal,Right Side(R),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,1
1,Jump Shot,Jump Shot,12,20000012,34.0443,-157,0,-118.4268,10,1,...,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,2
2,Jump Shot,Jump Shot,35,20000012,33.9093,-101,135,-118.3708,7,1,...,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,3
3,Jump Shot,Jump Shot,43,20000012,33.8693,138,175,-118.1318,6,1,...,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,4
4,Driving Dunk Shot,Dunk,155,20000012,34.0443,0,0,-118.2698,6,2,...,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,5


In [4]:
target = 'shot_made_flag'

**Задания:**

1. Провести анализ данных. Много хороших примеров анализа можно посмотреть здесь https://www.kaggle.com/c/kobe-bryant-shot-selection/kernels
2. Подготовить фичи для обучения модели - нагенерить признаков, обработать пропущенные значения, проверить на возможные выбросы, обработать категориальные признаки и др.
3. Обучить линейную модель, Lasso, Ridge на тех же признаках - построить сравнительную таблицу коэффициентов, сделать заключения о том, как меняется величина коэффициентов, какие зануляются. Посчитать RSS

**Дополнительно**
4. Сравнить результаты на тестовом наборе данных - сделать train_test_split в самом начале, подготовить переменные, сравнить результаты работы классификаторов (те же 3), метрика ROC AUC

In [7]:
#Оставляем данные только с известным таргетом для возможности проведения оценки в дальнейшем
df = data[data[target].notnull()]

# EDA

### Обозначения в датасете

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25697 entries, 1 to 30696
Data columns (total 25 columns):
action_type           25697 non-null object
combined_shot_type    25697 non-null object
game_event_id         25697 non-null int64
game_id               25697 non-null int64
lat                   25697 non-null float64
loc_x                 25697 non-null int64
loc_y                 25697 non-null int64
lon                   25697 non-null float64
minutes_remaining     25697 non-null int64
period                25697 non-null int64
playoffs              25697 non-null int64
season                25697 non-null object
seconds_remaining     25697 non-null int64
shot_distance         25697 non-null int64
shot_made_flag        25697 non-null float64
shot_type             25697 non-null object
shot_zone_area        25697 non-null object
shot_zone_basic       25697 non-null object
shot_zone_range       25697 non-null object
team_id               25697 non-null int64
team_name         

In [10]:
df.describe()

Unnamed: 0,game_event_id,game_id,lat,loc_x,loc_y,lon,minutes_remaining,period,playoffs,seconds_remaining,shot_distance,shot_made_flag,team_id,shot_id
count,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0,25697.0
mean,249.348679,24741090.0,33.953043,7.148422,91.257345,-118.262652,4.886796,2.5208,0.146243,28.311554,13.457096,0.446161,1610613000.0,15328.166946
std,149.77852,7738108.0,0.088152,110.073147,88.152106,0.110073,3.452475,1.151626,0.353356,17.523392,9.388725,0.497103,0.0,8860.462397
min,2.0,20000010.0,33.2533,-250.0,-44.0,-118.5198,0.0,1.0,0.0,0.0,0.0,0.0,1610613000.0,2.0
25%,111.0,20500060.0,33.8843,-67.0,4.0,-118.3368,2.0,1.0,0.0,13.0,5.0,0.0,1610613000.0,7646.0
50%,253.0,20900340.0,33.9703,0.0,74.0,-118.2698,5.0,3.0,0.0,28.0,15.0,0.0,1610613000.0,15336.0
75%,367.0,29600270.0,34.0403,94.0,160.0,-118.1758,8.0,3.0,0.0,43.0,21.0,1.0,1610613000.0,22976.0
max,653.0,49900090.0,34.0883,248.0,791.0,-118.0218,11.0,7.0,1.0,59.0,79.0,1.0,1610613000.0,30697.0


In [11]:
df.dtypes

action_type            object
combined_shot_type     object
game_event_id           int64
game_id                 int64
lat                   float64
loc_x                   int64
loc_y                   int64
lon                   float64
minutes_remaining       int64
period                  int64
playoffs                int64
season                 object
seconds_remaining       int64
shot_distance           int64
shot_made_flag        float64
shot_type              object
shot_zone_area         object
shot_zone_basic        object
shot_zone_range        object
team_id                 int64
team_name              object
game_date              object
matchup                object
opponent               object
shot_id                 int64
dtype: object

In [16]:
df[['action_type','combined_shot_type','shot_type']].head()

Unnamed: 0,action_type,combined_shot_type,shot_type
1,Jump Shot,Jump Shot,2PT Field Goal
2,Jump Shot,Jump Shot,2PT Field Goal
3,Jump Shot,Jump Shot,2PT Field Goal
4,Driving Dunk Shot,Dunk,2PT Field Goal
5,Jump Shot,Jump Shot,2PT Field Goal


combined_shot_type - показывает типы бросков, каким был забит мяч


bank_shot - мятч должен отскочить от счета (backboard) в корзину


Hook Shot - мятч зибивается в прыжке,обычно перепендикулярно корзине, мяч забрасывается одной рукой, другая - блокирует защитника


Tip Shot - когда мяч забивается после того, как в первый раз не заходит в корзину, отлетает, и в моменте его отксока от корзины игрок в прыжке пытается снова загнать его в корзину обычно одной рукой (пальцами), не ловя при этом мятч двумя руками и не призмеляясь с ним снова


Dunk - забивается в прыжке, когда игрко контролирует мятч над горизонтальной поверхностью обода, забивает мятч прямо оуская его в корзину одной или двумя руками


Layup - забивается одной рукой из под корзины, вторая рука используется для балансировки


Jump Shot - мятч забитый в прыжке, мятч поднимается над головой и посылается в корзину

In [19]:
df['combined_shot_type'].value_counts()

Jump Shot    19710
Layup         4532
Dunk          1056
Tip Shot       152
Hook Shot      127
Bank Shot      120
Name: combined_shot_type, dtype: int64

Shot type -  показывает количество полученных очков за попадание ( не учитывая free throw - an opportunity or attempt to score one or more points without opposition because of a foul committed by a member of the other team) 2PT = 2 Point 3PT = 3 Point

In [23]:
#In basketball, a field goal is a basket scored on any shot or tap other than a free throw, worth two or three points depending on the distance of the attempt from the basket. 
df['shot_type'].value_counts()

2PT Field Goal    20285
3PT Field Goal     5412
Name: shot_type, dtype: int64

In [24]:
df['action_type'].value_counts()

Jump Shot                          15836
Layup Shot                          2154
Driving Layup Shot                  1628
Turnaround Jump Shot                 891
Fadeaway Jump Shot                   872
Running Jump Shot                    779
Pullup Jump shot                     402
Turnaround Fadeaway shot             366
Slam Dunk Shot                       334
Reverse Layup Shot                   333
Jump Bank Shot                       289
Driving Dunk Shot                    257
Dunk Shot                            217
Tip Shot                             151
Step Back Jump shot                  106
Alley Oop Dunk Shot                   95
Floating Jump shot                    93
Driving Reverse Layup Shot            83
Hook Shot                             73
Driving Finger Roll Shot              68
Alley Oop Layup shot                  67
Reverse Dunk Shot                     61
Driving Finger Roll Layup Shot        59
Turnaround Bank shot                  58
Running Layup Sh

In [27]:
#The playoffs, play-offs, postseason and/or finals of a sports league are a competition played after the regular season by the top competitors to determine the league champion or a similar accolade. 
df['playoffs'].value_counts()

0    21939
1     3758
Name: playoffs, dtype: int64

In [28]:
df['period'].value_counts()

3    7002
1    6700
4    6043
2    5635
5     280
6      30
7       7
Name: period, dtype: int64

In [29]:
df.columns

Index(['action_type', 'combined_shot_type', 'game_event_id', 'game_id', 'lat',
       'loc_x', 'loc_y', 'lon', 'minutes_remaining', 'period', 'playoffs',
       'season', 'seconds_remaining', 'shot_distance', 'shot_made_flag',
       'shot_type', 'shot_zone_area', 'shot_zone_basic', 'shot_zone_range',
       'team_id', 'team_name', 'game_date', 'matchup', 'opponent', 'shot_id'],
      dtype='object')

In [31]:
df[['shot_zone_area', 'shot_zone_basic', 'shot_zone_range']].head()

Unnamed: 0,shot_zone_area,shot_zone_basic,shot_zone_range
1,Left Side(L),Mid-Range,8-16 ft.
2,Left Side Center(LC),Mid-Range,16-24 ft.
3,Right Side Center(RC),Mid-Range,16-24 ft.
4,Center(C),Restricted Area,Less Than 8 ft.
5,Left Side(L),Mid-Range,8-16 ft.


In [32]:
df['shot_zone_basic'].value_counts()

Mid-Range                10532
Restricted Area           5932
Above the Break 3         4720
In The Paint (Non-RA)     3880
Right Corner 3             333
Left Corner 3              240
Backcourt                   60
Name: shot_zone_basic, dtype: int64

In [34]:
#https://hooptactics.com/Basketball_Basics_Court_Areas
df['shot_zone_area'].value_counts()

Center(C)                11289
Right Side Center(RC)     3981
Right Side(R)             3859
Left Side Center(LC)      3364
Left Side(L)              3132
Back Court(BC)              72
Name: shot_zone_area, dtype: int64

In [37]:
#мэтчинг команды Коби с командой противника
df['matchup'].head(20)

1       LAL @ POR
2       LAL @ POR
3       LAL @ POR
4       LAL @ POR
5       LAL @ POR
6       LAL @ POR
8       LAL @ POR
9       LAL @ POR
10      LAL @ POR
11    LAL vs. UTA
12    LAL vs. UTA
13    LAL vs. UTA
14    LAL vs. UTA
15    LAL vs. UTA
17    LAL vs. UTA
18    LAL vs. UTA
20    LAL vs. UTA
21    LAL vs. UTA
22    LAL vs. UTA
23    LAL vs. UTA
Name: matchup, dtype: object

In [41]:
#дата представлена в виде строки
type(df['game_date'].values[0])

str

### Графики и распределения