# Intro

I'm just a beginner who started using `pandas` one week ago and now trying to do some basic data analysis for the first time.  
I would very much like to hear your feedback, comments and improvements to my code.

&nbsp;

---

&nbsp;

# Importing data

First we'll import the data and check if everything was loaded ok.

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('../input/data.csv')
df.head()

In [None]:
df.shape

# Exploring the columns

We'll explore the columns and see if we can modify or drop some of them.  
List of columns to drop will be in `not_needed` list.

In [None]:
random_sample = df.take(np.random.permutation(len(df))[:3])
random_sample.T

In [None]:
not_needed = []

## Action type and combined shot type

In [None]:
print(df['action_type'].unique())
print(df['combined_shot_type'].unique())

Let's keep both of them.

## Game event and game IDs

`game_event_id` and `game_id` won't be needed:

In [None]:
not_needed.extend(['game_event_id', 'game_id'])

## loc_x, loc_y, lat, lon

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df, vars=['loc_x', 'loc_y', 'lat', 'lon'], hue='shot_made_flag')

`loc_x` and `lon` are correlated, also `loc_y` and `lat`, so we'll drop `lon` and `lat`.

In [None]:
not_needed.extend(['lon', 'lat'])

## Minutes and seconds remaining

`minutes_remaining` and `seconds_remaining` can be put in one column named `time_remaining`.

In [None]:
df['time_remaining'] = 60 * df.loc[:, 'minutes_remaining'] + df.loc[:, 'seconds_remaining']

In [None]:
not_needed.extend(['minutes_remaining', 'seconds_remaining'])

## Period

In [None]:
df['period'].unique()

## Playoffs

In [None]:
df['playoffs'].unique()

## Shot made flag

In [None]:
df['shot_made_flag'].unique()

This will be the feature we're looking for, and later we'll split the data based on it.

## Season

In `season` column, we'll just keep the year when season started and convert column to integer.

In [None]:
df['season'] = df['season'].apply(lambda x: x[:4])
df['season'] = pd.to_numeric(df['season'])

## Shot distance

In [None]:
dist = pd.DataFrame({'true_dist': np.sqrt((df['loc_x']/10)**2 + (df['loc_y']/10)**2), 
                     'shot_dist': df['shot_distance']})
dist[:10]

Seems like `shot_distance` is just floored distance calculated from x- and y- location of a shot, so we'll use more precise measure and drop floored one.

In [None]:
df['shot_distance_'] = dist['true_dist']
not_needed.append('shot_distance')

## Shot type

In [None]:
df['shot_type'].unique()

We can create a new column `3pt_goal` which will have values `1` for 3pt goal and `0` for 2pt goal, and then drop `shot_type` column.

In [None]:
df['3pt_goal'] = df['shot_type'].str.contains('3PT').astype('int')
not_needed.append('shot_type')

## Shot zone: range, area, basic

In [None]:
print(df['shot_zone_range'].unique())
print(df['shot_zone_area'].unique())
print(df['shot_zone_basic'].unique())

`shot_zone_range` is just putting `shot_distance` in 5 bins. Don't need it.

In [None]:
not_needed.append('shot_zone_range')


Let's visualize `shot_zone_area` and `shot_zone_basic`.  
We'll put `loc_y = 0` near the top, so right and left sides show correctly in the graph.

In [None]:
area_group = df.groupby('shot_zone_area')
basic_group = df.groupby('shot_zone_basic')

plt.subplots(1, 2, figsize=(15, 7), sharey=True)
colors = list('rgbcmyk')

plt.subplot(121)
plt.ylim(500, -50)
plt.title('shot_zone_area')
for i, (_, area) in enumerate(area_group):
    plt.scatter(area['loc_x'], area['loc_y'], alpha=0.1, color=colors[i])
    
plt.subplot(122)
plt.ylim(500, -50)
plt.title('shot_zone_basic')
for i, (_, basic) in enumerate(basic_group):
    plt.scatter(basic['loc_x'], basic['loc_y'], alpha=0.1, color=colors[i])


## Team ID and name

In [None]:
print(df['team_id'].unique())
print(df['team_name'].unique())


Those two columns are the same for all entries, so we can drop them.

In [None]:
not_needed.extend(['team_id', 'team_name'])

## Game date

We'll convert `game_date` to datetime format, and then split it to year, month and weekday (0 = Monday, 6 = Sunday), so it won't be needed anymore.

In [None]:
df['game_date'] = pd.to_datetime(df['game_date'])
df['game_year'] = df['game_date'].dt.year
df['game_month'] = df['game_date'].dt.month
df['game_day'] = df['game_date'].dt.dayofweek

not_needed.append('game_date')


## Matchup and opponent

`matchup` and `opponent` columns give as almost the same data - matchup tells us if the game was home or away (depending if it is '@' or 'vs'), so we'll make a new column with that info and then we can drop `matchup` column.

In [None]:
df['home_game'] = df['matchup'].str.contains('vs.').astype(int)
not_needed.append('matchup')

## Shot ID

We can set `shot_id` as index:

In [None]:
df.set_index('shot_id', inplace=True)

## Exploring the columns - summary

Let's finally drop all not needed columns:

In [None]:
df = df.drop(not_needed, axis=1)

In [None]:
df.shape

In [None]:
pd.set_option('display.max_columns', None)
random_sample = df.take(np.random.permutation(len(df))[:10])
random_sample.head(10)

&nbsp;

---

&nbsp;

# Splitting the data

`submission_data` are those shots where we don't know if he scored or not, and shots where we'll test accuracy of our model.

In [None]:
submission_data = df[df['shot_made_flag'].isnull()]
submission_data = submission_data.drop('shot_made_flag', 1)
submission_data.shape

In [None]:
data = df[df['shot_made_flag'].notnull()]
data.shape

&nbsp;

---

&nbsp;

# Exploring the data

## Shot accuracy

In [None]:
sns.countplot(x='shot_made_flag', data=data)

In [None]:
data['shot_made_flag'].value_counts() / data['shot_made_flag'].shape

He scores around 45% of his shots.

Let's see his attempts depending on the seconds to the end of a period:

In [None]:
data['time_remaining'].plot(kind='hist', bins=24, xlim=(720, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')

Accuracy of those shots:

In [None]:
time_bins = np.arange(0, 721, 30)
attempts_in_time = pd.cut(data['time_remaining'], time_bins, right=False)
grouped = data.groupby(attempts_in_time)
prec = grouped['shot_made_flag'].mean()

prec[::-1].plot(kind='bar', figsize=(12, 6), ylim=(0.2, 0.5), 
                title='Shot accuracy over time\n(seconds to the end of period)')

Lots of attempts in last 30 seconds, and much worse accuracy than usual. Let's explore that more.

## Shots in the last seconds of a period

In [None]:
last_30 = data[data['time_remaining'] < 30]
last_30['shot_made_flag'].value_counts() / last_30['shot_made_flag'].shape

In the last 30 seconds he scores only about 33% of his shots. Pressure?

Let's explore what happens in those last minutes of the game.

In [None]:
last_2min = data[data['time_remaining'] <= 120]

last_2min['time_remaining'].plot(kind='hist', bins=30, xlim=(120, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')

Ok, this explains things a bit. Plenty of last seconds desperate shots. Let's return to last 30 seconds.

In [None]:
last_30['time_remaining'].plot(kind='hist', bins=10, xlim=(30, 0), figsize=(12,6),
                            title='Attempts made over time\n(seconds to the end of period)')

In [None]:
last_5sec_misses = data[(data['time_remaining'] <= 5) & (data['shot_made_flag'] == 0)]
last_5sec_scores = data[(data['time_remaining'] <= 5) & (data['shot_made_flag'] == 1)]


fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(12,7))
ax1.set_ylim(800, -50)

sns.regplot(x='loc_x', y='loc_y', data=last_5sec_misses, fit_reg=False, ax=ax1, color='r')
sns.regplot(x='loc_x', y='loc_y', data=last_5sec_scores, fit_reg=False, ax=ax2, color='g')

In last 5 seconds, there are some desperate shots from far away, plenty of misses from 3pt line, but he misses a lot even from close distance.

In [None]:
last_5sec_close = data[(data['time_remaining'] <= 5) & (data['shot_distance_'] <= 20)]

last_5sec_close['shot_made_flag'].value_counts() / last_5sec_close['shot_made_flag'].shape

In [None]:
For comparison, accuracy from close distance when there are more than 5 seconds to go:

In [None]:
close_shots = data[(data['time_remaining'] > 5) & (data['shot_distance_'] <= 20)]

close_shots['shot_made_flag'].value_counts() / close_shots['shot_made_flag'].shape

## Period accuracy

Number of shots taken in each period:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="period", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
period_acc = data['shot_made_flag'].groupby(data['period']).mean()
period_acc.plot(kind='barh', figsize=(12, 6))

Seems like a period of a game doesn't influence much his accuracy.

## Accuracy depending on shot type

### Combined shot type

Number of different kinds of shots:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="combined_shot_type", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
shot_type_acc = data['shot_made_flag'].groupby(data['combined_shot_type']).mean()
shot_type_acc.plot(kind='barh', figsize=(12, 6))

### Action type

Number of shots:

In [None]:
plt.figure(figsize=(12,18))
sns.countplot(y="action_type", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
action_type = data['shot_made_flag'].groupby(data['action_type']).mean()
action_type.sort_values()

action_type.sort_values().plot(kind='barh', figsize=(12, 18))

## Career accuracy

Number of shots over seasons:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="season", hue="shot_made_flag", data=data)

In [None]:
season_acc = data['shot_made_flag'].groupby(data['season']).mean()
season_acc.plot(figsize=(12, 6), title='Accuracy over seasons')

Some Wikipedia insight on what happened with season 2013-14, and possible explanation for the big decline in his last seasons:

*On April 12 [2013], Bryant suffered a **torn Achilles tendon** against the Golden State Warriors, ending his [2012-13] season. (...) Bryant resumed practicing starting in November, after the start of the 2013–14 season. (...)  
Bryant resumed playing on December 8 [2013] after missing the season's first 19 games. On December 17, Bryant matched his season high of 21 points in a 96–92 win over Memphis, but he suffered a **lateral tibial plateau fracture** in his left knee that was expected to sideline him for six weeks. (...)  
On March 12, 2014, the Lakers **ruled Bryant out for the remainder of the season**, citing his need for more rehab and the limited time remaining in the season.*

So I guess he never fully recovered from his injuries, at least when it comes to shot accuracy.

## Season freshness

Number of shots each month:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="game_month", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
game_month = data['shot_made_flag'].groupby(data['game_month']).mean()
game_month.plot(kind='barh', figsize=(12, 6))

Almost the same performance troughout the season - just slightly worse accuracy at the start (month 10) and at the end (month 6) of the season, but those months have much less games than other months.

## Manic Monday

Month of a year doesn't affect him, but what about weekday?

Number of shots per weekday:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="game_day", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
game_day = data['shot_made_flag'].groupby(data['game_day']).mean()
game_day.plot(kind='barh', figsize=(12, 6))

Again no noticeable difference.

## Regular season vs playoffs

Number of shots:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="playoffs", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
playoffs = data['shot_made_flag'].groupby(data['playoffs']).mean()
playoffs.plot(kind='barh', figsize=(12, 2), xlim=(0, 0.50))

No difference between regular season and playoffs.

## Accuracy depending on the shot distance

First let's create categories of distances, each 3ft long.

In [None]:
distance_bins = np.append(np.arange(0, 31, 3), 300) 
distance_cat = pd.cut(data['shot_distance_'], distance_bins, right=False)

dist_data = data.loc[:, ['shot_distance_', 'shot_made_flag']]
dist_data['distance_cat'] = distance_cat

distance_cat.value_counts(sort=False)

Number of shots in each distance category:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="distance_cat", hue="shot_made_flag", data=dist_data)

Small number of shots in [21, 24) because that's just inside of 3pt line - better to step outside and try going for 3pt.

Accuracy by distance category:

In [None]:
dist_prec = dist_data['shot_made_flag'].groupby(dist_data['distance_cat']).mean()
dist_prec.plot(kind='bar', figsize=(12, 6))

## Accuracy based on shot zones

For the difference between `shot_zone_area` and `shot_zone_basic` see [here](#Shot-zone:-range,-area,-basic).

### Shot zone area

Number of shots:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="shot_zone_area", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
shot_area = data['shot_made_flag'].groupby(data['shot_zone_area']).mean()
shot_area.plot(kind='barh', figsize=(12, 6))

He's most accurate from the center, but what's interesting is that he's slightly more accurate from the right side.

### Shot zone basic

Number of shots:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="shot_zone_basic", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
shot_basic = data['shot_made_flag'].groupby(data['shot_zone_basic']).mean()
shot_basic.plot(kind='barh', figsize=(12, 6))

We have seen that he's more accurate from right-hand side, but when it comes to corners - left corner suits him slightly better.

## Home game vs away

Number of shots:

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="home_game", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
shot_basic = data['shot_made_flag'].groupby(data['home_game']).mean()
shot_basic.plot(kind='barh', figsize=(12, 2))

Slightly more accurate in front of his home crowd.

## Opponents

Number of shots:

In [None]:
plt.figure(figsize=(12,16))
sns.countplot(y="opponent", hue="shot_made_flag", data=data)

Accuracy:

In [None]:
opponent = data['shot_made_flag'].groupby(data['opponent']).mean()
opponent.sort_values().plot(kind='barh', figsize=(12,10))