# 2016-2017赛季NBA常规赛结果预测

## 项目简介
我是NBA的狂热粉，从2003年姚明刚进入NBA的时候就开始关注NBA，从小玩到大的游戏也是NBA，从早期的NBAlive到现在的NBA2K。  
我喜欢看NBA比赛，也爱看NBA各项数据，既然学了这么久数据分析，那么也是时候和我的另一个爱好NBA相结合，来分析下NBA数据。  
本次项目将基于2015-2016年的NBA 常规赛及季后赛的比赛统计数据，预测2016-2017常规赛每场赛事的结果。  
由于影响比赛胜负的因素较多，本次分析将不考虑16-17赛季将发生的球员交易、球员伤病、教练变动等影响因素，仅基于15-16的赛季状态来分析。

## 数据来源
本次用于分析的数据来源于专门统计NBA各项数据的Basketball Reference.com网站，用于分析的主要分四个表格：Team Per Game Stats，每支球队平均每场比赛的表现统计；Opponent Per Game Stats，每支球队所遇到的对手平均每场比赛的统计信息；Miscellaneous Stats，每支球队综合统计数据；NBA Schedule and Results，赛季日历和结果。以上数据都是15-16赛季的数据。还有用于预测的16-17Schedule也就是16-17赛季赛程表，在分析过程中发现16-17赛季比赛日历少了森林狼和开拓者的比赛，所以自己补上了。

## 公式引用
 **Elo Score**: Elo等级分最初为了提供国际象棋中，更好地对不同的选手进行等级划分。在现在很多的竞技运动或者游戏中都会采取 Elo 等级分制度对选手或玩家进行等级划分，如足球、篮球、棒球比赛或 LOL，DOTA 等游戏。  
 **E(A)**: A 对 B 的胜率期望值,E(A)=1/(1+10^((Elo(B)-Elo(A))/400)  
 **E(B)**: B 对 A 的胜率期望值,E(B)=1/(1+10^((Elo(A)-Elo(B))/400)  
 **更新Elo Score**: 如果 A 在比赛中的真实得分 S(A)(胜 1 分,和 0.5 分,负 0 分)和他的胜率期望值 E(A)不同，  
 则他的等级分要根据以下公式进行调整：Elo<sup>new</sup>(A)=Elo<sup>old</sup>(A)+K(S(A)-E(A)),根据等级分的不同 K 值也会做相应的调整。  

## 分析目录
### 1. 理解数据
    1.1 导入数据
    1.2 初始化数据
### 2. 特征工程
    2.1 特征提取
    2.2 特征选择
### 3. 建模
    3.1 建立并拟合模型
    3.2 评估模型
### 4. 方案实施
    4.1 预测验证集
    4.2 结论

In [8388]:
import pandas as pd
import math
import csv
import random
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression,Lasso
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score,roc_curve
from sklearn.tree import DecisionTreeClassifier

# 1.理解数据

## 1.1 导入数据

In [8389]:
Mstat = pd.read_csv('nbadata/15-16Miscellaneous_Stat.csv')#球队赛季总和统计数据
Ostat = pd.read_csv('nbadata/15-16Opponent_Per_Game_Stat.csv')#对手赛季平均每场比赛统计数据
Tstat = pd.read_csv('nbadata/15-16Team_Per_Game_Stat.csv')#球队赛季平均每场比赛统计数据
result_data = pd.read_csv('nbadata/2015-2016_result.csv')#15-16赛季比赛日历和结果
schedule1617 = pd.read_csv('nbadata/16-17Schedule.csv')#16-17赛季比赛日历

In [8390]:
Mstat

Unnamed: 0,Rk,Team,Age,W,L,PW,PL,MOV,SOS,SRS,...,eFG%,TOV%,ORB%,FT/FGA,eFG%.1,TOV%.1,DRB%,FT/FGA.1,Arena,Attendance
0,1,Golden State Warriors,27.4,73,9,65,17,10.76,-0.38,10.38,...,0.563,13.5,23.5,0.191,0.479,12.6,76.0,0.208,Oracle Arena,803436
1,2,San Antonio Spurs,30.3,67,15,67,15,10.63,-0.36,10.28,...,0.526,12.4,23.0,0.197,0.477,14.1,79.1,0.182,AT&T Center,756445
2,3,Oklahoma City Thunder,25.8,55,27,59,23,7.28,-0.19,7.09,...,0.524,14.0,31.1,0.228,0.484,11.7,76.0,0.205,Chesapeake Energy Arena,746323
3,4,Cleveland Cavaliers,28.1,57,25,57,25,6.0,-0.55,5.45,...,0.524,12.7,25.1,0.194,0.496,12.6,78.5,0.205,Quicken Loans Arena,843042
4,5,Los Angeles Clippers,29.7,53,29,53,29,4.28,-0.15,4.13,...,0.524,12.1,20.1,0.22,0.48,13.8,73.8,0.222,STAPLES Center,786910
5,6,Toronto Raptors,26.3,56,26,53,29,4.5,-0.42,4.08,...,0.504,12.3,24.6,0.255,0.498,12.7,77.7,0.201,Air Canada Centre,812863
6,7,Atlanta Hawks,28.2,48,34,51,31,3.61,-0.12,3.49,...,0.516,13.8,19.1,0.185,0.48,14.4,74.6,0.194,Philips Arena,690150
7,8,Boston Celtics,25.2,48,34,50,32,3.21,-0.37,2.84,...,0.488,12.1,25.1,0.208,0.487,14.6,74.6,0.231,TD Garden,749076
8,9,Charlotte Hornets,26.0,48,34,49,33,2.72,-0.36,2.36,...,0.502,11.7,20.0,0.222,0.496,12.5,79.8,0.191,Time Warner Cable Arena,716894
9,10,Utah Jazz,24.2,40,42,46,36,1.79,0.05,1.84,...,0.501,14.2,25.9,0.213,0.495,13.5,77.7,0.21,Vivint Smart Home Arena,791489


In [8391]:
Ostat.head()

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,San Antonio Spurs,82,240.3,35.7,81.8,0.436,6.6,19.9,0.331,...,0.758,9.1,31.4,40.5,20.8,7.2,3.9,14.8,19.5,92.9
1,2,Utah Jazz,82,243.4,35.6,79.9,0.446,7.9,22.2,0.357,...,0.746,9.3,30.8,40.1,19.1,8.0,4.7,14.0,19.9,95.9
2,3,Toronto Raptors,82,241.2,36.5,82.1,0.444,8.7,23.4,0.373,...,0.748,9.5,31.2,40.8,21.7,6.5,5.4,13.3,22.0,98.2
3,4,Cleveland Cavaliers,82,242.1,36.8,82.1,0.448,7.9,22.7,0.347,...,0.743,9.3,31.8,41.0,21.4,7.2,4.4,13.3,20.6,98.3
4,5,Miami Heat,82,241.8,37.2,84.3,0.442,7.4,21.2,0.347,...,0.77,9.8,31.5,41.3,20.2,7.5,4.1,12.9,19.6,98.4


In [8392]:
Tstat.head()

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Golden State Warriors,82,242.4,42.5,87.3,0.487,13.1,31.6,0.416,...,0.763,10.0,36.2,46.2,28.9,8.4,6.1,15.2,20.7,114.9
1,2,Oklahoma City Thunder,82,241.8,41.1,86.4,0.476,8.3,23.7,0.349,...,0.782,13.1,35.6,48.6,23.0,7.4,5.9,15.9,20.6,110.2
2,3,Sacramento Kings,82,241.5,40.0,86.4,0.464,8.0,22.4,0.359,...,0.725,10.6,33.7,44.2,24.5,8.9,4.5,16.2,20.4,106.6
3,4,Houston Rockets,82,241.8,37.7,83.5,0.452,10.7,30.9,0.347,...,0.694,11.3,31.7,43.1,22.2,10.0,5.2,15.9,21.8,106.5
4,5,Boston Celtics,82,241.2,39.2,89.2,0.439,8.7,26.1,0.335,...,0.788,11.6,33.3,44.9,24.2,9.2,4.2,13.7,21.9,105.7


## 1.2 初始化数据

In [8393]:
#删除无用列
new_Mstat = Mstat.drop(['Rk', 'Arena'], axis=1)
new_Ostat = Ostat.drop(['Rk', 'G', 'MP'], axis=1)
new_Tstat = Tstat.drop(['Rk', 'G', 'MP'], axis=1)

In [8394]:
#根据队名横向拼接前两个表
team_stats1 = pd.merge(new_Mstat, new_Ostat, how='left', on='Team')#merge完会删除on的列

In [8395]:
team_stats1.shape

(30, 45)

In [8396]:
#根据队名横向拼接上第三个表
team_stats1 = pd.merge(team_stats1, new_Tstat, how='left', on='Team')

In [8397]:
team_stats1.shape

(30, 66)

In [8398]:
team_stats1.head()#注意merge的两表重名的会加后缀

Unnamed: 0,Team,Age,W,L,PW,PL,MOV,SOS,SRS,ORtg,...,FT%_y,ORB_y,DRB_y,TRB_y,AST_y,STL_y,BLK_y,TOV_y,PF_y,PTS_y
0,Golden State Warriors,27.4,73,9,65,17,10.76,-0.38,10.38,114.5,...,0.763,10.0,36.2,46.2,28.9,8.4,6.1,15.2,20.7,114.9
1,San Antonio Spurs,30.3,67,15,67,15,10.63,-0.36,10.28,110.3,...,0.803,9.4,34.5,43.9,24.5,8.3,5.9,13.1,17.5,103.5
2,Oklahoma City Thunder,25.8,55,27,59,23,7.28,-0.19,7.09,113.1,...,0.782,13.1,35.6,48.6,23.0,7.4,5.9,15.9,20.6,110.2
3,Cleveland Cavaliers,28.1,57,25,57,25,6.0,-0.55,5.45,110.9,...,0.748,10.6,33.9,44.5,22.7,6.7,3.9,13.6,20.3,104.3
4,Los Angeles Clippers,29.7,53,29,53,29,4.28,-0.15,4.13,108.3,...,0.692,8.8,33.3,42.0,22.8,8.6,5.6,13.0,21.3,104.5


In [8399]:
team_stats=team_stats1.set_index('Team', inplace=False, drop=True)

In [8400]:
result_data.head()

Unnamed: 0,WTeam,LTeam,WLoc
0,Atlanta Hawks,New York Knicks,V
1,San Antonio Spurs,Brooklyn Nets,H
2,Memphis Grizzlies,Brooklyn Nets,H
3,Chicago Bulls,Brooklyn Nets,V
4,Detroit Pistons,Chicago Bulls,H


In [8401]:
team_stats.shape

(30, 65)

# 2. 特征工程

## 2.1 特征提取

In [8402]:
# 当每支队伍没有elo等级分时，赋予其基础elo等级分
base_elo = 1600
team_elos = {} 
#team_stats = {}
#X = []
#y = []
folder = 'nbadata' #存放数据的目录

In [8403]:
# 获取elo等级分函数
def get_elo(team):
    try:
        return team_elos[team]
    except:
        # 当最初没有elo时，给每个队伍最初赋base_elo
        team_elos[team]= base_elo
        return team_elos[team]

In [8404]:
#更新elo等级分函数
def calc_elo(win_team, lose_team,HorV):
    elo_win_old=get_elo(win_team)
    elo_lose_old=get_elo(lose_team)
    if HorV=='H':
        elo_diff=elo_lose_old-elo_win_old-50
    else:
        elo_diff=elo_lose_old-elo_win_old+50
    E_win=1/(1+10**(elo_diff/400))
    E_lose=1/(1+10**(-elo_diff/400))
    if elo_win_old>=1650:
        K=16
    elif 1550<=elo_win_old<1650:
        K=24
    else:
        K=32
    elo_win_new=elo_win_old+K*(1-E_win)
    #按照默认的K值得出的强队愈强，弱队愈弱，因为强队elo高，输球代价小，虽然赢球增值也小，但输少赢多。弱队赢球增值高，但输球代价大，输多赢少
    #这种方式，强队一旦输给弱队，elo下降的会比较多，
    if elo_lose_old>=1650:
        K=32
    elif 1550<=elo_lose_old<1650:
        K=24
    else:
        K=16
    elo_lose_new=elo_lose_old+K*(0-E_lose)
    team_elos[win_team]=round(elo_win_new)
    team_elos[lose_team]=round(elo_lose_new)
    return team_elos[win_team],team_elos[lose_team]

In [8405]:
# 创建相关特征的数组的函数
def build_dataSet(result_data):
    X = []
    y = []
    skip = 0
    for index, row in result_data.iterrows():
        Wteam = row['WTeam']
        Lteam = row['LTeam']
        # 获取最初的elo或是每个队伍最初的elo值
        team1_elo=get_elo(Wteam)
        team2_elo=get_elo(Lteam)
        #if row['WLoc'] == 'H':
            #team1_elo += 40
        #else:
            #team2_elo += 40
        # 把elo作为评价每个队伍的第一个特征值
        team1_features = [team1_elo]
        team2_features = [team2_elo]
        # 把是否主场作为评价每个队伍的第二个特征值,采用实验楼方法时(主场优势加在elo中)，都改为0即可
        if row['WLoc']=='H':
            team1_features.append(1)
            team2_features.append(0)
        else:
            team1_features.append(0)
            team2_features.append(1)
        # 添加我们从basketball reference.com获得的每个队伍的统计信息
        for key,value in team_stats.loc[Wteam].iteritems():
            team1_features.append(value)
        for key,value in team_stats.loc[Lteam].iteritems():
            team2_features.append(value)
        # 将两支队伍的特征值随机的分配在每场比赛数据的左右两侧
        # 并将对应的0/1赋给y值
        if random.random()<0.5:
            X.append(team1_features+team2_features)
            y.append(0)
        else:
            X.append(team2_features+team1_features)
            y.append(1)
        # 此步可有可无
        if skip == 0:
            print('X',X)
            skip = 1
        # 根据这场比赛的数据更新队伍的elo值
        calc_elo(Wteam,Lteam,row['WLoc'])
    return X,y

In [8406]:
X,y=build_dataSet(result_data)

X [[1600, 0, 28.199999999999999, 48.0, 34.0, 51.0, 31.0, 3.6099999999999999, -0.12, 3.4900000000000002, 105.09999999999999, 101.40000000000001, 97.099999999999994, 0.23699999999999999, 0.33600000000000002, 0.55200000000000005, 0.51600000000000001, 13.800000000000001, 19.100000000000001, 0.185, 0.47999999999999998, 14.4, 74.599999999999994, 0.19399999999999998, 690150.0, 37.100000000000001, 86.099999999999994, 0.43200000000000005, 8.3000000000000007, 24.5, 0.33799999999999997, 28.899999999999999, 61.600000000000001, 0.46899999999999997, 16.699999999999999, 22.100000000000001, 0.755, 11.5, 35.0, 46.5, 22.0, 8.5999999999999996, 5.0, 16.100000000000001, 18.300000000000001, 99.200000000000003, 38.600000000000001, 84.400000000000006, 0.45799999999999996, 9.9000000000000004, 28.399999999999999, 0.34999999999999998, 28.699999999999999, 56.100000000000001, 0.51200000000000001, 15.6, 20.0, 0.78299999999999992, 8.3000000000000007, 33.799999999999997, 42.100000000000001, 25.600000000000001, 9.0999

### 特征转换
(数组转换成数据框用于后续评估)

In [1]:
#以下为列表转换为矩阵再转换为数据框的分步骤方法，已写成函数

In [8407]:
#X1=np.array(X)

In [8408]:
#X1.shape

In [8409]:
#X1=np.insert(X1,134,y,axis=1)

In [8410]:
#X1.shape

In [8411]:
#team_stats.columns.values

In [8412]:
#np.insert(team_stats.columns.values,0,['elo','HorV'])

In [8413]:
#clms=np.insert(team_stats.columns.values,0,['elo','HorV'])

In [8414]:
#clms1=clms+'1'

In [8415]:
#clms2=clms+'2'

In [8416]:
#clms_t= np.insert(clms1,len(clms1),clms2)

In [8417]:
#clms_t= np.insert(clms_t,len(clms_t),'Win')

In [8418]:
#feature_df=pd.DataFrame(X1,columns=clms_t)

In [8419]:
#feature_df.head()

In [8420]:
#将数组转化为数据框函数
def form_df(X,z='train'):
    X=np.array(X)
    if z=='train':
        X=np.insert(X,134,y,axis=1)
    elif z=='test':
        X=np.insert(X,134,0,axis=1)
    clms=np.insert(team_stats.columns.values,0,['elo','HorV'])
    clms1=clms+'1'
    clms2=clms+'2'
    clms_t= np.insert(clms1,len(clms1),clms2)
    clms_t= np.insert(clms_t,len(clms_t),'Win')
    return(pd.DataFrame(X,columns=clms_t))

## 2.2 特征选择

### 特征评估

#### 相关系数法

In [8421]:
feature_df=form_df(X)

In [8422]:
feature_df.corr()

Unnamed: 0,elo1,HorV1,Age1,W1,L1,PW1,PL1,MOV1,SOS1,SRS1,...,ORB_y2,DRB_y2,TRB_y2,AST_y2,STL_y2,BLK_y2,TOV_y2,PF_y2,PTS_y2,Win
elo1,1.000000,0.035202,0.427327,0.882988,-0.882988,0.850505,-0.850505,0.860947,-0.718262,0.861247,...,0.056812,0.033883,0.061015,0.002009,-0.001809,0.013709,0.021991,0.019583,0.062109,-0.235206
HorV1,0.035202,1.000000,-0.009060,0.034684,-0.034684,0.033731,-0.033731,0.034348,-0.032698,0.034173,...,0.002994,0.017893,0.015325,0.016595,0.005878,-0.015928,-0.008254,-0.003164,0.016576,-0.189547
Age1,0.427327,-0.009060,1.000000,0.453873,-0.453873,0.394061,-0.394061,0.388610,-0.217991,0.394312,...,0.093377,-0.006807,0.051989,-0.050064,-0.008242,-0.032193,0.023409,0.064801,0.022265,-0.148399
W1,0.882988,0.034684,0.453873,1.000000,-1.000000,0.975201,-0.975201,0.980787,-0.836679,0.980136,...,0.056836,0.014306,0.045724,-0.015246,-0.009985,0.014865,0.016996,0.015714,0.028708,-0.313068
L1,-0.882988,-0.034684,-0.453873,-1.000000,1.000000,-0.975201,0.975201,-0.980787,0.836679,-0.980136,...,-0.056836,-0.014306,-0.045724,0.015246,0.009985,-0.014865,-0.016996,-0.015714,-0.028708,0.313068
PW1,0.850505,0.033731,0.394061,0.975201,-0.975201,1.000000,-1.000000,0.997211,-0.860953,0.996025,...,0.048528,0.013686,0.040184,-0.002782,-0.005710,0.020406,0.012072,0.012214,0.032460,-0.301451
PL1,-0.850505,-0.033731,-0.394061,-0.975201,0.975201,-1.000000,1.000000,-0.997211,0.860953,-0.996025,...,-0.048528,-0.013686,-0.040184,0.002782,0.005710,-0.020406,-0.012072,-0.012214,-0.032460,0.301451
MOV1,0.860947,0.034348,0.388610,0.980787,-0.980787,0.997211,-0.997211,1.000000,-0.849615,0.999574,...,0.054097,0.012148,0.042484,-0.006752,-0.005179,0.017908,0.015692,0.017367,0.032755,-0.303795
SOS1,-0.718262,-0.032698,-0.217991,-0.836679,0.836679,-0.860953,0.860953,-0.849615,1.000000,-0.833875,...,0.005343,-0.029348,-0.018711,-0.017044,0.022920,-0.045997,0.002363,0.050538,-0.001492,0.253651
SRS1,0.861247,0.034173,0.394312,0.980136,-0.980136,0.996025,-0.996025,0.999574,-0.833875,1.000000,...,0.056939,0.011100,0.043452,-0.007953,-0.004140,0.016208,0.016592,0.021017,0.034244,-0.303872


In [8423]:
corr_ss=feature_df.corr().Win.map(abs).sort_values(ascending=False)

In [8424]:
corr_ss.head()

Win     1.000000
W2      0.321983
L2      0.321983
MOV2    0.321193
PL2     0.321134
Name: Win, dtype: float64

In [8427]:
#去除后缀_和数字[1-2],算出特征前缀一致的特征相关系数均值
corr_gr_ss=corr_ss.groupby(corr_ss.index.str.split('_').str[0].str.replace('\d$','')).mean().sort_values()

In [8428]:
#删除Win行向量
corr_gr_ss=corr_gr_ss.drop('Win')

#### 随机森林分类法

In [8430]:
model_test=RandomForestClassifier()

In [8431]:
feature_df.iloc[:,:-1].head()

Unnamed: 0,elo1,HorV1,Age1,W1,L1,PW1,PL1,MOV1,SOS1,SRS1,...,FT%_y2,ORB_y2,DRB_y2,TRB_y2,AST_y2,STL_y2,BLK_y2,TOV_y2,PF_y2,PTS_y2
0,1600.0,0.0,28.2,48.0,34.0,51.0,31.0,3.61,-0.12,3.49,...,0.805,10.4,34.0,44.4,20.5,5.7,5.7,13.4,19.7,98.4
1,1600.0,0.0,26.9,21.0,61.0,22.0,60.0,-7.35,0.24,-7.12,...,0.803,9.4,34.5,43.9,24.5,8.3,5.9,13.1,17.5,103.5
2,1600.0,1.0,30.5,42.0,40.0,35.0,47.0,-2.24,0.11,-2.14,...,0.757,10.5,31.9,42.4,22.3,7.6,4.0,14.8,18.0,98.6
3,1600.0,0.0,27.6,42.0,40.0,37.0,45.0,-1.48,0.01,-1.46,...,0.757,10.5,31.9,42.4,22.3,7.6,4.0,14.8,18.0,98.6
4,1613.0,0.0,27.6,42.0,40.0,37.0,45.0,-1.48,0.01,-1.46,...,0.668,12.5,33.9,46.3,19.4,7.0,3.7,13.5,19.0,102.0


In [8432]:
model_test.fit(feature_df.iloc[:,:-1],feature_df['Win'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [8433]:
rfc_ss=pd.Series(index=feature_df.columns[:-1],data=model_test.feature_importances_).sort_values()

In [8434]:
#去除后缀_和数字[1-2],算出特征前缀一致的特征重要性均值
rfc_gr_ss=rfc_ss.groupby(rfc_ss.index.str.split('_').str[0].str.replace('\d$','')).mean().sort_values()

In [8435]:
rfc_gr_ss.head()

FTA       0.003092
FT/FGA    0.003220
3PAr      0.003439
TOV%.1    0.003729
FT        0.003762
dtype: float64

#### lasso方法

In [8436]:
model_test2=Lasso()

In [8437]:
model_test2.fit(feature_df.iloc[:,:-1],feature_df['Win'])

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [8438]:
model_test2.coef_

array([ -1.18590341e-03,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -8.53282829e-07,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,

In [8439]:
lasso_ss=pd.Series(index=feature_df.columns[:-1],data=model_test2.coef_).sort_values()

In [8440]:
#去除后缀_和数字[1-2],算出特征前缀一致的特征的系数均值
lasso_gr_ss=lasso_ss.groupby(lasso_ss.index.str.split('_').str[0].str.replace('\d$','')).mean().sort_values()

In [8441]:
lasso_gr_ss.head()

Attendance   -2.532585e-07
2P            0.000000e+00
MOV           0.000000e+00
ORB           0.000000e+00
ORB%          0.000000e+00
dtype: float64

#### 评估方法比较 
比较相关系数、随机森林分类、Lasso方法对各个特征的评估

In [8442]:
rfc_gr_ss.name='rfc'

In [8443]:
corr_gr_ss.name='corr'

In [8444]:
lasso_gr_ss.name='lasso'

In [8445]:
#pd.concat([corr_gr_ss,pd.Series(range(0,len(corr_gr_ss)),index=corr_gr_ss.index,name='crank')],axis=1)#固定索引加列复杂方法

In [8446]:
corr_df=pd.DataFrame(corr_gr_ss)

In [8447]:
corr_df['crank']=range(1,len(corr_gr_ss)+1)

In [8448]:
corr_df.head()

Unnamed: 0,corr,crank
FT/FGA,0.004391,1
FTr,0.019937,2
FGA,0.02215,3
TOV%.1,0.023688,4
ORB,0.026859,5


In [8449]:
rfc_df=pd.DataFrame(rfc_gr_ss)

In [8450]:
rfc_df['rrank']=range(1,len(rfc_df)+1)

In [8451]:
rfc_df.head()

Unnamed: 0,rfc,rrank
FTA,0.003092,1
FT/FGA,0.00322,2
3PAr,0.003439,3
TOV%.1,0.003729,4
FT,0.003762,5


In [8452]:
lasso_df=pd.DataFrame(lasso_gr_ss)

In [8453]:
lasso_df['lrank']=lasso_df.lasso.apply(lambda x:1 if x!=0 else 0)

In [8454]:
lasso_df.head()

Unnamed: 0,lasso,lrank
Attendance,-2.532585e-07,1
2P,0.0,0
MOV,0.0,0
ORB,0.0,0
ORB%,0.0,0


In [8455]:
tscore_df=pd.concat([corr_df,rfc_df,lasso_df],axis=1).sort_values('crank')

corr为各特征和标注的相关系数，crank为相关性排名，越小越不相关  
rfc为各特征重要性，rrank为重要性排名，rrank越小重要性越小  
lasso为对特征用lasso拟合的相关系数，lrank代表相关系数是否为0，1代表不为0  

In [8456]:
tscore_df

Unnamed: 0,corr,crank,rfc,rrank,lasso,lrank
FT/FGA,0.004391,1,0.00322,2,0.0,0
FTr,0.019937,2,0.005094,27,0.0,0
FGA,0.02215,3,0.004237,12,0.0,0
TOV%.1,0.023688,4,0.003729,4,0.0,0
ORB,0.026859,5,0.00501,24,0.0,0
Pace,0.037915,6,0.006108,34,0.0,0
ORB%,0.038486,7,0.003784,6,0.0,0
FTA,0.047677,8,0.003092,1,0.0,0
FT,0.049978,9,0.003762,5,0.0,0
FT%,0.055676,10,0.003982,8,0.0,0


#### 删除(相关系数排名＋随机森林分类排名)<=n(想选入的特征个数，自定义)的变量

In [8457]:
tscore_df.loc[(tscore_df.crank+tscore_df.rrank)/2<=10]

Unnamed: 0,corr,crank,rfc,rrank,lasso,lrank
FT/FGA,0.004391,1,0.00322,2,0.0,0
FGA,0.02215,3,0.004237,12,0.0,0
TOV%.1,0.023688,4,0.003729,4,0.0,0
ORB%,0.038486,7,0.003784,6,0.0,0
FTA,0.047677,8,0.003092,1,0.0,0
FT,0.049978,9,0.003762,5,0.0,0
FT%,0.055676,10,0.003982,8,0.0,0
3PAr,0.079473,16,0.003439,3,0.0,0


In [8458]:
del_name=tscore_df.loc[(tscore_df.crank+tscore_df.rrank)/2<=10].index.values

In [8459]:
del_name

array(['FT/FGA', 'FGA', 'TOV%.1', 'ORB%', 'FTA', 'FT', 'FT%', '3PAr'], dtype=object)

In [8460]:
#将所有特征名转化为它的前缀名
map_df=pd.Series(feature_df.columns[:-1].str.split('_').str[0].str.replace('\d$',''))

In [8461]:
#将所有特征名赋值给map_df的索引
map_df.index=feature_df.columns[:-1]

In [8462]:
map_df.head()

elo1      elo
HorV1    HorV
Age1      Age
W1          W
L1          L
dtype: object

In [8463]:
X=feature_df.drop('Win',axis=1)

### 删除不需要的特征

In [8465]:
#特征加工函数(删除显著性较低特征)
def processdf(X):
    if 'Win' in X.columns:
        X.drop('Win',axis=1,inplace=True)
    for name in X.columns:
        if map_df[name] in del_name:
            X.drop(name,axis=1,inplace=True)
    return(X)

In [8466]:
X.shape

(1316, 134)

In [8468]:
#加工特征，得到我们最终想要的
X=processdf(X)

# 3.建模

## 3.1 建立并拟合模型

In [8469]:
#采用决策树建模
model=DecisionTreeClassifier()

In [8470]:
model.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

## 3.2 评估模型

In [8471]:
#模型准确率
model.score(X,y)

0.99924012158054709

In [8472]:
#10折交叉验证准确率
cross_val_score(model, X, y, cv = 10, scoring='accuracy', n_jobs=-1).mean()

0.61396768625776255

In [8473]:
schedule1617.shape

(1230, 2)

In [8474]:
schedule1617.head()

Unnamed: 0,Vteam,Hteam
0,New York Knicks,Cleveland Cavaliers
1,San Antonio Spurs,Golden State Warriors
2,Utah Jazz,Portland Trail Blazers
3,Brooklyn Nets,Boston Celtics
4,Dallas Mavericks,Indiana Pacers


# 4.方案实施

## 4.1 预测验证集

In [8475]:
#生成预测数据集数组函数
def predict_dataset(df):
    X=[]
    for index,row in df.iterrows():
        team1=row['Vteam']
        team2=row['Hteam']
        team1_elo=get_elo(team1)
        team2_elo=get_elo(team2)
        feature_team1 = [team1_elo,0]
        feature_team2 = [team2_elo,1]
        for key,value in team_stats.loc[team1].iteritems():
            feature_team1.append(value)#注意不要加赋值语句
        for key,value in team_stats.loc[team2].iteritems():
            feature_team2.append(value)
        X.append(feature_team1+feature_team2)
    return X

In [8476]:
X=predict_dataset(schedule1617)

In [8477]:
X=form_df(X,z='test')

In [8478]:
X=processdf(X)

In [8479]:
X.head()

Unnamed: 0,elo1,HorV1,Age1,W1,L1,PW1,PL1,MOV1,SOS1,SRS1,...,2P%_y2,ORB_y2,DRB_y2,TRB_y2,AST_y2,STL_y2,BLK_y2,TOV_y2,PF_y2,PTS_y2
0,1535.0,0.0,27.2,32.0,50.0,33.0,49.0,-2.73,0.0,-2.74,...,0.514,10.6,33.9,44.5,22.7,6.7,3.9,13.6,20.3,104.3
1,1681.0,0.0,30.3,67.0,15.0,67.0,15.0,10.63,-0.36,10.28,...,0.528,10.0,36.2,46.2,28.9,8.4,6.1,15.2,20.7,114.9
2,1601.0,0.0,24.2,40.0,42.0,46.0,36.0,1.79,0.05,1.84,...,0.49,11.6,33.9,45.5,21.3,6.9,4.6,14.6,21.7,105.1
3,1493.0,0.0,26.9,21.0,61.0,22.0,60.0,-7.35,0.24,-7.12,...,0.483,11.6,33.3,44.9,24.2,9.2,4.2,13.7,21.9,105.7
4,1590.0,0.0,30.3,42.0,40.0,40.0,42.0,-0.3,0.29,-0.02,...,0.486,10.3,33.9,44.2,21.2,9.0,4.8,14.9,20.0,102.2


In [8480]:
pred_y=model.predict(X)

In [8481]:
pred_y_pro=model.predict_proba(X)

In [8482]:
pred_y_pro[:,1]

array([ 0.,  1.,  0., ...,  1.,  0.,  0.])

In [8483]:
pred_y_ss=pd.Series(pred_y_pro[:,1]).map(lambda x:1 if x>0.5 else 0)#通过调节百分比,控制主队获胜难度

In [8484]:
result=pd.concat([schedule1617,pred_y_ss],axis=1)

In [8485]:
result.columns=['Vteam', 'Hteam', 'win']

In [8486]:
result.head()

Unnamed: 0,Vteam,Hteam,win
0,New York Knicks,Cleveland Cavaliers,0
1,San Antonio Spurs,Golden State Warriors,1
2,Utah Jazz,Portland Trail Blazers,0
3,Brooklyn Nets,Boston Celtics,0
4,Dallas Mavericks,Indiana Pacers,1


In [8495]:
#15-16赛季主客场胜负场次
result_data.WLoc.value_counts()

H    782
V    534
Name: WLoc, dtype: int64

In [8487]:
#16-17赛季主客场胜负场次
result.win.value_counts()

1    696
0    534
Name: win, dtype: int64

In [8491]:
#求出每个球队的主场胜利情况
Homewin_ss=result.groupby('Hteam').win.sum()
#求出每个球队的客场胜利情况
VictorWin_ss=result.groupby('Vteam').win.apply(lambda x:x.count()-x.sum())

In [8496]:
#生成每个球队的主客场胜负表
result_per_team=pd.concat([Homewin_ss,VictorWin_ss],axis=1)

In [8497]:
#计算出每个球队的总胜场和总负场及胜率，添加至表中
result_per_team.columns=['Hwin','Vwin']
result_per_team['total_win']=result_per_team.sum(axis=1)
result_per_team['total_lose']=82-result_per_team['total_win']
result_per_team['win%']=(round(result_per_team['total_win']/82*100,0)).astype('str')+'%'

In [8498]:
#按照总胜场来对球队进行排序
result_per_team.sort_values('total_win',ascending=False)

Unnamed: 0,Hwin,Vwin,total_win,total_lose,win%
Golden State Warriors,38,29,67,15,82.0%
San Antonio Spurs,35,28,63,19,77.0%
Atlanta Hawks,32,21,53,29,65.0%
Boston Celtics,27,26,53,29,65.0%
Dallas Mavericks,30,22,52,30,63.0%
Oklahoma City Thunder,35,17,52,30,63.0%
Miami Heat,34,18,52,30,63.0%
Los Angeles Clippers,33,18,51,31,62.0%
Milwaukee Bucks,32,18,50,32,61.0%
Cleveland Cavaliers,24,23,47,35,57.0%


## 4.2 结论

本次分析的主要时间都花在建立特征上，特别是根据15-16赛季的每场比赛结果来更新elo等级分，15-16赛季最终的elo分将奠定16-17赛季每个球队的实力基础，elo等级分的调整系数K对于整个预测起到了一个关键性作用，如果设置的不好，会拉大强队和弱队的差距，所以调整K以及它所对应的不同elo区间花费了比较多的时间。再然后是对特征的选取，主要根据特征相关系数和特征重要性来筛选特征，由于很多拥有相同前缀的特征有着相似的含义，所以得把他们当成同一个父特征的字特征，对这些父特征进行筛选，最终选入优选的父特征的所有子特征进行建模。   
本次分析还有非常多的因素没有考虑，例如新赛季球员交易、教练变动、球员伤病、选秀,这些变化结合在一起又会创造新的变化，比如球员化学反应，不同球员的打法不一样，有些球星聚集在一起打球并不能起到好的效果，例如像韦少这样喜欢持球单干的球员，可能不太适合巨星抱团。所以可以增加进分析的新变量有很多，未来有时间我还会继续研究这一块，当然你要快速预测也不是没有捷径的，比如直接打开2K模拟一整个赛季。