# Classifying players & Constructing time dataframes

## Summary 

### Import libraries &  data

### Classify players
1. Create player feature, average games per hour for hours online
2. Use decision tree (manual classification) to categorise players

### Create data frames for time series analysis
1. Hourly traffic data frame
2. Hourly traffic that includes at least one recreational player
3. Hourly count of number of professionals online (played at least one game in the hour)
4. Fill 0 hours that are likely result of data collection/poker service provider

### Further EDA with player classifications
1. Distribution of all traffic and recreational traffic
2. Average (mean) number of pros online by hour of day
3. Network graph of professionals

## Import libraries 

In [14]:
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import copy

# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
from plotly import tools
import networkx as nx

# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

## Import and format data

In [15]:
ht=pd.read_csv('hourly_traffic.csv')
ps=pd.read_csv('Player_summary.csv')
ts=pd.read_csv('Tournament_summary.csv')

In [16]:
ts.head()

Unnamed: 0,date_time,tournament_id,total_buyin,prize_buyin,rake_buyin,first_place_id,second_place_id,first_place_prize,finishing_level
0,2015-11-04 16:00:01,1,51,50,1,Mark Hunter,Juan Avery,100,1
1,2015-11-04 16:07:49,2,51,50,1,Michelle Wiley,Dana Brown,100,1
2,2015-11-04 16:12:36,3,51,50,1,Dana Brown,Richard Myers,100,2
3,2015-11-04 16:21:04,4,51,50,1,Dana Brown,Mary Campbell,100,1
4,2015-11-04 16:21:54,5,51,50,1,Jesse Myers,Jonathon Hernandez,100,3


In [17]:
#recast date_time column as type date_time
ht['date_time']= ht['date_time'].apply(pd.Timestamp)
ts['date_time']= ts['date_time'].apply(pd.Timestamp)

#create hour of day column
ht['hour_of_day']=[i.hour for i in ht.date_time]

#create day of week as number column - for plotting multiple days in order
ht['num_day_of_week']=[i.weekday() for i in ht.date_time]

#ps.set_index('player_id', inplace=True)

#create month column
ht['month']=[i.month for i in ht.date_time]

#create day of month column
ht['day_of_month']=[i.day for i in ht.date_time]

#set date time as index for tournament summary
ts.set_index('date_time', inplace=True)


## Classify players

The exploratory data analysis shows there exists two classes of players, professional players, (over 1500 games, usually winners), and recreational players (less than 1000 games) who see a variety of results due to the high variance associated with small samples sizes (i.e. low game count).

Over 99% of the players can be easily classed using common sense thresholds on profit and game count. This provides: 

1. Additional information to include in the player report
2. The opportunity to model various types of traffic (all traffic, traffic including at least on recreational player, number of professionals online).


#### Notes on accuracy
- For poker players, pro/rec classifications could be considered subjective
- Professional players new to the game will be misclassified until they pass a certain game count threshold
- Players who's short term real win rate deviate from their long term expected win rate may be misclassified

Any method we use to classify players is difficult to implement at scale, the data set only has games from a single stake; the relationships between recreational and professional players may be different at different stakes, similarly the relationship between stake and profit. Ideally, any information dashboard would allow users to reclassify players at will and rerun models with the new player classifications. 

#### Method for classification
We have identified three variables as important in classifying players as recreational or professional, number of games played, average return on investment and number of games played per hour online. 

Any attempt to apply clustering algorithms (supervised or not) will require a manual review of predictions to give a subject estimation of accuracy. Due to this unavoidable problem, a manual classification will be performed in a decision tree style (slice player groups with thresholds on the variable stated above - assign as appropriate).

In [18]:
def avg_hourly_games_online(player_id, ts=ts):
    """Given player, return average number of games per hour given player has played at least one game  in hour""" 
    #all games including player_id
    info=ts[(ts.first_place_id==player_id) | (ts.second_place_id ==player_id)]
    
    #resample by hour, name column 'count'
    df=pd.DataFrame(info.first_place_id.resample('H').count())
    df=df.rename(columns={"first_place_id": "game_count"})
    
    return df[df['game_count']>0].mean()

In [19]:
#populate games per hour column
ps['games_per_hour']=ps['player_id'].apply(avg_hourly_games_online)

In [20]:
#set all is reg values to -1 i.e. unclassified
ps['is_reg']=[-1]*len(ps)

In [21]:
#Classify obvious recreational and pro players

#less than 1000 games = recreational
for i in ps.index:
    if ps.loc[i]['game_count']<1000:
        ps.loc[i,'is_reg']=0
        
#more than 1500 games and profit = pro
for i in ps.index:
    if ps.loc[i]['game_count']>1500:
        if ps.loc[i]['avg_roi_%']>0:
            ps.loc[i,'is_reg']=1
            


In [22]:
#players still unclassified
ps[ps.is_reg==-1]

Unnamed: 0,player_id,game_count,total_profit,avg_roi_%,avg_stake,games_per_hour,is_reg
417,Alyssa Huerta,1128,-328,-0.57,51.0,7.77931,-1
1326,Ashley Werner,1407,-4757,-6.63,51.0,4.289634,-1
1971,Brooke Shepherd,1458,-1158,-1.56,51.0,7.363636,-1
2141,Carmen Brown,1409,-2459,-3.42,51.0,7.045,-1
2304,Chad Tran,1807,-2857,-3.1,51.0,9.266667,-1
5398,Jacqueline Green,1518,-2318,-2.99,51.0,10.615385,-1
7358,Karen Adkins,1121,2429,4.25,51.0,5.312796,-1
8186,Kristen Rodriguez,1610,-1510,-1.84,51.0,4.375,-1
11621,Robert Green,1266,3734,5.78,51.0,3.907407,-1
12383,Sean Foster II,6098,-7098,-2.28,51.0,15.595908,-1


In [23]:
#more than 1500 and games per hour over 9 =pro
#instances were rare - individuals were examined to ensure classification is accurate
for i in ps.index:
    if ps.loc[i]['game_count']>1500:
        if ps.loc[i]['games_per_hour']>9:
            ps.loc[i,'is_reg']=1

In [24]:
#players still unclassified
ps[ps.is_reg==-1]

Unnamed: 0,player_id,game_count,total_profit,avg_roi_%,avg_stake,games_per_hour,is_reg
417,Alyssa Huerta,1128,-328,-0.57,51.0,7.77931,-1
1326,Ashley Werner,1407,-4757,-6.63,51.0,4.289634,-1
1971,Brooke Shepherd,1458,-1158,-1.56,51.0,7.363636,-1
2141,Carmen Brown,1409,-2459,-3.42,51.0,7.045,-1
7358,Karen Adkins,1121,2429,4.25,51.0,5.312796,-1
8186,Kristen Rodriguez,1610,-1510,-1.84,51.0,4.375,-1
11621,Robert Green,1266,3734,5.78,51.0,3.907407,-1


In [25]:
#rois of winner quite high, prob regs
#remaining players fall into two distinct games per hour group
#under 5= rec, over 5= pro

#two with positive roi likely reg
for i in ps.index:
    if ps.loc[i]['is_reg']==-1:
        if ps.loc[i]['avg_roi_%']>0:
            ps.loc[i,'is_reg']=1

#two with lower game count
for i in ps.index:
    if ps.loc[i]['is_reg']==-1:
        if ps.loc[i]['games_per_hour']<5:
            ps.loc[i,'is_reg']=0
        else:
            ps.loc[i,'is_reg']=1
            
ps[ps.is_reg==-1].shape

(0, 7)

In [26]:
#all players have been assigned a class
ps[ps.is_reg==1].describe()

Unnamed: 0,game_count,total_profit,avg_roi_%,avg_stake,games_per_hour,is_reg
count,48.0,48.0,48.0,48.0,48.0,48.0
mean,4260.583333,9137.333333,3.940208,51.0,6.081107,1.0
std,3295.766581,8778.642495,3.219975,0.0,2.511498,0.0
min,1121.0,-7098.0,-3.42,51.0,3.131361,1.0
25%,1932.25,2895.0,2.6125,51.0,4.57304,1.0
50%,3163.0,7785.5,4.32,51.0,5.367209,1.0
75%,6032.75,15275.75,6.23,51.0,7.134984,1.0
max,18790.0,34310.0,9.98,51.0,15.595908,1.0


In [27]:
ps[ps.is_reg==0].describe()

Unnamed: 0,game_count,total_profit,avg_roi_%,avg_stake,games_per_hour,is_reg
count,14435.0,14435.0,14435.0,14435.0,14435.0,14435.0
mean,13.545272,-58.096709,-21.100648,51.0,1.814482,0.0
std,47.367656,230.223658,65.205433,0.0,1.532657,0.0
min,1.0,-5736.0,-100.0,51.0,1.0,0.0
25%,1.0,-102.0,-100.0,51.0,1.0,0.0
50%,3.0,-51.0,-17.44,51.0,1.142857,0.0
75%,9.0,47.0,12.04,51.0,2.0,0.0
max,1610.0,5564.0,96.08,51.0,27.75,0.0


In [28]:
ps.to_csv('Player_summary_with_pro_classification.csv')

## Create three time series

As discussed in the EDA, we will now compute 3 seperate hourly data frames:
1. All traffic (already have as variable ht)
2. Traffic where at least one player is recreational
2. Number of professionals online in given hour (unique professional online in hour)

In [29]:
pros=list(ps[ps['is_reg']==1]['player_id'])

In [30]:
#number of professional in sample
len(pros)

48

#### Add pro count to tournament summary dataframe

In [33]:
ts['pro_count']=[0]*len(ts)

#index with loc because some time indexes are duplicates
for index in range(len(ts)):
    pro_count=0
    if ts.iloc[index]['first_place_id'] in pros:
        pro_count=pro_count+1
    if ts.iloc[index]['second_place_id'] in pros:
        pro_count=pro_count+1
    #column 10 is pro count column
    ts.iloc[index,8]=pro_count


In [34]:
#number of games pro vs pro
len(ts[ts['pro_count']==2])/len(ts)

0.07185389241914437

In [35]:
#number of games rec vs rec
len(ts[ts['pro_count']==0])/len(ts)

0.049400800931920785

In [36]:
#number of games pro vs rec
len(ts[ts['pro_count']==1])/len(ts)

0.8787453066489348

#### Construct hourly traffic data frame (for games with at least 1 recreational player)

In [37]:
#subset ts dataframe for games with at least one recreatioanl
ts_recs=ts[ts['pro_count']<2]

#sample tournament summaries with hourly interval, taking count of each hour
ht_recs=pd.DataFrame(ts_recs.first_place_id.resample('H').count())

#change column name to count
ht_recs.columns=['game_count']

#### Construct hourly traffic data frame (number of pros online)

In [38]:
#add date_time as column for masking
ts['date_time']=ts.index

#delete index title
ts.index.name=None

In [39]:
#create data frame with hours
pros_online=pd.DataFrame(ht['date_time'])

In [40]:
pro_column=[]

for time in ht['date_time']:
    end_time=time+timedelta(hours=1)
    #subset tournament summary by single hour
    info=ts[(ts['date_time']>=time) & (ts['date_time']<end_time)]
    #collect list of unique players
    player_info=list(pd.unique(list(info['first_place_id']) +list(info['second_place_id'])))
    #number of unique pros in list
    unique_pros=0
    for player in player_info:
        if player in pros:
            unique_pros=unique_pros+1
    pro_column.append(unique_pros)
    
pros_online['pros_online']=pro_column

## Fill 0 hours where necessary

Fill instances of sequential 0 hours with the mean games per hour for that hour of the week.

In [41]:
#add columns to use for subsettig in function
pros_online['num_day_of_week']=[i.weekday() for i in pros_online.date_time]
pros_online['hour_of_day']=[i.hour for i in pros_online.date_time]

ht_recs['date_time']=ht_recs.index
ht_recs['num_day_of_week']=[i.weekday() for i in ht_recs.date_time]
ht_recs['hour_of_day']=[i.hour for i in ht_recs.date_time]


In [42]:
def fill_missing_values(hourly_traffic):
    """Fills 0 hour instances that are likely to be missing values with hour of week mean
    """
    
    ht=copy.copy(hourly_traffic)
    
    #construct dataframes of fill values
    ht_mean_weekly=pd.DataFrame(ht.groupby(['num_day_of_week', 'hour_of_day'])['game_count'].mean())
        
    index_list=[]

    for index in list(ht.index):
        #sequential 0 hour indexes
        try:
            if ((ht.loc[index]['game_count']==0 )& (ht.loc[index+1]['game_count']==0)):
                index_list.append(index)
                index_list.append(index+1)
        #skips last instance when index+1 is not index
        except:
            pass
        #also fill in regular 0 hour time
        if ((ht.loc[index]['num_day_of_week']==2) &(ht.loc[index]['hour_of_day']==11) &(ht.loc[index]['game_count']==0)):
                index_list.append(index)
                index_list.append(index+1)
                index_list.append(index-1)

    #unique index
    index_list=list(np.unique(index_list))
     
    #set 0 values to weekly average
    for index in index_list:
        try:
            week=ht.loc[index, 'num_day_of_week']
            day=ht.loc[index, 'hour_of_day']
            ht.loc[index, 'game_count']=int(round(ht_mean_weekly.loc[week].loc[day]))
        except:
            pass

    return ht

In [43]:
#clean hourly traffic, all players included
ht_clean_ap=fill_missing_values(ht)

In [44]:
#clean hourly traffic, at least one recreational player in game
ht_clean_recs=fill_missing_values(ht_recs)

In [45]:
#number of pros online in given hour
#change pros_online column to game_count so it works in function
pros_online.columns=['date_time', 'game_count', 'num_day_of_week', 'hour_of_day']

pros_online_clean=fill_missing_values(pros_online)

#change game_count columnback to pros_online
pros_online_clean.columns=['date_time', 'pros_online', 'num_day_of_week', 'hour_of_day']

In [46]:
ht_clean_ap.set_index('date_time', inplace = True)
pros_online_clean.set_index('date_time', inplace = True)

## EDA with player classifications

#### Comparison of game counts per hour descriptions

In [47]:
#all traffic
ht_clean_ap['game_count'].describe()

count    7565.000000
mean       26.593126
std        16.377406
min         0.000000
25%        15.000000
50%        24.000000
75%        35.000000
max       139.000000
Name: game_count, dtype: float64

In [48]:
ht_clean_recs['game_count'].describe()

count    7565.000000
mean       24.688169
std        14.179365
min         0.000000
25%        14.000000
50%        22.000000
75%        32.000000
max       107.000000
Name: game_count, dtype: float64

#### All categories of traffic resampled (mean) by day

In [49]:
resample='D'

# Create traces
trace0 = go.Scatter(
    x = ht_clean_ap.resample(resample).mean().index,
    y = ht_clean_ap.resample(resample).mean()['game_count'],
    mode = 'lines',
    name = 'All traffic'
)
trace1 = go.Scatter(
    x = ht_clean_recs.resample(resample).mean().index,
    y = ht_clean_recs.resample(resample).mean()['game_count'],
    mode = 'lines',
    name = 'Recreational traffic'
)
trace2 = go.Scatter(
    x = pros_online_clean.resample(resample).mean().index,
    y = pros_online_clean.resample(resample).mean()['pros_online'],
    mode = 'lines',
    name = 'Professionals online'
)

layout= go.Layout(
    title= 'Average hourly traffic resampled by day',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Date',
        ticklen= 5,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Count',
        ticklen= 5,
        gridwidth= 2,
    ),
)

data = [trace0, trace1, trace2]
fig= go.Figure(data=data, layout=layout)
iplot(fig)

Finally we see the cause of the spike in traffic in December. It corresponds to a spike in professionals online and a spike in games between professionals (the difference between recreational traffic and all traffic). 

This period represents either a breakdown in relationships between existing professionals or the insurgence of a new  group of professionals; a short war that was quickly resolved.

Although the cause of the spike in all traffic has been identified, for games including at least on recreational player it still remains the period of highest traffic in the year. This points to the possibility that some of the players categorised as recreational may actually  be professionals. 

There are similar seasonal trends for all thee categories of traffic.

### Traffic breakdown by player types

In [50]:
pro_only=pd.DataFrame(ts[ts['pro_count']==2]['first_place_id'].resample('H').count())
rec_only=pd.DataFrame(ts[ts['pro_count']==0]['first_place_id'].resample('H').count())
pro_rec=pd.DataFrame(ts[ts['pro_count']==1]['first_place_id'].resample('H').count())

resample='D'

# Create traces
trace0 = go.Scatter(
    x = pro_only.resample(resample).sum().index,
    y = pro_only.resample(resample).sum()['first_place_id'],
    mode = 'lines',
    name = 'Pro vs pro traffic'
)
trace1 = go.Scatter(
    x = rec_only.resample(resample).sum().index,
    y = rec_only.resample(resample).sum()['first_place_id'],
    mode = 'lines',
    name = 'Rec vs rec traffic'
)
trace2 = go.Scatter(
    x = pro_rec.resample(resample).sum().index,
    y = pro_rec.resample(resample).sum()['first_place_id'],
    mode = 'lines',
    name = 'Pro vs rec traffic'
)

layout= go.Layout(
    title= 'Games per day',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Date',
        ticklen= 5,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Count',
        ticklen= 5,
        gridwidth= 2,
    ),
)

data = [trace0, trace1, trace2]
fig= go.Figure(data=data, layout=layout)
iplot(fig)

#### Compairson of all traffic and recreational traffic distributions


In [51]:
trace1 = go.Histogram(
    x=ht_clean_recs.game_count,
    opacity=0.5,
    name='Recreational traffic'
  
)
trace2 = go.Histogram(
    x=ht_clean_ap.game_count,
    opacity=0.5,
    name='All traffic'
)

layout= go.Layout(
    title= 'Games per hour distribution',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Games per hour',
        ticklen= 5,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Count',
        ticklen= 5,
        gridwidth= 2),
)

data = [trace1, trace2]

layout = go.Layout(barmode='overlay',
                    title= 'Games per hour distribution',
                    xaxis= dict(
                                title= 'Games per hour',
                                ticklen= 5,
                                gridwidth= 2,
                                ),
                    yaxis=dict(
                        title= 'Count',
                        ticklen= 5,
                        gridwidth= 2,
                        )
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)


Recreational traffic is (slightly) more normally distributed than all traffic, this will likely lead to a higher accuracy when forecasting (compared to all traffic). 

The differences also highlight the difference in rec vs rec traffic and pro vs pro traffic.  When pros play each, they likely play 'sessions' that involves rematching or continuing to play pros for a sustained period. 

#### Professionals online by hour of day

In [52]:
pros_online_clean.groupby('hour_of_day')['pros_online'].mean().iplot(kind='bar', 
                 xTitle='Hour of day', 
                 yTitle='Professionals online', 
                 title='Average no. of professionals online per hour by day' 
                                                     )

Professionals online in a given hour broadly reflects the seasonal trend we saw for the same plot of all traffic.

In [53]:
def versus_many(player_list, ts=ts, ps=ps):
    """takes list of players in form of list of strings
       returns dataframe showing gamecount between players all players
       player against self shows total game count, total profit
        """
    
    unique_players= np.unique(player_list)
    valid_players=[]
    for player in unique_players:
        if player in list(ps.player_id):
            valid_players.append(player)
        else:
            print(player, 'is not valid player id')
    
    player_dict={player: [0]*len(valid_players) for player in valid_players}
    
    versus_df=pd.DataFrame(player_dict, index=valid_players)
    
    for i in valid_players:
        for j in valid_players:
            if i==j:
                versus_df.loc[i,j]=0
            else:
                info=ts[(ts.first_place_id==i) | (ts.second_place_id ==i)]
                info=info[(info.first_place_id==j) | (info.second_place_id ==j)]
                versus_df.loc[i,j]=len(info)
    
    return versus_df

In [54]:
pairwise_games=versus_many(pros)

In [None]:
ps_id=ps.set_index('player_id')

In [60]:
#get positions of nodes
G=nx.Graph(pairwise_games)
pos = nx.drawing.spring_layout(G)

#NOTE: improve edge trace by layering multiple edge traces (partitioned by game count), such that
#line strength also represents game count - use code below to mask dataframes for diff traces
#used to put filer on pairwise matrix
#where_are_NaNs = np.isnan(top_df[top_df>10])
#top_df[where_are_NaNs] = 0

#initiate empty edge trace
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=0.5,color='#888'),
    hoverinfo='none',
    mode='lines')

#populate edge trace using  node positions
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])


#initiate empty node trace
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=False,#toggle to show colour bar
        # colorscale options
        #'Greys' | 'YlGnBu' | 'Greens' | 'YlOrRd' | 'Bluered' | 'RdBu' |
        #'Reds' | 'Blues' | 'Picnic' | 'Rainbow' | 'Portland' | 'Jet' |
        #'Hot' | 'Blackbody' | 'Earth' | 'Electric' | 'Viridis' |
        colorscale='Bluered',
        reversescale=True,
        color=[],
        size=10,
#         colorbar=dict(
#             thickness=15,
#             title='Node Connections',
#             xanchor='left',
#             titleside='right'
#         ),
        line=dict(width=2)))

#populate node trace with node positions
for node in G.nodes():
    x, y = pos[node]
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    

#set node size equal to absolute average roi
node_trace['marker']['size']=list(np.absolute(ps_id.loc[list(G.nodes)]['avg_roi_%'])*5)

#set colour by whether average roi is positive or negative (blue=profit, red=loss)
node_trace['marker']['color']=(ps_id.loc[list(G.nodes)]['avg_roi_%']>0)*1 #profit/loss colour

#set hover text of node to be name of player
node_trace['text']=list(G.nodes)


#plot masterpiece
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Professionals: network graph',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
#                 annotations=[ dict(
#                     text="Python code: <a href='https://plot.ly/ipython-notebooks/network-graphs/'> https://plot.ly/ipython-notebooks/network-graphs/</a>",
#                     showarrow=False,
#                    xref="paper", yref="paper",
#                     x=0.005, y=-0.002 ) ],
                 xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

iplot(fig)

### Colour lines by game count....

In [51]:
# pairwise_games_u10=copy.copy(pairwise_games)
# where_is_nans=np.isnan(pairwise_games[pairwise_games<=10])
# pairwise_games_u10[where_is_nans]=0

# pairwise_games_10_100=copy.copy(pairwise_games)
# where_is_nans=np.isnan(pairwise_games[(pairwise_games>10) &(pairwise_games<=100)])
# pairwise_games_10_100[where_is_nans]=0

# pairwise_games_100_1000=copy.copy(pairwise_games)
# where_is_nans=np.isnan(pairwise_games[(pairwise_games>100) &(pairwise_games<=1000)])
# pairwise_games_100_1000[where_is_nans]=0

# pairwise_games_1000=copy.copy(pairwise_games)
# where_is_nans=np.isnan(pairwise_games[(pairwise_games>1000) ])
# pairwise_games_1000[where_is_nans]=0

In [52]:
# #get positions of nodes
# G=nx.Graph(pairwise_games)
# pos = nx.drawing.spring_layout(G)

# #NOTE: improve edge trace by layering multiple edge traces (partitioned by game count), such that
# #line strength also represents game count - use code below to mask dataframes for diff traces
# #used to put filer on pairwise matrix
# #where_are_NaNs = np.isnan(top_df[top_df>10])
# #top_df[where_are_NaNs] = 0

# #initiate empty edge trace
# edge_trace = go.Scatter(
#     x=[],
#     y=[],
#     line=dict(width=0.5,color='#888'),
#     hoverinfo='none',
#     mode='lines')

# #populate edge trace using  node positions
# for edge in G.edges():
#     x0, y0 = pos[edge[0]]
#     x1, y1 = pos[edge[1]]
#     edge_trace['x'] += tuple([x0, x1, None])
#     edge_trace['y'] += tuple([y0, y1, None])


# #initiate empty node trace
# node_trace = go.Scatter(
#     x=[],
#     y=[],
#     text=[],
#     mode='markers',
#     hoverinfo='text',
#     marker=dict(
#         showscale=False,#toggle to show colour bar
#         # colorscale options
#         #'Greys' | 'YlGnBu' | 'Greens' | 'YlOrRd' | 'Bluered' | 'RdBu' |
#         #'Reds' | 'Blues' | 'Picnic' | 'Rainbow' | 'Portland' | 'Jet' |
#         #'Hot' | 'Blackbody' | 'Earth' | 'Electric' | 'Viridis' |
#         colorscale='Bluered',
#         reversescale=True,
#         color=[],
#         size=10,
# #         colorbar=dict(
# #             thickness=15,
# #             title='Node Connections',
# #             xanchor='left',
# #             titleside='right'
# #         ),
#         line=dict(width=2)))

# #populate node trace with node positions
# for node in G.nodes():
#     x, y = pos[node]
#     node_trace['x'] += tuple([x])
#     node_trace['y'] += tuple([y])
    

# #set node size equal to absolute average roi
# node_trace['marker']['size']=list(np.absolute(ps_id.loc[list(G.nodes)]['avg_roi_%'])*5)

# #set colour by whether average roi is positive or negative (blue=profit, red=loss)
# node_trace['marker']['color']=(ps_id.loc[list(G.nodes)]['avg_roi_%']>0)*1 #profit/loss colour

# #set hover text of node to be name of player
# node_trace['text']=list(G.nodes)


# #plot masterpiece
# fig = go.Figure(data=[edge_trace, node_trace],
#              layout=go.Layout(
#                 title='Professionals: network graph',
#                 titlefont=dict(size=16),
#                 showlegend=False,
#                 hovermode='closest',
#                 margin=dict(b=20,l=5,r=5,t=40),
# #                 annotations=[ dict(
# #                     text="Python code: <a href='https://plot.ly/ipython-notebooks/network-graphs/'> https://plot.ly/ipython-notebooks/network-graphs/</a>",
# #                     showarrow=False,
# #                    xref="paper", yref="paper",
# #                     x=0.005, y=-0.002 ) ],
#                  xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
#                 yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

# iplot(fig)

### Save dataframes

In [61]:
ht_clean_ap.to_csv('all_players_modelling_df.csv', index=True) 
ht_clean_recs.to_csv('recreational_players_modelling_df.csv', index=True) 
pros_online_clean.to_csv('pros_online_modelling_df.csv', index=True) 
ts.to_csv('Tournament_summary_with_pro_count', index=True)
pairwise_games.to_csv('all_pros_pairwise.csv', index=True)