# Q学習で三目並べ


強化学習で三目並べ(Tic Tac Toe)AIを作成する．

目次

1.1 ゲームの盤面設定　　

1.2 ゲームの動作定義　　

2.1 ランダム手エージェントVS人間　　

2.2 改良ランダム手エージェントVS人間　　

3 モンテカルロVSモンテカルロ　　

4 Q学習プレイヤーVS人間　　



## 1.1 ゲームの盤面設定(環境)

In [1]:
# POS State
EMPTY=0
PLAYER_X=1
PLAYER_O=-1
MARKS={PLAYER_X:"X",PLAYER_O:"O",EMPTY:" "}
DRAW=2

class TTTBoard:
    
    def __init__(self,board=None):
        if board==None:
            self.board = []
            for i in range(9):self.board.append(EMPTY)
        else:
            self.board=board
        self.winner=None
    
    def get_possible_pos(self):
        pos=[]
        for i in range(9):
            if self.board[i]==EMPTY:
                pos.append(i)
        return pos
    
    def print_board(self):
        tempboard=[]
        for i in self.board:
            tempboard.append(MARKS[i])
        row = ' {} | {} | {} '
        hr = '\n-----------\n'
        print((row + hr + row + hr + row).format(*tempboard))
               

                    
    def check_winner(self):
        win_cond = ((1,2,3),(4,5,6),(7,8,9),(1,4,7),(2,5,8),(3,6,9),(1,5,9),(3,5,7))
        for each in win_cond:
            if self.board[each[0]-1] == self.board[each[1]-1]  == self.board[each[2]-1]:
                if self.board[each[0]-1]!=EMPTY:
                    self.winner=self.board[each[0]-1]
                    return self.winner
        return None
    
    def check_draw(self):
        if len(self.get_possible_pos())==0 and self.winner is None:
            self.winner=DRAW
            return DRAW
        return None
    
    def move(self,pos,player):
        if self.board[pos]== EMPTY:
            self.board[pos]=player
        else:
            self.winner=-1*player
        self.check_winner()
        self.check_draw()
    
    def clone(self):
        return TTTBoard(self.board.copy())
    
    def switch_player(self):
        if self.player_turn == self.player_x:
            self.player_turn=self.player_o
        else:
            self.player_turn=self.player_x

## 1.2 ゲームの進行定義(環境)

In [2]:
class TTT_GameOrganizer:

    act_turn=0
    winner=None
    
    def __init__(self,px,po,nplay=1,showBoard=True,showResult=True,stat=100):
        self.player_x=px
        self.player_o=po
        self.nwon={px.myturn:0,po.myturn:0,DRAW:0}
        self.nplay=nplay
        self.players=(self.player_x,self.player_o)
        self.board=None
        self.disp=showBoard
        self.showResult=showResult
        self.player_turn=self.players[random.randrange(2)]
        self.nplayed=0
        self.stat=stat
    
    def progress(self):
        while self.nplayed<self.nplay:
            self.board=TTTBoard()
            while self.board.winner==None:
                if self.disp:print("Turn is "+self.player_turn.name)
                act=self.player_turn.act(self.board)
                self.board.move(act,self.player_turn.myturn)
                if self.disp:self.board.print_board()
               
                if self.board.winner != None:
                    # notice every player that game ends
                    for i in self.players:
                        i.getGameResult(self.board) 
                    if self.board.winner == DRAW:
                        if self.showResult:print ("Draw Game")
                    elif self.board.winner == self.player_turn.myturn:
                        out = "Winner : " + self.player_turn.name
                        if self.showResult: print(out)
                    else:
                        print ("Invalid Move!")
                    self.nwon[self.board.winner]+=1
                else:
                    self.switch_player()
                    #Notice other player that the game is going
                    self.player_turn.getGameResult(self.board)

            self.nplayed+=1
            if self.nplayed%self.stat==0 or self.nplayed==self.nplay:
                print(self.player_x.name+":"+str(self.nwon[self.player_x.myturn])+","+self.player_o.name+":"+str(self.nwon[self.player_o.myturn])
             +",DRAW:"+str(self.nwon[DRAW]))

            
    def switch_player(self):
        if self.player_turn == self.player_x:
            self.player_turn=self.player_o
        else:
            self.player_turn=self.player_x

## 2.1 ランダム手エージェントVS人間

完全にランダムな手をうつエージェントを作成

キーボードからの入力で対戦できる．

油断しなければ勝てる

In [3]:
import random
             

class PlayerRandom:
    def __init__(self,turn):
        self.name="Random"
        self.myturn=turn
        
    #利用可能な行動からランダムに1つ選ぶ
    def act(self,board):
        acts=board.get_possible_pos()
        i=random.randrange(len(acts))
        return acts[i]
    
    
    def getGameResult(self,board):
        pass

    
class PlayerHuman:
    def __init__(self,turn):
        self.name="Human"
        self.myturn=turn
        
    def act(self,board):
        valid = False
        while not valid:
            try:
                act = input("Where would you like to place " + str(self.myturn) + " (1-9)? ")
                act = int(act)
                #if act >= 1 and act <= 9 and board.board[act-1]==EMPTY:
                if act >= 1 and act <= 9:
                    valid=True
                    return act-1
                else:
                    print ("That is not a valid move! Please try again.")
            except Exception as e:
                    print (act +  "is not a valid move! Please try again.")
        return act
    
    def getGameResult(self,board):
        if board.winner is not None and board.winner!=self.myturn and board.winner!=DRAW:
            print("I lost...")
        


In [4]:
def Human_vs_Random():#試合を組む関数
    
    p1=PlayerHuman(PLAYER_X)
    p2=PlayerRandom(PLAYER_O)
    game=TTT_GameOrganizer(p1,p2)
    game.progress()

Human_vs_Random()#実行

Turn is Random
   |   |   
-----------
   | O |   
-----------
   |   |   
Turn is Human


Where would you like to place 1 (1-9)?  1


 X |   |   
-----------
   | O |   
-----------
   |   |   
Turn is Random
 X |   |   
-----------
   | O |   
-----------
 O |   |   
Turn is Human


Where would you like to place 1 (1-9)?  3


 X |   | X 
-----------
   | O |   
-----------
 O |   |   
Turn is Random
 X |   | X 
-----------
   | O |   
-----------
 O | O |   
Turn is Human


Where would you like to place 1 (1-9)?  2


 X | X | X 
-----------
   | O |   
-----------
 O | O |   
Winner : Human
Human:1,Random:0,DRAW:0


# 2.2 改良ランダムプレイヤー

次の手で勝てるときだけその手を選んで打つエージェントを用意．

多分油断しなければ勝てる

In [5]:
class PlayerAlphaRandom:
    
    
    def __init__(self,turn,name="AlphaRandom"):
        self.name=name
        self.myturn=turn
        
    def getGameResult(self,winner):
        pass
        
    def act(self,board):
        acts=board.get_possible_pos()
        #see only next winnable act
        for act in acts:
            tempboard=board.clone()#シミュレーション用の盤を用意
            tempboard.move(act,self.myturn)
            # check if win
            if tempboard.winner==self.myturn:#価値条件に当てはまればその手を採用
                #print ("Check mate")
                return act
        i=random.randrange(len(acts))
        return acts[i]


In [7]:
p1=PlayerHuman(PLAYER_X)
p2=PlayerAlphaRandom(PLAYER_O)
game=TTT_GameOrganizer(p1,p2)
game.progress()

Turn is AlphaRandom
   |   |   
-----------
 O |   |   
-----------
   |   |   
Turn is Human


Where would you like to place 1 (1-9)?  3


   |   | X 
-----------
 O |   |   
-----------
   |   |   
Turn is AlphaRandom
   |   | X 
-----------
 O | O |   
-----------
   |   |   
Turn is Human


Where would you like to place 1 (1-9)?  6


   |   | X 
-----------
 O | O | X 
-----------
   |   |   
Turn is AlphaRandom
   |   | X 
-----------
 O | O | X 
-----------
   |   | O 
Turn is Human


Where would you like to place 1 (1-9)?  1


 X |   | X 
-----------
 O | O | X 
-----------
   |   | O 
Turn is AlphaRandom
 X |   | X 
-----------
 O | O | X 
-----------
 O |   | O 
Turn is Human


Where would you like to place 1 (1-9)?  2


 X | X | X 
-----------
 O | O | X 
-----------
 O |   | O 
Winner : Human
Human:1,AlphaRandom:0,DRAW:0


# モンテカルロプレイヤー
次の手で勝ち手があるときは勝ち手を，ない場合はランダムに行動選択する改良ランダム方策でn回勝敗がつくまでやってみる

n回での打ち手ごとの勝率を見て勝てる手を選ぶ．

この勝率が状態価値関数

nの回数で強さが変わる

In [8]:
class PlayerMC:
    def __init__(self,turn,name="MC"):
        self.name=name
        self.myturn=turn
    
    def getGameResult(self,winner):
        pass
        
    def win_or_rand(self,board,turn):#次で勝てるときはその手を選ぶ，そうでなければランダム行動
        acts=board.get_possible_pos()
        for act in acts:
            tempboard=board.clone()
            tempboard.move(act,turn)
            # check if win
            if tempboard.winner==turn:
                return act
        i=random.randrange(len(acts))
        return acts[i]
           
    def trial(self,score,board,act):
        tempboard=board.clone()
        tempboard.move(act,self.myturn)
        tempturn=self.myturn
        while tempboard.winner is None:
            tempturn=tempturn*-1
            tempboard.move(self.win_or_rand(tempboard,tempturn),tempturn)
        
        if tempboard.winner==self.myturn:
            score[act]+=1#勝つ手なら報酬+1
        elif tempboard.winner==DRAW:
            pass
        else:
            score[act]-=1#負けたら報酬-1

        
    def getGameResult(self,board):
        pass
        
    
    def act(self,board):
        acts=board.get_possible_pos()
        scores={}#辞書
        n=50
        for act in acts:#ある状態で取れる行動全てについて
            scores[act]=0 #キーactの値v=0　TTTでは同じ状態に戻ってくることがないのでvは区別していない
            for i in range(n):#1つの状態につきnエピソード勝敗がつくまでシミュレーション
                #print("Try"+str(i))
                self.trial(scores,board,act)
            
            #print(scores)
            scores[act]/=n
        
        max_score=max(scores.values())
        for act, v in scores.items():#辞書のキーと値を同時にインクリメント
            if v == max_score:
                print(str(act)+"="+str(v))
                return act

モンテカルロプレイヤー同士を10試合させる

In [9]:
#p1=PlayerAlphaRandom(PLAYER_X)
p1=PlayerMC(PLAYER_X,"M1")
p2=PlayerMC(PLAYER_O,"M2")
#p2=PlayerHuman(PLAYER_O)
game=TTT_GameOrganizer(p1,p2,10,False)
game.progress()

4=0.92
2=-0.58
1=0.94
7=-0.4
8=0.94
0=-0.4
5=0.44
3=0.0
6=0.0
Draw Game
4=0.72
2=-0.48
1=1.0
7=-0.42
8=0.94
0=-0.38
3=0.44
5=0.0
6=0.0
Draw Game
5=0.66
0=0.06
3=0.68
4=0.6
8=0.52
2=1.0
1=-1.0
6=1.0
Winner : M1
0=0.68
4=-0.14
3=0.8
6=0.62
2=0.66
1=0.7
7=0.0
5=0.0
8=0.0
Draw Game
4=0.76
6=-0.52
3=1.0
5=-0.5
7=0.8
1=-0.42
0=0.56
8=0.0
2=0.0
Draw Game
8=0.78
6=-0.1
2=0.86
5=-0.18
0=1.0
1=-1.0
4=1.0
Winner : M1
4=0.9
2=-0.56
1=0.92
7=-0.4
8=0.96
0=-0.36
5=0.56
3=0.0
6=0.0
Draw Game
4=0.76
0=-0.5
1=0.96
7=-0.6
3=0.82
5=-0.22
2=0.52
6=0.0
8=0.0
Draw Game
4=0.76
2=-0.6
1=1.0
7=-0.54
5=0.96
3=-0.32
0=0.52
8=0.0
6=0.0
Draw Game
4=0.74
8=-0.56
7=1.0
1=-0.6
2=0.78
6=-0.34
5=0.58
3=0.0
0=0.0
Draw Game
M1:2,M2:0,DRAW:8


In [11]:
#p1=PlayerAlphaRandom(PLAYER_X)
p1=PlayerHuman(PLAYER_O)
p2=PlayerMC(PLAYER_X)
#p2=PlayerHuman(PLAYER_O)
game=TTT_GameOrganizer(p1,p2)
game.progress()

Turn is MC
4=0.76
   |   |   
-----------
   | X |   
-----------
   |   |   
Turn is Human


Where would you like to place -1 (1-9)?  1


 O |   |   
-----------
   | X |   
-----------
   |   |   
Turn is MC
1=0.96
 O | X |   
-----------
   | X |   
-----------
   |   |   
Turn is Human


Where would you like to place -1 (1-9)?  8


 O | X |   
-----------
   | X |   
-----------
   | O |   
Turn is MC
3=0.92
 O | X |   
-----------
 X | X |   
-----------
   | O |   
Turn is Human


Where would you like to place -1 (1-9)?  6


 O | X |   
-----------
 X | X | O 
-----------
   | O |   
Turn is MC
6=0.62
 O | X |   
-----------
 X | X | O 
-----------
 X | O |   
Turn is Human


Where would you like to place -1 (1-9)?  3


 O | X | O 
-----------
 X | X | O 
-----------
 X | O |   
Turn is MC
8=0.0
 O | X | O 
-----------
 X | X | O 
-----------
 X | O | X 
Draw Game
Human:0,MC:0,DRAW:1


## Q-Learning プレイヤー

Q(s,a) = Q(s,a) + alpha (reward + gammma* max(Q(s',a')- Q(s,a))

学習率alpha

epsilon-greedyのe

割引率gammaを変えると学習の様子が変わる

勝敗が決する時点のQ値は0とした

状態数は1マス3パターンが9マスなので19863状態

In [12]:

class PlayerQL:
    def __init__(self,turn,name="QL",e=0.2,alpha=0.3):
        self.name=name
        self.myturn=turn
        self.q={} #set of s,a
        self.e=e#epsilon-greedyする確率
        self.alpha=alpha#学習率
        self.gamma=0.9#割引率
        self.last_move=None
        self.last_board=None
        self.totalgamecount=0
        
    
    def policy(self,board):
        self.last_board=board.clone()
        acts=board.get_possible_pos()
        
        #epsilon-greedy部分
        if random.random() < (self.e/(self.totalgamecount//10000+1)):
                i=random.randrange(len(acts))
                return acts[i]
        qs = [self.getQ(tuple(self.last_board.board),act) for act in acts]
        maxQ= max(qs)
        
        #最大のQ値が複数あるときはランダムに選ぶ
        if qs.count(maxQ) > 1:
            best_options = [i for i in range(len(acts)) if qs[i] == maxQ]
            i = random.choice(best_options)
        else:
            i = qs.index(maxQ)

        self.last_move = acts[i]
        return acts[i]
    
    def getQ(self, state, act):
        #Q値をstate,actをキーにした辞書の値で定義
        if self.q.get((state, act)) is None:
            #Q値の初期値は1
            self.q[(state, act)] = 1
        return self.q.get((state, act))
    
    def getGameResult(self,board):
        r=0
        if self.last_move is not None:
            #行動の結果，遷移した状態S(t+1)で勝敗が決まったかどうか
            #結果に応じた報酬を関数learnにわたす
            if board.winner is None:
                self.learn(self.last_board,self.last_move, 0, board)
                pass
            else:
                if board.winner == self.myturn:
                    self.learn(self.last_board,self.last_move, 1, board)
                elif board.winner !=DRAW:
                    self.learn(self.last_board,self.last_move, -1, board)
                else:
                    self.learn(self.last_board,self.last_move, 0, board)
                self.totalgamecount+=1
                self.last_move=None
                self.last_board=None

    def learn(self,s,a,r,fs):
        pQ=self.getQ(tuple(s.board),a)
        if fs.winner is not None:
            maxQnew=0#勝敗が決まった状態のQ(t+1)は0
        else:
            maxQnew=max([self.getQ(tuple(fs.board),act) for act in fs.get_possible_pos()])#勝敗が決らなかったときは最大のQ(t+1)を選び
        #Q(t)更新
        self.q[(tuple(s.board),a)]=pQ+self.alpha*((r+self.gamma*maxQnew)-pQ)
        #print (str(s.board)+"with "+str(a)+" is updated from "+str(pQ)+" refs MAXQ="+str(maxQnew)+":"+str(r))
        #print(self.q)

    
    def act(self,board):
        return self.policy(board)

## Q-Learningプレイヤー同士で戦わせて学習

In [13]:
pQ=PlayerQL(PLAYER_O,"QL1")
p2=PlayerQL(PLAYER_X,"QL2")
game=TTT_GameOrganizer(pQ,p2,100000,False,False,10000)#10000万試合を10セット
game.progress()


QL1:4337,QL2:4408,DRAW:1255
QL1:8418,QL2:8346,DRAW:3236
QL1:11947,QL2:11905,DRAW:6148
QL1:14133,QL2:14201,DRAW:11666
QL1:14975,QL2:15075,DRAW:19950
QL1:15516,QL2:15619,DRAW:28865
QL1:15927,QL2:16046,DRAW:38027
QL1:16346,QL2:16374,DRAW:47280
QL1:16649,QL2:16722,DRAW:56629
QL1:16980,QL2:16955,DRAW:66065


## 学習済みQ-Learningプレイヤーとモンテカルロプレイヤーを対戦させてみる

In [16]:
pQ.e=0#探索パラメータ0
p2=PlayerMC(PLAYER_X,"M1")
game=TTT_GameOrganizer(pQ,p2,1000,False,False,10)
game.progress()


0=0.52
2=0.88
7=-0.08
3=0.38
8=0.0
4=0.72
3=1.0
7=0.84
8=0.54
2=0.0
4=0.68
3=0.88
7=0.84
0=0.44
2=0.0
4=0.66
3=0.98
7=0.88
0=0.56
2=0.0
4=0.94
3=0.96
7=0.98
8=0.5
2=0.0
4=0.68
3=0.98
7=0.82
0=0.6
2=0.0
4=0.68
7=0.98
0=0.84
5=0.6
2=0.0
4=0.66
3=0.94
8=0.92
1=0.52
2=0.0
4=0.62
7=0.96
0=0.88
5=0.58
2=0.0
4=0.82
3=0.9
7=0.88
0=0.56
2=0.0
QL1:0,M1:0,DRAW:10
4=0.7
3=0.92
7=0.94
8=0.5
2=0.0
4=0.78
3=0.96
8=0.84
7=0.56
2=0.0
4=0.84
3=0.96
7=0.84
0=0.64
2=0.0
4=0.64
3=0.94
8=0.96
1=0.52
2=0.0
4=0.72
3=0.98
8=0.84
7=0.6
2=0.0
4=0.64
3=0.96
7=0.92
0=0.5
2=0.0
4=0.66
7=0.98
0=0.8
3=0.6
2=0.0
4=0.8
3=1.0
8=0.9
7=0.56
2=0.0
4=0.64
3=0.94
7=0.82
0=0.62
2=0.0
4=0.74
7=0.98
0=0.86
3=0.46
2=0.0
QL1:0,M1:0,DRAW:20
4=0.82
7=1.0
0=0.86
3=0.62
2=0.0
4=0.72
7=1.0
0=0.92
3=0.54
2=0.0
4=0.68
7=0.98
0=0.88
3=0.48
2=0.0
4=0.74
3=0.9
7=0.86
0=0.48
2=0.0
4=0.68
7=0.98
3=0.86
8=0.52
2=0.0
4=0.8
7=0.92
0=0.82
3=0.48
2=0.0
4=0.74
3=0.9
7=0.84
0=0.46
2=0.0
4=0.8
3=0.92
8=0.9
7=0.54
2=0.0
4=0.68
3=0.86
7=0.86
8=0.64
2=

## 学習済みQ-Learningプレイヤーと対戦

ミスしなければ引き分けまで持ち込めるがミスすれば必ず負ける

In [15]:
pQ.e=0
p2=PlayerHuman(PLAYER_X)
game=TTT_GameOrganizer(pQ,p2)
game.progress()

Turn is QL1
   |   |   
-----------
   |   | O 
-----------
   |   |   
Turn is Human


Where would you like to place 1 (1-9)?  1


 X |   |   
-----------
   |   | O 
-----------
   |   |   
Turn is QL1
 X |   | O 
-----------
   |   | O 
-----------
   |   |   
Turn is Human


Where would you like to place 1 (1-9)?  9


 X |   | O 
-----------
   |   | O 
-----------
   |   | X 
Turn is QL1
 X |   | O 
-----------
   | O | O 
-----------
   |   | X 
Turn is Human


Where would you like to place 1 (1-9)?  7


 X |   | O 
-----------
   | O | O 
-----------
 X |   | X 
Turn is QL1
 X |   | O 
-----------
 O | O | O 
-----------
 X |   | X 
I lost...
Winner : QL1
QL1:1,Human:0,DRAW:0
