<center><h2>Artificial and Computational Intelligence (Assignment - 2)</h2></center>

## Problem Statement

As part of the 2nd Assignment, we'll implement Bayesian Networks and also learn to use the pomegranate library.

You are required to create a bayesian network model which would help you predict the probability. The detailed problem description is attached as a PDF as a part of this assignment along with the marking scheme.  

### What is a Bayesian Network ?

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). 

Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 

### Dataset

The dataset can be downloaded from https://drive.google.com/drive/folders/1oMtKmmvPkN4O8DmrHMJe6M8CbB93Z5kw .You can access it only using your BITS IDs. Also, the same dataset is attached along with the assignment. 

#### Dataset Description
##### Sample Tuple

Y	won	5wickets	lost	2nd	vWest_Indies	Home	6-Nov-11

##### Explanation
- The first column represents if Ashwin was in the playing 11 or not. 
- The second column represents the Result of the match . win indicates India won the match.
- The third column represents the Margin of victory / losss.
- The fourth column represents the results of the toss. won indicates India won the toss. 
- The fifth column represents the batting order. If India batted 1st or 2nd. 
- The sixth column represents the opponent.
- The seventh column represents the location of the match. If the match was held in Home(India) or away. 
- The last column represents the start date of the match.


### Evaluation
We wish to evaluate based on 
- coding practices being followed
- commenting to explain the code and logic behind doing something
- your understanding and explanation of data
- how good the model would perform

In [379]:
# BITS RollNumbers , Names. 
# 2018AB04579, Abhishek.K.R.


In [380]:
#Import libraries
import openpyxl
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import numpy as np
from collections import Counter

from pomegranate import *

np.random.seed(0)
np.set_printoptions(suppress=True)

In [381]:
# method to compute prior probability given the 1-d array
#Solution for part 1
def get_prior_probability(arr):
    total_count = arr.size
    counter = Counter(arr)
    counter = dict(counter)
    for key in counter:
        counter[key] = counter[key]/total_count
    return counter

#method to compute posterior probability given the n-d array when nth column is the dependent column
#Solution for part 2
def get_posterior_probability(arr):
    total_columns = arr.shape[1]
    sub_arr = arr[:,:total_columns-1]
    unique_arr = np.vstack({tuple(row) for row in arr})
    unique_sub_arr = np.vstack({tuple(row) for row in sub_arr})

    unique_sub_arr_count = get_unique_row_count(sub_arr)
    unique_arr_count = get_unique_row_count(arr)

    postererior_probability_arr = []
    
    for t in unique_arr_count:
        temp_arr = list(t)
        
        temp_sub_arr = temp_arr[:len(temp_arr)-1]
        
        probability = unique_arr_count[tuple(temp_arr)]/unique_sub_arr_count[tuple(temp_sub_arr)]
        temp_arr.append(float(probability))
        postererior_probability_arr.append(temp_arr)
    return postererior_probability_arr

#helper method to fill in the gaps. This is needed to use the model building available in Pomegranate library
def fill_probability_gaps(dataset, posterior_prob):
    ashwin_playing_dict = get_prior_probability(dataset[:,0])
    toss_dict = get_prior_probability(dataset[:,1])
    batting_dict = get_prior_probability(dataset[:,2])
    location_dict = get_prior_probability(dataset[:,3])
    result_dict = get_prior_probability(dataset[:,4])
    
    unique_arr_dict = get_unique_row_count(dataset)
    print(unique_arr_dict[('adsf')])
    for ashwin_playing in ashwin_playing_dict:
        for toss in toss_dict:
            for batting in batting_dict:
                for location in location_dict:
                    for result in result_dict:
                        temp_arr = [ashwin_playing, toss, batting, location, result]
                        if(unique_arr_dict[tuple(temp_arr)]==0):
                            temp_arr.append(float(0.0))
                            posterior_prob.append(temp_arr)


def get_unique_row_count(arr):
    unique_arr_count = Counter()
    for x in arr:
        unique_arr_count[tuple(x)] += 1
    return unique_arr_count

def row_count(ws):
    return len([row for row in ws if not all([cell.value == None for cell in row])])

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

In [382]:
#Read data
wb_obj = openpyxl.load_workbook("India_Test_stats.xlsx") 
sheet_obj = wb_obj.active

In [383]:
#Data cleaning
dataset = []
total_rows = row_count(sheet_obj)
for row in sheet_obj.rows:
    arr = []
    if ((total_rows)==0):
        break
    total_rows=total_rows-1
    for column in row:
        arr.append(column.value)
    dataset.append(arr)

dataset = np.array(dataset)
dataset = np.delete(dataset,0,0)
ASHWIN_PLAYED_INDEX = 0
RESULT_INDEX = 1
TOSS_INDEX = 3
BAT_INDEX = 4
LOCATION_INDEX = 6

#removing unwanted columns
dataset = dataset[:, [ASHWIN_PLAYED_INDEX, TOSS_INDEX, BAT_INDEX,LOCATION_INDEX,RESULT_INDEX]]


In [384]:
#building the model in Pomegranate library
##Construction of Bayesian Network

# Prior probability of Ashwin playing
ashwin_playing = DiscreteDistribution(get_prior_probability(dataset[:,0]))

# Prior probability of toss
toss = DiscreteDistribution(get_prior_probability(dataset[:,1]))

# Prior probability of batting
batting = DiscreteDistribution(get_prior_probability(dataset[:,2]))

# Prior probability of location
location = DiscreteDistribution(get_prior_probability(dataset[:,3]))

posterior_prob = get_posterior_probability(dataset)
fill_probability_gaps(dataset, posterior_prob)

result = ConditionalProbabilityTable(
        posterior_prob, [ashwin_playing, toss, batting, location])  


# State objects hold both the distribution, and a high level name.
s1 = State(ashwin_playing, name="ashwin_playing")
s2 = State(toss, name="toss")
s3 = State(batting, name="batting")
s4 = State(location, name="location")
s5 = State(result, name="result")

# Create the Bayesian network object with a useful name
model = BayesianNetwork("Ashwin Selection Model")

# Add the five states to the network 
model.add_states(s1, s2, s3, s4, s5)

# Add edges which represent conditional dependencies, where the fifth node is 
# conditionally dependent on all the other nodes
model.add_edge(s1, s5)
model.add_edge(s2, s5)
model.add_edge(s3, s5)
model.add_edge(s4, s5)

model.bake()
model.structure

0


  app.launch_new_instance()


((), (), (), (), (0, 1, 2, 3))

In [385]:
model.predict_proba([None,None,None,None,None])

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Y" :0.8940322344083566,
            "N" :0.10596776559164342
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "lost" :0.585939409742027,
            "won" :0.4140605902579731
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "2nd" :0.401039714485928,
            "1st" :0.5989602855140721
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.5062904356045173,
            "Away" :0.49370956439548275
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",

In [386]:
#a. India winning, batting 2nd, Ashwin playing
model.predict_proba(['Y',None,'2nd',None,'won'])


array(['Y',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "lost" :0.6279069767441859,
            "won" :0.3720930232558141
        }
    ],
    "frozen" :false
},
       '2nd',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :1.0,
            "Away" :0.0
        }
    ],
    "frozen" :false
},
       'won'], dtype=object)

In [387]:
#b. India winning, batting 2nd, Ashwin not playing
model.predict_proba(['N',None,'2nd',None,'won'])


array(['N',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "lost" :0.5294117647058824,
            "won" :0.47058823529411764
        }
    ],
    "frozen" :false
},
       '2nd',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.5058823529411764,
            "Away" :0.49411764705882355
        }
    ],
    "frozen" :false
},
       'won'], dtype=object)

In [388]:
#c. India losing, batting 2nd, Ashwin playing
model.predict_proba(['Y',None,'2nd',None,'lost'])


array(['Y',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "lost" :1.0,
            "won" :0.0
        }
    ],
    "frozen" :false
},
       '2nd',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.07445301432394161,
            "Away" :0.9255469856760584
        }
    ],
    "frozen" :false
},
       'lost'], dtype=object)

In [389]:
#d. India losing, batting 2nd, Ashwin not playing
model.predict_proba(['N',None,'2nd',None,'lost'])


array(['N',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "lost" :0.6923076923076922,
            "won" :0.30769230769230776
        }
    ],
    "frozen" :false
},
       '2nd',
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.0,
            "Away" :1.0
        }
    ],
    "frozen" :false
},
       'lost'], dtype=object)

###### Pre-process data (Whatever you feel might be required)

<h3><center> Happy Coding!</center></h3>