# Modern Cricket Simulation for Twenty Twenty format
## ------------------------------------------------------------------------------------

## Project Description:

### As of now there is not even a single algorithm which can predict the next ball correctly in a cricket game. Even the most famous Don bradman cricket game was not implementing a proper next ball prediction algorithm. The project aims at predicting the outcome of the next ball in a real cricket game on a high level basis and to use this analysis as a base for implementing the next ball prediction algorithm for the cricket game.

### Cricket Game: Cricket is a game played mostly in Asian, African countries and in Australia. The game consists of two teams competing for the victory. Each team has 11 players. It is played in 3 formats: 'TEST','ODI' and 'T20'. In this project the analysis is done considering the T20 format. The two sides will be performing either of two tasks every single time 1. Batting 2.Bowling and not both at the same time. In T20 format each team will be allowed to bat for 20 overs. Each over has 6 balls. Each ball has an outcome of runs, wicket, defensed, extras, boundaries.

### We are considering these relations in the game to consider the outcome of every single ball
#### 1. Batsman performance ( Features included : Batsman Name, Highscore, strike rate, number of boundaries, Average,Total_runs)
#### 2.Bowler Performance (Features included: Bowler Name, Highscore, strike rate, Average,number_of_wickets)
#### 3.Bowler batsman relation
#### 4. Batsman and location 


### Data Preprocessing

In [179]:
import glob
import pandas as pd # importing pandas
path =r'C:\Users\manoj\Desktop\MLproject\Manojkumar_Gaddam\Data\T20' #Path to acces the Data
allFiles = glob.glob(path + "/*.csv")#reading all files which are in csv format
frame = pd.DataFrame()# creating a data frame
list_ = []
for file_ in allFiles:
    data=pd.read_csv(file_,error_bad_lines=False,names=["type_of_information","innings_or_info","ball",
                                                        "Batting_team","Striker","Non_striker","Bowler","Runs",
                                                        "Extras","wicket_type","Batsman_out"])
                                                        # reading the data from a csv file assigning columns names
    a=data[data.type_of_information=="ball"]# Filtering the data which contains only the Ball information
    match_date=data[data.innings_or_info=="date"].ball.item()
    a['date']=match_date # Assigning the match date to this dataframe
    match_Venue=data[data.innings_or_info=="venue"].ball.item()                  
    a['Venue']=match_Venue# Assigning the match Venue to this dataframe
    match_city=data[data.innings_or_info=="city"].ball.item()
    a['city']=match_city# Assigning the match city to this dataframe
    list_.append(a)# appending all the values of a into a list
frame = pd.concat(list_)# concating all the data in the file and saving the value to the frame
frame['Total runs scored']=frame.groupby('Striker').Runs.cumsum()
frame['Runs in match']=frame.groupby(['Striker','date'],axis=0).Runs.cumsum()
frame['Total runs scored']=frame.groupby(['Striker'],axis=0).Runs.cumsum()
frame['Runs_in_this_venue']=frame.groupby(['Striker','Venue'],axis=0).Runs.cumsum()

frame["wickets"]='None'
frame.wickets[frame.wicket_type.isin(['caught', 'bowled', 'run out', 'stumped', 'retired hurt',
       'lbw', 'caught and bowled', 'hit wicket'])]=1
frame.wickets=frame.wickets.replace('None',0)
frame['TotalWickets']=frame.groupby(['Bowler'],axis=0).wickets.cumsum()
frame['TotalWickets_thisVenue']=frame.groupby(['Bowler','Venue'],axis=0).wickets.cumsum()
frame['RunsAgainstthisBowler']=frame.groupby(['Bowler','Striker'],axis=0).Runs.cumsum()
frame['RAB_venue']=frame.groupby(['Bowler','Striker','Venue'],axis=0).Runs.cumsum()
frame['Dummyforballs']=1
frame['NOB_BBC']=frame.groupby(['Bowler','Striker'],axis=0).Dummyforballs.cumsum()
frame['NOB_venue']=frame.groupby(['Venue','Striker'],axis=0).Dummyforballs.cumsum()
frame['NOB_BBC_venue']=frame.groupby(['Venue','Striker','Bowler'],axis=0).Dummyforballs.cumsum()
frame['NOB_match']=frame.groupby(['date','Striker'],axis=0).Dummyforballs.cumsum()
frame['NOB']=frame.groupby(['Striker'],axis=0).Dummyforballs.cumsum()
frame['StrikeRate']=frame['Total runs scored']/frame.NOB
frame['SAB']=frame.RunsAgainstthisBowler/frame.NOB_BBC
frame['SAB_venue']=frame.RAB_venue/frame.NOB_BBC_venue
frame['SR_match']=frame['Runs in match']/frame.NOB_match
frame['Number_of_wickets_lost']=frame.groupby(['date','innings_or_info','Batting_team','Venue'],axis=0).wickets.cumsum()

frame=frame.drop(['2010/02/13'],axis=0)

path =r'C:\Users\manoj\Desktop\MLproject\Manojkumar_Gaddam\Data\Bowler details.xlsx'# path in which bowler details are present
Bowler_details=pd.read_excel(path,sheetname='Sheet1')#To read the data from the excel file.
Bowler_details.Player=Bowler_details.Player.str.encode('ascii', 'ignore').str.decode('ascii').str.strip()# to remove the string xao

path =r'C:\Users\manoj\Desktop\MLproject\Manojkumar_Gaddam\Data\Batsman details.xlsx'# path in which batsman details are present
Batsman_details=pd.read_excel(path,sheetname='Sheet1')#reading the data from the batsman details file
Batsman_details.Player=Batsman_details.Player.str.encode('ascii', 'ignore').str.decode('ascii').str.strip()# to remove the string xao
result = pd.merge(frame, Bowler_details, how='left', left_on='Bowler', right_on='Player')#merging the bowler details to the result file
result = pd.merge(result, Batsman_details, how='left', left_on='Striker', right_on='Player')# merging the batsman details to the result file

result['Powerplay']=pd.to_numeric(result.ball,errors='coerce')<6.0 # adding a new column to the result which conains boolean values
result['No_of_boundaries']=result['4s']+result['6s'] #adding a new column to the result which conains the result of 4's and 6's
result=result.fillna(value='0')# replaces the null values with na
result["output"]='None'#  creating a new column output with values none
result.output[result.wicket_type.isin(['caught', 'bowled', 'run out', 'stumped', 'retired hurt',
       'lbw', 'caught and bowled', 'hit wicket'])]=1# changing the values in the output based on the values in wiket_type
result.output[result.Extras.isin([1, 4, 2, 3, 5])]=0# changing the value in the output based on the values in the extras
result.output[result.Runs_x.isin([4,6])]=4# changing the value in the output based on the values of the runs equal to 4 and 6's
result.output[~result.Runs_x.isin([0,4,6])]=3#changing the value in the output based on the values in which runs other than 0 4 and 6
result.output[result.output=='None']=2# if there is none in the output then defensed will be placed in the output



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See th

In [186]:
final=result[[ 'Total runs scored', 'Runs in match' ,'Runs_in_this_venue','output' ,'Powerplay', 'TotalWickets','TotalWickets_thisVenue', 'RunsAgainstthisBowler' , 'RAB_venue',  'StrikeRate', 'SAB', 'SAB_venue', 'SR_match','Number_of_wickets_lost']]
# considering the required features in the result for pedicting the next ball and assigning the values to the f                                               
final.output=final.output.astype(int)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [53]:
final=final.rename(index=str,columns={"4":"No of wickets>4","SR_x":"strikerate_bowler","SR_y":"strikerate_batsman","Ave_x":"average_bowler","Ave_y":"average_batsman","Runs":"runsbowler","Runs_y":"runs_batsman"})
# renaming the columns for a better understanding of the type of the column 

### Final Data

In [187]:
print(final.head(3))

   Total runs scored  Runs in match  Runs_in_this_venue  output Powerplay  \
0                1.0            1.0                 1.0       3      True   
1                0.0            0.0                 0.0       2      True   
2                2.0            2.0                 2.0       3      True   

   TotalWickets  TotalWickets_thisVenue  RunsAgainstthisBowler  RAB_venue  \
0             0                       0                    1.0        1.0   
1             0                       0                    0.0        0.0   
2             0                       0                    2.0        2.0   

   StrikeRate  SAB  SAB_venue  SR_match  Number_of_wickets_lost  
0         1.0  1.0        1.0       1.0                       0  
1         0.0  0.0        0.0       0.0                       0  
2         1.0  1.0        1.0       1.0                       0  


In [182]:
#FINAL TABLE SIZE
final.shape # gives the number of rows and columns

(123062, 14)

In [8]:
# Bar graph 
import matplotlib.pyplot as plt
import numpy as np
n = 5
band_width=0.5
x=final.output.value_counts()
index = np.arange(n)
rects1 = plt.bar(index, x,band_width,color='rgbyw')
plt.xlabel('Output')
plt.ylabel('No of balls')
plt.xticks(index+(band_width/2), ('Runs', 'Defensed', 'Boundaries', 'Wicket', 'Extras'))
plt.legend()
plt.title('Histogram showing the type of the output and the number of balls of each type')
plt.tight_layout()
plt.show()



In [11]:
import matplotlib.pyplot as plt
import numpy as np
n = 5
band_width=0.5
x=final.output[final['Bowler']=='Z Khan'].value_counts()
index = np.arange(n)
rects1 = plt.bar(index, x,band_width,color='rgbyw')
plt.xlabel('Output')
plt.ylabel('No of balls')
plt.xticks(index+(band_width/2), ('Runs', 'Defensed', 'Boundaries', 'Wicket', 'Extras'))
plt.legend()
plt.title('Output Varied Based on Bowler Z Khan(Bowler relation)')
plt.tight_layout()
plt.show()



In [13]:
import matplotlib.pyplot as plt
import numpy as np
n = 5
band_width=0.5
x=final.output[(final['Bowler']=='Z Khan') & (final.Striker=='GC Smith')].value_counts()
index = np.arange(n)
rects1 = plt.bar(index, x,band_width,color='rgbyw')
plt.xlabel('Output')
plt.ylabel('No of balls')
plt.xticks(index+(band_width/2), ('Runs', 'Defensed', 'Boundaries', 'Wicket', 'Extras'))
plt.legend()
plt.tight_layout()
plt.title('Output Varied Based on Bowler Z Khan and batsman GC Smith(Batsman-Bowler Realation)')
plt.show()



# Removing the features which cannot be used to fit the model

In [201]:

data=final.drop(['output'],axis=1)
#from sklearn import preprocessing
#x_scaled=preprocessing.scale(data,with_mean=True)
#g=x_scaled.as_matrix()
#data=final
data=data.replace('-',0)
data=data.replace('na',0)
from sklearn import preprocessing
x_scaled=preprocessing.scale(data)
g1=(final.output).as_matrix()

In [202]:
x=[]
x.append(x_scaled)
x.append(g1)

# Fitting Different Models to Choose The Best One

### Gaussian Naive Bayes

In [207]:
from sklearn.naive_bayes import GaussianNB
import numpy as np
clf= GaussianNB()
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
X_new = SelectKBest(f_classif, k=7).fit_transform(x[0], x[1])
scores = cross_val_score(clf,X_new, x[1],cv=10,scoring='accuracy')
print(scores)
print("mean scores:",np.mean(scores))

[ 0.55317248  0.56849204  0.56272343  0.55570001  0.56569432  0.55806583
  0.57131247  0.56505486  0.56087451  0.55892393]
mean scores: 0.562001387025


### Logistic Regression with newton-cg solver 

In [206]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
X_new = SelectKBest(f_classif, k=7).fit_transform(x[0], x[1])
clf = LogisticRegression(solver ='newton-cg',multi_class='multinomial')
scores = cross_val_score(clf,X_new, x[1],cv=10,scoring='accuracy')
print(np.mean(scores))


0.610269774222


### Logistic Regression with sag solver

In [208]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
X_new = SelectKBest(f_classif, k=7).fit_transform(x[0], x[1])


clf = LogisticRegression(solver ='sag',multi_class='multinomial')#solver = sag is used which helps in running the data faster compared to newton-cg 
scores = cross_val_score(clf,X_new, x[1],cv=10,scoring='accuracy')
print(np.mean(scores))

0.610269774222


### Decision tree algorithm

In [209]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=2)
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
X_new = SelectKBest(f_classif, k=7).fit_transform(x[0], x[1])

scores = cross_val_score(clf,X_new, x[1],cv=5,scoring='accuracy')


#print(scores)
print(np.mean(scores))

0.548089581377


### Multinomial Naive Bayes

In [199]:
from sklearn import preprocessing
x_scaled1=preprocessing.normalize(data)
g1=(final.output).as_matrix()
x=[]
x.append(x_scaled1)
x.append(g1)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_new = SelectKBest(chi2, k=10).fit_transform(x[0], x[1])

from sklearn.naive_bayes import MultinomialNB
clf= MultinomialNB(alpha=0.1)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf,X_new, x[1],cv=10,scoring='accuracy')
#print(scores)
print(np.mean(scores))

0.450130229583


In [195]:
data.head()

Unnamed: 0,Total runs scored,Runs in match,Runs_in_this_venue,Powerplay,TotalWickets,TotalWickets_thisVenue,RunsAgainstthisBowler,RAB_venue,StrikeRate,SAB,SAB_venue,SR_match,Number_of_wickets_lost
0,1.0,1.0,1.0,True,0,0,1.0,1.0,1.0,1.0,1.0,1.0,0
1,0.0,0.0,0.0,True,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2.0,2.0,2.0,True,0,0,2.0,2.0,1.0,1.0,1.0,1.0,0
3,2.0,2.0,2.0,True,0,0,2.0,2.0,0.666667,0.666667,0.666667,0.666667,0
4,2.0,2.0,2.0,True,0,0,2.0,2.0,0.5,0.5,0.5,0.5,0


## The best measures in our analysis is :
## Total runs scored, Runs_in_this_venue.
### As we are doing a multilevel classifiaction we cannot classify a feature as positive and negative.


# Based on our analysis we feel that the logistic regression is the best fit for the Data.
