# Introduction:

In this Notebook we tend to perform an Expolatory data Analysis on ESRB Ratings.
And will develope an essemble model for Predicting these ratings. Before developing the ensemble model we will also be training various sklearn classifiers and will see their performance on the dataset as well.

# What is ESRB and what are the various Categories?
ESRB stands for Entertainment Software Rating Board. It is an American self-regulatory organization for assigning 
content ratings to consumer Video games. It was established in 1994 in response to criticism of contoversial video
games like mortal Kombat "Fatality, hmm...". ESRB has the following labels:
* RP => Ratings Pending (1994-present) This symbol is used in promotional materials for games which have not yet been assigned a final rating by the ESRB.
* EC => Early Childhood (1994-2018) Games with this rating contain content which is aimed towards a preschool audience. They do not contain content that parents would find objectionable to this audience.No longer used as of 2018 due to few titles using this, and all titles with this rating are replaced with the E rating.
* E => Everyone (1994-present) Games with this rating contain content which the ESRB believes is "generally suitable for all ages".They can contain content such as infrequent use of "mild"/cartoon violence and mild language.This rating was known as Kids to Adults (K-A) until 1998, when it was renamed "Everyone".
* E10+ => Everyone 10+ (2005-present) Games with this rating contain content which the ESRB believes is generally suitable for those aged 10 years and older. They can contain content with an impact higher than the "Everyone" rating can accommodate, but still not as high as to warrant a "Teen" rating, such as a larger amount of violence, mild language, crude humor, or suggestive content.
* T => Teen (1994-present) Games with this rating contain content which the ESRB believes is generally suitable for those aged 13 years and older; they can contain content such as moderate amounts of violence (including small amounts of blood), mild to moderate use of language or suggestive themes, sexual content, partial nudity and crude humor.
* M17+ => Mature 17+ (1994-present) Games with this rating contain content which the ESRB believes is generally suitable for those aged 17 years and older; they can contain content with an impact higher than the "Teen" rating can accommodate, such as intense and/or realistic portrayals of violence (including blood, gore, mutilation, and depictions of death), strong sexual themes and content, nudity, and more frequent use of strong language.
* A => Adults (1994-present) Games with this rating contain content which the ESRB believes is only suitable for those aged 18 years and older; they contain content with an impact higher than the "Mature" rating can accommodate, such as graphic sexual themes and content, extreme portrayals of violence, or unsimulated gambling with real currency. The majority of AO-rated titles are pornographic adult video games; the ESRB has seldom issued the AO rating solely for violence.


In [None]:
# let's import some Essential libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

#lets load data into dataframes
trainFrame = pd.read_csv('/kaggle/input/video-games-rating-by-esrb/Video_games_esrb_rating.csv')
testFrame  = pd.read_csv('/kaggle/input/video-games-rating-by-esrb/test_esrb.csv')

In [None]:
# doing sanity checks on train Frame and test Frame
trainFrame[0:10]

In [None]:
# let's do some EDA
# Plotting distribution of games Platformwise

psExclusives = len(trainFrame[ trainFrame['console'] == 0 ])
xBoxEclusives = len(trainFrame[ trainFrame['console'] == 1 ])
availableOnBoth = len(trainFrame[ trainFrame['console'] == 2 ])
other =len(trainFrame[ trainFrame['console'] == 3 ])

nameList = ['Play Station Exclusives', 'XBox Exclusives', 'Available on both', 'Others']
nameValues = [psExclusives, xBoxEclusives, availableOnBoth, other]

plt.figure(figsize=(17, 5))
plt.subplot(1,2,1)
plt.bar(nameList, nameValues, color = 'green')
plt.title('Bar graph: Game distribution Platformwise [Train set]')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.pie(nameValues, labels = nameList,autopct='%1.2f%%')
plt.title('Pie chart: For platforwise game distribution [Train set]')
plt.show()

In [None]:
# doing same for test set
psExclusives = len(testFrame[ testFrame['console'] == 0 ])
xBoxEclusives = len(testFrame[ testFrame['console'] == 1 ])
availableOnBoth = len(testFrame[ testFrame['console'] == 2 ])
other =len(testFrame[ testFrame['console'] == 3 ])

nameList = ['Play Station Exclusives', 'XBox Exclusives', 'Available on both', 'Others']
nameValues = [psExclusives, xBoxEclusives, availableOnBoth, other]

plt.figure(figsize=(17, 5))
plt.subplot(1,2,1)
plt.bar(nameList, nameValues, color = 'green')
plt.title('Bar graph: Game distribution Platformwise [Test set]')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.pie(nameValues, labels = nameList,autopct='%1.2f%%')
plt.title('Pie chart: For platforwise game distribution [Test set]')
plt.show()

Note : It seems in the given data set there are no Xbox exclusives and other platform games. 

In [None]:
# Now lets do the same for label distribution  but lets write a 
# function to do so

def DistributionPlotter(dataFrame, featureName, setName):
    uniqueVals = set(dataFrame[featureName])
    countArr =  []
    
    for mem in uniqueVals:
        countArr.append(len(dataFrame[dataFrame[featureName] == mem]))
    
    plt.figure(figsize=(17, 5))
    plt.subplot(1,2,1)
    plt.bar(list(uniqueVals), countArr, color = 'orange')
    plt.title('Bar graph: ' + str(setName))
    plt.grid(True)

    plt.subplot(1, 2, 2)
    plt.pie(countArr, labels = list(uniqueVals),autopct='%1.2f%%')
    plt.title('Pie chart:  ' + str(setName))
    plt.show()
    

In [None]:
DistributionPlotter(trainFrame, 'esrb_rating', 'Rating distribution [Train set]')

In [None]:
DistributionPlotter(testFrame, 'esrb_rating', 'Rating distribution [Test set]')

Also this seems in the given set only 4 ratings namely E, ET+, T and mature are present.
lets try to do 2 more things:
* finding correlation between features and ratings
* Visualize data in 2d hyperspace.

To do this we must first map rating values to numerical values. I will be following the below mapping (based on severity of rating):

{ 'E' : 0,
  'ET': 1,
  'T' : 2,
  'M' : 3,
}

In [None]:
mapp = { 'E' : 0,
  'ET': 1,
  'T' : 2,
  'M' : 3,
}

testFrame['esrb_rating'] = testFrame['esrb_rating'].map(mapp)
trainFrame['esrb_rating'] = trainFrame['esrb_rating'].map(mapp)

In [None]:
plt.figure(figsize = (19, 11))
featuresRequired = trainFrame.columns[2:]
sb.heatmap(testFrame[featuresRequired].corr(), annot = True)
plt.show()

Some Conclusions that can be drawn from here: 
* features blood_gore and blood has correlations of 0.5 and 0.41 respectively, more of these more sever will be the rating.
* feature voilence has a correlation 0.47 and hence follows the above statement.

Pretty much what we expected.


In [None]:
# moving with task 2 visualizing rating in hyperspace(here 2D space)
featuresRequired = trainFrame.columns[2:-1]
featueVector = np.array(trainFrame[featuresRequired])

from sklearn.preprocessing import StandardScaler
stdSc = StandardScaler()
featueVector = stdSc.fit_transform(featueVector)

pca = PCA(n_components=2)
pca.fit(featueVector)
featueVector = pca.transform(featueVector)

dimReducedDataFrame = pd.DataFrame(featueVector)
dimReducedDataFrame['targets'] = trainFrame['esrb_rating'].map({0 : 'E', 1 : 'ET', 2 : 'T', 3 : 'M'})
dimReducedDataFrame = dimReducedDataFrame.rename(columns = {0:'V1', 1 : 'V2'})

plt.figure(figsize=(10, 10))
sb.scatterplot(data=dimReducedDataFrame, x='V1', y='V2', hue = 'targets')
plt.grid(True)
plt.show()


In [None]:
# doing same for testset
# moving with task 2 visualizing rating in hyperspace(here 2D space)
featuresRequired = testFrame.columns[2:-1]
featueVector = np.array(testFrame[featuresRequired])

from sklearn.preprocessing import StandardScaler
stdSc = StandardScaler()
featueVector = stdSc.fit_transform(featueVector)

pca = PCA(n_components=2)
pca.fit(featueVector)
featueVector = pca.transform(featueVector)

dimReducedDataFrame = pd.DataFrame(featueVector)
dimReducedDataFrame['targets'] = testFrame['esrb_rating'].map({0 : 'E', 1 : 'ET', 2 : 'T', 3 : 'M'})
dimReducedDataFrame = dimReducedDataFrame.rename(columns = {0:'V1', 1 : 'V2'})

plt.figure(figsize=(10, 10))
sb.scatterplot(data=dimReducedDataFrame, x='V1', y='V2', hue = 'targets')
plt.grid(True)
plt.show()

From the above 2 figures it is clear that there exists a decission boundary
and hence our algos will work well !!

In [None]:
# lets define a utility function for visualizing the performance of our classifiers
def Vizutil(trueVals, predictedVals, classifierName):
    plt.figure(figsize=(7, 7))
    plt.scatter(trueVals, predictedVals, color = 'green')
    plt.xlabel('True Values')
    plt.ylabel('Predicted Values')
    plt.grid(True)
    
    from sklearn.metrics import accuracy_score
    acc = accuracy_score(trueVals, predictedVals)
    
    plt.title('performance of ' + str(classifierName) + 'Acc :' + str(acc*100))
    plt.plot([0,1,2,3], [0,1,2,3], color = 'blue')
    plt.show()
    

In [None]:
# Now we split data into test - train sets
xTrain = np.array (trainFrame[trainFrame.columns[2:-1]])
xTest = np.array (testFrame[testFrame.columns[2:-1]])

yTrain = np.array (trainFrame[trainFrame.columns[-1]])
yTest = np.array (testFrame[testFrame.columns[-1]])

# Kindly note we dont need to apply standard scalling here 
# as features are binary in nature

In [None]:
# Classifier 1.) Decission Tree classifier

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(xTrain, yTrain)

predVal1 = dtc.predict(xTest)
Vizutil(yTest, predVal1, 'Decission Tree Classifier')




In [None]:
# Classifier 2.) Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)

predVal2 = rfc.predict(xTest)
Vizutil(yTest, predVal2, 'Radom forest  Classifier')

In [None]:
# Classifier 3.) gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(xTrain, yTrain)

predVal3 = gbc.predict(xTest)
Vizutil(yTest, predVal3, 'Gradient Boosting  Classifier')

In [None]:
# Classifier 4.) bagging classifier
from sklearn.ensemble import BaggingClassifier
bgc = BaggingClassifier()

bgc.fit(xTrain, yTrain)
predVal4 = bgc.predict(xTest)
Vizutil(yTest, predVal4, 'Bagging Classifier')

In [None]:
# Classifier 5 ensebled Classifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

estimators = [
    ('dt', DecisionTreeClassifier()),
    ('rf', RandomForestClassifier()),
    ('bg', BaggingClassifier()),
    ('rf2', RandomForestClassifier(n_estimators = 100, max_depth= 12))
    
]

clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
clf.fit(xTrain, yTrain)

predVal5 = clf.predict(xTest)
Vizutil(yTest, predVal5, 'Stacked/Ensembled Classifier')

From above we can see that:

* Decission Tree has an acc score of 83 % on test set
* Random Forest has an acc score of 85.4 % on test set
* Gradient Boosting classifier has an accuracy of 79.6 % (this is unusual as gbc in many cases outperforms the above 2)
* Bagging Classifier has an accuracy of 83.8 % on test set
* Ensembled Classifier has an accuracy of 85.6 % on test


Todo: Implementing tensorflow model, but then again it's performance will be around 80%
because neural networks tends have similar performance as that of gradient boosting classifier

Well if you have reached this far, Thanks for reading..........