# Hello everyone!

In this notebook we will analyze the data which includes many columns about features like “Is there blood in this computer game” or “Are there murders in this game” and ESRB rating for each video game in data frame. In my opinion it is very interesting data, because we can understand how many video games have for example this or that rating. So, let`s start.

# 1) What is ESRB rating?

The Entertainment Software Rating Board (ESRB) is an American self-regulatory organization that assigns age and content ratings to consumer video games. The board assigns ratings to games based on their content, using judgment similar to the motion picture rating systems used in many countries, using a combination of six age-based levels intended to aid consumers in determining a game's content and suitability, along with a system of "content descriptors" which detail specific types of content present in a particular game. The ratings are determined by a combination of material provided by the game's publisher in both questionnaires and video footage of the game, and a review of this material by a panel of reviewers who assign it a rating. The ratings are designed towards parents so they can make informed decisions about purchasing games for their children. Once a game is rated, the ESRB maintains a code of ethics for the advertising and promotion of video games - ensuring that marketing materials for games are targeted to appropriate audiences. (https://en.wikipedia.org/wiki/Entertainment_Software_Rating_Board)

After we learn what is ESRB we can go and try to understand what is letters like E 10+, A, RP and etcetera. To know this let`s see table which author of this data frame (https://www.kaggle.com/imohtn) left. ![image.png](attachment:image.png)

Here we can see all ESRB_rating letters and what they mean. For example, E – everyone. It means that everyone can play this game.

# 2) Import data and libraries.

After theory we can go to practice. Firstly, lets import all useful libraries. Secondly, load data.

In [None]:
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# load data
main_data = pd.read_csv("/kaggle/input/video-games-rating-by-esrb/Video_games_esrb_rating.csv")

# 3) Fast looking on data.

Let`s see head of our dataframe and list of columns.

In [None]:
# fast looking (size of dataframe, columns, 5 first rows of data, Nan values)
print("The dataset has " + str(main_data.shape[0]) + " rows.")
print("And " + str(main_data.shape[1]) + " columns atleast.")

print("List of columns: ", main_data.columns)

main_data.head(5)

Here we can see that this dataframe has 1895 rows and 34 columns and there are a lot of features. For example, blood_and_gore, intense_violence and others. We can see that all of those columns include numbers: 0 and 1. Author of this dataset explains that 0 means there isn`t this feature in the game and 1 means there is this feature. Also, there is column “Console”. This column is about the console on which the game was released. It includes numbers 0 and 1 too. But it means like that: 0 = PS4, 1 = PS4 & Xbox_one. So, we understand what columns in this data.

After that we can check Nan and null values here.

In [None]:
procent_of_null = main_data.isnull().sum() / main_data.shape[0]
print(procent_of_null)

print("")

procent_of_nan = main_data.isna().sum() / main_data.shape[0]
print(procent_of_nan)

Cool. There aren`t null and nan values, thanks author. 

# 4) Visualizing data.

Well, in my mind this the most interesting part of data analyzing. 
I think the best way to explore data is go in ascending column order. So, let’s start. First column is Name, but I think we can’t visualize anything here. Second column is about consoles. So, we can do something here. 

In [None]:
# size
plt.figure(figsize=(8, 8))

# pie and bar graphs
consoles_labels = ["Games on PS", "Games on both consoles"]

playstation = main_data[main_data["console"] == 0]
both_consoles = main_data[main_data["console"] == 1]

consoles_names = [len(playstation), len(both_consoles)]

plt.bar(consoles_labels, consoles_names, color = "green")
plt.title("Count of games on different consoles")

plt.figure(figsize=(19, 5))
plt.pie(consoles_names, labels = consoles_labels, shadow = True, colors = "rgbyc")
plt.title("Count of games on different consoles")

Here we can see that:

• There is a little bit few than 1000 games on PlayStation, but only 900 on both platforms.

• First fact means that there are more playstation`s exclusives than games on both platforms.

• The ratio of PlayStation games to games on both platforms is approximately 51% to 49%.

Second and every next column are about is this or that game includes something. That’s why I think that we can firstly calculate correlation between them, find out which columns are most important and after that visualize those columns separately from each other. We will use Pearson`s Correlation. If you would like to learn more about correlation, you can check my last notebook (link: https://www.kaggle.com/artemborzenko/calculating-the-correlation-of-a-youtube-dataset) 

In [None]:
# correlation
# size
plt.figure(figsize=(15, 12))

useless_cols = ["title", "console"] 
corr_data = main_data.drop(useless_cols, axis = 1)

encode = {'E' : 0,
          'ET': 1,
          'T' : 2,
          'M' : 3}

corr_data["esrb_rating"] = corr_data["esrb_rating"].map(encode)

corr_cols = list(corr_data.columns)

main_correlation = corr_data[corr_cols].corr()
sns.heatmap(main_correlation, annot = False, cmap = "Greens")

Here we can see that:

• Columns like Blood, Blood and Gore, Strong Language have strong positive correlation with ESRB rating.

• But nudity, use of drugs and alcohol have really low correlation.

Well, we find out which feature columns are important, so we can visualize them. 
Let`s start from Blood column:

In [None]:
# about blood in video games
plt.figure(figsize=(8, 8))
blood_labels = ["There is blood", "No blood"]

blood_0 = main_data[main_data["blood"] == 0]
blood_1 = main_data[main_data["blood"] == 1]

blood_names = [len(blood_0), len(blood_1)]

plt.bar(blood_labels, blood_names, color = "green")
plt.title("Is there blood?")

plt.figure(figsize=(19, 5))
plt.pie(blood_names, labels = blood_labels, shadow = True, colors = "rgbyc")
plt.title("Is there blood?")

Here we can see that:

• There are more games with blood scenes than without it.

• There are more than 1400 computer games with blood scenes in this dataframe.

Blood and Gore:

In [None]:
# blood and gore column
# size
plt.figure(figsize=(8, 8))
blood_and_gore_labels = ["There is blood and gore", "No blood and gore"]

blood_and_gore_0 = main_data[main_data["blood_and_gore"] == 0]
blood_and_gore_1 = main_data[main_data["blood_and_gore"] == 1]

blood_and_gore_names = [len(blood_and_gore_0), len(blood_and_gore_1)]

plt.bar(blood_and_gore_labels, blood_and_gore_names, color = "green")
plt.title("Is there blood and gore?")

plt.figure(figsize=(19, 5))
plt.pie(blood_and_gore_names, labels = blood_and_gore_labels,  colors = "rgbyc")
plt.title("Is there blood and gore?")

Here we can see that:

• There are more games with blood and gore scenes than without it.

• There are more than 1600 computer games with blood and gore scenes in this data frame.

And what about Strong Language:

In [None]:
# about strong language
# size
plt.figure(figsize=(8, 8))
strong_lg_labels = ["There is strong lng", "No strong lng"]

strong_lg_0 = main_data[main_data["strong_janguage"] == 0]
strong_lg_1 = main_data[main_data["strong_janguage"] == 1]

strong_lg_names = [len(strong_lg_0), len(strong_lg_1)]

plt.bar(strong_lg_labels, strong_lg_names, color = "green")
plt.title("Is there strong language?")

plt.figure(figsize=(19, 5))
plt.pie(strong_lg_names, labels = strong_lg_labels, shadow = True, colors = "rgbyc")
plt.title("Is there strong language?")

Here we can see that:

• There are more games with strong language scenes than without it.

• There are more than 1600 computer games with strong language scenes in this data frame.

Well, lets go to rating column. If carefully look at this column we can see that there are only 4 unique values. E, M, T, ET. That`s why I will prepare lists only for these four values.

In [None]:
# ratings
# size
plt.figure(figsize=(8, 8))
rating_labels = ["E", "ET", "T", "M"]

E_rating = main_data[main_data["esrb_rating"] == "E"]
ET_rating = main_data[main_data["esrb_rating"] == "ET"]
T_rating = main_data[main_data["esrb_rating"] == "T"]
M_rating = main_data[main_data["esrb_rating"] == "M"]

rating_names = [len(E_rating), len(ET_rating), len(T_rating), len(M_rating)]

plt.bar(rating_labels, rating_names, color = "green")
plt.title("Rating distribution")

plt.figure(figsize=(19, 5))
plt.pie(rating_names, labels = rating_labels, shadow = True, colors = "rgbyc")
plt.title("Rating distribution")

Here we can see that:

• There are a lot of games with rating T, but not so many games with rating M and ET.

• Also, we can see that about half of all games are with rating T (~690), games with E and ET ratings are about 410 – 400 and games for matures are only about 390.

# 5) Conclusion of Analyzing.

In conclusion I can say that we did good job. We analyze this data and understand that correlation of strong language, blood and gore between ESRB rating is strong positive, but correlation of for example nudity is not really strong (by information in this data frame). Also, we visualized information from many columns and learn a lot of interesting information. 

# Prediction part.

Our task is to predict ESRB ratings of games in test data. So, after analysis this data we understand that there we can use classifier models to predict it.

# 1) Prepare data.

Let’s reload out data to train data frame and test data frame. 
After we did it, we can split data. To use it we wont use train_test_split because we can do it easier with Numpy.

In [None]:
# load data
train_df = pd.read_csv("/kaggle/input/video-games-rating-by-esrb/Video_games_esrb_rating.csv")
test_df = pd.read_csv("/kaggle/input/video-games-rating-by-esrb/test_esrb.csv")

# split the data
x_train = np.array(train_df[train_df.columns[2:-1]])
y_train = np.array(train_df[train_df.columns[-1]])

x_test = np.array(test_df[test_df.columns[2:-1]])
y_test = np.array(test_df[test_df.columns[-1]])

# 2) Modelling

In [None]:
#metric
from sklearn.metrics import accuracy_score

In [None]:
# 1)
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier version 1 model
dct1_model = DecisionTreeClassifier()

dct1_model.fit(x_train, y_train)
dct1_pred = dct1_model.predict(x_test)

dct1_pred_score_1 = accuracy_score(y_test, dct1_pred)

print("Prediction list of first dct: ", dct1_pred[:5])
print("Score of first dct: ", dct1_pred_score_1)

First model is simple Decision Tree Classifier.
It is having not bad accuracy score: 0.82. How we can see Decision Tree Classifier is really good at this dataset. That’s why let’s check Random Forest Classifier score.

In [None]:
# 2)
from sklearn.ensemble import RandomForestClassifier

# RandomForestClassifier vesion 1 model
rfc_1_model = RandomForestClassifier()

rfc_1_model.fit(x_train, y_train)
rfc_1_pred = rfc_1_model.predict(x_test)

rfc_1_pred_score_1 = accuracy_score(y_test, rfc_1_pred)

print("Prediction list of rfc: ", rfc_1_pred[:5])
print("Score of rfc: ", rfc_1_pred_score_1)

Random Forest Classifier is better than Decision Tree here, because it is having 0.85 of accuracy score.

Finally, we can stack these models and see what will be result of this move.

In [None]:
# 3)
from sklearn.ensemble import StackingClassifier

estimators = [('dt', DecisionTreeClassifier()),
              ('rf', RandomForestClassifier())]

stc1_model = StackingClassifier(estimators=estimators)

stc1_model.fit(x_train, y_train)
stc1_pred = stc1_model.predict(x_test)

stc_1_pred_score_1 = accuracy_score(y_test, stc1_pred)

print("Prediction list of Stack classifier model: ", stc1_pred[:5])
print("Score of Stack classifier model: ", stc_1_pred_score_1)

# 3) Conclusion of prediction.

Decision Tree Classifier has 0.82 accuracy score

Random Forest Classifier has 0.85 accuracy score (our “winner”)

Stack model has 0.848 accuracy score

Well, here we created three models, compared them, saw results and understanded that machine learning model can predict ESRB rating of games very well.

Thank you everyone who check this notebook, especially to author of this dataset, because I always want to do work like this. If you like my notebook upvote it and if you dislike, please, write your comments it will help me to improve my skills. 
Good luck! 