# Video Games Rating By "ESRB"

## 1 - Intro

### 1.1 - What is ESRB?

#### **According to Wikipedia:**

The **Entertainment Software Rating Board (ESRB)** is a **self-regulatory** organization that **assigns age and content ratings to consumer video games** in North America. The ESRB was **established in 1994 by the Entertainment Software Association** (ESA, formerly the Interactive Digital Software Association (IDSA)), **in response to criticism of controversial video games with excessively violent or sexual content**, particularly after the 1993 congressional hearings following the releases of Mortal Kombat and Night Trap for home consoles and Doom for home computers. The industry, pressured with potential government oversight of video game ratings from these hearings, established both the IDSA and the ESRB within it to create a voluntary ratings system based on the Motion Picture Association of America film rating system with additional considerations for video game interactivity.
 
 Link for the full article: https://en.wikipedia.org/wiki/Entertainment_Software_Rating_Board

<img src="https://images.unsplash.com/photo-1476703993599-0035a21b17a9?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=750&q=80" alt="Mom and kids with tablet" width="600"/>

Photo courtesy of Alexander Dummer on Unsplash.

## 2 - Problem Definition

Could we create a machine learning model that predicts the ESRB rating for a videogame based on it's descriptors?

## 3 - Data

We have data from 1895 games already rated by the ESRB in different categories. 

We have have the following columns in our database:

- **title**: Name of the video game (plain text).

- **console**: Checks if the video game has a console version (binary value).

- **alcohol_reference**: Checks if the video game makes reference to and/or images of alcoholic beverages (binary vector).

- **animated_blood**: Checks if the video game has discolored and/or unrealistic depictions of blood (binary vector).

- **blood**: Checks if the video game has depictions of blood (binary vector).

- **blood_and_gore**: Checks if the video game has depictions of blood or the mutilation of body parts (binary vector).

- **cartoon_violence**: Checks if the video game has violent actions involving cartoon-like situations and characters. May include violence where a character is unharmed after the action has been inflicted (binary vector).

- **crude_humor**: Checks if the video game has depictions or dialogue involving vulgar antics, including "bathroom" humor (binary vector).

- **drug_reference**: Checks if the video game has reference to and/or images of illegal drugs (binary vector).

- **fantasy_violence** Checks if the video game has violent actions of a fantasy nature, involving human or non-human characters in situations easily distinguishable from real life (binary vector).

- **intense_violence**: Checks if the video game has graphic and realistic-looking depictions of physical conflict. May involve extreme and/or realistic blood, gore, weapons and depictions of human injury and death (binary vector).

- **language**: Checks if the video game has mild to moderate use of profanity (binary vector).

- **lyrics**: Checks if the video game has mild references to profanity, sexuality, violence, alcohol or drug use in music (binary vector).

- **mature_humor**: Checks if the video game has depictions or dialogue involving "adult" humor, including sexual references (binary vector).

- **mild_blood**: Checks if the video game has low frequency, intensity or severity depictions of blood (binary vector).

- **mild_cartoon_violence**: Checks if the video game has low frequency, intensity or severity violent actions involving cartoon-like situations and characters. May include violence where a character is unharmed after the action has been inflicted (binary vector).

- **mild_fantasy_violence**: Checks if the video game has low frequency, intensity or severity violent actions of a fantasy nature, involving human or non-human characters in situations easily distinguishable from real life (binary vector).

- **mild_language**: Checks if the video game has low frequency, intensity or severity use of profanity (binary vector).

- **mild_lyrics**: Checks if the video game has low frequency, intensity or severity references to profanity, sexuality, violence, alcohol or drug use in music (binary vector).

- **mild_suggestive_themes**: Checks if the video game has low frequency, intensity or severity provocative references or materials (binary vector).

- **mild_violence**: Checks if the video game has mild scenes involving aggressive conflict. May contain bloodless dismemberment (binary vector).

- **no_descriptors**: Checks if the video game does not have any of the descriptors (binary vector).

- **nudity**: Checks if the video game has graphic or prolonged depictions of nudity (binary vector).

- **partial_nudity**: Checks if the video game has brief and/or mild depictions of nudity (binary vector).

- **sexual_content**: Checks if the video game has non-explicit depictions of sexual behavior, possibly including partial nudity (binary vector).

- **sexual_themes**: Checks if the video game has references to sex or sexuality (binary vector).

- **simulated_gambling**: Checks if the video game has a feature that allows the player to gamble without betting or wagering real cash or currency (binary vector).

- **strong_language**: Checks if the video game has explicit and/or frequent use of profanity (binary vector).

- **strong_sexual_content**: Checks if the video game has explicit and/or frequent depictions of sexual behavior, possibly including nudity (binary vector).

- **suggestive_themes**: Checks if the video game has mild provocative references or materials (binary vector).

- **use_of_alcohol**: Checks if the video game has the consumption of alcoholic beverages (binary vector).

- **use_of_drugs_and_alcohol**: Checks if the video game has the consumption or use of illegal drugs and alcohol (binary vector).

- **violence**: Checks if the video game has scenes involving aggressive conflict. May contain bloodless dismemberment (binary vector).

- **esrb_rating**: ESRB rating of the video game (plain text - more info below).

The ESRB ratings are as follows:

- **E**: Everyone. Content is generally suitable for all ages. May contain minimal cartoon, fantasy or mild violence and/or infrequent use of mild language.

- **ET**: Everyone 10+. Content is generally suitable for ages 10 and up. May contain more cartoon, fantasy or mild violence, mild language and/or minimal suggestive themes.

- **T**: Teen. Content is generally suitable for ages 13 and up. May contain violence, suggestive themes, crude humor, minimal blood, simulated gambling and/or infrequent use of strong language.

- **M**: Mature 17+. Content is generally suitable for ages 17 and up. May contain intense violence, blood and gore, sexual content and/or strong language.

## 4 - Evaluation

Letting a video game aimed for a more mature audience reach a younger crowd is a **very grave** offense.

Therefore, the desired accuracy for the model should be **above 90%**.

Cases where the model labels a game with age restrictions with less severity than it should **must not happen**.

## 5 - Exploratory Data Analysis (EDA):

### 5.1 - First steps

#### 5.1.1 - Necessary libraries

In [None]:
import pandas as pd 
import numpy as np

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import PCA
from sklearn import metrics

import matplotlib.pyplot as plt
import seaborn as sns

#### 5.1.2 - Importing the dataframe

In [None]:
df = pd.read_csv("../input/video-games-rating-by-esrb/Video_games_esrb_rating.csv")

In [None]:
df.head()

Correcting a small typo in the original dataframe

In [None]:
df = df.rename(columns={"strong_janguage": "strong_language"})

In [None]:
df.head()

#### 5.1.3 - Dataframe checks

Checking for null values

In [None]:
df.isnull().sum()

Checking the data types on the dataframe

In [None]:
df.dtypes

Checking if the int64 variables have only 0's and 1's

In [None]:
col_val = pd.DataFrame()
col_val = df.drop(["title", "esrb_rating"], axis=1)

In [None]:
unique_val = np.unique(col_val)
unique_val

In [None]:
del col_val, unique_val

Checking if we have repeated video game titles

In [None]:
df["title"].is_unique

#### 5.1.4 - Visualizations

##### 5.1.4.1 - Principal Component Analysis (PCA)

In [None]:
df_vis_x = df.drop("esrb_rating",axis = 1)
df_vis_x.set_index("title", inplace=True)
df_vis_y = df["esrb_rating"]

In [None]:
scaler = StandardScaler()

In [None]:
df_vis_scaled_x = scaler.fit_transform(df_vis_x)

In [None]:
df_vis_scaled_x

In [None]:
pca = PCA(n_components=2)

In [None]:
principal_components = pca.fit_transform(df_vis_scaled_x)

In [None]:
principaldf = pd.DataFrame(data = principal_components,
                           columns = ["pc1", "pc2"])

In [None]:
principaldf

In [None]:
finaldf = pd.concat([principaldf, df_vis_y], axis = 1)

In [None]:
finaldf

In [None]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel("Principal Component 1", fontsize = 15)
ax.set_ylabel("Principal Component 2", fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ["E", "ET", "T", "M"]
colors = ["r", "g", "b", "y"]
for target, color in zip(targets,colors):
    indicesToKeep = df_vis_y == target
    ax.scatter(finaldf.loc[indicesToKeep, "pc1"]
               , finaldf.loc[indicesToKeep, "pc2"]
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

##### 5.1.4.2 - Correlation Matrix

In [None]:
one_hot = pd.get_dummies(df["esrb_rating"])

In [None]:
df_heat = df.drop("esrb_rating", axis=1)

In [None]:
df_heat = df_heat.join(one_hot)

In [None]:
cor_matrix = df_heat.corr()
ax = plt.subplots(figsize=(30,30))
sns.heatmap(cor_matrix, annot=True);

##### 5.1.4.3 - Number of games by ESRB rating

In [None]:
ax = plt.subplots(figsize=(10,10))
df["esrb_rating"].value_counts().plot.bar();

##### 5.1.4.4 - Number of the top 10 in the database

In [None]:
df_desc = df.drop(columns=["title", "esrb_rating"], axis=1)

In [None]:
df_desc = df_desc.sum()

In [None]:
df_desc = df_desc.reset_index() 

In [None]:
df_desc.rename(columns={"index" : "Descriptors", 0 : "Count"}, inplace=True)

In [None]:
df_desc.sort_values("Count", ascending=False, inplace=True)

In [None]:
df_desc.plot.bar(x="Descriptors", y="Count");

## 6 - Machine Learning Model:

### 6.1 - First Steps

In [None]:
clf = SGDClassifier()

### 6.2 - Working the data

In [None]:
df_x = df.drop(columns=["esrb_rating","title"], axis=1)
df_y = df["esrb_rating"]
df_test = pd.read_csv("../input/video-games-rating-by-esrb/test_esrb.csv")
df_test_x = df_test.drop(columns=["esrb_rating", "title"], axis=1)
df_test_y = df_test["esrb_rating"]

### 6.3 - Fitting the model

In [None]:
clf.fit(df_x, df_y)

### 6.4 - Model evaluation

In [None]:
clf.score(df_test_x, df_test_y)

In [None]:
cross_val_score(clf, df_x, df_y, cv=10)

In [None]:
preds = clf.predict(df_test_x)

In [None]:
print(classification_report(df_test_y, preds))

In [None]:
labels = ["M", "T", "ET", "E"]

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
metrics.plot_confusion_matrix(clf, df_test_x,df_test_y, labels=labels, xticks_rotation="vertical", ax=ax)

### 6.5 - Trying with a different model:

In [None]:
clf2 = DecisionTreeClassifier(criterion="entropy")

In [None]:
clf2.fit(df_x, df_y)

In [None]:
clf2.score(df_test_x, df_test_y)

In [None]:
cross_val_score(clf2, df_x, df_y, cv=10)

In [None]:
preds2 = clf2.predict(df_test_x)

In [None]:
print(classification_report(df_test_y, preds2))

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
metrics.plot_confusion_matrix(clf2, df_test_x,df_test_y, labels=labels, xticks_rotation="vertical", ax=ax)

## 7 - Conclusion

Even though the model couln't reach the desired accuracy of above 90%, it came close at 85% and only a few games aimed at older crowds being classified as being ok for a younger audience. 

Next steps: Hyperparameter tuning and possibly better model selection.