## Preprocess Metacritic, Genres, and Supported Platforms Columns

In [None]:
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MultiLabelBinarizer
from wordcloud import WordCloud

In [None]:
df = pd.read_csv('../../data/raw/info_base_games.csv', low_memory=False)

In [None]:
df.shape

In [None]:
df.head()

Checking null values

In [None]:
df.isna().sum().sort_values(ascending = False)

Drop Corrupted Sample That Contains The Column Names

In [None]:
df.loc[[9929]]

In [None]:
df = df.drop(index=9929)

### Preprocessing the Metacritic Score Column

[Metacritic](https://www.metacritic.com/) is a website for critics and users to review digital content (movies, music, **games**, etc.), the metacritic score that is included on Steam is the **critics** (professional game reviewers) metacritic score, not the users.

In [None]:
# Converting the metacritic column to numeric since it contains NaNs
df['metacritic'] = pd.to_numeric(df['metacritic'], errors='coerce')

Checking Percentage of Missing Values

In [None]:
total = len(df)
non_null = df['metacritic'].notna().sum()
missing = df['metacritic'].isna().sum()
print(f"Total rows: {total}")
print(f"With metacritic score: {non_null} ({non_null/total:.1%})")
print(f"Missing metacritic score: {missing} ({missing/total:.1%})")

The above results show that almost 97% of the games in our data don't have a metacritic score associated with them and the metacritic score for them is null, this is due to Steam leaving it **optional** for games' publishers to include a metacritic score on their Steam page.

In [None]:
df['metacritic'].describe()

In [None]:
plt.hist(df['metacritic'].dropna(), bins=20)
plt.xlabel("Metacritic score")
plt.ylabel("Number of games")
plt.title("Distribution of Metacritic Scores Ignoring Null Values")
plt.show()

The above histogram shows that the **metacritic scores are normally distributed**, which means **we can apply standardization on it**.

Metacritic scores before standardization

In [None]:
df['metacritic'].dropna().head()

Standardizing metacritic scores

In [None]:
standardizer = StandardScaler()
# IMPORTANT: later on I should fit_transform only on the training data
df['metacritic_preprocessed'] = standardizer.fit_transform(df[['metacritic']])

df['metacritic_preprocessed'].dropna().head()

Now a problem remains, which is the missing values, it doesn't make sense to drop all the rows that don't contain a metacritic score since 97% of the data doesn't have it, the best solution that came to my mind is to **set all the NaN metacritic scores** to be equal to the **mean of the standardized metacritic scores**, which is **0**. And create a new boolean feature `has_metacritic` which indicates whether this game has a metacritic score or not, I hope that this can help models understand that if `has_metacritic = 0` then ignore this metacritic score, and also generally `has_metacritic` may later on turn out to be a useful feature on its own.

Creating New `has_metacritic` Feature

In [None]:
df['has_metacritic'] = df['metacritic'].notna().astype(int)
print(df['has_metacritic'].head())
print(df['has_metacritic'].loc[[1893]])

#### Replacing Missing Values With the Mean Value

Before

In [None]:
df['metacritic_preprocessed'].head()

After

In [None]:
df['metacritic_preprocessed'] = df['metacritic_preprocessed'].fillna(0)
print(df['metacritic_preprocessed'].head())

In [None]:
df.head()

#### Final Thoughts On Preprocessing Metacritic

The above concludes my current trials to preprocess the metacritic column. I will leave below some thoughts that we may wish to revisit in the future:
1. I am not sure if the imputation technique that I used is the most suitable technique for this case, and I am not sure if it is correct to standardize then impute, or should I impute first then standardize, I chose the first approach since I think this will better maintain the distribution of the original data, and especially since I believe the imputed value shouldn't have any more meaning than just indicating that this row didn't have a metacritic value, which I tried to do along with the `has_metacritic` column.

2. Most importantly, I believe later on the `metacritic` column **will turn out to not be useful** for our models and that we will remove it as a feature, this is based on some discussions I read ([discussion1](https://Steamcommunity.com/discussions/forum/10/3057367211653181335/?l=latam), [discussion2](https://www.reddit.com/r/pcgaming/comments/1gjadpf/da_tv_metacritic_user_score_vs_Steam_ratings/)) in which multiple people feel that it doesn't give accurate reviews, I also expect that it might have a high correlation with `reviewScore` in `gamalytic_Steam_games.csv`, and considering that a small number of samples have this score, I expect this feature won't be useful, on the other hand, I believe we might find the new `has_metacritic` feature useful on its own, still this is all just speculations and we will find out by using proper feature selection methods.

### Preprocessing the Genres Column

In [None]:
df['genres'].head()

Split Genres Into Lists

In [None]:
df['genres_split'] = df['genres'].fillna('')
df['genres_split'] = df['genres_split'].apply(lambda x: x.split(', ') if x else [])
df['genres_split'].head()

#### Analyze Genres

Flatten genres into one big list

In [None]:
all_genres = [genre
              for sublist in df['genres_split']
              for genre in sublist]

Count genres frequencies

In [None]:
genre_counts = Counter(all_genres)
genre_freq = (
    pd.DataFrame.from_dict(genre_counts, orient='index', columns=['count'])
      .sort_values('count', ascending=False)
)
print("Number of uniqure genres in the whole dataset:")
print(genre_freq.size)
print(genre_freq)

Output word cloud for visualization of the genre counts

In [None]:
wc = WordCloud(width=1400, height=700, background_color='white')
wc.generate_from_frequencies(genre_counts)

plt.figure(figsize=(14, 7))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Multi-Hot Encode The Genres and Merge Them Into the Dataframe

In [None]:
mlb = MultiLabelBinarizer()
# IMPORTANT: later on I should fit_transform only on the training data
genres_encoded = mlb.fit_transform(df['genres_split'])
genres_df = pd.DataFrame(genres_encoded, columns=[f'genre_{c}' for c in mlb.classes_], index=df.index)
df = pd.concat([df, genres_df], axis=1)
df = df.drop(columns=['genres_split'])

In [None]:
pd.set_option('display.max_columns', None)
df.head()

#### Final Thoughts On Preprocessing Genres

1. There are rare genres (Accounting, Nudity, Web Publishing, etc.), after research I believe we have 2 options that we can do with them, either treat them normally like all the other genres as I did above, or set a certain frequency threshold, and genres that have frequencies less than that threshold get removed and we replace them with a "genre_Other" column, I didn't do that since I have a feeling that these rare genres might help during prediction, but the best way to know would be to test both methods in the "training and evaluation" phase to determine which method helps the model make better predictions.

2. About handling unseen genres that we might get later on with unseen data, if we do not have the "genre_Other" column then the best way to handle them would be to ignore them, put 1's in the genres that we know, and ignore the ones we don't know, on the other hand if we have a "genre_Other" column, then we would put a 1 at that column.

3. If later on the genres prove to be useful in predictions and we select them as a feature, I believe the best way to handle them being missing in unseen data would be to webscrape and get the genres of the game.