<h1 style="text-align: center; font-weight: bold;">Preprocessing</h1>

In order to make the data suitable for the model, we need to preprocess it. In this notebook, we will preprocess the data in the following steps:
- Merge the datasets
- Remove duplicates
- Handle missing values
- Handle outliers
- Encode categorical features

In [1]:
import pandas as pd
import numpy as np

<h2 style="font-weight: bold;">1. Load datasets and prepare data for merging</h2>

### **Code Flow Explanation**

1. **Loading the MetaCritic Data**
   ```python
   metacritic_df = pd.read_csv("./MetaCritic/MetaCritic.csv")
   print(f"MetaCritic: {metacritic_df.shape}")
   print(metacritic_df.columns)
   ```
   - **`pd.read_csv(...)`**: Loads the MetaCritic dataset from the specified file path (`"./MetaCritic/MetaCritic.csv"`) into a DataFrame named `metacritic_df`.
   - **`metacritic_df.shape`**: Returns a tuple with the number of rows and columns in the MetaCritic dataset. This is printed to show the dataset's dimensions.
   - **`metacritic_df.columns`**: Displays the column names of the MetaCritic dataset, providing an overview of the variables it contains.

2. **Loading the OpenCritic Data**
   ```python
   opencritic_df = pd.read_csv("./OpenCritic/game_data.csv")
   print(f"OpenCritic: {opencritic_df.shape}")
   print(opencritic_df.columns)
   ```
   - **`pd.read_csv(...)`**: Loads the OpenCritic dataset from the specified file path (`"./OpenCritic/game_data.csv"`) into a DataFrame named `opencritic_df`.
   - **`opencritic_df.shape`**: Prints the shape (rows, columns) of the OpenCritic dataset.
   - **`opencritic_df.columns`**: Prints the names of the columns in the OpenCritic dataset.

3. **Loading the Steam Data**
   ```python
   steam_df = pd.read_csv("./steam_scraping/data/outputs/steam_data.csv")
   print(f"Steam: {steam_df.shape}")
   print(steam_df.columns)
   ```
   - **`pd.read_csv(...)`**: Loads the Steam dataset from the specified file path (`"./steam_scraping/data/outputs/steam_data.csv"`) into a DataFrame named `steam_df`.
   - **`steam_df.shape`**: Prints the shape (rows, columns) of the Steam dataset.
   - **`steam_df.columns`**: Prints the names of the columns in the Steam dataset.

---

### **Expected Output:**
The expected output will be printed in the following format:

```text
MetaCritic: (rows, columns)
Index(['column1', 'column2', 'column3', ...], dtype='object')

OpenCritic: (rows, columns)
Index(['column1', 'column2', 'column3', ...], dtype='object')

Steam: (rows, columns)
Index(['column1', 'column2', 'column3', ...], dtype='object')
```

Where `(rows, columns)` represents the shape of each dataset, and the column names (`'column1'`, `'column2'`, etc.) represent the names of the columns in each respective dataset.

In [2]:
# metacritic
metacritic_df = pd.read_csv("./MetaCritic/MetaCritic.csv")
print(f"MetaCritic: {metacritic_df.shape}")
print(metacritic_df.columns)

# opencritic
opencritic_df = pd.read_csv("./OpenCritic/game_data.csv")
print(f"OpenCritic: {opencritic_df.shape}")
print(opencritic_df.columns)

# steam
steam_df = pd.read_csv("./steam_scraping/data/outputs/steam_data.csv")
print(f"Steam: {steam_df.shape}")
print(steam_df.columns)

MetaCritic: (24258, 8)
Index(['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Release Date',
       'Platform', 'Rating', 'Number of Ratings'],
      dtype='object')
OpenCritic: (7771, 9)
Index(['Game Title', 'Game Genre', 'Pricing', 'Game Library Size', 'Publisher',
       'Release Date', 'Platform', 'Rating', 'Number of Rating'],
      dtype='object')
Steam: (58320, 8)
Index(['Game Title', 'Game Genre', 'Pricing', 'Publisher', 'Release Date',
       'Platform', 'Rating', 'Number of Ratings'],
      dtype='object')


There are a few columns in the datasets whose column names are not the same. We will rename these columns to make them consistent across the datasets. We will also remove the columns that are not useful for the model.

- **OpenCritic**
    - Has 'Number of Rating' instead of 'Number of Ratings' and 'Publisher' instead of 'Developer'.
    - Has an additional 'Game Library Size' column which is not needed.

- **Steam**
    - Has 'Publisher' instead of 'Developer'

- **MetaCritic** is fine as it is, so we will leave it there.

### **Code Flow Explanation**

1. **Renaming Columns in the OpenCritic DataFrame**
   ```python
   opencritic_df.rename(columns={"Number of Rating" : "Number of Ratings",
                                 "Publisher" : "Developer"}, inplace=True)
   ```
   - **`rename(...)`**: This operation renames the columns of `opencritic_df`:
     - **"Number of Rating"** is renamed to **"Number of Ratings"**.
     - **"Publisher"** is renamed to **"Developer"**.
   - **`inplace=True`**: The changes are applied directly to the `opencritic_df` DataFrame (without needing to create a new variable).

2. **Dropping the "Game Library Size" Column in the OpenCritic DataFrame**
   ```python
   opencritic_df.drop(columns=["Game Library Size"], inplace=True)
   ```
   - **`drop(columns=[...])`**: Drops the `"Game Library Size"` column from the `opencritic_df` DataFrame.
   - **`inplace=True`**: The operation is done directly on `opencritic_df`.

3. **Renaming Columns in the Steam DataFrame**
   ```python
   steam_df.rename(columns={"Publisher" : "Developer"}, inplace=True)
   ```
   - **`rename(...)`**: This operation renames the `"Publisher"` column in `steam_df` to `"Developer"`.
   - **`inplace=True`**: The changes are applied directly to `steam_df`.

4. **Printing Column Names**
   ```python
   print(opencritic_df.columns)
   print(steam_df.columns)
   print(metacritic_df.columns)
   ```
   - **`opencritic_df.columns`**: Prints the column names of the `opencritic_df` DataFrame after the renaming and column drop.
   - **`steam_df.columns`**: Prints the column names of the `steam_df` DataFrame after the renaming.
   - **`metacritic_df.columns`**: Prints the column names of the `metacritic_df` DataFrame (this DataFrame hasn't been modified in this snippet, so it will remain the same).

---

### **Expected Output:**
The output will display the updated column names of the three DataFrames:

```text
Index(['column1', 'column2', ..., 'Number of Ratings', 'Developer', ...], dtype='object')
Index(['column1', 'column2', ..., 'Developer', ...], dtype='object')
Index(['column1', 'column2', ..., 'columnN'], dtype='object')
```

- For `opencritic_df`, you’ll see that `"Number of Rating"` has been renamed to `"Number of Ratings"` and `"Publisher"` has been renamed to `"Developer"`, and `"Game Library Size"` will be absent.
- For `steam_df`, `"Publisher"` will have been renamed to `"Developer"`.
- For `metacritic_df`, no changes will have occurred, so its columns will remain the same as before.

In [3]:
opencritic_df.rename(columns={"Number of Rating" : "Number of Ratings",
                              "Publisher" : "Developer"}, inplace=True)
opencritic_df.drop(columns=["Game Library Size"], inplace=True)

steam_df.rename(columns={"Publisher" : "Developer"}, inplace=True)

print(opencritic_df.columns)
print(steam_df.columns)
print(metacritic_df.columns)

Index(['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Release Date',
       'Platform', 'Rating', 'Number of Ratings'],
      dtype='object')
Index(['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Release Date',
       'Platform', 'Rating', 'Number of Ratings'],
      dtype='object')
Index(['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Release Date',
       'Platform', 'Rating', 'Number of Ratings'],
      dtype='object')


<h3 style="font-weight: bold;">Normalize <i>Ratings</i> column</h2>

Since our dataset is merged from 3 different sources, the ratings scale may be different. We will normalize the ratings to a scale of 0-100.

### **Code Flow Explanation**

1. **Function Definition: `getMinMaxRatings(df)`**
   ```python
   def getMinMaxRatings(df):
       return df['Rating'].max(), df['Rating'].min()
   ```
   - **Purpose**: This function takes a DataFrame `df` as input and returns the maximum and minimum values from the `"Rating"` column.

2. **Calculate Max and Min Ratings for Each DataFrame**
   ```python
   meta_max, meta_min = getMinMaxRatings(metacritic_df)
   open_max, open_min = getMinMaxRatings(opencritic_df)
   steam_max, steam_min = getMinMaxRatings(steam_df)
   ```
   - This calculates the max and min ratings for `metacritic_df`, `opencritic_df`, and `steam_df`.

3. **Print the Results**
   ```python
   print(f"MetaCritic: Max: {meta_max}, Min: {meta_min}")
   print(f"OpenCritic: Max: {open_max}, Min: {open_min}")
   print(f"Steam: Max: {steam_max}, Min: {steam_min}")
   ```
   - This prints the max and min ratings for each platform: MetaCritic, OpenCritic, and Steam.

---

### **Expected Output**
```text
MetaCritic: Max: {meta_max}, Min: {meta_min}
OpenCritic: Max: {open_max}, Min: {open_min}
Steam: Max: {steam_max}, Min: {steam_min}
```
Where `{meta_max}`, `{meta_min}`, `{open_max}`, `{open_min}`, `{steam_max}`, and `{steam_min}` will be the respective values from the `"Rating"` column of each DataFrame.

In [4]:
def getMinMaxRatings(df):
    return df['Rating'].max(), df['Rating'].min()


meta_max, meta_min = getMinMaxRatings(metacritic_df)
open_max, open_min = getMinMaxRatings(opencritic_df)
steam_max, steam_min = getMinMaxRatings(steam_df)

print(f"MetaCritic: Max: {meta_max}, Min: {meta_min}")
print(f"OpenCritic: Max: {open_max}, Min: {open_min}")
print(f"Steam: Max: {steam_max}, Min: {steam_min}")

MetaCritic: Max: 10.0, Min: 0.2
OpenCritic: Max: 97.0, Min: 12.0
Steam: Max: 100.0, Min: 0.0


As seen from the output above, **MetaCritic** have ratings in the scale of 0-10, while the other 2 are in the scale of 0-100. We will have to normalize **MetaCritic**'s ratings to 0-100.

### **Code Flow Explanation**

1. **Scaling the Ratings**
   ```python
   metacritic_df['Rating'] = metacritic_df['Rating'] * 10
   ```
   - **Purpose**: This line multiplies all the ratings in the `"Rating"` column of the `metacritic_df` DataFrame by 10. This scales the ratings, for example, from a scale of 1-100 to a scale of 10-1000.

2. **Calculate Max and Min Ratings for MetaCritic**
   ```python
   meta_max, meta_min = getMinMaxRatings(metacritic_df)
   print(f"MetaCritic: Max: {meta_max}, Min: {meta_min}")
   ```
   - This calculates the max and min ratings for the updated `metacritic_df` DataFrame after scaling and prints the result.

---

### **Expected Output**
```text
MetaCritic: Max: {meta_max}, Min: {meta_min}
```
Where `{meta_max}` and `{meta_min}` will be the max and min values of the `"Rating"` column after scaling by 10.

In [5]:
metacritic_df['Rating'] = metacritic_df['Rating'] * 10

meta_max, meta_min = getMinMaxRatings(metacritic_df)
print(f"MetaCritic: Max: {meta_max}, Min: {meta_min}")

MetaCritic: Max: 100.0, Min: 2.0


We should also check for validity of data (e.g. if the ratings are within the expected range).

But since we have seen the minimum and maximum values of the ratings in the previous notebook, we can skip this step.

<h2 style="font-weight: bold;">2. Merging data</h2>

We will merge the datasets by `Game Title` column, which is common in all datasets.

According to dataset sizes, the priority order will be:
1. Steam (58k rows)
2. OpenCritic (24k rows)
3. MetaCritic (7k rows)

Based on the priority order, we will merge the datasets from **OpenCritic** and **MetaCritic** into **Steam**'s.
If the values in **Steam** are missing, we will fill them with the values from **OpenCritic** and **MetaCritic**.

### **Code Flow Explanation**

1. **Merging DataFrames**
   ```python
   merged_df = pd.merge(steam_df, metacritic_df,
       on="Game Title",
       how="outer",
       suffixes=('_steam', '_metacritic')
   )
   merged_df = pd.merge(merged_df, opencritic_df,
       on="Game Title",
       how="outer",
       suffixes=('', '_opencritic')
   )
   ```
   - **Purpose**: Merges three DataFrames (`steam_df`, `metacritic_df`, `opencritic_df`) into a single `merged_df` on the `"Game Title"` column using an outer join. This preserves all rows from all DataFrames. The suffixes (`_steam`, `_metacritic`, `_opencritic`) differentiate the columns from each DataFrame.

2. **Filling Missing Values Based on Priority**
   ```python
   for col in target_columns[1:]:
       merged_df[col] = merged_df[f"{col}_steam"].fillna(merged_df[f"{col}_metacritic"], axis=0).fillna(merged_df[col], axis=0)
   ```
   - **Purpose**: Fills missing values in each target column. The priority is to fill missing values in the `_steam` column with values from the `_metacritic` column, and if still missing, fill with values from the opencritic column.

3. **Cleaning String Columns**
   ```python
   string_col = ['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Platform', 'Release Date']
   for col in string_col:
       merged_df[col] = merged_df[col].str.strip().str.strip("[").str.strip("]").str.strip("'").str.strip("'").str.replace('\u202A', '', regex=False)
   ```
   - **Purpose**: Cleans string columns by removing extra spaces, square brackets, single quotes, and any unwanted characters like `\u202A`.

4. **Selecting the Final Columns**
   ```python
   merged_df = merged_df[target_columns]
   ```
   - **Purpose**: Reduces `merged_df` to only include the columns specified in `target_columns`.

5. **Display Merged DataFrame**
   ```python
   merged_df
   ```
   - **Purpose**: Displays the final `merged_df` DataFrame containing the cleaned, merged data.

---

### **Result**
The final `merged_df` contains combined data from the three sources, with missing values filled, cleaned string columns, and only the necessary columns selected.

In [6]:
# Merge dataframes
merged_df = pd.merge(steam_df, metacritic_df,
    on="Game Title",
    how="outer",
    suffixes=('_steam', '_metacritic')
)
merged_df = pd.merge(merged_df,opencritic_df,
    on="Game Title",
    how="outer",
    suffixes=('','_opencritic')
)

target_columns = ['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Release Date', 'Platform', 'Rating', 'Number of Ratings']

# Fill values based on priority
for col in target_columns[1:]:
    # strip whitespace and [, ], ' and " from the columns as well
    merged_df[col] = merged_df[f"{col}_steam"].fillna(merged_df[f"{col}_metacritic"], axis=0).fillna(merged_df[col], axis=0)
    
string_col = ['Game Title', 'Game Genre', 'Pricing', 'Developer', 'Platform', 'Release Date']
for col in string_col:
    merged_df[col] = merged_df[col].str.strip().str.strip("[").str.strip("]").str.strip("'").str.strip("'").str.replace('\u202A', '', regex=False)



merged_df = merged_df[target_columns]
merged_df

Unnamed: 0,Game Title,Game Genre,Pricing,Developer,Release Date,Platform,Rating,Number of Ratings
0,! That Bastard Is Trying To Steal Our Gold !,"Action, Adventure, Casual, Indie",$2.99,WTFOMGames,Mar 1 2016,,56.0,66.0
1,! Wild Russia !,"Action, Adventure, Casual",$19.99,Andreev Worlds,Apr 28 2020,,61.0,60.0
2,!4RC4N01D!,Arcade,,armogames,"Jan 12, 2018",PC,40.0,4.0
3,!4RC4N01D! 2: Retro Edition,Arcade,,armogames,"Feb 6, 2018",PC,38.0,4.0
4,!4RC4N01D! 2: Retro Edition,Arcade,,armogames,"Feb 6, 2018",PC,38.0,4.0
...,...,...,...,...,...,...,...,...
79856,🔴 Circles,"Casual, Indie",Free To Play,Jeroen Wimmers,Feb 17 2017,,89.0,226.0
79857,🔴 Circles,"Casual, Indie",Free To Play,Jeroen Wimmers,Feb 17 2017,,89.0,226.0
79858,🔴 Circles,"Casual, Indie",Free To Play,Jeroen Wimmers,Feb 17 2017,,89.0,226.0
79859,🚀 Human Rocket Person,"Action, Indie, Simulation",$1.99,2nd Studio,Nov 14 2018,,95.0,43.0


In [7]:
merged_df.dtypes

Game Title            object
Game Genre            object
Pricing               object
Developer             object
Release Date          object
Platform              object
Rating               float64
Number of Ratings    float64
dtype: object

<h2 style="font-weight: bold;">3. Data Cleaning</h2>

<h3 style="font-weight: bold;">3.1. Remove duplicates</h2>

As seen from the above preview, the total records of the dataset is **79861** entries. However, there are duplicates in the datasets. We will have to remove them.

### **Code Explanation**

1. **Stripping Whitespace from the 'Game Title' Column**
   ```python
   merged_df['Game Title'] = merged_df['Game Title'].str.strip()
   ```
   - **Purpose**: This line removes any leading or trailing whitespace from the `'Game Title'` column to ensure consistency in the data.

2. **Removing Duplicate 'Game Title' Entries**
   ```python
   merged_df = merged_df.drop_duplicates(subset=['Game Title'])
   ```
   - **Purpose**: This line removes any duplicate rows based on the `'Game Title'` column, keeping only the first occurrence of each unique game title.

3. **Check the Shape of the DataFrame**
   ```python
   merged_df.shape
   ```
   - **Purpose**: This line returns the shape (number of rows and columns) of the `merged_df` DataFrame after the operations above.

---

### **Result**
- The `merged_df` DataFrame will have:
  - No leading or trailing spaces in the `'Game Title'` column.
  - Only unique game titles (no duplicates).
  - The shape of the DataFrame can be checked to see the updated number of rows and columns.

In [8]:
# strip the string
merged_df['Game Title'] = merged_df['Game Title'].str.strip()
# remove duplicates
merged_df = merged_df.drop_duplicates(subset=['Game Title'])
merged_df.shape

(62373, 8)

<h3 style="font-weight: bold;">3.2. Check for invalidity</h2>

Supossedly, the ratings should be within the range of 0-100. But since we have known the minimum and maximum values of the ratings do not violate this constraint in the previous section, we can skip this step.

All we need is to check whether the `Number of Ratings` column is valid, i.e. it should be > 0.

### **Code Explanation**

```python
print(f"Min Rating: {merged_df['Rating'].min()}")
```

- **Purpose**: This line prints the minimum value in the `'Rating'` column of the `merged_df` DataFrame.
  - The `.min()` function is used to find the smallest rating in the dataset.

---

### **Result**
- The output will display the minimum rating value across all the rows in the `merged_df` DataFrame.

In [9]:
print(f"Min Rating: {merged_df['Rating'].min()}")

Min Rating: 0.0


There are indeed columns with `Number of Ratings` = 0. We will remove these rows.

### **Code Explanation**

```python
merged_df = merged_df[merged_df['Rating'] > 0]
```
- **Purpose**: This line filters the `merged_df` DataFrame to only include rows where the `'Rating'` value is greater than 0.
  - The condition `merged_df['Rating'] > 0` creates a boolean mask, and only the rows that satisfy this condition are kept in the DataFrame.

```python
merged_df.shape
```
- **Purpose**: This line returns the shape (number of rows and columns) of the filtered `merged_df` DataFrame.
  - After filtering, this will show how many rows remain, and how many columns are present in the DataFrame.

---

### **Result**
- The output will display the new shape of the DataFrame after filtering out any rows where the `'Rating'` is 0 or less.

In [10]:
merged_df = merged_df[merged_df['Rating'] > 0]
merged_df.shape

(61757, 8)

<h3 style="font-weight: bold;">3.3. Fill or remove missing values</h2>

Since we merged 3 datasets together, there will exist rows where values from all 3 datasets are missing, hence the **NaN** values. We will fill these missing rows with accordingly.

First, we will check the missing values in the dataset.

### **Code Explanation**

1. **Function to Get Missing Values**
   ```python
   def getMissingValues(df: pd.DataFrame) -> pd.Series:
       return df.isnull().sum()
   ```
   - **Purpose**: Defines a function `getMissingValues` that calculates and returns the count of missing values for each column in the input DataFrame (`df`).
     - `df.isnull()` creates a DataFrame where `True` indicates a missing value.
     - `.sum()` counts the `True` values (missing entries) per column.

2. **Getting Missing Values for `merged_df`**
   ```python
   missing_values = getMissingValues(merged_df)
   ```
   - **Purpose**: Calls the `getMissingValues` function on `merged_df` and stores the resulting count of missing values in the `missing_values` variable.

3. **Output the Missing Values**
   ```python
   missing_values
   ```
   - **Purpose**: Displays the count of missing values for each column in `merged_df`.

---

### **Result**
- Displays the count of missing values for each column in `merged_df`.

In [11]:
def getMissingValues(df: pd.DataFrame) -> pd.Series:
    return df.isnull().sum()

missing_values = getMissingValues(merged_df)
missing_values

Game Title               0
Game Genre            1638
Pricing              12545
Developer               11
Release Date             4
Platform             41866
Rating                   0
Number of Ratings        0
dtype: int64

**Developer, Release Date and Number of Ratings** have very few missing values (<20), we can safely remove them from the dataset without significant impact.

### **Code Explanation**

```python
merged_df = merged_df.dropna(subset=['Developer', 'Release Date', 'Number of Ratings'])
```
- **Purpose**: This line removes any rows in the `merged_df` DataFrame where one or more of the specified columns (`'Developer'`, `'Release Date'`, `'Number of Ratings'`) contain missing (null) values.
  - `.dropna(subset=[...])`: The `subset` parameter is used to specify which columns should be checked for missing values. If any of the specified columns have a missing value in a row, that row will be dropped from the DataFrame.
  
---

### **Result**
- After this operation, the `merged_df` DataFrame will no longer contain any rows with missing values in the `'Developer'`, `'Release Date'`, or `'Number of Ratings'` columns.

In [12]:
merged_df = merged_df.dropna(subset=['Developer', 'Release Date', 'Number of Ratings'])

**Game Genre** has *2238* missing values. We will fill these missing values with *mode* because it is a categorical feature.

In [13]:
# get all genre values
genres = merged_df['Game Genre'].str.split(',').explode().str.strip("[").str.strip("'").str.strip("'").str.strip("]").dropna().value_counts()

mode_genre = genres.idxmax()
merged_df.fillna({'Game Genre': mode_genre}, inplace=True)
print(f"Missing values in Game Genre replaced with mode value: {mode_genre}")

Missing values in Game Genre replaced with mode value:  Indie


**Pricing** column is skewed, so we will fill the missing values with the *median*.

But first, there are values in `Pricing` where they do not contain a numeric value (usually "Free to Play" for "Free", etc.), we will have to convert the column to a numeric type, with non-numeric (might be Free) values as 0.

### **Code Explanation**

1. **Importing Regular Expression (re) Library**:
   ```python
   import re
   ```
   - **Purpose**: Imports the `re` library to use regular expressions for pattern matching in strings.

2. **Check Columns in `merged_df`**:
   ```python
   merged_df.columns
   ```
   - **Purpose**: Displays the column names of `merged_df` for verification.

3. **Select the `Pricing` Column**:
   ```python
   pricing_col = merged_df['Pricing']
   ```
   - **Purpose**: Selects the `Pricing` column from `merged_df` for further processing.

4. **Define Regular Expression Pattern**:
   ```python
   numeric_pattern = re.compile(r'(?:\d+)+')
   ```
   - **Purpose**: Defines a pattern to match numeric values (one or more digits).

5. **Identify Non-Numeric Prices**:
   ```python
   non_numeric_prices = pricing_col[~pricing_col.str.contains(numeric_pattern, na=False)].unique()
   ```
   - **Purpose**: Filters out non-numeric values from the `Pricing` column using the defined regex pattern and retrieves the unique non-numeric prices.

6. **Display Non-Numeric Prices**:
   ```python
   non_numeric_prices
   ```
   - **Purpose**: Displays the unique non-numeric pricing values found in the `Pricing` column.

---

### **Result**
- The variable `non_numeric_prices` contains unique non-numeric entries in the `Pricing` column, helping identify any discrepancies.

In [14]:
import re

merged_df.columns
pricing_col = merged_df['Pricing']

numeric_pattern = re.compile(r'(?:\d+)+')
non_numeric_prices = pricing_col[~pricing_col.str.contains(numeric_pattern, na=False)].unique()

non_numeric_prices

array([nan, 'Free', 'Free To Play', '', 'Free to Play', 'Free Demo',
       'Free Mod', 'Play for Free!', 'Install Now', 'Play the Demo',
       'Play WARMACHINE: Tactics Demo'], dtype=object)

Now that we knew non-numeric values in the `Pricing` column are mostly *free* games, we will replace them with `0`.

We also need to convert the `Pricing` column to a numeric type. Preferably `float`.

### **Code Explanation**

1. **Replace Non-Numeric Prices with 0**:
   ```python
   merged_df['Pricing'] = merged_df['Pricing'].replace(non_numeric_prices, '0')
   ```
   - Replaces non-numeric prices with `'0'` to handle invalid entries.

2. **Remove Dollar Signs and Convert to Numeric**:
   ```python
   merged_df['Pricing'] = merged_df['Pricing'].str.replace(r'\$\s*', '', regex=True).astype(float)
   ```
   - Removes dollar signs and spaces, then converts the `Pricing` column to float for numerical analysis.

3. **Result**:
   - Ensures that all values in the `Pricing` column are numeric and ready for processing.

In [15]:
# convert the non-numeric prices to 0
merged_df['Pricing'] = merged_df['Pricing'].replace(non_numeric_prices, '0')

# convert pricing to numeric
merged_df['Pricing'] = merged_df['Pricing'].str.replace(r'\$\s*', '', regex=True).astype(float)

There are still NaN values in the `Pricing` column. We will fill them with the median value since the column is skewed and does not follow a normal distribution.

### Code Explanation

1. **Calculate the Median of Pricing**:
   ```python
   median_pricing = merged_df['Pricing'].median()
   ```
   - Finds the median value of the `Pricing` column.

2. **Fill Missing Pricing Values with Median**:
   ```python
   merged_df.fillna({'Pricing' : median_pricing}, inplace=True)
   ```
   - Replaces `NaN` values in the `Pricing` column with the calculated median.

**Result**: Missing values in the `Pricing` column are filled with the median price.

In [16]:
median_pricing = merged_df['Pricing'].median()
merged_df.fillna({'Pricing' : median_pricing}, inplace=True)

**Platform** column has the most missing values, *41882*. We will fill these missing values with the *mode* since the column is categorical.

### Code Explanation

1. **Get Most Common Platform (Mode)**:
   ```python
   platforms = merged_df['Platform'].str.split(',').explode().str.strip().value_counts()
   mode_platform = platforms.idxmax()
   ```
   - Splits the `Platform` column by commas, flattens the values into individual entries, removes extra spaces, and counts the occurrences of each platform.
   - Finds the most common platform (mode) using `idxmax()`.

2. **Fill Missing Platform Values with Mode**:
   ```python
   merged_df.fillna({'Platform': mode_platform}, inplace=True)
   ```

**Result**: Any missing values in the `Platform` column are replaced with the most frequent platform in the dataset.

In [17]:
platforms = merged_df['Platform'].str.split(',').explode().str.strip().value_counts()
mode_platform = platforms.idxmax()
merged_df.fillna({'Platform': mode_platform}, inplace=True)
print(f"Missing values in Platform replaced with mode value: {mode_platform}")

Missing values in Platform replaced with mode value: PC


**Rating** column has *600* missing values. We will fill these missing values with *median* since the rating is skewed.

But first, we need to convert them all into numerical datatype.

### Code Explanation

1. **Convert `Rating` Column to Float**:
   ```python
   merged_df['Rating'] = merged_df['Rating'].astype(float)
   ```
   - Ensures that the `Rating` column is in a float format for numerical operations.

2. **Calculate the Median Rating**:
   ```python
   median_rating = merged_df['Rating'].median()
   ```

3. **Fill Missing Rating Values with Median**:
   ```python
   merged_df.fillna({'Rating': median_rating}, inplace=True)
   ```

**Result**: Any missing values in the `Rating` column are replaced with the median rating value. The median value used is printed.


In [18]:
merged_df['Rating'] = merged_df['Rating'].astype(float)
median_rating = merged_df['Rating'].median()
merged_df.fillna({'Rating': median_rating}, inplace=True)
print(f"Missing values in Rating replaced with median value: {median_rating}")

Missing values in Rating replaced with median value: 79.0


Now that we have filled/removed missing values, we will check again to make sure there are no missing values left.

In [19]:
missing_values = getMissingValues(merged_df)
missing_values

Game Title           0
Game Genre           0
Pricing              0
Developer            0
Release Date         0
Platform             0
Rating               0
Number of Ratings    0
dtype: int64

We will also check for the appropriate datatypes for each column.

In [20]:
merged_df.dtypes

Game Title            object
Game Genre            object
Pricing              float64
Developer             object
Release Date          object
Platform              object
Rating               float64
Number of Ratings    float64
dtype: object

**Number of Ratings** should be `uint64` instead of `float`.

### Code Explanation

1. **Convert `Number of Ratings` Column to Unsigned Integer**:
   ```python
   merged_df['Number of Ratings'] = merged_df['Number of Ratings'].astype(np.uint64)
   ```
   - This ensures the `Number of Ratings` column is in an unsigned 64-bit integer format (`np.uint64`), which is suitable for large numerical values and prevents negative numbers.

In [21]:
merged_df['Number of Ratings'] = merged_df['Number of Ratings'].astype(np.uint64)

In [22]:
merged_df.dtypes

Game Title            object
Game Genre            object
Pricing              float64
Developer             object
Release Date          object
Platform              object
Rating               float64
Number of Ratings     uint64
dtype: object

<h3 style="font-weight: bold;">3.4. Handle outliers</h2>

`Number of Ratings` column is suspected to be highly skewed because there are some game titles with very few ratings while some received a lot of ratings. We will remove the outliers in this column.

Since we want to mitigate the effect of outliers, while retaining data, we will have to **winsorize (cap)** the data. The chosen approach is using Percentile method and capp the data at 99th percentile.

In [23]:
merged_df['Number of Ratings'].describe().round()

count      61742.0
mean        1782.0
std        40793.0
min            3.0
25%           17.0
50%           43.0
75%          186.0
max      8389102.0
Name: Number of Ratings, dtype: float64

### Code Explanation

1. **Determine Lower and Upper Bounds**:
   ```python
   lower_bound = merged_df['Number of Ratings'].quantile(0.05)
   upper_bound = merged_df['Number of Ratings'].quantile(0.99)
   ```
   - The `quantile()` function is used to compute the 5th percentile (`lower_bound`) and the 99th percentile (`upper_bound`) of the `Number of Ratings` column. This helps define the acceptable range for the data, effectively identifying extreme values.

2. **Clip Outliers**:
   ```python
   merged_df['Number of Ratings'] = merged_df['Number of Ratings'].clip(lower=lower_bound, upper=upper_bound).astype(np.uint64)
   ```
   - The `clip()` function is used to limit the values of `Number of Ratings` to be within the `lower_bound` and `upper_bound`. Any values below the 5th percentile are set to the `lower_bound`, and values above the 99th percentile are set to the `upper_bound`. After clipping, the values are converted to `np.uint64` for consistency.

3. **Describe the `Number of Ratings` Column**:
   ```python
   merged_df['Number of Ratings'].describe().round()
   ```
   - The `describe()` function provides summary statistics for the `Number of Ratings` column. The `.round()` method rounds the results for readability.

In [24]:
lower_bound = merged_df['Number of Ratings'].quantile(0.05)
upper_bound = merged_df['Number of Ratings'].quantile(0.99)

merged_df['Number of Ratings'] = merged_df['Number of Ratings'].clip(lower=lower_bound, upper=upper_bound).astype(np.uint64)

merged_df['Number of Ratings'].describe().round()

count    61742.0
mean       769.0
std       3100.0
min          7.0
25%         17.0
50%         43.0
75%        186.0
max      24635.0
Name: Number of Ratings, dtype: float64

<h2 style="font-weight: bold;">4. Save cleaned data</h2>

Sort the data alphabetically by `Game Title` and save it as a CSV file.

### Code Explanation

1. **Sort the DataFrame**:
   ```python
   merged_df = merged_df.sort_values(by="Game Title")
   ```
   - The `sort_values()` function is used to sort the `merged_df` DataFrame by the `"Game Title"` column in ascending order.

2. **Save the Cleaned Data**:
   ```python
   merged_df.to_csv("cleaned_data.csv", index=False, encoding='utf-8')
   ```
   - The `to_csv()` function saves the cleaned DataFrame to a CSV file named `cleaned_data.csv`. The `index=False` parameter ensures that the index is not included in the saved file, and `encoding='utf-8'` specifies the encoding for the file.

3. **Display the First Few Rows**:
   ```python
   merged_df.head()
   ```
   - The `head()` function displays the first 5 rows of the DataFrame to verify the final cleaned data.

In [25]:
merged_df = merged_df.sort_values(by="Game Title")
merged_df.to_csv("cleaned_data.csv", index=False, encoding='utf-8')
merged_df.head()

Unnamed: 0,Game Title,Game Genre,Pricing,Developer,Release Date,Platform,Rating,Number of Ratings
0,! That Bastard Is Trying To Steal Our Gold !,"Action, Adventure, Casual, Indie",2.99,WTFOMGames,Mar 1 2016,PC,56.0,66
1,! Wild Russia !,"Action, Adventure, Casual",19.99,Andreev Worlds,Apr 28 2020,PC,61.0,60
2,!4RC4N01D!,Arcade,0.0,armogames,"Jan 12, 2018",PC,40.0,7
3,!4RC4N01D! 2: Retro Edition,Arcade,0.0,armogames,"Feb 6, 2018",PC,38.0,7
5,!4RC4N01D! 3: Cold Space,Arcade,0.0,armogames,"Mar 8, 2018",PC,30.0,7
