# Lecture 2: Preprocessing, Dataset Splitting, and Feature Engineering

## Overview

In this task, you’ll explore the fundamentals of working with recommendation datasets. You’ll preprocess the data and prepare it for collaborative filtering algorithms, as well as perform basic feature engineering for content-based filtering. By the end of this task, you’ll gain a deeper understanding of how interaction matrices are structured, how to effectively split datasets for training and testing in a machine learning pipeline, and how to engineer features from textual data.

Throughout this Jupyter Notebook, any code that requires your input or completion will be clearly marked with a **Task** and a [TODO] tag. Your job is to fill in these sections with the appropriate code, following the instructions and hints provided to guide you through each step.

By the end of this task, you will have:

- Understood the concept of interaction data.
- Created an interaction matrix using a sparse matrix format.
- Implemented functions to preprocess the interaction data.
- Split the interaction data into training, validation, and test sets using a Strong Generalization strategy.
- Applied One Hot Encoding to convert categorical variables.
- Used TF-IDF on game descriptions to generate new “tags”.

Let’s begin by importing the necessary libraries:

In [168]:
import ast
import numpy as np
import pandas as pd
import sklearn
from sklearn.utils.extmath import density
from tqdm import tqdm
from scipy.sparse import csr_matrix, lil_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

## Task 1: Creating an Interaction Matrix

### Objective
Transform the interaction data into a sparse matrix format, commonly used for collaborative filtering algorithms.

### Overview
1. We will first perform some re-indexing on the interactions data.
2. Next, we convert the interactions data into CSR matrix using Scipy.
3. Finally, we'll calculate some summary statistics for the interaction matrix.

### Background
An interaction matrix is a two-dimensional matrix where rows represent users, columns represent items, and the values indicate the interaction between them (e.g., ratings or clicks).

<img src="images/matrix.jpg" alt="" width="300"/>

A Compressed Sparse Row (CSR) matrix is a memory-efficient way to store sparse matrices, which are matrices with a large number of zero elements. It comprises three main components: data, indices, and indptr. The data array holds the non-zero values in the matrix, while the indices array contains the column indices corresponding to each value in data. The indptr array acts as a pointer that marks the start of each row in the data and indices arrays. The indptr array has a length of n_rows + 1 where n_rows is the number of rows in the matrix. Each element in the indptr array points to the position in the data and indices arrays where the corresponding row starts. Suppose indptr[i] equals 5 and indptr[i+1] equals 7. This means that the non-zero elements of row i are stored in the data array from index 5 to index 6, and their corresponding column indices are also stored in the indices array from index 5 to 6.

<img src="images/csr.png" alt="" width="500"/>

Note: Unlike the image above, the values in our matrix will be binary. In addition to the CSR (Compressed Sparse Row) format, other sparse matrix formats like COO (Coordinate format) and LIL (List of Lists format) also exist, each with its own advantages and disadvantages.

### Example
Consider a matrix with dimensions 1000x1000 (users x items), totaling 1,000,000 elements. If only 1% of these elements are non-zero, that means 99% of the matrix is filled with zeros.

In a dense storage format, every element is stored explicitly. Assuming each element is a 4-byte floating-point number, storing all 1,000,000 elements would require 4,000,000 bytes (or 4 MB) of memory. This is inefficient, especially since the majority of the elements are zeros.

In contrast, using a Compressed Sparse Row (CSR) format, only the non-zero elements are stored along with their corresponding row and column indices. For the 10,000 non-zero elements (1% of the matrix), this requires storing 40,000 bytes for the values, 40,000 bytes for the column indices, and about 4,004 bytes for the row pointers, totaling approximately 84,000 bytes (or ~84 KB).

This approach reduces the memory usage from 4 MB in dense storage to just ~84 KB in CSR format, representing a 95% reduction in memory usage.

### Code

### Task 1.1. Load the interactions dataset.

In [169]:
train_interactions = pd.read_csv('train_interactions.csv')

Check that the DataFrame loaded correctly by printing out the first few rows:

In [170]:
train_interactions.head()


Unnamed: 0,user_id,item_id,item_name,playtime
0,0,0,Counter-Strike,6
1,0,2555,Day of Defeat,7
2,0,2556,Day of Defeat: Source,4733
3,0,1043,Counter-Strike: Source,1853
4,0,5335,Psychonauts,333


### Task 1.2. Create new user and item IDs.

The user IDs do not follow a contiguous sequence starting from zero, let's first re-index the item and user IDs. With this mapping, you can easily convert back and forth between the original IDs and the matrix indices. This simplifies tasks like retrieving user-specific or item-specific data, as you can quickly access the relevant row or column in the matrix. Do note that we need to maintain the integrity of this mapping at all times. When making predictions using a collaborative filtering algorithm, the results will be stored in a sparse matrix format of the same dimensions as the input. To interpret these predictions and make actual item recommendations to users, you need to map the matrix rows and columns back to the original user and item IDs.

In [171]:
train_interactions.rename(columns={'user_id': 'old_user_id', 'item_id': 'old_item_id'}, inplace=True)

user_id_mapping = {old_id: new_id for new_id, old_id in enumerate(train_interactions['old_user_id'].unique())}

train_interactions['user_id'] = train_interactions['old_user_id'].map(user_id_mapping)

item_id_mapping = {old_id: new_id for new_id, old_id in enumerate(train_interactions['old_item_id'].unique())}

train_interactions['item_id'] = train_interactions['old_item_id'].map(item_id_mapping)

new_to_old_user_id_mapping = {v: k for k, v in user_id_mapping.items()}
new_to_old_item_id_mapping = {v: k for k, v in item_id_mapping.items()}

#print(new_to_old_user_id_mapping)
#print(new_to_old_item_id_mapping)

### Task 1.3. Create a CSR matrix that represents user-item interactions.

Have a look at the Scipy CSR matrix documentation if you're unsure how to create the matrix: [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html). 

In [172]:
num_users = len(train_interactions['user_id'].unique())
num_items = len(train_interactions['item_id'].unique())

In [173]:
rows = train_interactions['user_id'].values
#print(rows)
cols = train_interactions['item_id'].values
#print(cols)
data = train_interactions['playtime'].values
#print(data)

In [174]:
interaction_matrix_csr = csr_matrix((data, (rows, cols)), shape=(num_users, num_items))

To highlight the memory efficiency of working with dense and sparse interaction matrices, let’s convert the sparse matrix into its dense form:

In [175]:
interaction_matrix_dense = interaction_matrix_csr.toarray()

In [176]:
dense_memory_usage_gb = interaction_matrix_dense.nbytes / 1024 ** 3
print(f"Memory usage of the dense matrix: {dense_memory_usage_gb:.3f} GB")

Memory usage of the dense matrix: 3.386 GB


*That's a lot of memory for a matrix that contains mostly zeroes!*

Compare this to the memory usage of a sparse matrix:

In [177]:
sparse_memory_usage = interaction_matrix_csr.data.nbytes + interaction_matrix_csr.indptr.nbytes + interaction_matrix_csr.indices.nbytes
print(f"Memory usage of sparse CSR matrix: {sparse_memory_usage / 1024 ** 3:.3f} GB")

Memory usage of sparse CSR matrix: 0.026 GB


*That's way better! You can delete the dense interaction matrix if you would like to.*

### Task 1.4. Answer the following questions:
1. What's the density of the interaction matrix?
2. How many games does a user own on average?

Print the results!

In [178]:
zero_counts = interaction_matrix_csr.count_nonzero()

all_counts = interaction_matrix_csr.shape[0] * interaction_matrix_csr.shape[1]
density = zero_counts / all_counts

In [179]:
print(f'Density: {density:.4f}')

Density: 0.0050


In [180]:
# we need to count the number of nonzero entry
user_interaction_count = np.diff(interaction_matrix_csr.indptr)

average = user_interaction_count.mean()

In [181]:
print(f'Average interactions per user: {average:.4f}')

Average interactions per user: 41.7452


## Task 2: Implementing Dataset Preprocessors

### Objective
Create functions to filter the sparse matrix based on the number of interactions.

### Overview
1. We first implement the MinItemsPerUser function.
2. Next we implement the MinUsersPerItem function.
3. We apply these functions to the CSR matrix.
4. We use two checks to ensure all users have more than (or equal to) 5 interactions and all items have more than (or equal to) 20 interactions.
5. Finally, we calculate some summary statistics for the filtered dataset.

### Background
- MinItemsPerUser: Filters out users who have fewer than a specified number of interactions.
- MinUsersPerItem: Filters out items that have been interacted with by fewer than a specified number of users.

There are other filters and preprocessing steps you might consider applying to interaction data. For example, you may want to check for duplicate entries, which could skew the results, or remove users with an unusually high number of interactions, as their extreme behavior might not be representative of typical user activity.

### Code

### Task 2.1. Implement the MinUsersPerItem function:

In [182]:
def MinUsersPerItem(interaction_matrix_csr, item_mapping, min_users):
    # Filter out items with strictly less than min_users interactions
    item_user_counting = interaction_matrix_csr.getnnz(axis=0)
    item_mask = item_user_counting >= min_users
    # Hint: use the .getnnz() method of the scipy csr matrix.

    # Apply the mask to the matrix to filter items
    filtered_interaction_matrix = interaction_matrix_csr[:, item_mask]

    # Great, because we're removing items (columns) from the matrix, we break our original mapping between columns and "old" item IDs!
    # Update the mapping:

    sorted_items = sorted(item_mapping.items(), key=lambda x: x[1])
    filtered_items = [key for (key, value), item_to_keep in zip(sorted_items, item_mask) if item_to_keep]
    new_item_mapping = {old_id: new_id for new_id, old_id in enumerate(filtered_items)}

    return filtered_interaction_matrix, new_item_mapping

### Task 2.2. Implement the MinItemsPerUser function:

In [183]:
def MinItemsPerUser(interaction_matrix_csr, user_mapping, min_items):
    # Filter out users with strictly less than min_items interactions
    user_item_counts = interaction_matrix_csr.getnnz(axis=1)
    user_mask = user_item_counts >= min_items
    # Hint: use the .getnnz() method of the scipy csr matrix.
    
    # Apply the mask to the matrix to filter users
    filtered_interaction_matrix = interaction_matrix_csr[user_mask]

    # Great, because we're removing users (rows) from the matrix, we break our original mapping between rows and "old" user IDs!
    # Update the mapping:
    sorted_users = sorted(user_mapping.items(), key=lambda x: x[1])
    filtered_users = [key for (key, value), item_tokeep in zip(sorted_users, user_mask) if item_tokeep]
    new_user_mapping = {old_id: new_id for new_id, old_id in enumerate(filtered_users)}

    return filtered_interaction_matrix, new_user_mapping

Let's apply the first filter:

In [184]:
filtered_matrix, updated_item_id_mapping = MinUsersPerItem(interaction_matrix_csr, new_to_old_item_id_mapping, min_users=20)

Check to ensure all items have more than (or equal to) 20 interactions:

In [185]:
assert filtered_matrix.getnnz(axis=0).min() >= 20

Let's apply the second filter:

In [186]:
filtered_matrix, updated_user_id_mapping = MinItemsPerUser(filtered_matrix, new_to_old_user_id_mapping, min_items=5)

Check to ensure all users have more than (or equal to) 5 interactions: 

In [187]:
assert filtered_matrix.getnnz(axis=1).min() >= 5

### Task 2.3. Answer the following questions:
1. How many users are left after filtering?
2. How many items are left after filtering?

In [188]:
print(f"user before the filtering: {interaction_matrix_csr.shape[0]}")
print(f"user left after the filtering: {len(updated_user_id_mapping)}")

user before the filtering: 54315
user left after the filtering: 46706


In [189]:
print(f"items before the filtering: {interaction_matrix_csr.shape[1]}")
print(f"items left after the filtering: {len(updated_item_id_mapping)}")

items before the filtering: 8368
items left after the filtering: 4256


## Task 3: Creating a Strong Generalization Splitter

### Objective
Split the dataset into training, validation, and test sets using a Strong Generalization strategy.

### Overview
1.	We implement a StrongGeneralization class with a split() method.
2.	The split() method will take a CSR matrix as input and return three datasets: train, validation, and test.
3.	Each user in the validation and test sets should have a fold-in and hold-out part based on a percentage of their interactions.
4.	We use some checks to ensure that there is no overlap between users across the datasets and that the fold-in and hold-out interactions don't overlap either.

### Background

Strong Generalization Splitter: This strategy splits the data into three parts: a training set, a validation set, and a test set. The validation and test sets are further divided into fold-in (seen interactions) and hold-out (held-out interactions) for each user. This split ensures that there is no overlap between the users in different sets.

### Code

### Task 3.1. Implement the StrongGeneralization class. Make sure you implement all TODOs.

In [190]:
class StrongGeneralizationSplitter:
    def __init__(self, train_ratio=0.6, val_ratio=0.2, test_ratio=0.2, fold_in_ratio=0.8):
        assert train_ratio + val_ratio + test_ratio == 1, "Train, validation, and test ratios must sum to 1."
        self.train_ratio = train_ratio
        self.val_ratio = val_ratio
        self.test_ratio = test_ratio
        self.fold_in_ratio = fold_in_ratio
        self.hold_out_ratio = 1 - fold_in_ratio

    def split(self, interaction_matrix_csr):
        num_users, num_items = interaction_matrix_csr.shape
        
        # Shuffle users for random splitting
        users = np.arange(num_users)
        np.random.shuffle(users)
        
        # Split users into training, validation, and test sets based on the specified ratios
        train_end = int(num_users * self.train_ratio)
        val_end = int(num_users * (self.train_ratio + self.val_ratio))
        # Hint: this is where you'd use your train and validation ratios.

        train_users = users[:train_end]
        val_users = users[val_end:train_end]
        test_users = users[val_end:]
        # Hint: this is where you should slice the "users"
        
        train_matrix = interaction_matrix_csr[train_users, :]
        val_matrix = interaction_matrix_csr[val_users, :]
        test_matrix = interaction_matrix_csr[test_users, :]
        
        # Create fold-in and hold-out splits for validation and test sets
        val_fold_in, val_hold_out = self.split_interactions(val_matrix)
        test_fold_in, test_hold_out  = self.split_interactions(test_matrix)
        
        return train_matrix, (val_fold_in, val_hold_out), (test_fold_in, test_hold_out), (train_users, val_users, test_users)

    def split_interactions(self, interaction_matrix_csr):
        # Convert the matrix to lil format for easier manipulation
        lil = interaction_matrix_csr.tolil()
    
        # Initialize the lists to store fold-in and hold-out data
        fold_in_lil = lil_matrix((interaction_matrix_csr.shape[0], interaction_matrix_csr.shape[1]))
        hold_out_lil = lil_matrix((interaction_matrix_csr.shape[0], interaction_matrix_csr.shape[1]))
    
        # Iterate over each row (user)
        for i in tqdm(range(lil.shape[0]), desc="Splitting interactions: "):
            
            # Get the non-zero entries in this row
            row_data = lil.data[i]
            row_indices = lil.rows[i]
            
            # Determine the number of interactions to include in fold-in
            fold_in_size = int(len(row_data) * self.fold_in_ratio)
            
            # Randomly shuffle the indices
            shuffle_indices = np.random.permutation(len(row_data))
            
            # Split the indices into fold-in and fold-out
            fold_in_indices = shuffle_indices[:fold_in_size]
            hold_out_indices = shuffle_indices[fold_in_size:]
            
            # Assign the fold-in data to the fold_in_lil matrix
            fold_in_lil.rows[i] = np.array(row_indices)[fold_in_indices].tolist()
            fold_in_lil.data[i] = np.array(row_data)[fold_in_indices].tolist()
            
            # Assign the fold-out data to the hold_out_lil matrix
            hold_out_lil.rows[i] = np.array(row_indices)[hold_out_indices].tolist()
            hold_out_lil.data[i] = np.array(row_data)[hold_out_indices].tolist()
    
        # Convert the lil matrices back to csr format
        fold_in_matrix = fold_in_lil.tocsr()
        hold_out_matrix = hold_out_lil.tocsr()
    
        return fold_in_matrix, hold_out_matrix

In [191]:
# Instantiate the splitter and split the dataset
splitter = StrongGeneralizationSplitter()
train, (val_fold_in, val_hold_out), (test_fold_in, test_hold_out), (train_users, val_users, test_users) = splitter.split(filtered_matrix)

Splitting interactions: : 0it [00:00, ?it/s]
Splitting interactions: 100%|██████████| 9342/9342 [00:00<00:00, 56027.37it/s]


Check that there is no overlap between users in the train, validation, and test sets:

In [192]:
assert set(train_users).isdisjoint(val_users), "Train and validation sets overlap!"
assert set(train_users).isdisjoint(test_users), "Train and test sets overlap!"
assert set(val_users).isdisjoint(test_users), "Validation and test sets overlap!"

### Task 3.2. Implement the check_interaction_overlap function. 

For each user in the validation and test sets, you should verify that there is no overlap between the interactions in the fold-in and fold-out sets. Create a function that checks whether there is any overlap:

In [193]:
def check_interaction_overlap(fold_in_matrix, hold_out_matrix):
    overlapping = False
    for user in range(fold_in_matrix.shape[0]):
        fold_in_items = set(fold_in_matrix[user].indices)
        hold_out_items = set(hold_out_matrix[user].indices)
        overlap = fold_in_items.intersection(hold_out_items)
        if overlap:
            overlapping = True
    return overlapping

Check that there is no overlap in interactions between fold-in and fold-out sets:

In [194]:
check_interaction_overlap(val_fold_in, val_hold_out)
check_interaction_overlap(test_fold_in, test_hold_out)

False

## Task 4: One-Hot Encoding Genres

### Objective
Convert the categorical "genres" attribute into a format that can be used by machine learning algorithms.

### Overview
This task is straightforward: we will convert the categorical ‘genre’ attribute into a one-hot encoded format.

### Background
In previous tasks, you focused on interactions between users and items, such as ratings, clicks or buys, which are central to collaborative filtering techniques. However, for content-based recommendations, the focus shifts to the attributes of the items themselves.

For the next tasks, we will work with the game data which includes user-generated tags, descriptions, reviews, and other metadata. These features help recommendation systems understand what the games are about and how similar they are to one another, enabling personalized recommendations even for new games that haven’t received much user interaction yet.

### Code

### Task 4.1. Load the games dataset.

In [195]:
games = pd.read_csv('games.csv')

Check that the DataFrame loaded correctly by printing out the first few rows:

In [196]:
games.head()

Unnamed: 0,item_id,item_name,publisher,genres,url,tags,sentiment,metascore,specs,price,release_date
0,0,Counter-Strike,Valve,['Action'],http://store.steampowered.com/app/10/CounterSt...,"['Action', 'FPS', 'Multiplayer', 'Shooter', 'C...",Overwhelmingly Positive,88.0,"['Multi-player', 'Valve Anti-Cheat enabled']",9.99,2000-11-01
1,1,Rag Doll Kung Fu,Mark Healey,['Indie'],http://store.steampowered.com/app/1002/Rag_Dol...,"['Indie', 'Fighting', 'Multiplayer']",Mixed,69.0,"['Single-player', 'Multi-player']",9.99,2005-10-12
2,2,Silo 2,Nevercenter Ltd. Co.,['Animation &amp; Modeling'],http://store.steampowered.com/app/100400/Silo_2/,"['Animation & Modeling', 'Software']",Mostly Positive,,,99.99,2012-12-19
3,3,Call of Duty: World at War,Activision,['Action'],http://store.steampowered.com/app/10090/Call_o...,"['Zombies', 'World War II', 'FPS', 'Action', '...",Very Positive,83.0,"['Single-player', 'Multi-player', 'Co-op']",19.99,2008-11-18
4,4,3D-Coat V4.8,Pilgway,['Animation &amp; Modeling'],http://store.steampowered.com/app/100980/3DCoa...,['Animation & Modeling'],Very Positive,,['Steam Cloud'],99.99,2012-10-02


Let's first convert the objects to lists, and then to a string format before one-hot encoding.

In [197]:
# Function to safely evaluate stringified lists
def convert_str_to_list(x):
    try:
        return ast.literal_eval(x)
    except (ValueError, SyntaxError):
        return []

# Convert string representations of lists into actual lists
games['genres'] = games['genres'].apply(convert_str_to_list)

# Convert lists to strings for one-hot encoding
games['genres'] = games['genres'].apply(lambda x: ','.join(x))

### Task 4.2. Use the pd.Series.str.get_dummies function to create one-hot encoded columns for each unique value in the genres column ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html)).

In [198]:
genres_encoded = pd.Series(games['genres']).str.get_dummies(sep=',')

Concatenate the one-hot encoded tags with the original DataFrame:

In [199]:
games = pd.concat([games, genres_encoded], axis=1)

In [200]:
games.head()

Unnamed: 0,item_id,item_name,publisher,genres,url,tags,sentiment,metascore,specs,price,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,0,Counter-Strike,Valve,Action,http://store.steampowered.com/app/10/CounterSt...,"['Action', 'FPS', 'Multiplayer', 'Shooter', 'C...",Overwhelmingly Positive,88.0,"['Multi-player', 'Valve Anti-Cheat enabled']",9.99,...,0,0,0,0,0,0,0,0,0,0
1,1,Rag Doll Kung Fu,Mark Healey,Indie,http://store.steampowered.com/app/1002/Rag_Dol...,"['Indie', 'Fighting', 'Multiplayer']",Mixed,69.0,"['Single-player', 'Multi-player']",9.99,...,0,0,0,0,0,0,0,0,0,0
2,2,Silo 2,Nevercenter Ltd. Co.,Animation &amp; Modeling,http://store.steampowered.com/app/100400/Silo_2/,"['Animation & Modeling', 'Software']",Mostly Positive,,,99.99,...,0,0,0,0,0,0,0,0,0,0
3,3,Call of Duty: World at War,Activision,Action,http://store.steampowered.com/app/10090/Call_o...,"['Zombies', 'World War II', 'FPS', 'Action', '...",Very Positive,83.0,"['Single-player', 'Multi-player', 'Co-op']",19.99,...,0,0,0,0,0,0,0,0,0,0
4,4,3D-Coat V4.8,Pilgway,Animation &amp; Modeling,http://store.steampowered.com/app/100980/3DCoa...,['Animation & Modeling'],Very Positive,,['Steam Cloud'],99.99,...,0,0,0,0,0,0,0,0,0,0


## Task 5: Applying TF-IDF to Game Descriptions

#### Objective
Transform the text data from game descriptions into numerical vectors that represent the importance of words in each description.

#### Overview
Create the TfidfVectorizer with the following specifications:
1. Set the maximum number of features to 50.
2. Remove common English stop words.
3. Consider both unigrams (single words) and bigrams (two-word combinations).
4. Ignore terms that appear in more than 80% of documents.
5. Ignore terms that appear in fewer than 2 documents.

#### Background
TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to convert textual data into numerical representations, highlighting the importance of words within a document relative to a collection of documents (corpus).

**Term Frequency (TF)** measures how often a word appears in a document (description in our case):
$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

**Inverse Document Frequency (IDF)** reduces the weight of common words by measuring how rare a word is across documents:
$$
\text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t} + 1\right)
$$

**TF-IDF Score** combines TF and IDF to highlight important terms in a document:
$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

While TF-IDF was once a dominant method for text analysis and information retrieval, it has now been surpassed by more complex approaches like word embeddings (e.g., Word2Vec, GloVe) and transformer-based models (e.g., BERT), which capture richer contextual relationships between words. For this assignment, TF-IDF serves as a fundamental example to introduce you to text vectorization, but you are encouraged to explore more advanced methods in your own project, like word embeddings and transformer-based models.

### Task 5.1. Load the extended games dataset.

In [201]:
extended_games = pd.read_csv('extended_games.csv')

In [202]:
extended_games.head()

Unnamed: 0,item_id,item_name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,...,tags_Shop Keeper,tags_Coding,tags_Football (Soccer),tags_Hobby Sim,tags_Tile-Matching,tags_Mahjong,tags_Birds,tags_Football (American),tags_Fox,tags_Extraction Shooter
0,0,Counter-Strike,"Nov 1, 2000",0,9.99,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,...,,,,,,,,,,
1,1,Rag Doll Kung Fu,"Oct 12, 2005",0,0.99,0,A piece of Steam history - THE FIRST EVER NON ...,A piece of Steam history - THE FIRST EVER NON ...,A piece of Steam history - THE FIRST EVER NON ...,,...,,,,,,,,,,
2,2,Silo 2,"Dec 19, 2012",0,59.99,0,Note: This is a legacy version of Silo which i...,Note: This is a legacy version of Silo which i...,"Silo is a focused, lightning-fast standalone 3...",,...,,,,,,,,,,
3,3,Call of Duty: World at War,"Nov 18, 2008",17,19.99,0,"Call of Duty is back, redefining war like you'...","Call of Duty is back, redefining war like you'...","Call of Duty is back, redefining war like you'...",,...,,,,,,,,,,
4,13,Runespell: Overture,"Jul 20, 2011",0,9.99,0,Runespell: Overture is a role-playing game com...,Runespell: Overture is a role-playing game com...,"This role-playing game, set in an alternate me...",,...,,,,,,,,,,


Fill NaN values in 'detailed_description' with an empty string: 

In [203]:
extended_games['detailed_description'] = extended_games['detailed_description'].fillna('')

### Task 5.2. Create the TfidfVectorizer with the following specifications:

1. Set the maximum number of features to 50.
2. Remove common English stop words.
3. Consider both unigrams (single words) and bigrams (two-word combinations).
4. Ignore terms that appear in more than 80% of documents.
5. Ignore terms that appear in fewer than 2 documents.

*Hint:* Have a look at the scikit-learn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [204]:
tfidf_vectorizer = TfidfVectorizer(max_features=50, stop_words='english', ngram_range=(1, 2), max_df=0.8, min_df=2)

Fit and transform the descriptions into TF-IDF features:

In [205]:
tfidf_matrix = tfidf_vectorizer.fit_transform(extended_games['detailed_description'])

Inspect the shape of `tfidf_matrix`. Does it match your expectations based on max_features?

In [206]:
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)

Shape of TF-IDF matrix: (6753, 50)


Convert the TF-IDF matrix into a DataFrame:

In [207]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

Print the columns and first few rows of `tfidf_df` to inspect the TF-IDF values. Do the results make sense?

In [208]:
tfidf_df.columns

Index(['action', 'adventure', 'amp', 'based', 'battle', 'build', 'character',
       'characters', 'combat', 'control', 'create', 'different', 'enemies',
       'experience', 'explore', 'features', 'fight', 'friends', 'game',
       'gameplay', 'games', 'help', 'just', 'key', 'level', 'levels', 'life',
       'like', 'make', 'mode', 'multiplayer', 'new', 'online', 'play',
       'player', 'players', 'power', 'real', 'set', 'skills', 'space', 'steam',
       'story', 'support', 'time', 'unique', 'use', 'way', 'weapons', 'world'],
      dtype='object')

In [209]:
tfidf_df.head()

Unnamed: 0,action,adventure,amp,based,battle,build,character,characters,combat,control,...,space,steam,story,support,time,unique,use,way,weapons,world
0,0.415054,0.0,0.0,0.427071,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.312139
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.670566,0.0,0.0,0.0,0.248035,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.348481,0.0,0.0,0.0,0.125504,0.0,0.125204,...,0.0,0.0,0.0,0.131651,0.092684,0.0,0.108253,0.0,0.0,0.0
3,0.203604,0.0,0.0,0.209499,0.0,0.0,0.082604,0.0,0.392159,0.0,...,0.0,0.0,0.0,0.0,0.05572,0.057507,0.0,0.0,0.146174,0.255199
4,0.0,0.0,0.0,0.0,0.340807,0.0,0.0,0.350224,0.0,0.0,...,0.0,0.180416,0.153893,0.0,0.0,0.133468,0.0,0.0,0.0,0.118458


Clearly, there are some issues with the extracted tags—such as words that convey the same meaning e.g. ‘character’ and ‘characters,’. In practice, you would address this by applying techniques like stemming and lemmatization, as well as removing frequent, uninformative words (e.g., ‘based,’ ‘use,’ ‘way’). However, even with this basic technique, we could identify interesting tags: ‘story’ might indicate a game focused on storytelling, ‘character’ on character development, and ‘world’ could suggest an open-world game. This was just a simple example of how to extract insights from textual descriptions. Remember, more advanced methods are available, and there is additional text to analyze, such as user reviews for the games.

Concatenate the original DataFrame with the TF-IDF DataFrame:

In [210]:
extended_games_with_tfidf = pd.concat([extended_games, tfidf_df], axis=1)

In [211]:
extended_games_with_tfidf.head()

Unnamed: 0,item_id,item_name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,...,space,steam,story,support,time,unique,use,way,weapons,world
0,0,Counter-Strike,"Nov 1, 2000",0,9.99,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.312139
1,1,Rag Doll Kung Fu,"Oct 12, 2005",0,0.99,0,A piece of Steam history - THE FIRST EVER NON ...,A piece of Steam history - THE FIRST EVER NON ...,A piece of Steam history - THE FIRST EVER NON ...,,...,0.0,0.670566,0.0,0.0,0.0,0.248035,0.0,0.0,0.0,0.0
2,2,Silo 2,"Dec 19, 2012",0,59.99,0,Note: This is a legacy version of Silo which i...,Note: This is a legacy version of Silo which i...,"Silo is a focused, lightning-fast standalone 3...",,...,0.0,0.0,0.0,0.131651,0.092684,0.0,0.108253,0.0,0.0,0.0
3,3,Call of Duty: World at War,"Nov 18, 2008",17,19.99,0,"Call of Duty is back, redefining war like you'...","Call of Duty is back, redefining war like you'...","Call of Duty is back, redefining war like you'...",,...,0.0,0.0,0.0,0.0,0.05572,0.057507,0.0,0.0,0.146174,0.255199
4,13,Runespell: Overture,"Jul 20, 2011",0,9.99,0,Runespell: Overture is a role-playing game com...,Runespell: Overture is a role-playing game com...,"This role-playing game, set in an alternate me...",,...,0.0,0.180416,0.153893,0.0,0.0,0.133468,0.0,0.0,0.0,0.118458


### Congratulations! You’ve completed the assignment.