# Feature Engineering
* **Author**: Winnie Zhang
* BrainStation, Data Science
* Previous notebook: 1. Cleaning and Preprocessing

## Introduction

In the previous notebook, I did the initial cleaning of the dataset.

In this notebook, I will do the feature engineering.

First, I will load all the packages and data that I need.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import joblib

In [2]:
reviews = joblib.load("data/reviews_clean_final.pkl")
games = joblib.load("data/games_wrangled.pkl")

I will do a quick sanity check on the data that I loaded.

In [3]:
reviews.head()

Unnamed: 0,index,user,rating,comment,ID,name
1,1,avlawn,10.0,I tend to either love or easily tire of co-op ...,30549,Pandemic
2,2,Mease19,10.0,This is an amazing co-op game. I play mostly ...,30549,Pandemic
3,3,cfarrell,10.0,Hey! I can finally rate this game I've been pl...,30549,Pandemic
4,4,gregd,10.0,Love it- great fun with my son. 2 plays so far...,30549,Pandemic
5,5,calbearfan,10.0,"Fun, fun game. Strategy is required, but defin...",30549,Pandemic


In [4]:
games.head()

Unnamed: 0,id,primary,description,yearpublished,minplayers,maxplayers,minplaytime,maxplaytime,minage,boardgameexpansion,...,Bluffing,Humor,Adventure,Deduction,Miniatures,Action / Dexterity,Movies / TV / Radio theme,Medieval,Players: Two Player Only Games,Crowdfunding: Kickstarter
0,30549,Pandemic,"In Pandemic, several virulent diseases have br...",2008,2,4,45,45,8,1,...,0,0,0,0,0,0,0,0,0,0
1,822,Carcassonne,Carcassonne is a tile-placement game in which ...,2000,2,5,30,45,7,1,...,0,0,0,0,0,0,0,1,0,0
2,13,Catan,"In CATAN (formerly The Settlers of Catan), pla...",1995,3,4,60,120,10,1,...,0,0,0,0,0,0,0,0,0,0
3,68448,7 Wonders,You are the leader of one of the 7 great citie...,2010,2,7,30,30,10,1,...,0,0,0,0,0,0,0,0,0,0
4,36218,Dominion,"&quot;You are a monarch, like your parents bef...",2008,2,4,30,30,13,1,...,0,0,0,0,0,0,0,1,0,0


## Functions for Feature Engineering

One of the features I want to get is the length of the user review and the length the game description. Therefore, I will write a function to get text length for both `reviews` and `games`.

In [5]:
def text_length(text):
    """
    This function takes a string and returns the number of words the string contains.
    """
    num_words = len(text.split(" "))
    return num_words 

## Feature Engineering 

### `Reviews` DataFrame

First, I will engineer a few features for the `reviews` dataframe. I want to know the average rating a user gives and the number of reviews they gave, as this may have an impact on the score they give a game. 

In [6]:
# aggregate based on count of number of reviews and mean of rating for each user 
users = reviews.groupby("user").agg(number_of_reviews_by_user=("comment", "size"),
                                          avg_rating=("rating", "mean")).reset_index()

# sanity check
users.head()

Unnamed: 0,user,number_of_reviews_by_user,avg_rating
0,Fu_Koios,2,9.0
1,-DE-,1,10.0
2,-Johnny-,170,5.958824
3,-LucaS-,23,7.717391
4,-Mal-,2,5.0


Now, I will concatenate the aggregations I made to `reviews_df`.

In [7]:
reviews_df = pd.merge(reviews, users, left_on="user", right_on="user")

# sanity check
reviews_df.head()

Unnamed: 0,index,user,rating,comment,ID,name,number_of_reviews_by_user,avg_rating
0,1,avlawn,10.0,I tend to either love or easily tire of co-op ...,30549,Pandemic,136,6.147059
1,558,avlawn,10.0,hurm. the gameplay changes between this and V...,40692,Small World,136,6.147059
2,5221,avlawn,10.0,"brilliant, but very skill-dependent. And I'm b...",2655,Hive,136,6.147059
3,25958,avlawn,10.0,"Still great, but i've come to vastly prefer pl...",3076,Puerto Rico,136,6.147059
4,314082,avlawn,7.0,Not so much a deckbuilding game as a Fantasy C...,96848,Mage Knight Board Game,136,6.147059


Next, I will get the length of the review that the user left.

In [8]:
length_of_text = lambda x: text_length(x)
reviews_df["comment_length"] = reviews_df["comment"].apply(length_of_text)

In [9]:
# sanity check 
reviews_df.head()

Unnamed: 0,index,user,rating,comment,ID,name,number_of_reviews_by_user,avg_rating,comment_length
0,1,avlawn,10.0,I tend to either love or easily tire of co-op ...,30549,Pandemic,136,6.147059,76
1,558,avlawn,10.0,hurm. the gameplay changes between this and V...,40692,Small World,136,6.147059,133
2,5221,avlawn,10.0,"brilliant, but very skill-dependent. And I'm b...",2655,Hive,136,6.147059,9
3,25958,avlawn,10.0,"Still great, but i've come to vastly prefer pl...",3076,Puerto Rico,136,6.147059,17
4,314082,avlawn,7.0,Not so much a deckbuilding game as a Fantasy C...,96848,Mage Knight Board Game,136,6.147059,88


### Feature Engineer `games`

Next, I will feature engineer `games` by getting the length of the description.

In [None]:
games.isna().sum()

There is 1 null value in the description column. I will replace this null value with "N/A" 

In [10]:
games[games["description"].isna()]

Unnamed: 0,id,primary,description,yearpublished,minplayers,maxplayers,minplaytime,maxplaytime,minage,boardgameexpansion,...,Bluffing,Humor,Adventure,Deduction,Miniatures,Action / Dexterity,Movies / TV / Radio theme,Medieval,Players: Two Player Only Games,Crowdfunding: Kickstarter
15338,170984,Timeline: Sports et Loisirs,,2014,2,8,15,15,8,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
games["description"].fillna("N/A", inplace=True)

In [12]:
# check that it worked
games[games["description"].isna()]

Unnamed: 0,id,primary,description,yearpublished,minplayers,maxplayers,minplaytime,maxplaytime,minage,boardgameexpansion,...,Bluffing,Humor,Adventure,Deduction,Miniatures,Action / Dexterity,Movies / TV / Radio theme,Medieval,Players: Two Player Only Games,Crowdfunding: Kickstarter


There are no null values left in `games` and I can get the lengt of the description:

In [13]:
games["description_length"] = games["description"].apply(length_of_text) 

In [14]:
# sanity check 
games.head()

Unnamed: 0,id,primary,description,yearpublished,minplayers,maxplayers,minplaytime,maxplaytime,minage,boardgameexpansion,...,Humor,Adventure,Deduction,Miniatures,Action / Dexterity,Movies / TV / Radio theme,Medieval,Players: Two Player Only Games,Crowdfunding: Kickstarter,description_length
0,30549,Pandemic,"In Pandemic, several virulent diseases have br...",2008,2,4,45,45,8,1,...,0,0,0,0,0,0,0,0,0,245
1,822,Carcassonne,Carcassonne is a tile-placement game in which ...,2000,2,5,30,45,7,1,...,0,0,0,0,0,0,1,0,0,209
2,13,Catan,"In CATAN (formerly The Settlers of Catan), pla...",1995,3,4,60,120,10,1,...,0,0,0,0,0,0,0,0,0,481
3,68448,7 Wonders,You are the leader of one of the 7 great citie...,2010,2,7,30,30,10,1,...,0,0,0,0,0,0,0,0,0,252
4,36218,Dominion,"&quot;You are a monarch, like your parents bef...",2008,2,4,30,30,13,1,...,0,0,0,0,0,0,1,0,0,285


Now, I will merge the `games` and `reviews_df` together, based on the game id.

In [1]:
df = pd.merge(games, reviews_df, left_on="id", right_on="ID")

# sanity check
df.head()

NameError: name 'pd' is not defined

In [16]:
# sanity check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3036278 entries, 0 to 3036277
Data columns (total 53 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   id                              int64  
 1   primary                         object 
 2   description                     object 
 3   yearpublished                   int64  
 4   minplayers                      int64  
 5   maxplayers                      int64  
 6   minplaytime                     int64  
 7   maxplaytime                     int64  
 8   minage                          int64  
 9   boardgameexpansion              int32  
 10  boardgameimplementation         int32  
 11  usersrated                      int64  
 12  average                         float64
 13  Board Game Rank                 object 
 14  owned                           int64  
 15  trading                         int64  
 16  wanting                         int64  
 17  wishing                    

The number of columns is 52, which adds up correctly based on `games` and `reviews_df`.

I will check that `primary` and `name` columns are the same to ensure the 2 tables concatenated correctly.

In [17]:
(df["primary"] == df["name"]).value_counts()

True     3035998
False        280
dtype: int64

There are around 200 columns that aren't the same. I will take a closer loop at them. 

In [18]:
df_copy = df.copy()
df_copy = df[df["primary"] != df["name"]]

In [19]:
df_copy["primary"].value_counts()

Cluedo Super Sleuth                                                              59
Star Trek Red Alert                                                              38
The StoryMaster\'s Tales "Weirding Woods" Hybrid RPG                             35
Der schwarze Pirat: Das Duell                                                    31
The Walking Dead "Don\'t Look Back" Dice Game                                    29
"Oh My God! There\'s An Axe In My Head." The Game of International Diplomacy     24
Admiral Ackbar "It\'s a TRAP!" GAME                                              18
Cluedo  Passport to Murder                                                       12
Cluedo Chocolate Edition                                                         12
Cartaventura: Lhassa                                                              7
Adventure Games: Die Akte Gloom City                                              6
EXIT: Das Spiel – Adventskalender: Die Jagd nach dem goldenen Buch          

In [20]:
df_copy["name"].value_counts()

Cluedo: Super Sleuth                                                            59
Star Trek Red Alert!                                                            38
The StoryMaster's Tales "Weirding Woods" Hybrid RPG                             35
Pirates Blast                                                                   31
The Walking Dead "Don't Look Back" Dice Game                                    29
"Oh My God! There's An Axe In My Head." The Game of International Diplomacy     24
Admiral Ackbar "It's a TRAP!" GAME                                              18
Cluedo: Passport to Murder                                                      12
Cluedo: Chocolate Edition                                                       12
Cartaventura: Lhasa                                                              7
Adventure Games: The Gloom City File                                             6
Exit: The Game – Advent Calendar: The Hunt for the Golden Book                   4
Exit

The 200 columns that aren't the same are just the same games with slightly different names. The counts also align between the 2 columns! That means we succesfully concatenated the two dataframes and we can drop the `name`, `primary`, `id`, and `ID` columns

In [21]:
df_clean = df.drop(["name", "primary", "id", "ID"], axis=1)

In [22]:
# sanity check
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3036278 entries, 0 to 3036277
Data columns (total 49 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   description                     object 
 1   yearpublished                   int64  
 2   minplayers                      int64  
 3   maxplayers                      int64  
 4   minplaytime                     int64  
 5   maxplaytime                     int64  
 6   minage                          int64  
 7   boardgameexpansion              int32  
 8   boardgameimplementation         int32  
 9   usersrated                      int64  
 10  average                         float64
 11  Board Game Rank                 object 
 12  owned                           int64  
 13  trading                         int64  
 14  wanting                         int64  
 15  wishing                         int64  
 16  numcomments                     int64  
 17  numweights                 

There are now 48 columns remaining. We succesfully dropped 4 columns. I also want to drop the column `index` as it doesn't provide value.

In [23]:
df_clean.drop("index", axis=1, inplace=True)

In [24]:
# sanity check
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3036278 entries, 0 to 3036277
Data columns (total 48 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   description                     object 
 1   yearpublished                   int64  
 2   minplayers                      int64  
 3   maxplayers                      int64  
 4   minplaytime                     int64  
 5   maxplaytime                     int64  
 6   minage                          int64  
 7   boardgameexpansion              int32  
 8   boardgameimplementation         int32  
 9   usersrated                      int64  
 10  average                         float64
 11  Board Game Rank                 object 
 12  owned                           int64  
 13  trading                         int64  
 14  wanting                         int64  
 15  wishing                         int64  
 16  numcomments                     int64  
 17  numweights                 

The column `Board Game Rank` is an object. I will make it into an integer.

In [27]:
df_clean["Board Game Rank"] = pd.to_numeric(df_clean["Board Game Rank"])

In [28]:
# sanity check
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3036278 entries, 0 to 3036277
Data columns (total 48 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   description                     object 
 1   yearpublished                   int64  
 2   minplayers                      int64  
 3   maxplayers                      int64  
 4   minplaytime                     int64  
 5   maxplaytime                     int64  
 6   minage                          int64  
 7   boardgameexpansion              int32  
 8   boardgameimplementation         int32  
 9   usersrated                      int64  
 10  average                         float64
 11  Board Game Rank                 int64  
 12  owned                           int64  
 13  trading                         int64  
 14  wanting                         int64  
 15  wishing                         int64  
 16  numcomments                     int64  
 17  numweights                 

I have engineered all the features, and now I will save this dataframe.

In [1]:
# joblib.dump(df_clean, "data/data_clean_ver2.pkl")

## Conclusion
In this notebook, we engineered a few features to both the reviews and games datasets and concatenated them together. We then saved our new dataframe.

**Next Notebook**: 3. EDA 
- We explore the data and see what trends exist.