# Transforming Data into Features

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis.


## Basic Exploration

1. Let's start with some basic exploring by performing the following:

First, import your dataset. It is stored under a file named reviews.csv. Save 
it to a variable called reviews.


In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# import data
reviews = pd.read_csv("../data/reviews.csv")


2. Next, we want to look at the column names of our dataset along with their data types. Do the following two steps:
Print the column names of your dataset.
Check your features' data types by printing .info().


In [3]:
# print column names
print(reviews.columns)
print(reviews.info())

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB
None


### Data Transformations

3. Transform the recommended feature. Start by printing the feature's `.value_counts()`

In [4]:
# look at the counts of recommended
print(reviews["recommended"].value_counts())

recommended
True     4166
False     834
Name: count, dtype: int64


4. Since this is a True/False feature, we want to transform it to 1 for True and 0 for False.

To do this, create a dictionary called `binary_dict` where:

The keys are what is currently in the recommended feature.
The values are what we want in the new column (0s and 1s).


In [5]:
# create binary dictionary
binary_dict = {
    True: 1,
    False: 0,
}

5. Using `binary_dict`, transform the recommended column so that it will now be binary. Print the results using `.value_counts()` to confirm the transformation.


In [6]:
# transform column
reviews["recommended"] = reviews["recommended"].map(binary_dict)

# print your transformed column
print(reviews["recommended"].value_counts())

recommended
1    4166
0     834
Name: count, dtype: int64


6. Let's run through a similar process to transform the rating feature. This is ordinal data so our transformation should make that more clear. Again, start by printing the .`value_counts()`.


In [7]:
# look at the counts of rating
print(reviews["rating"].value_counts())

rating
Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: count, dtype: int64


7. We want to make the following changes to the values:

   - 'Loved it' → 5
   - 'Liked it' → 4
   - 'Was okay' → 3
   - 'Not great' → 2
   - 'Hated it' → 1

    Create a dictionary called `rating_dict` where the keys are what is currently in the feature and the values are what we want in the new column.


In [8]:
# create dictionary
rating_dict = {
    "Loved it": 5,
    "Liked it": 4,
    "Was okay": 3,
    "Not great": 2,
    "Hated it": 1,
}


8. Using `rating_dict`, transform the rating column so it contains numerical values. Print the results using `.value_counts()` to confirm the transformation.

In [9]:
# transform rating column
reviews["rating"] = reviews["rating"].map(rating_dict)

# print your transformed column values
print(reviews["rating"].value_counts())


rating
5    2798
4    1141
3     564
2     304
1     193
Name: count, dtype: int64


9. Let's now transform the `department_name` feature. This process will be slightly different, but start by printing the `.value_counts()` of the feature.

    Use Panda's `get_dummies` to one-hot encode our feature. Attach the results back to our original data frame. Print the column names to see!

In [10]:
# get the number of categories in a feature
print(reviews["department_name"].value_counts())

department_name
Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: count, dtype: int64


10. Use panda's `get_dummies()` method to one-hot encode our feature. Assign this to a variable called `one_hot`.

In [11]:
# perform get_dummies
one_hot = pd.get_dummies(reviews["department_name"])

11. Join the results from `one_hot` back to our original data frame. Then print out the column names. What has been added?

In [12]:
# join the new columns back onto the original
reviews = reviews.join(one_hot)

# print column names
print(reviews.columns)

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')


12. Let's make one more feature transformation!

    Transform the review_date feature.

    This feature is listed as an object type, but we want this to be transformed into a date-time feature.

    Transform review_date into a date-time feature.


In [13]:
# transform review_date to date-time data
reviews["review_date"] = pd.to_datetime(reviews["review_date"])

# print review_date data type
print(reviews["review_date"].dtype)


datetime64[ns]


### Scaling the Data


13. The final step we will take in our transformation project is scaling our data. We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.

    Let's get our data frame to only have the numerical features we created.


In [14]:
# get numerical columns
reviews = reviews[
    [
        "clothing_id",
        "age",
        "recommended",
        "rating",
        "Bottoms",
        "Dresses",
        "Intimate",
        "Jackets",
        "Tops",
        "Trend",
    ]
].copy()

14. Reset the index to be our clothing_id feature.

In [15]:
# reset index
reviews = reviews.set_index('clothing_id')

15. We are ready to scale our data! Perform a `.fit_transform()` on our data set, and print the results to see how the features have changed.

Create a `StandardScaler()` and then use `.fit_transform()` on reviews. 

In [16]:
scaler = StandardScaler()
scaler.fit_transform(reviews)

array([[-0.34814459,  0.44742824, -0.1896478 , ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-1.24475223,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-0.51116416,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       ...,
       [-0.59267395,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-1.24475223,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [ 1.68960003,  0.44742824,  0.71602461, ..., -0.21656679,
         1.12998541, -0.07504356]])