# Transforming Data into Features
## Project description - by codecademy
"You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:

- Transforming categorical data
- Scaling your data
- Working with date-time features"


## Data description - from data source
"Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

- Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
- Age: Positive Integer variable of the reviewers age.
- Title: String variable for the title of the review.
- Review Text: String variable for the review body.
- Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
- Division Name: Categorical name of the product high level division.
- Department Name: Categorical name of the product department name.
- Class Name: Categorical name of the product class name."

"Anonymous but real source"

## Code

In [27]:
# import necessary packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [28]:
# import data
## note: data used in this notebook is copied from codecademy; 
##       the dataset is smaller and slightly different than the origianl
reviews = pd.read_csv('reviews.csv')

In [29]:
#print column names
print(reviews.columns)
 
#print .info
print(reviews.info())

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB
None


In [30]:
#look at the counts of recommended
print(reviews['recommended'].value_counts())
 
#create binary dictionary
binary_dict = {True: 1,
False: 0}
 
#transform column
reviews['recommended'] = reviews['recommended'].replace(binary_dict)
 
#print your transformed column
print(reviews['recommended'].value_counts())

True     4166
False     834
Name: recommended, dtype: int64
1    4166
0     834
Name: recommended, dtype: int64


In [31]:
#look at the counts of rating
print(reviews['rating'].value_counts())
 
#create dictionary
rating_dict = {'Loved it': 5,
'Liked it': 4,
'Was okay': 3,
'Not great': 2,
'Hated it': 1
}
 
#transform rating column
reviews['rating'] = reviews['rating'].replace(rating_dict)
 
#print your transformed column values
print(reviews['rating'].value_counts())

Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: rating, dtype: int64
5    2798
4    1141
3     564
2     304
1     193
Name: rating, dtype: int64


In [32]:
#get the number of categories in a feature
print(reviews['department_name'].value_counts())
 
#perform get_dummies
one_hot = pd.get_dummies(reviews['department_name'])
 
#join the new columns back onto the original
reviews = reviews.join(one_hot)

#print column names
print(reviews.columns)

Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: department_name, dtype: int64
Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')


In [33]:
reviews.head()
# illlustrates that we have one hot encoded the department name to each row

Unnamed: 0,clothing_id,age,review_title,review_text,recommended,division_name,department_name,review_date,rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,1095,39,"Cute,looks like a dress on",If you are afraid of the jumpsuit trend but li...,1,General,Dresses,2019-07-08,4,0,1,0,0,0,0
1,1095,28,"So cute, great print!",I love fitted top dresses like this but i find...,1,General,Dresses,2019-05-17,5,0,1,0,0,0,0
2,699,37,So flattering!,"I love these cozy, fashionable leggings. they ...",1,Initmates,Intimate,2019-06-24,5,0,0,1,0,0,0
3,1072,36,Effortless,"Another reviewer said it best, ""i love the way...",1,General Petite,Dresses,2019-12-06,5,0,1,0,0,0,0
4,1094,32,You need this!,Rompers are my fav so i'm biased writing this ...,1,General,Dresses,2019-10-04,5,0,1,0,0,0,0


In [34]:
#transform review_date to date-time data
reviews['review_date'] = pd.to_datetime(reviews['review_date'])

#print review_date data type 
print(reviews['review_date'].dtype)

datetime64[ns]


In [35]:
#get numerical columns
reviews = reviews.select_dtypes(['number'])
print(reviews.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   clothing_id  5000 non-null   int64
 1   age          5000 non-null   int64
 2   recommended  5000 non-null   int64
 3   rating       5000 non-null   int64
 4   Bottoms      5000 non-null   uint8
 5   Dresses      5000 non-null   uint8
 6   Intimate     5000 non-null   uint8
 7   Jackets      5000 non-null   uint8
 8   Tops         5000 non-null   uint8
 9   Trend        5000 non-null   uint8
dtypes: int64(4), uint8(6)
memory usage: 185.7 KB
None


In [36]:
#reset index
reviews.set_index('clothing_id',inplace=True)
print(reviews.head())

             age  recommended  rating  Bottoms  Dresses  Intimate  Jackets  \
clothing_id                                                                  
1095          39            1       4        0        1         0        0   
1095          28            1       5        0        1         0        0   
699           37            1       5        0        0         1        0   
1072          36            1       5        0        1         0        0   
1094          32            1       5        0        1         0        0   

             Tops  Trend  
clothing_id               
1095            0      0  
1095            0      0  
699             0      0  
1072            0      0  
1094            0      0  


In [37]:
#instantiate standard scaler
scaler = StandardScaler()

#fit transform data
scaled_reviews = scaler.fit_transform(reviews)
print(scaled_reviews)

[[-0.34814459  0.44742824 -0.1896478  ... -0.21656679 -0.88496718
  -0.07504356]
 [-1.24475223  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 [-0.51116416  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 ...
 [-0.59267395  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 [-1.24475223  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 [ 1.68960003  0.44742824  0.71602461 ... -0.21656679  1.12998541
  -0.07504356]]


In [38]:
# transfer scaled reviews array to the reviews dataframe
reviews = pd.DataFrame(scaled_reviews, index=reviews.index, columns=reviews.columns)
reviews.head()

Unnamed: 0_level_0,age,recommended,rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
clothing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1095,-0.348145,0.447428,-0.189648,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
1095,-1.244752,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
699,-0.511164,0.447428,0.716025,-0.451928,-0.599529,3.496786,-0.216567,-0.884967,-0.075044
1072,-0.592674,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
1094,-0.918713,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
