# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [1]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [2]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [3]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# My Work

## Data Exploration

In [4]:
# Select columns by data type
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object', 'category']).columns
bool_cols = df.select_dtypes(include=['bool']).columns

print("Numerical columns:", num_cols.tolist())
print("Categorical columns:", cat_cols.tolist())
print("Boolean columns:", bool_cols.tolist())


Numerical columns: ['Clothing ID', 'Age', 'Positive Feedback Count', 'Recommended IND']
Categorical columns: ['Title', 'Review Text', 'Division Name', 'Department Name', 'Class Name']
Boolean columns: []


- Numerical:'Clothing ID', 'Age', 'Positive Feedback Count'
- Categorical:'Division Name', 'Department Name', 'Class Name'
- Text:'Title', 'Review Text'


## Building Pipeline

In [5]:
from sklearn.pipeline import Pipeline

split data into numerical, categorical, and text features

In [6]:
num_features = (
    X
    .select_dtypes(exclude=['object']).columns
    .drop(
          [
               'Age', # More of category than a numerical feature
           ],
    )
)
print('Numerical features:', num_features)

cat_features = (
    X[[
          'Division Name', 
          'Department Name', 
          'Class Name',
          'Age'
       ]].columns
)
print('Categorical features:', cat_features)


text_features = (
    X[[
           'Title', 
           'Review Text'
     ]].columns
)
print ('Review Text features:', text_features)



Numerical features: Index(['Clothing ID', 'Positive Feedback Count'], dtype='object')
Categorical features: Index(['Division Name', 'Department Name', 'Class Name', 'Age'], dtype='object')
Review Text features: Index(['Title', 'Review Text'], dtype='object')


In [7]:
# --- Combine Text Columns ---
#def combine_text_columns(X):
 #   return (X['Title'] + ' ' + X['Review Text']).values

#combine_text_step = FunctionTransformer(combine_text_columns, validate=False)

### Numerical Features Pipeline 

In [8]:
# define pipeline for numerical features called `num_pipeline``
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
    (   
         'imputer',
        SimpleImputer(strategy='mean'),
    ),
    (
         'scaler',
        MinMaxScaler(),
    )
])

num_pipeline

### Categorical Features Pipeline

In [9]:
# define pipeline for categorical features called `cat_pipeline`
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    (
          'ordinal_encoder',
          OrdinalEncoder(
               handle_unknown='use_encoded_value',
          unknown_value=-1,
          )
    ),
    (
          'imputer',
          SimpleImputer(
               strategy='most_frequent',
          )
    ),
    (
          'cat_encoder',
          OneHotEncoder(
               sparse_output=False,
               handle_unknown='ignore',
          )
    ),
    
])

cat_pipeline

### Text Feature Pipeline

In [10]:
import spacy
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV

# --- Load spaCy ---
nlp = spacy.load("en_core_web_sm")

# --- Combine Text Columns ---
def combine_text_columns(X):
    return (X['Title'] + ' ' + X['Review Text']).values

combine_text_step = FunctionTransformer(combine_text_columns, validate=False)

# --- Character Counts Pipeline ---
class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = list(X)
        return np.array([[text.count(self.character)] for text in X])

character_counts_pipeline = FeatureUnion([
    ('spaces', CountCharacter(' ')),
    ('exclamations', CountCharacter('!')),
    ('questions', CountCharacter('?')),
])

# --- spaCy Feature Extraction (1回だけ) ---
class SpacyFeatureExtractor(BaseEstimator, TransformerMixin):
    """
    spaCy を1回だけ回して、lemma, POS, NER をすべて抽出する
    """
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = list(X)
        lemmas = []
        pos_features = []
        ner_features = []

        for doc in self.nlp.pipe(X, batch_size=50):
            # Lemmas
            lemmas.append(' '.join(token.lemma_ for token in doc if not token.is_stop))

            # POS比率
            counts = {"NOUN": 0, "VERB": 0, "ADJ": 0}
            for token in doc:
                if token.pos_ in counts:
                    counts[token.pos_] += 1
            total = len(doc) + 1e-6
            pos_features.append([
                counts["NOUN"]/total,
                counts["VERB"]/total,
                counts["ADJ"]/total
            ])

            # NER数
            ner_features.append([len(doc.ents)])

        return np.hstack([
            np.array(pos_features),
            np.array(ner_features)
        ]), lemmas  # POS+NER特徴量と lemma のリストを返す

# --- TF-IDF Pipeline for Lemmas ---
class LemmaTFIDF(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tfidf = TfidfVectorizer(stop_words='english')

    def fit(self, X, y=None):
        self.tfidf.fit(X)
        return self

    def transform(self, X):
        return self.tfidf.transform(X)

# --- Text Pipeline ---
class FastTextPipeline(BaseEstimator, TransformerMixin):
    """
    spaCy を1回だけ回した特徴量をまとめて処理
    """
    def __init__(self):
        self.char_counts = character_counts_pipeline
        self.spacy_features = SpacyFeatureExtractor(nlp)
        self.tfidf = LemmaTFIDF()

    def fit(self, X, y=None):
        X_text = combine_text_columns(X)
        # spaCy特徴量
        spacy_num_features, lemmas = self.spacy_features.fit_transform(X_text)
        self.spacy_num_features_ = spacy_num_features
        # TF-IDF
        self.tfidf.fit(lemmas)
        return self

    def transform(self, X):
        X_text = combine_text_columns(X)
        # Character counts
        char_features = self.char_counts.transform(X_text)
        # spaCy数値特徴量
        spacy_num_features, lemmas = self.spacy_features.transform(X_text)
        # TF-IDF
        tfidf_features = self.tfidf.transform(lemmas)
        # Combine all
        from scipy.sparse import hstack
        return hstack([char_features, spacy_num_features, tfidf_features])

# --- Model Pipeline ---
#text_pipeline = FastTextPipeline()

#model_pipeline = Pipeline([
 #   ('text', text_pipeline),
  #  ('scaler', StandardScaler(with_mean=False)),
   # ('svc', LinearSVC(random_state=27, max_iter=5000, dual=False))
#])




In [11]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV

# --- Feature Engineering ---
feature_engineering = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features),
    ('text', FastTextPipeline(), ['Title', 'Review Text'])  # Text部分をFastTextPipelineで処理
])

# --- Model Pipeline ---
model_pipeline = make_pipeline(
    feature_engineering,
    StandardScaler(with_mean=False),  # LinearSVC用
    LinearSVC(random_state=27, max_iter=5000, dual=False)
)
model_pipeline

In [12]:
model_pipeline.fit(X_train, y_train)


### Evaluate Model

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

# Precision（適合率）
precision_forest_pipeline = precision_score(y_test, y_pred_forest_pipeline)

# Recall（再現率）
recall_forest_pipeline = recall_score(y_test, y_pred_forest_pipeline)

# F1 score（Precision と Recall の調和平均）
f1_forest_pipeline = f1_score(y_test, y_pred_forest_pipeline)

# 結果表示
print("Accuracy :", accuracy_forest_pipeline)
print("Precision:", precision_forest_pipeline)
print("Recall   :", recall_forest_pipeline)
print("F1 Score :", f1_forest_pipeline)

Accuracy : 0.8216802168021681
Precision: 0.9052488070892979
Recall   : 0.8748353096179183
F1 Score : 0.8897822445561139


## Fine-Tuning Pipeline

In [14]:

# --- Fine-tuning ---
param_distributions = dict(
    linearsvc__C=[0.01, 0.1, 1, 10, 100],
    linearsvc__tol=[1e-4, 1e-3, 1e-2]
)

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=param_distributions,
    n_iter=5,          # 5種類のパラメータ組み合わせ
    cv=3,              # 3-fold CV
    n_jobs=1,         
    refit=True,
    verbose=3,
    random_state=27
)

# --- Fit ---
param_search.fit(X_train, y_train)

# --- Best parameters ---
print(param_search.best_params_)


Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV 1/3] END linearsvc__C=10, linearsvc__tol=0.001;, score=0.838 total time= 6.6min
[CV 2/3] END linearsvc__C=10, linearsvc__tol=0.001;, score=0.837 total time= 6.7min
[CV 3/3] END linearsvc__C=10, linearsvc__tol=0.001;, score=0.830 total time= 6.3min
[CV 1/3] END linearsvc__C=1, linearsvc__tol=0.0001;, score=0.837 total time= 6.4min
[CV 2/3] END linearsvc__C=1, linearsvc__tol=0.0001;, score=0.837 total time= 6.4min
[CV 3/3] END linearsvc__C=1, linearsvc__tol=0.0001;, score=0.830 total time= 6.3min
[CV 1/3] END linearsvc__C=0.1, linearsvc__tol=0.001;, score=0.840 total time= 6.3min
[CV 2/3] END linearsvc__C=0.1, linearsvc__tol=0.001;, score=0.838 total time= 6.2min
[CV 3/3] END linearsvc__C=0.1, linearsvc__tol=0.001;, score=0.832 total time= 6.7min
[CV 1/3] END linearsvc__C=100, linearsvc__tol=0.01;, score=0.839 total time= 6.3min
[CV 2/3] END linearsvc__C=100, linearsvc__tol=0.01;, score=0.841 total time= 6.3min
[CV 3/3] END 

In [15]:
model_best = param_search.best_estimator_
model_best

In [16]:
y_pred_forest_pipeline = model_best.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

Accuracy: 0.8520325203252033


In [17]:
# Precision（適合率）
precision_forest_pipeline = precision_score(y_test, y_pred_forest_pipeline)

# Recall（再現率）
recall_forest_pipeline = recall_score(y_test, y_pred_forest_pipeline)

# F1 score（Precision と Recall の調和平均）
f1_forest_pipeline = f1_score(y_test, y_pred_forest_pipeline)

# 結果表示
print("Accuracy :", accuracy_forest_pipeline)
print("Precision:", precision_forest_pipeline)
print("Recall   :", recall_forest_pipeline)
print("F1 Score :", f1_forest_pipeline)


Accuracy : 0.8520325203252033
Precision: 0.9141716566866267
Recall   : 0.9051383399209486
F1 Score : 0.9096325719960279
