# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## 1. Load Data

In [123]:
import pandas as pd
import spacy

# Load spaCy model for text processing
nlp = spacy.load('en_core_web_sm')

# Read the dataset
df = pd.read_csv('/Users/lem/Python/Pipeline Project/starter/data/reviews.csv')

# Check data info and head
df.info()
df.head()

# Preparing features (X) & target (y)
X = df.drop('Recommended IND', axis=1)  # Correct the column name
y = df['Recommended IND']  # Correct the column name

# Splitting the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, shuffle=True, random_state=27
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


## 2. Building a Pipeline: Splitting Numerical, Categorical, and Text Data

In [125]:
# Define the feature groups
num_features = X.select_dtypes(exclude=['object']).columns  # Remove the .drop(['date_year']) part
cat_features = ['Division Name', 'Department Name', 'Class Name']  # Use appropriate categorical features
text_features = ['Review Text']  # Use the correct column name for the text features

## 3. Numerical Features Pipeline

In [129]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
])


## 4. Categorical Features Pipeline

In [132]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

cat_pipeline = Pipeline([
    ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('cat_encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore')),
])



## 5.Text Features Pipeline

In [135]:
# Custom Transformer: Count Characters (for spaces, exclamations, and question marks)
from sklearn.base import BaseEstimator, TransformerMixin

class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[text.count(self.character)] for text in X]

# Feature engineering pipeline for counting spaces, exclamations, and question marks
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

initial_text_preprocess = Pipeline([
    ('dimension_reshaper', FunctionTransformer(np.reshape, kw_args={'newshape': -1})),
])

character_counts_pipeline = Pipeline([
    ('initial_text_preprocess', initial_text_preprocess),
    ('feature_engineering', FeatureUnion([
        ('count_spaces', CountCharacter(character=' ')),
        ('count_exclamations', CountCharacter(character='!')),
        ('count_question_marks', CountCharacter(character='?')),
    ])),
])

# Custom Transformer for spaCy Lemmatization
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        lemmatized = [
            ' '.join(token.lemma_ for token in doc if not token.is_stop)
            for doc in self.nlp.pipe(X)
        ]
        return lemmatized

# TF-IDF Pipeline with lemmatization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    ('dimension_reshaper', FunctionTransformer(np.reshape, kw_args={'newshape': -1})),
    ('lemmatizer', SpacyLemmatizer(nlp=nlp)),
    ('tfidf_vectorizer', TfidfVectorizer(stop_words='english')),
])


## 6. Combine All Feature Engineering Pipelines

In [138]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features),
    ('character_counts', character_counts_pipeline, text_features),
    ('tfidf_text', tfidf_pipeline, text_features),
])


## 7.Build Complete Pipeline (Including Model)

In [141]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(random_state=27),
)


## 8. Train the Pipeline

In [144]:
model_pipeline.fit(X_train, y_train)


## 9. Make Predictions

In [146]:
y_pred_forest_pipeline = model_pipeline.predict(X_test)


## 10. Evaluate the Model (Accuracy)

In [148]:
from sklearn.metrics import accuracy_score
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)
print(f"Accuracy: {accuracy_forest_pipeline:.4f}")


Accuracy: 0.8472


## 11.Fine-Tune the Model Using RandomizedSearchCV

In [150]:
from sklearn.model_selection import RandomizedSearchCV

my_distributions = {
    'randomforestclassifier__max_features': [100, 150, 200], 
    'randomforestclassifier__n_estimators': [50, 100],  
}

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=my_distributions,
    n_iter=6,
    cv=5,
    n_jobs=-1,
    refit=True,
    verbose=3,
    random_state=27,
)

param_search.fit(X_train, y_train)


Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 3/5] END randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=50;, score=0.845 total time= 4.7min
[CV 2/5] END randomforestclassifier__max_features=150, randomforestclassifier__n_estimators=50;, score=0.856 total time= 4.6min
[CV 4/5] END randomforestclassifier__max_features=150, randomforestclassifier__n_estimators=100;, score=0.849 total time= 4.5min
[CV 4/5] END randomforestclassifier__max_features=200, randomforestclassifier__n_estimators=100;, score=0.855 total time= 3.8min
[CV 1/5] END randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=100;, score=0.846 total time= 4.7min
[CV 4/5] END randomforestclassifier__max_features=150, randomforestclassifier__n_estimators=50;, score=0.849 total time= 4.6min
[CV 2/5] END randomforestclassifier__max_features=200, randomforestclassifier__n_estimators=50;, score=0.864 total time= 4.5min
[CV 5/5] END randomforestclassifier__max_

## 12. Get Best Parameters and Evaluate the Final Model

In [152]:
print(f"Best parameters: {param_search.best_params_}")

# Refit the model with the best parameters
model_best = param_search.best_estimator_

# Make final predictions and evaluate
y_pred_best = model_best.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Final Accuracy after fine-tuning: {accuracy_best:.4f}")


Best parameters: {'randomforestclassifier__n_estimators': 50, 'randomforestclassifier__max_features': 200}
Final Accuracy after fine-tuning: 0.8509
