>## **<center>Hierarchical Classification</center>**

### **Importing Necessary Libraries**

In [1]:
# Import necessary libraries
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.preprocessing import LabelEncoder
import pickle

[nltk_data] Downloading package stopwords to C:\Users\Subhash
[nltk_data]     Dixit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **Reading data**

In [2]:
data = pd.read_csv('data.csv')
# data = data.iloc[:100]

### **Basic analysis and pre-processing**

In [3]:
display(data.head())
display(data.info())
display(data.isnull().sum())
data.dropna(inplace=True)  # Drop rows with missing values


Unnamed: 0,productId,Title,userId,Time,Text,Cat1,Cat2,Cat3
0,B0002AQK70,PetSafe Staywell Pet Door with Clear Hard Flap,A2L6QTQQI13LZG,1344211200,We've only had it installed about 2 weeks. So ...,pet supplies,cats,cat flaps
1,B0002DK8OI,"Kaytee Timothy Cubes, 1-Pound",A2HJUOZ9R9K4F,1344211200,My bunny had a hard time eating this because t...,pet supplies,bunny rabbit central,food
2,B0006VJ6TO,Body Back Buddy,A14PK96LL78NN3,1344211200,would never in a million years have guessed th...,health personal care,health care,massage relaxation
3,B000EZSFXA,SnackMasters California Style Turkey Jerky,A2UW73HU9UMOTY,1344211200,"Being the jerky fanatic I am, snackmasters han...",grocery gourmet food,snack food,jerky dried meats
4,B000KV61FC,Premier Busy Buddy Tug-a-Jug Treat Dispensing ...,A1Q99RNV0TKW8R,1344211200,Wondered how quick my dog would catch on to th...,pet supplies,dogs,toys


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   productId  10000 non-null  object
 1   Title      9995 non-null   object
 2   userId     10000 non-null  object
 3   Time       10000 non-null  int64 
 4   Text       10000 non-null  object
 5   Cat1       10000 non-null  object
 6   Cat2       10000 non-null  object
 7   Cat3       10000 non-null  object
dtypes: int64(1), object(7)
memory usage: 625.1+ KB


None

productId    0
Title        5
userId       0
Time         0
Text         0
Cat1         0
Cat2         0
Cat3         0
dtype: int64

### **Text Pre-Processing**

In [4]:
# Define a function to clean the text data
def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
    return text

# Preprocess both Title and Text columns and combine them
def preprocess_data(data):
    data['clean_text'] = data['Text'].apply(preprocess_text)
    data['clean_title'] = data['Title'].apply(preprocess_text)
    data['combined_text'] = data['clean_title'] + " " + data['clean_text']
    
    # Encode target labels
    le_cat1 = LabelEncoder()
    le_cat2 = LabelEncoder()
    le_cat3 = LabelEncoder()
    data['Cat1'] = le_cat1.fit_transform(data['Cat1'])
    data['Cat2'] = le_cat2.fit_transform(data['Cat2'])
    data['Cat3'] = le_cat3.fit_transform(data['Cat3'])
    
    return data, le_cat1, le_cat2, le_cat3


### **Train Test Split**

In [5]:
# Split the data into train and test sets
def split_data(data):
    X = data['combined_text']
    y_cat1 = data['Cat1']
    y_cat2 = data['Cat2']
    y_cat3 = data['Cat3']
    
    X_train, X_test, y_train_cat1, y_test_cat1 = train_test_split(X, y_cat1, test_size=0.2, random_state=42)
    X_train_cat2, X_test_cat2, y_train_cat2, y_test_cat2 = train_test_split(X, y_cat2, test_size=0.2, random_state=42)
    X_train_cat3, X_test_cat3, y_train_cat3, y_test_cat3 = train_test_split(X, y_cat3, test_size=0.2, random_state=42)
    
    return (X_train, X_test, y_train_cat1, y_test_cat1,
            X_train_cat2, X_test_cat2, y_train_cat2, y_test_cat2,
            X_train_cat3, X_test_cat3, y_train_cat3, y_test_cat3)


### **Model Building**

#### **TF-IDF Vectorization**

- **TF-IDF** stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). TF-IDF is used as a weighting factor in text mining and information retrieval.

##### Components of TF-IDF

1. **Term Frequency (TF)**: Measures how frequently a term appears in a document.
   - \( \text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \)

2. **Inverse Document Frequency (IDF)**: Measures how important a term is across the corpus.
   - \( \text{IDF}(t,D) = \log \left( \frac{\text{Total number of documents } N}{\text{Number of documents containing term } t} \right) \)
   - Terms that are common in many documents will have a lower IDF score.

3. **TF-IDF Score**: The product of TF and IDF.
   - \( \text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D) \)

##### Why Use TF-IDF?

- **Relevance**: TF-IDF helps to highlight important words in the documents by considering their frequency in a particular document relative to the entire corpus. Common words (like "the", "is", "and") are down-weighted.
- **Feature Representation**: Converts text data into a numerical representation that can be fed into machine learning algorithms.
- **Discriminative Power**: It improves the discriminative power of text classification models by focusing on unique and meaningful terms.

##### Why TF-IDF is Preferred?

- **Effective Weighting**: Unlike simple term frequency, TF-IDF considers the distribution of terms across all documents, which helps in distinguishing between relevant and non-relevant terms.
- **Dimensionality Reduction**: It can reduce the dimensionality of the feature space by focusing on the most significant terms.
- **Interpretability**: The resulting weights provide insight into the importance of terms within documents.

##### Parameter: `max_features`

- **Definition**: `max_features` specifies the maximum number of features (terms) to consider when building the TF-IDF matrix.
- **Usage**: In this case, `max_features=5000` means that the top 5000 terms with the highest TF-IDF scores will be selected as features for the model.
- **Benefits**:
  - **Computational Efficiency**: Reduces the size of the feature space, making the model more efficient to train and less prone to overfitting.
  - **Focus on Most Important Terms**: Ensures that only the most relevant terms (with the highest TF-IDF scores) are used as features, improving the performance and interpretability of the model.

##### Summary

- **TF-IDF**: A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.
- **Benefits**: Improves the relevance and discriminative power of text features.
- **max_features**: Limits the number of features to the most significant 5000 terms, enhancing computational efficiency and model performance.

In [6]:
# Function to build and evaluate models
def build_and_evaluate_model(X_train, X_test, y_train, y_test, classifiers, param_grids):
    best_f1 = 0
    best_model = None
    best_clf_name = ""
    
    for clf_name, clf, param_grid in zip(classifiers.keys(), classifiers.values(), param_grids.values()):
        pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(max_features=5000)),
            ('clf', clf)
        ])
        
        # Base model evaluation
        pipeline.fit(X_train, y_train)
        predictions = pipeline.predict(X_test)
        f1 = f1_score(y_test, predictions, average='weighted')
        print(f"Base Model Classification Report for {clf_name}")
        print(classification_report(y_test, predictions))
        print(f"Base Model F1 Score for {clf_name}: {f1}")
        
        if f1 > best_f1:
            best_f1 = f1
            best_model = pipeline
            best_clf_name = clf_name
        
        # Hyperparameter tuning
        grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1_weighted', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        best_clf = grid_search.best_estimator_
        predictions = best_clf.predict(X_test)
        
        f1 = f1_score(y_test, predictions, average='weighted')
        if f1 > best_f1:
            best_f1 = f1
            best_model = best_clf
            best_clf_name = clf_name
        
        print(f"Tuned Model Classification Report for {clf_name}")
        print(classification_report(y_test, predictions))
        print(f"Tuned Model F1 Score for {clf_name}: {f1}")
    
    print(f"\nBest model: {best_clf_name} with F1 Score: {best_f1}")
    return best_model


### **Model Function Calling**

- Let's go through the hyperparameters used in the `param_grids` dictionary for each classifier in detail.

#### **Logistic Regression (`LogisticRegression`)**

1. **`C`**: Inverse of regularization strength (float, default=1.0)
   - Smaller values specify stronger regularization.
   - Controls the trade-off between achieving a low training error and a low testing error.

2. **`solver`**: Algorithm to use in the optimization problem (string, default='lbfgs')
   - `'liblinear'`: A good choice for small datasets. Uses a coordinate descent algorithm.

#### **Multinomial Naive Bayes (`MultinomialNB`)**

1. **`alpha`**: Additive (Laplace/Lidstone) smoothing parameter (float, default=1.0)
   - 0 for no smoothing.
   - Higher values of alpha result in smoother probability estimates.

#### **Support Vector Classifier (`SVC`)**

1. **`C`**: Regularization parameter (float, default=1.0)
   - The strength of the regularization is inversely proportional to `C`.
   - Must be strictly positive.
   - Smaller values specify stronger regularization.

2. **`kernel`**: Specifies the kernel type to be used in the algorithm (string, default='rbf')
   - `'linear'`: Linear kernel.
   - `'rbf'`: Radial basis function (Gaussian) kernel.
   - The kernel defines the decision boundary shape in the feature space.

#### **Random Forest Classifier (`RandomForestClassifier`)**

1. **`n_estimators`**: The number of trees in the forest (int, default=100)
   - Higher values typically improve performance but increase computation time.

2. **`max_depth`**: The maximum depth of the tree (int, default=None)
   - Limits the depth of the tree to prevent overfitting.
   - Higher values increase the risk of overfitting.

#### **Decision Tree Classifier (`DecisionTreeClassifier`)**

1. **`max_depth`**: The maximum depth of the tree (int, default=None)
   - Limits the depth of the tree to prevent overfitting.
   - Higher values increase the risk of overfitting.

2. **`min_samples_split`**: The minimum number of samples required to split an internal node (int or float, default=2)
   - Higher values can prevent the model from learning overly specific patterns and help prevent overfitting.


#### **Summary**
- **Regularization Parameters (`C`, `alpha`)**: Control the trade-off between achieving a low training error and a low testing error.
- **Solver (`solver`)**: Determines the optimization algorithm for Logistic Regression.
- **Kernel (`kernel`)**: Defines the decision boundary shape for SVC.
- **Number of Estimators (`n_estimators`)**: Determines the number of trees (or boosting rounds) in ensemble methods.
- **Tree Depth (`max_depth`)**: Limits the complexity of the model to prevent overfitting.
- **Minimum Samples Split (`min_samples_split`)**: Ensures nodes have enough samples before splitting to prevent overly specific patterns.

These parameters are crucial for tuning the model to achieve the best performance by balancing bias and variance.

In [7]:
# Define classifiers and their hyperparameters for tuning
classifiers = {
    'LogisticRegression': LogisticRegression(),
    'MultinomialNB': MultinomialNB(),
    'SVC': SVC(),
    'RandomForest': RandomForestClassifier(),
    'DecisionTree': DecisionTreeClassifier()
}

param_grids = {
    'LogisticRegression': {'clf__C': [0.1, 1, 10], 'clf__solver': ['liblinear']},
    'MultinomialNB': {'clf__alpha': [0.5, 1.0, 1.5]},
    'SVC': {'clf__C': [0.1, 1, 10], 'clf__kernel': ['linear', 'rbf']},
    'RandomForest': {'clf__n_estimators': [50, 100, 200], 'clf__max_depth': [10, 20, 30]},
    'DecisionTree': {'clf__max_depth': [10, 20, 30], 'clf__min_samples_split': [2, 5, 10]}
}

# Process data
data, le_cat1, le_cat2, le_cat3 = preprocess_data(data)

# Save LabelEncoders
pickle.dump(le_cat1, open('le_cat1.pkl', 'wb'))
pickle.dump(le_cat2, open('le_cat2.pkl', 'wb'))
pickle.dump(le_cat3, open('le_cat3.pkl', 'wb'))

(X_train, X_test, y_train_cat1, y_test_cat1,
 X_train_cat2, X_test_cat2, y_train_cat2, y_test_cat2,
 X_train_cat3, X_test_cat3, y_train_cat3, y_test_cat3) = split_data(data)

# Train and evaluate models for Cat1
print("Training and evaluating models for Cat1")
best_model_cat1 = build_and_evaluate_model(X_train, X_test, y_train_cat1, y_test_cat1, classifiers, param_grids)
pickle.dump(best_model_cat1, open('best_model_cat1.pkl', 'wb'))

# Convert predictions to strings and add as new features
X_train_cat2 = X_train.to_frame()
X_test_cat2 = X_test.to_frame()

X_train_cat2['pred_cat1'] = best_model_cat1.predict(X_train).astype(str)
X_test_cat2['pred_cat1'] = best_model_cat1.predict(X_test).astype(str)

X_train_cat2_combined = X_train_cat2.apply(lambda row: ' '.join(row.values), axis=1)
X_test_cat2_combined = X_test_cat2.apply(lambda row: ' '.join(row.values), axis=1)

# Train and evaluate models for Cat2
print("\nTraining and evaluating models for Cat2")
best_model_cat2 = build_and_evaluate_model(X_train_cat2_combined, X_test_cat2_combined, y_train_cat2, y_test_cat2, classifiers, param_grids)
pickle.dump(best_model_cat2, open('best_model_cat2.pkl', 'wb'))

# Convert predictions to strings and add as new features
X_train_cat3 = X_train.to_frame()
X_test_cat3 = X_test.to_frame()

X_train_cat3['pred_cat2'] = best_model_cat2.predict(X_train_cat2_combined).astype(str)
X_test_cat3['pred_cat2'] = best_model_cat2.predict(X_test_cat2_combined).astype(str)

X_train_cat3_combined = X_train_cat3.apply(lambda row: ' '.join(row.values), axis=1)
X_test_cat3_combined = X_test_cat3.apply(lambda row: ' '.join(row.values), axis=1)

# Train and evaluate models for Cat3
print("\nTraining and evaluating models for Cat3")
best_model_cat3 = build_and_evaluate_model(X_train_cat3_combined, X_test_cat3_combined, y_train_cat3, y_test_cat3, classifiers, param_grids)
pickle.dump(best_model_cat3, open('best_model_cat3.pkl', 'wb'))

Training and evaluating models for Cat1
Base Model Classification Report for LogisticRegression
              precision    recall  f1-score   support

           0       0.92      0.67      0.77       135
           1       0.92      0.88      0.90       438
           2       0.96      0.76      0.85       138
           3       0.78      0.94      0.86       601
           4       0.97      0.87      0.92       313
           5       0.90      0.88      0.89       374

    accuracy                           0.88      1999
   macro avg       0.91      0.83      0.86      1999
weighted avg       0.89      0.88      0.88      1999

Base Model F1 Score for LogisticRegression: 0.8754163207407083
Tuned Model Classification Report for LogisticRegression
              precision    recall  f1-score   support

           0       0.90      0.80      0.85       135
           1       0.91      0.90      0.90       438
           2       0.95      0.83      0.88       138
           3       0.85 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for LogisticRegression
              precision    recall  f1-score   support

           0       0.81      0.54      0.65        24
           1       0.78      0.54      0.64        26
           2       0.89      0.67      0.76        12
           4       0.63      0.53      0.58        36
           5       0.74      0.47      0.57        30
           6       0.80      0.57      0.67         7
           7       0.81      0.84      0.83        31
           8       0.71      0.50      0.59        10
           9       0.00      0.00      0.00         4
          10       1.00      0.60      0.75         5
          11       0.82      0.86      0.84        21
          12       0.00      0.00      0.00         1
          13       0.85      0.68      0.76        25
          14       1.00      1.00      1.00         1
          15       0.75      0.70      0.72        90
          16       0.00      0.00      0.00         3
          17       0.00 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for MultinomialNB
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        24
           1       0.00      0.00      0.00        26
           2       0.00      0.00      0.00        12
           4       0.67      0.06      0.10        36
           5       1.00      0.03      0.06        30
           6       0.00      0.00      0.00         7
           7       1.00      0.23      0.37        31
           8       0.00      0.00      0.00        10
           9       0.00      0.00      0.00         4
          10       0.00      0.00      0.00         5
          11       1.00      0.19      0.32        21
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00        25
          14       0.00      0.00      0.00         1
          15       0.88      0.48      0.62        90
          16       0.00      0.00      0.00         3
          17       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for MultinomialNB
              precision    recall  f1-score   support

           0       1.00      0.17      0.29        24
           1       0.00      0.00      0.00        26
           2       0.00      0.00      0.00        12
           4       0.77      0.28      0.41        36
           5       1.00      0.13      0.24        30
           6       0.00      0.00      0.00         7
           7       0.88      0.48      0.62        31
           8       0.00      0.00      0.00        10
           9       0.00      0.00      0.00         4
          10       0.00      0.00      0.00         5
          11       1.00      0.38      0.55        21
          12       0.00      0.00      0.00         1
          13       1.00      0.04      0.08        25
          14       0.00      0.00      0.00         1
          15       0.70      0.54      0.61        90
          16       0.00      0.00      0.00         3
          17       0.00      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for SVC
              precision    recall  f1-score   support

           0       0.89      0.33      0.48        24
           1       1.00      0.35      0.51        26
           2       1.00      0.50      0.67        12
           4       0.62      0.44      0.52        36
           5       0.92      0.37      0.52        30
           6       1.00      0.14      0.25         7
           7       0.82      0.74      0.78        31
           8       0.75      0.30      0.43        10
           9       0.00      0.00      0.00         4
          10       1.00      0.40      0.57         5
          11       0.86      0.90      0.88        21
          12       0.00      0.00      0.00         1
          13       1.00      0.48      0.65        25
          14       0.00      0.00      0.00         1
          15       0.81      0.67      0.73        90
          16       0.00      0.00      0.00         3
          17       0.00      0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for SVC
              precision    recall  f1-score   support

           0       0.73      0.46      0.56        24
           1       0.71      0.58      0.64        26
           2       1.00      0.67      0.80        12
           4       0.62      0.56      0.59        36
           5       0.74      0.57      0.64        30
           6       0.50      0.57      0.53         7
           7       0.84      0.87      0.86        31
           8       0.86      0.60      0.71        10
           9       1.00      0.25      0.40         4
          10       0.57      0.80      0.67         5
          11       0.82      0.86      0.84        21
          12       0.00      0.00      0.00         1
          13       0.80      0.64      0.71        25
          14       0.50      1.00      0.67         1
          15       0.68      0.74      0.71        90
          16       0.00      0.00      0.00         3
          17       1.00      0.50      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for RandomForest
              precision    recall  f1-score   support

           0       0.50      0.38      0.43        24
           1       1.00      0.31      0.47        26
           2       0.88      0.58      0.70        12
           4       0.33      0.31      0.32        36
           5       0.83      0.33      0.48        30
           6       0.33      0.14      0.20         7
           7       0.65      0.84      0.73        31
           8       0.62      0.50      0.56        10
           9       0.00      0.00      0.00         4
          10       1.00      0.60      0.75         5
          11       0.61      0.90      0.73        21
          12       0.00      0.00      0.00         1
          13       0.79      0.60      0.68        25
          14       0.00      0.00      0.00         1
          15       0.83      0.71      0.77        90
          16       0.00      0.00      0.00         3
          17       0.00      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for RandomForest
              precision    recall  f1-score   support

           0       0.83      0.21      0.33        24
           1       0.00      0.00      0.00        26
           2       0.86      0.50      0.63        12
           4       0.73      0.22      0.34        36
           5       0.82      0.30      0.44        30
           6       0.00      0.00      0.00         7
           7       0.59      0.77      0.67        31
           8       0.00      0.00      0.00        10
           9       0.00      0.00      0.00         4
          10       1.00      0.20      0.33         5
          11       1.00      0.62      0.76        21
          12       0.00      0.00      0.00         1
          13       0.86      0.24      0.38        25
          14       0.00      0.00      0.00         1
          15       0.85      0.70      0.77        90
          16       0.00      0.00      0.00         3
          17       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for DecisionTree
              precision    recall  f1-score   support

           0       0.29      0.21      0.24        24
           1       0.31      0.31      0.31        26
           2       0.70      0.58      0.64        12
           4       0.33      0.25      0.29        36
           5       0.60      0.30      0.40        30
           6       0.17      0.14      0.15         7
           7       0.81      0.81      0.81        31
           8       0.36      0.50      0.42        10
           9       0.00      0.00      0.00         4
          10       0.67      0.40      0.50         5
          11       0.58      0.86      0.69        21
          12       0.00      0.00      0.00         1
          13       0.67      0.48      0.56        25
          14       0.00      0.00      0.00         1
          15       0.69      0.61      0.65        90
          16       0.00      0.00      0.00         3
          17       0.50      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for DecisionTree
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        24
           1       0.00      0.00      0.00        26
           2       0.00      0.00      0.00        12
           4       1.00      0.06      0.11        36
           5       0.71      0.17      0.27        30
           6       0.00      0.00      0.00         7
           7       0.86      0.58      0.69        31
           8       0.50      0.10      0.17        10
           9       0.00      0.00      0.00         4
          10       0.00      0.00      0.00         5
          11       0.00      0.00      0.00        21
          12       0.00      0.00      0.00         1
          13       0.86      0.24      0.38        25
          14       0.00      0.00      0.00         1
          15       0.71      0.60      0.65        90
          16       0.00      0.00      0.00         3
          17       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Training and evaluating models for Cat3
Base Model Classification Report for LogisticRegression
              precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         1
           4       1.00      0.27      0.43        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         5
           8       0.50      0.08      0.14        12
           9       0.62      0.60      0.61        30
          10       1.00      0.67      0.80         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for LogisticRegression
              precision    recall  f1-score   support

           0       0.75      1.00      0.86         3
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00         2
           3       0.00      0.00      0.00         1
           4       0.67      0.36      0.47        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.67      0.40      0.50         5
           8       0.41      0.58      0.48        12
           9       0.65      0.57      0.61        30
          10       1.00      0.67      0.80         9
          11       0.00      0.00      0.00         2
          12       0.50      0.33      0.40         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.20      1.00      0.33         1
          23       0.00 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for MultinomialNB
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         5
           8       0.00      0.00      0.00        12
           9       0.86      0.20      0.32        30
          10       1.00      0.11      0.20         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for MultinomialNB
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         1
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         5
           8       0.00      0.00      0.00        12
           9       0.73      0.37      0.49        30
          10       1.00      0.56      0.71         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for SVC
              precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.00      0.00      0.00         1
           2       1.00      0.50      0.67         2
           3       0.00      0.00      0.00         1
           4       1.00      0.18      0.31        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         5
           8       0.00      0.00      0.00        12
           9       0.72      0.60      0.65        30
          10       1.00      0.67      0.80         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for SVC
              precision    recall  f1-score   support

           0       0.60      1.00      0.75         3
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00         2
           3       0.00      0.00      0.00         1
           4       0.67      0.36      0.47        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.67      0.40      0.50         5
           8       0.44      0.67      0.53        12
           9       0.64      0.53      0.58        30
          10       1.00      0.67      0.80         9
          11       0.00      0.00      0.00         2
          12       0.57      0.67      0.62         6
          17       0.00      0.00      0.00         1
          20       1.00      1.00      1.00         1
          22       0.25      1.00      0.40         1
          23       0.00      0.00      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for RandomForest
              precision    recall  f1-score   support

           0       0.67      0.67      0.67         3
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00         2
           3       0.00      0.00      0.00         1
           4       0.67      0.36      0.47        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.75      0.60      0.67         5
           8       0.15      0.25      0.19        12
           9       0.55      0.60      0.57        30
          10       1.00      0.78      0.88         9
          11       0.00      0.00      0.00         2
          12       0.50      0.50      0.50         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.17      1.00      0.29         1
          23       0.00      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for RandomForest
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         1
           4       0.57      0.36      0.44        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         5
           8       1.00      0.17      0.29        12
           9       0.47      0.60      0.53        30
          10       1.00      0.78      0.88         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Base Model Classification Report for DecisionTree
              precision    recall  f1-score   support

           0       0.20      0.67      0.31         3
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00         2
           3       0.00      0.00      0.00         1
           4       0.80      0.36      0.50        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       1.00      0.60      0.75         5
           8       0.17      0.33      0.23        12
           9       0.67      0.47      0.55        30
          10       0.83      0.56      0.67         9
          11       0.00      0.00      0.00         2
          12       0.43      0.50      0.46         6
          15       0.00      0.00      0.00         0
          17       0.00      0.00      0.00         1
          20       1.00      1.00      1.00         1
          22       0.00      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tuned Model Classification Report for DecisionTree
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         1
           4       0.80      0.36      0.50        11
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1
           7       1.00      0.40      0.57         5
           8       0.20      0.08      0.12        12
           9       0.70      0.47      0.56        30
          10       0.83      0.56      0.67         9
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         6
          17       0.00      0.00      0.00         1
          20       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### **Perforing Predictions**

In [8]:
# Example prediction for a new text
def predict_new_text(title, body):
    cleaned_title = preprocess_text(title)
    cleaned_body = preprocess_text(body)
    combined_cleaned_text = cleaned_title + " " + cleaned_body

    model_cat1 = pickle.load(open('best_model_cat1.pkl', 'rb'))
    model_cat2 = pickle.load(open('best_model_cat2.pkl', 'rb'))
    model_cat3 = pickle.load(open('best_model_cat3.pkl', 'rb'))

    le_cat1 = pickle.load(open('le_cat1.pkl', 'rb'))
    le_cat2 = pickle.load(open('le_cat2.pkl', 'rb'))
    le_cat3 = pickle.load(open('le_cat3.pkl', 'rb'))

    # Predict Category 1
    pred_cat1_encoded = model_cat1.predict([combined_cleaned_text])[0]
    pred_cat1 = le_cat1.inverse_transform([pred_cat1_encoded])[0]
    # Predict Category 2
    combined_text_with_cat1 = combined_cleaned_text + " " + str(pred_cat1)
    pred_cat2_encoded = model_cat2.predict([combined_text_with_cat1])[0]
    pred_cat2 = le_cat2.inverse_transform([pred_cat2_encoded])[0]
    
    # Predict Category 3
    combined_text_with_cat2 = combined_cleaned_text + " " + str(pred_cat2)
    pred_cat3_encoded = model_cat3.predict([combined_text_with_cat2])[0]
    pred_cat3 = le_cat3.inverse_transform([pred_cat3_encoded])[0]

    return f"Predicted Categories: {pred_cat1} > {pred_cat2} > {pred_cat3}"

# Example usage
new_text_title = "Kaytee Timothy Cubes, 1-Pound"
new_text_body = "My bunny had a hard time eating this because the hay was so dry and it was too small for her to chew on."
predicted_categories = predict_new_text(new_text_title, new_text_body)
print(predicted_categories)

Predicted Categories: pet supplies > bunny rabbit central > food
