# Problem Statement

1. 🛠️ The overall problem is to develop a text classification system using a pre-trained DistilBert model that can accurately predict categories based on textual input.

2. 📝 The input feature, denoted as 'X', consists of raw text strings sourced from a dataset, which the model processes to predict categorical labels.

3. 🎯 The target variable, referred to as 'label', represents the actual categories of the text, which are used to train the model and evaluate its accuracy.

4. 📊 The model's performance is assessed through metrics such as the confusion matrix and accuracy, comparing the predicted labels against the actual labels.

5. 🔧 The project involves not only fine-tuning a pre-trained language model on a specific dataset but also validating its effectiveness on both a smaller sample and a larger subset to ensure robustness and scalability of the predictions.

#Install the necessary libraries



1. **Install Accelerate**:
2. **Install Transformers and Datasets**:
3. **Upgrade PyArrow**:

Import Libraries

1. **Import Standard Data Science Libraries**:
3. **Import PyTorch and Transformers for NLP**:



In [None]:
## Write your code here

# Import Emotions Data. Target variable is 'Label'

Import the data from https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Final_Emotion_Data/five_emotions_data.csv

Display the first few rows of the dataset to understand its structure

In [None]:
## Write your code here

# Use distilbert model without finetunung

Here are concise instructions to write and execute the Python code you provided, which utilizes a pre-trained DistilBERT model from the `transformers` library for text classification without fine-tuning:

Step 1: Install and Import Libraries
1. **Install the necessary Python packages** if they're not already installed:

2. **Import the libraries** in your Python script:


 Step 2: Set Up the DistilBERT Model
1. **Initialize the DistilBERT pipeline** for text classification. Ensure you have GPU support (CUDA) if available:


 Step 3: Prepare and Predict with Your Dataset
1. **Sample and preprocess the dataset**:
   - Assuming `emotions_data` is already loaded:

2. **Classify text** using the model and extract predictions:


Step 4: Evaluate Model Performance
1. **Calculate and print the confusion matrix** and accuracy:




In [None]:
## Write your code here

# Finetuning the model with our data


1. Begin by importing necessary libraries including `DistilBertTokenizer` and `DistilBertForSequenceClassification` from `transformers`, `load_dataset`, `DatasetDict`, `ClassLabel`, and `Dataset` from `datasets`, `pandas`, `train_test_split` from `sklearn.model_selection`, and `torch`.

2. Load your sample data into a pandas DataFrame, `sample_data`.

3. Convert the DataFrame into a `Dataset` object using `Dataset.from_pandas()` method.

4. Split the data into training and testing sets using the `train_test_split` method from the `datasets` library, setting aside 20% of the data for testing.

5. Create a `DatasetDict` to organize the datasets into 'train' and 'test' subsets.

6. Load the `DistilBertTokenizer` from the `transformers` library using the pretrained 'distilbert-base-uncased' model.

7. Set the tokenizer's padding token to the end-of-sequence token and update the padding token ID accordingly. Add a special padding token '[PAD]' to the tokenizer.

8. Define a function, `tokenize_function`, that tokenizes the texts in the dataset. Ensure the texts are padded to a maximum length of 100 characters and truncated if necessary.

9. Apply the `tokenize_function` to the datasets using the `.map()` method with `batched=True` to process batches of texts at once.

In [None]:
#Write your code here

## Load and Train the model

1. Load the `DistilBertForSequenceClassification` model from the `transformers` library, using the pretrained 'distilbert-base-uncased' model. Specify `num_labels` based on your classification needs and set `pad_token_id` to the tokenizer's end-of-sequence token ID.

2. Set up the training configuration using `TrainingArguments`. Define an `output_dir` for storing training results, set `num_train_epochs` for the number of training epochs, specify a `logging_dir` for training logs, and choose `evaluation_strategy` as 'epoch' to perform evaluations at the end of each epoch.

3. Initialize a `Trainer` object with the model, training arguments, and datasets for training and evaluation. Specify the training dataset using `tokenized_datasets['train']` and the evaluation dataset as `tokenized_datasets['test']`.

4. Begin the training process by invoking the `train()` method on the `Trainer`.

5. Specify a directory where you want to save both the trained model and the tokenizer, e.g., "./distilbert_finetuned".

6. Save the trained model and tokenizer to the specified directory using the `save_pretrained()` method for both the model and tokenizer.

7. Additionally, save the model to another directory or under another name using the `save_model()` method of the `Trainer`.

8. Compress the model directory into a ZIP file using a zip utility, specifying the target directory and the name for the ZIP file, e.g., "distilbert_finetuned_10k.zip".

## Make Predictions

1. Define a function, `make_prediction`, which takes a single text string as input. Inside this function:
   - Tokenize the text using your pretrained tokenizer. Specify that tensors should be returned as PyTorch tensors (`return_tensors="pt"`).
   - Send the tokenized inputs to a CUDA device for GPU acceleration.
   - Pass the inputs to your pretrained model and obtain the outputs.
   - Extract the logits from the outputs and determine the predicted class by finding the index of the maximum logit value.
   - Convert the prediction tensor from GPU to CPU and transform it into a numpy array.
   - Return the prediction.

2. Apply the `make_prediction` function to the "Text" column of your `sample_data` DataFrame, extracting the first element of the returned array to a new column named `finetuned_predicted`.

3. Calculate the confusion matrix using `confusion_matrix` from `sklearn.metrics`, comparing the actual labels and the predicted labels from your DataFrame. Print the resulting confusion matrix.

4. Compute and print the accuracy of your predictions by dividing the sum of the diagonal elements of the confusion matrix by the total number of predictions.

5. Select a large subset (e.g., 40,000 samples) from another dataset (assumed to be `emotions_data` here), using a fixed random state for reproducibility.

6. Apply the `make_prediction` function to the "Text" column of this large dataset, storing the results in a new column named `finetuned_predicted`.

7. Calculate and print the confusion matrix for this larger dataset to compare the predicted labels against the actual labels.

8. Compute and print the accuracy for this dataset in the same manner as described above.