## Learning Objectives

At the end of the mini-project, you will be able to :

* perform data preprocessing, EDA on the resume dataset
* fine-tune Bert model for resume classification

## Dataset Description

The data is in CSV format, with two features: Category, and Resume.

**Category** -  Industry sector to which the resume belongs to, and

**Resume** - The complete CV (text) of the candidate.

## Information

Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates. Finding suitable candidates for an open role from a database of 1000s of resumes can be a tough task. Automated resume categorization can speeden the candidate selection process. Such automation can really ease the tedious process of fair screening and shortlisting the right candidates and aid quick decisionmaking.

To learn more about this, click [here](https://www.sciencedirect.com/science/article/pii/S187705092030750X).

**Problem Statement:** Fine-tune a pre-trained Bert model for resume classification.

*For fine-tuning Bert, refer to the HuggingFace platform session held on 17 Aug.*

### Install dependencies

After installing the below dependencies ***Restart the session/runtime***.

In [None]:
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

# Ignore the Error/Warning showing after running this cell

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [None]:
from datasets import Dataset, DatasetDict

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import pipeline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

### Downloading the data

**Exercise 1: Read the UpdatedResumeDataset.csv dataset [0.5 Mark]**

**Hint:** pd.read_csv( , encoding='utf-8')

In [None]:
# Read the dataset
# YOUR CODE HERE

### Pre-processing and EDA

**Exercise 2: Display  all the categories of resumes and their counts in the dataset **



In [None]:
# Display the distinct categories of resume
# YOUR CODE HERE

In [None]:
# Displaying the number of distinct categories of resume
# YOUR CODE HERE

In [None]:
# Display the distinct categories of resume and the number of records belonging to each category
# YOUR CODE HERE

**Exercise 3: Create the count plot of different categories **

**Hint:** Use `sns.countplot()`

In [None]:
# YOUR CODE HERE

**Exercise 4: Create a pie plot depicting the percentage of resume distributions category-wise.**

**Hint:** Use [plt.pie()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html) and [plt.get_cmap](https://matplotlib.org/stable/tutorials/colors/colormaps.html) for color mapping the pie chart.

In [None]:
targetCounts = df['Category'].value_counts()
targetLabels  = targetCounts.index
# Make square figures and axes
plt.figure(1, figsize=(25,25))
the_grid = GridSpec(2, 2)

# YOUR CODE HERE to display pie chart with color coding (eg. `coolwarm`)

**Exercise 5: Convert all the `Resume` text to lower case and remove trailing spaces **




In [None]:
# Convert all characters to lowercase and remove trailing spaces
# YOUR CODE HERE

### Cleaning Resume

**Exercise 6: Define a function to clean the resume text**

In the text there are special characters, urls, hashtags, mentions, etc. You need to remove for the following:  

* URLs: For reference click [here](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)
* RT | cc: For reference click [here](https://www.machinelearningplus.com/python/python-regex-tutorial-examples/)
* Hashtags, # and Mentions, @
* punctuations
* extra whitespace

PS: Use the provided reference similarly for removing any other such elements.

After cleaning as above, store the Resume Text in a separate column (New Feature say `Cleaned_Resume`).


In [None]:
import re
def cleanResume(resumeText):
    # YOUR CODE HERE
    #
    #


In [None]:
# Apply the function defined above and save the
df['Cleaned_Resume'] =  # YOUR CODE HERE

**Exercise 7: Convert the categorical variable `Category` to a numerical feature and make a different column <font color="#990000">`label`</font>, which can be treated as the target variable [0.5 Mark]**

**Hint:** Use [`sklearn.preprocessing.LabelEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) method

In [None]:
from sklearn.preprocessing import LabelEncoder

# YOUR CODE HERE

df["label"] = # YOUR CODE HERE

**Exercise 8: Plot the histogram of words count of `Cleaned_Resume` text **

**Hint:** Use sns.distplot()

In [None]:
df['Count'] = # YOUR CODE HERE to count number of words

In [None]:
plt.figure(figsize= (8, 8))

# YOUR CODE HERE

plt.show()

### Train Test Split

**Exercise 9: Split the dataset into training, validation, and testing set **

* Do stratified splitting using `label` column

In [None]:
train_test_df, val_df = # YOUR CODE HERE to split data
train_df, test_df = # YOUR CODE HERE to split data

len(train_df), len(val_df), len(test_df)

In [None]:
train_df.head(2)

### Convert to HuggingFace Dataset

**Exercise 10: Convert Pandas dataframe to HuggingFace Dataset **

**Hint:**

    import pandas as pd
    from datasets import Dataset, DatasetDict

    tdf = pd.DataFrame({"a": [1, 2, 3], "b": ['hello', 'ola', 'thammi']})
    vdf = pd.DataFrame({"a": [4, 5, 6], "b": ['four', 'five', 'six']})
    tds = Dataset.from_pandas(tdf)
    vds = Dataset.from_pandas(vdf)

    ds = DatasetDict()

    ds['train'] = tds
    ds['validation'] = vds

    print(ds)

In [None]:
from datasets import Dataset, DatasetDict

train_ds = # YOUR CODE HERE
val_ds = # YOUR CODE HERE
test_ds = # YOUR CODE HERE

ds = DatasetDict()

ds['train'] = train_ds
ds['validation'] = val_ds
ds['test'] = test_ds

### Tokenizer

**Exercise 11: Load tokenizer for checkpoint `distilbert-base-uncased` **

**Hint:** `AutoTokenizer`

In [None]:
# Load tokenizer
tokenizer = # YOUR CODE HERE

In [None]:
def tokenize_fn(batch):
    # YOUR CODE HERE..


In [None]:
tokenized_datasets = ds.map(tokenize_fn, batched=True)
tokenized_datasets

### Load Pre-Trained Model

**Exercise 12: Load pre-trained Bert model with checkpoint `distilbert-base-uncased` and show model summary **

**Hint:** `AutoModelForSequenceClassification`

In [None]:
from transformers import AutoModelForSequenceClassification

model = # YOUR CODE HERE

**Exercise 13: Freeze/Un-Freeze different layers  [0.5 Mark]**

**Hint:** Freeze layers starting with name *distilbert*


In [None]:
# Display layers name

# YOUR CODE HERE

In [None]:
# Freezing

# YOUR CODE HERE..

In [None]:
# Display layers gradient

# YOUR CODE HERE..

### Metrics

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [None]:
f1_score(y_true=[1,0,1], y_pred=[1,0,0], average='weighted')

In [None]:
def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits,axis=-1)
    return {'f1_score': f1_score(y_true=labels, y_pred=predictions, average='weighted')}


**Exercise 14: Fine-tune model on train dataset  [0.5 Mark]**
 * Create `TrainingArguments` class object
 * Create `Trainer` class  object
 * Train it for higher number of epochs say 40 or 50
 * Switch to GPU runtime if needed

**Hint:** Check if the training code is running without any error with CPU runtime, later switch to GPU runtime for faster training. Once trained, save the model, create its zip file, and download into your system.

In [None]:
from transformers import TrainingArguments
from transformers import Trainer

In [None]:
# Set up the training arguments

model_output_path = "/content/bert_model"

training_args = # YOUR CODE HERE..

In [None]:
# Train the model

trainer = # YOUR CODE HERE..

### Save Model

In [None]:
# Save the model
trainer.save_model('saved_bert_model')

In [None]:
!ls

In [None]:
# Create a Zip file and download
!zip -r saved_bert_model.zip saved_bert_model

### Load Model

**Exercise 15: Load the saved model and create a pipeline to perform text classification **

 * Create the pipeline object for text classification
 * Create a `make_prediction` function to use pipeline object and output the prediction label

**Hint:** pipeline()

In [None]:
from transformers import pipeline

In [None]:
my_model = # YOUR CODE HERE.. to load text classification pipeline

In [None]:
# Function to predict label for a resume text

def make_prediction(input_text):

    # YOUR CODE HERE..


In [None]:
# Test prediction
make_prediction('programming, web designing, coding')

In [None]:
make_prediction('continuous integration and continuous delivery')

In [None]:
make_prediction('law student and journalist')

In [None]:
make_prediction('machine learning, data, EDA, big data, neural networks')

## **Optional**: Create a Gradio based web interface to test and display the model predictions

In [None]:
!pip -q install gradio

In [None]:
import gradio

In [None]:
# Textbox to take Input from user
in_text = gradio.Textbox(lines=10, placeholder=None, value="Enter resume text here", label='Resume Text')


# Textbox to display Output prediction
out_label = gradio.Textbox(type="text", label='Predicted Class Label')


# Gradio interface to create UI
iface = gradio.Interface(fn = make_prediction,             # fine-tuned model is used inside this function
                         inputs = [in_text],
                         outputs = [out_label],
                         title = "Resume Classification",
                         description = "Using fine-tuned Bert model",
                         allow_flagging = 'never')


# Launch interface
iface.launch(share = True)