<a href="https://colab.research.google.com/github/yogesh7025/Data-Science_Projects/blob/main-packge/M1_MP2_NB_Resume_Classification_Using_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Resume Classification

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform data preprocessing, EDA and feature extraction on the Resume dataset
* perform multinomial Naive Bayes classification on the Resume dataset

### Dataset description

The data is in CSV format, with two features: Category, and Resume.

**Category** -  Industry sector to which the resume belongs to, and

**Resume** - The complete CV (text) of the candidate.

##  Grading = 10 Points

## Information

Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates. Finding suitable candidates for an open role from a database of 1000s of resumes can be a tough task. Automated resume categorization can speeden the candidate selection process. Such automation can really ease the tedious process of fair screening and shortlisting the right candidates and aid quick decisionmaking.

To learn more about this, click [here](https://www.sciencedirect.com/science/article/pii/S187705092030750X).

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib.gridspec import GridSpec
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#### Downloading the data

In [None]:
#@title Download the dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/UpdatedResumeDataSet.csv
print("Data Downloaded Successfuly!!")

Data Downloaded Successfuly!!


**Exercise 1: Read the UpdatedResumeDataset.csv dataset [0.5 Mark]**

**Hint:** pd.read_csv()

In [None]:
# read the dataset
# YOUR CODE HERE
df = pd.read_csv('UpdatedResumeDataSet.csv')

### Pre-processing and EDA

**Exercise 2: Display  all the categories of resumes and their counts in the dataset [0.5 Mark]**



In [None]:
# Display the distinct categories of resume
# YOUR CODE HERE
df['Category'].unique()

array(['Data Science', 'HR', 'Advocate', 'Arts', 'Web Designing',
       'Mechanical Engineer', 'Sales', 'Health and fitness',
       'Civil Engineer', 'Java Developer', 'Business Analyst',
       'SAP Developer', 'Automation Testing', 'Electrical Engineering',
       'Operations Manager', 'Python Developer', 'DevOps Engineer',
       'Network Security Engineer', 'PMO', 'Database', 'Hadoop',
       'ETL Developer', 'DotNet Developer', 'Blockchain', 'Testing'],
      dtype=object)

In [None]:
# Display the distinct categories of resume and the number of records belonging to each category
# YOUR CODE HERE
df['Category'].value_counts()

Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Blockchain                   40
ETL Developer                40
Operations Manager           40
Data Science                 40
Sales                        40
Mechanical Engineer          40
Arts                         36
Database                     33
Electrical Engineering       30
Health and fitness           30
PMO                          30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
SAP Developer                24
Civil Engineer               24
Advocate                     20
Name: Category, dtype: int64

**Exercise 3: Create the count plot of different categories [0.5 Mark]**

**Hint:** Use `sns.countplot()`

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(25,3))
sns.countplot(df)
plt.tight_layout()
plt.xticks(rotation=90)
plt.legend().set_visible(False)

NameError: ignored

In [None]:
sns.barplot(data=df.reset_index(), x=df['Category'].uni, y='count')

In [None]:
df['Category'].value_counts()

Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Blockchain                   40
ETL Developer                40
Operations Manager           40
Data Science                 40
Sales                        40
Mechanical Engineer          40
Arts                         36
Database                     33
Electrical Engineering       30
Health and fitness           30
PMO                          30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
SAP Developer                24
Civil Engineer               24
Advocate                     20
Name: Category, dtype: int64

**Exercise 4: Create a pie plot depicting the percentage of resume distributions category-wise [0.5 mark]**

**Hint:** Use [plt.pie()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html) and [plt.get_cmap](https://matplotlib.org/stable/tutorials/colors/colormaps.html) for color mapping the pie chart.

In [None]:
targetCounts = df['Category'].value_counts()
targetLabels  = df['Category'].unique()
# Make square figures and axes
plt.figure(1, figsize=(25,25))
the_grid = GridSpec(2, 2)

# YOUR CODE HERE to display pie chart with color coding (eg. `coolwarm`)

**Exercise 5: Convert all the `Resume` text to lower case [0.5 Mark]**




In [None]:
# Convert all characters to lowercase
# YOUR CODE HERE

### Cleaning resumes' text data

**Exercise 6: Define a function to clean the resume text [2 Mark]**

In the text there are special characters, urls, hashtags, mentions, etc. Remove the following:  

* URLs: For reference click [here](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)
* RT | cc: For reference click [here](https://www.machinelearningplus.com/python/python-regex-tutorial-examples/)
* Hashtags, # and Mentions, @
* punctuations
* extra whitespace

PS: Use the provided reference similarly for removing any other such elements.

After cleaning as above, store the Resume Text in a separate column (New Feature).


In [None]:
import re
def cleanResume(resumeText):
  # YOUR CODE HERE

In [None]:
 # apply the function defined above and save the
 # YOUR CODE HERE

In [None]:
sent_lens = []
for i in df.cleaned_resume:
    length = len(i.split())
    sent_lens.append(length)

print(len(sent_lens))
print(max(sent_lens))

### Stopwords removal

The stopwords, for example, `and, the, was, and so forth` etc. appear very frequently in the text and are not helpful in the predictive process. Therefore these are usually removed for text analytics and text classification purposes.

1. Tokenize the input words into individual tokens and store it in an array
2. Using `nltk.corpus.stopwords`, remove the stopwords

Hint: See Module 1 - Assignment 4 'Text Classification using Naive Bayes'


**Exercise 7: Use `nltk` package to find the most common words from the `cleaned resume` column [2 Marks]**

**Hint:**
* Use `nltk.FreqDist`


In [None]:
# stop words
# YOUR CODE HERE to print the stop words in english language

In [None]:
# most common words
# YOUR CODE HERE

In [None]:
# YOUR CODE HERE to show the most common word using WordCloud

**Exercise 8: Convert the categorical variable `Category` to a numerical feature and make a different column, which can be treated as the target variable [0.5 Mark]**

**Hint:** Use [`sklearn.preprocessing.LabelEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) method

In [None]:
from sklearn.preprocessing import LabelEncoder

# YOUR CODE HERE

### Feature Extraction

**Exercise 9: Convert the text to feature vectors by applying `tfidf vectorizer` to the Label encoded category made above [2 Marks]**

`TF-IDF`will tokenize documents, learn the vocabulary, inverse document frequency weightings, and allow you to encode new documents

**Hint:** Use [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).



In [None]:
# YOUR CODE HERE

## Naive Bayes Classifier

**Exercise 10: Split the data into train and test sets. Apply Naive Bayes Classifier (MultinomialNB) and evaluate the model predictions [1 mark]**

**Hint:** Use Vectorized features made above as X and Labelled category as y.

In [None]:
# YOUR CODE HERE

## Optional: Create a Gradio based web interface to test and display the model predictions

**Report Analysis**
- Which method(s), other than TF-IDF could be used for text to vector conversion?
- Discuss about the `alpha`, `class_prior` and `fit_prior` parameters in sklearn `MultinomialNB`


There are several methods other than TF-IDF (Term Frequency-Inverse Document Frequency) that can be used for text-to-vector conversion in natural language processing (NLP). Some of the common methods include:

1. **Bag of Words (BoW)**: BoW represents text as a vector where each dimension corresponds to a unique word in the entire corpus. Each element in the vector counts the frequency of the corresponding word in the text, regardless of word order or context.

2. **Word Embeddings (Word2Vec, GloVe)**: Word embeddings are dense vector representations of words that capture semantic relationships between words. These embeddings are pre-trained on large text corpora and can be used to represent words, phrases, or documents as vectors. Popular methods include Word2Vec and GloVe.

3. **Doc2Vec**: Doc2Vec is an extension of Word2Vec that can represent entire documents as continuous-valued vectors. It assigns a unique vector to each document, capturing its semantic meaning.

4. **FastText**: FastText is an extension of Word2Vec that can represent words as subword embeddings. This is particularly useful for handling out-of-vocabulary words and morphological variations.

5. **Latent Semantic Analysis (LSA)**: LSA is a dimensionality reduction technique that applies singular value decomposition (SVD) to a term-document matrix to capture latent semantic relationships between terms and documents. It's used to reduce the dimensionality of the data.

6. **Word Frequency Counts**: Instead of TF-IDF, you can use simple word frequency counts where each dimension in the vector represents the count of a specific word in the text.

7. **One-Hot Encoding**: Each word in a text is represented by a binary vector where only one element is 1 (indicating the presence of that word) and all others are 0s.

8. **Hashing Vectorizer**: Hashing vectorization is a memory-efficient way of converting text to vectors by using a fixed-size hash space. It hashes words to specific dimensions in the vector, reducing the dimensionality.

9. **Paragraph Vectors (Doc2Vec)**: Similar to Doc2Vec, paragraph vectors represent documents by learning a vector representation for each document based on the words it contains.

10. **BERT Embeddings**: Bidirectional Encoder Representations from Transformers (BERT) embeddings are pre-trained contextual word embeddings that capture the meaning of words in the context of surrounding words. Fine-tuning BERT on specific tasks is also common.

The choice of method depends on the specific NLP task and the characteristics of your data. TF-IDF, Word2Vec, and BERT embeddings are among the most popular methods, but the suitability of each method should be determined based on the problem you are trying to solve.

In scikit-learn's `MultinomialNB` (Multinomial Naive Bayes) classifier, there are several important parameters, including `alpha`, `class_prior`, and `fit_prior`, which influence how the model is trained and how it makes predictions in the context of a multinomial distribution. Here's an explanation of each parameter:

1. **alpha**:
   - `alpha` is a smoothing hyperparameter used to handle the problem of zero probabilities. It's also known as Laplace smoothing or add-one smoothing.
   - Laplace smoothing is applied to avoid zero probabilities when a feature (word or term) in the test data has not been seen in the training data for a particular class. It adds a small positive value (alpha) to the counts of each feature in each class.
   - A smaller alpha value results in less smoothing, which means the model will rely more on the observed data. A larger alpha value introduces more smoothing, making the model less sensitive to rare features.
   - The default value for `alpha` is 1.0, which corresponds to Laplace smoothing. You can adjust it based on the characteristics of your data and the problem you're trying to solve.

2. **class_prior**:
   - `class_prior` is an optional parameter that allows you to specify prior probabilities for each class in the target variable. It is a list of floats representing the prior probabilities for each class.
   - If you don't specify `class_prior`, the classifier will estimate the class prior probabilities from the training data.
   - Specifying `class_prior` can be useful when you want to give more weight to certain classes or when the distribution of classes in your training data does not accurately represent the real-world distribution.

3. **fit_prior**:
   - `fit_prior` is a Boolean parameter that determines whether the model should estimate class prior probabilities from the training data (`fit_prior=True`) or whether it should use the provided `class_prior` values (`fit_prior=False`).
   - If `fit_prior=True`, the classifier will estimate class prior probabilities based on the relative frequencies of classes in the training data.
   - If `fit_prior=False`, the classifier will use the `class_prior` values that you provide.
   - Setting `fit_prior=False` can be useful when you have prior knowledge about class probabilities that you want to use, or when you want to maintain consistency in class prior probabilities across different runs of the classifier.

In summary, `alpha` controls the amount of smoothing applied to probabilities, `class_prior` allows you to specify custom class prior probabilities, and `fit_prior` determines whether the model estimates class prior probabilities from the data or uses the provided values. These parameters provide flexibility in how you configure your Multinomial Naive Bayes classifier based on the specifics of your classification problem.

Dataset Source Reference: [Resume dataset](https://www.kaggle.com/gauravduttakiit/resume-dataset/download)