Please find my answers to the assignment questions in-line.

### Imports

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
import time
import sys
import re
import gc
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords

In [None]:
!unzip /kaggle/input/job-salary-prediction/Train_rev1.zip

In [None]:
data = pd.read_csv("Train_rev1.csv")
data = data.sample(2500, random_state = 42) # Randomly selecting 2500 samples from data
data.head()

In [None]:
# Assigning whether we are above or below 75th Percentile
data['Percentile'] = data['SalaryNormalized'].rank(pct = True)
data.loc[data['Percentile'] > 0.75, 'Target'] = 'Above'
data.loc[data['Percentile'] < 0.75, 'Target'] = 'Below'
data.head()

# 1. Build a classification model with text (full job description) as the predictor.
What is the accuracy of your model? Show the confusion matrix. Also show the top 10 words (excluding stopwords) that are most indicative of (i) high salary ii) low salary


---------
The accuracy of the model is ~75% on the test set.

**Low Salary (Below 75th Percentile):**
Most important words are **shifts, teacher, teachers, weekends, telesales, hospitality, chef, monday, caterer, friday, secondary**. From a high-level overview, job-postings related to telesales, teaching, cooking (chef, catering) might be associated with low salary. Additionally, mention of days like Monday and Friday in the Job Descriptions are a sign of low salary as well.

**High Salary (Above 75th Percentile):**
Most important words are **alignment, mod, implementations, emea, milestones, worlds, soa, architects, multisite, cmd**. It appears that job postings that mention of EMEA (Europe, the Middle East and Africa), soa (service-oriented architecture), , 

### Pre-processing text

In [None]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    #stem_words=[stemmer.stem(w) for w in filtered_words]
    #lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in filtered_words]
    return " ".join(filtered_words)

data['FullDescription_Clean']=data['FullDescription'].map(lambda s:preprocess(s)) 

## Binary Counter

Indicate whether word exists or not in the individual description.

In [None]:
# Splitting data
X, y = data['FullDescription_Clean'], data['Target']

# Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True) 
tdata = cv.fit_transform(X) 
ft = cv.get_feature_names() 
full_set = list(zip(list(map(lambda row:dict(zip(ft,row)),tdata.toarray())), y))

# Train-test split
train_set ,test_set = train_test_split(full_set,test_size=0.2, random_state=42) 

In [None]:
# Model 
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
# Custom Function to get Top Labels for each class
def imp_features(self, n=30):
    for unique_label in list(set(self._labels)):
        # Determine the most relevant features, and display them.
        cpdist = self._feature_probdist
        print(f"Most Informative Features for {unique_label}:\n")
        ctr=1 # Counter
        for (fname, fval) in self.most_informative_features(100000000):
            def labelprob(l):
                return cpdist[l, fname].prob(fval)
            labels = sorted(
                (l for l in self._labels if fval in cpdist[l, fname].samples()),
                key=lambda element: (-labelprob(element), element),
                reverse=True,
            )
            l0 = labels[0]
            l1 = labels[-1]
            if len(labels) == 1 or l0 == unique_label:
                continue
            if cpdist[l0, fname].prob(fval) == 0:
                ratio = "INF"
            else:
                ratio = "%8.1f" % (
                    cpdist[l1, fname].prob(fval) / cpdist[l0, fname].prob(fval)
                )
            print(
                "%24s = %-14r %6s : %-6s = %s : 1.0"
                % (fname, fval, ("%s" % l1)[:6], ("%s" % l0)[:6], ratio)
            )
            ctr+=1
            print(ctr)
            if ctr == n:
                break

imp_features(classifier, 20)

In [None]:
# Extracting labels
labels = [classifier.classify(i[0]) for i in test_set]
y_test = [i[1] for i in test_set]

# Confusion Matrix
from sklearn.metrics import confusion_matrix
print("Confusion Matrix:\n")
mat = confusion_matrix(y_test, labels)
sns.heatmap(mat, annot=True, fmt='g')
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

# Classification Report
from sklearn import metrics
print("\nClassification Report:\n", metrics.classification_report(y_test, labels))

# 2. If you wanted to increase the accuracy of the model above, how can you accomplish this using the dataset you have?
---------

We have already incorporated **text pre-processing** techniques such as removing stop words, lemmatization, lower-casing, removing numbers. 

**Vectorization:** Count Vectorizer (Frequency Counting), Tfidf Vectorizer (Frequency Counting penalizing common words) can be used as an alternative to our simple example which simply indicates whether or not a word is present in the document or not.

**N-gram:** Bigrams are tokens of words based on two-words per token, trigrams based on three words. These can be used instead of our unigram approach to see if it possibly increases accuracy.

**Log-probabilities:** Naive (independence/naive assumption) involves multiplication of probabilities (leading to smaller numbers). So in our case floating point precision of computers can introduce a lower cap. In order to circumvent this, log probabilities can be used. 

**Additional Features:** Features like **Job Title, Location, Contract Type, Category** can be predictors for each job posting in addition to our previous approach. Since, our predictions of salaries are based on text data only, adding information such as Seniority, Department, Location, Full-time/Part-time information can significantly improve our model's performance.

**Sentiment Analysis Results** can also be incorporated as a feature.

**Synonyms:** The performance of Naive Bayes can degrade if the data contains highly correlated features. This is because the highly correlated features are voted for twice in the model, over inflating their importance.

# End of Assignment