# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Amay Viswanathan Iyer
#### Student ID: 3970066

Date: October 6, 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy
* random

## Introduction
In Task 1 I extracted the data from the folder and tokenized it based on the specifications provided.

## Importing libraries 

In [38]:
# here, I have written code to import the libraries needed for this assessment, e.g., numpy and pandas
import os
import pandas as pd
import re

# For NLP tasks, i imported the Regular Expressions Tokenizer
from nltk.tokenize import RegexpTokenizer


### 1.1 Examining and loading data
- xamine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [39]:
# Code to inspect the provided data file...
# Defining a function to extract data from each job advertisement
def extract_data_from_folder(base_path):
    data = []
    
    # Listing all category folders
    categories = os.listdir(base_path)
    
    for category in categories:
        category_path = os.path.join(base_path, category)
        
        # Making sure I'm looking only at folders and not stray files
        if os.path.isdir(category_path):
            files = os.listdir(category_path)
            
            for file in files:
                file_path = os.path.join(category_path, file)
                
                with open(file_path, 'r', encoding='utf-8') as f:
                    # Typically, the description is the entire content after the webindex
                    content = f.read()
                    # A simple regex split could help segregate title, webindex, and description, 
                    #but here I'll consider the entire content for simplicity and later tokenize it using 
                    #the regex in subsequent cells
                    data.append([category, file.split('.')[0], content])
                    
    return pd.DataFrame(data, columns=['Category', 'Job_ID', 'Content'])

# Assuming the base path is 'data' 
base_path = 'data'
df = extract_data_from_folder(base_path)

# displaying the first few rows of the dataframe
print(df.head())


  Category     Job_ID                                            Content
0    Sales  Job_00776  Title: Estate Agency Senior Sales Negotiator\n...
1    Sales  Job_00762  Title: Export Sales Executive (French & German...
2    Sales  Job_00763  Title: GRADUATE SALES ENGINEER\nWebindex: 6825...
3    Sales  Job_00749  Title: Sales Representative / Lead Generator\n...
4    Sales  Job_00761  Title: Search Recruitment Consultant  Media an...


### 1.2 Pre-processing data
Perform the required text pre-processing steps.

...... Sections and code blocks on basic text pre-processing


<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

In [40]:
# code to perform the task with the tokenizer
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

def tokenize(text):
    return tokenizer.tokenize(text.lower())

df['Tokens'] = df['Content'].apply(tokenize)


In [41]:
#removing small words
def remove_small_words(tokens):
    return [token for token in tokens if len(token) > 1]

df['Tokens'] = df['Tokens'].apply(remove_small_words)


In [42]:
#using the stopwords file 
with open("stopwords_en.txt", "r") as file:
    stopwords = file.read().splitlines()

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords]

df['Tokens'] = df['Tokens'].apply(remove_stopwords)


In [43]:
#removing the rare words as well
all_tokens = [token for sublist in df['Tokens'].tolist() for token in sublist]
freq = pd.Series(all_tokens).value_counts()

def remove_rare_words(tokens):
    return [token for token in tokens if freq[token] > 1]

df['Tokens'] = df['Tokens'].apply(remove_rare_words)


In [44]:

top_50 = freq.head(50).index.tolist()

def remove_top_frequent(tokens):
    return [token for token in tokens if token not in top_50]

df['Tokens'] = df['Tokens'].apply(remove_top_frequent)


In [45]:
df['Processed_Content'] = df['Tokens'].apply(lambda x: ' '.join(x))


In [46]:
#storing all the work in vocab.txt as required
vocabulary = sorted(list(set([token for sublist in df['Tokens'].tolist() for token in sublist])))
vocab_dict = {word: index for index, word in enumerate(vocabulary)}

with open("vocab.txt", "w") as file:
    for word, index in vocab_dict.items():
        file.write(f"{word}:{index}\n")

In [47]:
#Inspecting random 5 rows in the Processed_Content Column again

# Displaying 5 random rows from the Processed_Content column
print(df['Processed_Content'].sample(5))


638    commercial catering laundry equipment northamp...
555    security officer higher level carillion plc ov...
286    reporting analyst pepsico qualified accountant...
147    senior negotiator independent estate agency fa...
188    associate director equity car ukstaffsearch as...
Name: Processed_Content, dtype: object


In [48]:
#inspecting the vocab.txt file
import random

# Extracting 10 random key-value pairs from the vocabulary
random_vocab_entries = random.sample(list(vocab_dict.items()), 10)
for word, index in random_vocab_entries:
    print(f"{word}: {index}")


defining: 1292
public: 3898
contacted: 1075
britain's: 609
merit: 3036
lane: 2717
weekend: 5246
attention: 382
uk's: 5039
addresses: 84


## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt

## Summary
Give a short summary and anything you would like to talk about the assessment task here.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>