## Information of This Notebook
> **This section of the notebook is dedicated to classifying job titles into distinct categories and then grouping similar job titles under these categories.**

The process involves the following steps:
1. Text Classification Pipeline
2. Job Title Classification
3. Aggregation of Job Titles
4. Results Presentation

## Importing Necessary Libraries

In [1]:
from transformers import pipeline
import pandas as pd
from IPython.display import display

pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None) 
pd.set_option('display.max_colwidth', None)

Load `fraudDataset.csv`

In [3]:
df = pd.read_csv('../fraudDataset.csv')
df.head()

Unnamed: 0,transaction_time,credit_card_number,merchant,category,amount(usd),first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,transaction_id,merch_lat,merch_long,is_fraud,time,hour_of_day,day_of_week,month,year,age,age_group,distance
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09 00:00:00,0b242abb623afc578575680df30655b9,36.011293,-82.048315,0,2012-01-01 00:00:18,0,Tuesday,1,2019,31,21-40,78.597568
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21 00:00:00,1f76529f8574734946361c461b024d99,49.159047,-118.186462,0,2012-01-01 00:00:44,0,Tuesday,1,2019,41,41-60,30.212176
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252,42.1808,-112.262,4154,Nature conservation officer,1962-01-19 00:00:00,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481,0,2012-01-01 00:00:51,0,Tuesday,1,2019,57,41-60,108.206083
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,1967-01-12 00:00:00,6b849c168bdad6f867558c3793159a81,47.034331,-112.561071,0,2012-01-01 00:01:16,0,Tuesday,1,2019,52,41-60,95.673231
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28 00:00:00,a41d7549acf90789359a9aa5346dcb46,38.674999,-78.632459,0,2012-01-01 00:03:06,0,Tuesday,1,2019,33,21-40,77.556744


## 1. Text Classification Pipeline
We utilize the Hugging Face transformers library to set up a text classification pipeline. This pipeline employs the "serbog/distilbert-jobCategory_410k" model, which is a DistilBERT-based model specifically trained for categorizing job titles.

In [None]:
pipe = pipeline("text-classification", model="serbog/distilbert-jobCategory_410k") # initialize

## 2. Job Title Classification 
Each job title in the dataframe's 'job' column is processed through the pipeline. The model predicts the category of each job title based on its text content.

In [None]:
def classify_job(curr_job_title):
    result = pipe(curr_job_title)
    return result[0]['label']

In [None]:
df['job_category'] = df['job'].apply(classify_job)

## 3. Aggregation of Job Titles
After classification, job titles are grouped by their predicted categories. This grouping allows us to aggregate similar jobs under a unified category, making it easier to understand and analyze the distribution of job types in our dataset.

In [25]:
df_sorted = df.sort_values('job_category')

unique_jobs_per_category = df_sorted.groupby('job_category')['job'].unique()

# create a new DataFrame to hold the first 10 unique jobs for each category
top_jobs_per_category = pd.DataFrame({
    'job_category': unique_jobs_per_category.index,
    'top_jobs': [jobs[:10] for jobs in unique_jobs_per_category.values]
})

# Reset index to make it look neat
top_jobs_per_category.reset_index(drop=True, inplace=True)

## 4. Result Presentation
Each row represents a unique job category, accompanied by a list of job titles that fall under that category. This structured format provides a clear overview of how job titles are distributed across different categories, as determined by the model.

In [6]:
display(top_jobs_per_category)

Unnamed: 0,job_categories,top_jobs
0,C1,"[Energy manager, Sales professional, IT, Call centre manager, Drilling engineer, Chief Strategy Officer, Art gallery manager, Chief Financial Officer, Chief Executive Officer, Emergency planning/management officer, TEFL teacher]"
1,C2,"[Designer, ceramics/pottery, Counselling psychologist, Archivist, Toxicologist, Financial adviser, Scientist, biomedical, Physicist, medical, Writer, Editor, film/video, Comptroller]"
2,C3,"[Theatre manager, Administrator, Production manager, Field trials officer, Television production assistant, Water engineer, Probation officer, Agricultural consultant, Accounting technician, Tourist information centre manager]"
3,C4,"[Travel agency manager, Futures trader, Exhibitions officer, museum/gallery, Bookseller, Retail buyer, Chartered loss adjuster, Race relations officer, Magazine features editor, Technical brewer, Public relations account executive]"
4,C5,"[Sub, Barrister's clerk, Health physicist, Prison officer, Equities trader, Barrister, Public house manager, Mental health nurse, Make, Museum/gallery conservator]"
5,C6,"[Land, Horticulturist, commercial, Commercial horticulturist, Arboriculturist, Amenity horticulturist, Nature conservation officer, Forest/woodland manager]"
6,C7,"[Warden/ranger, Glass blower/designer, Mudlogger, Conservator, furniture, Furniture conservator/restorer, Television camera operator]"
7,C8,"[Clothing/textile technologist, Camera operator, Air cabin crew, Garment/textile technologist, Cabin crew, Dancer]"
8,C9,"[Press sub, Fisheries officer, Aid worker, Gaffer]"


C1: This category seems to include high-level managerial and executive positions across various industries, including energy, sales, finance, and emergency management.

C2: These jobs appear to be related to the arts, culture, and advisory services, including psychology, writing, editing, and financial advising.

C3: This category might focus on administrative and managerial roles in the arts, agriculture, and public information services.

C4: These jobs seem to be related to commercial services, trade, and public relations, including roles in management, trading, and editing.

C5: This category includes legal professions, health-related roles, and management positions in hospitality and mental health care.

C6: The jobs listed under this category are related to horticulture, conservation, and land management, indicating a focus on outdoor, environmental, and conservation work.

C7: These roles seem to be specialized artisanal or technical jobs related to conservation, media production, and geological services.

C8: This category includes jobs related to textiles, aviation, and performing arts, suggesting a focus on technical craftsmanship, customer service in travel, and entertainment.

C9: The roles here are diverse, ranging from media and journalism (press sub) to environmental and humanitarian work (fisheries officer, aid worker), indicating a category for specialized service and support roles.

## 5. Exporting Dataset with Classified Jobs
For further use in `data_preprocessing.ipynb`

In [None]:
df.to_csv('../fraudDataset_jobs_classified.csv', index=False)