## Information of This Notebook
> **This section of the notebook is dedicated to classifying job titles into distinct categories and then grouping similar job titles under these categories.**

The process involves the following steps:
1. Text Classification Pipeline
2. Job Title Classification
3. Aggregation of Job Titles
4. Results Presentation

## Importing Necessary Libraries

In [20]:
from transformers import pipeline
import pandas as pd
from IPython.display import display

pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None) 
pd.set_option('display.max_colwidth', None)

Load `fraudDataset.csv`

In [21]:
df = pd.read_csv('../fraudDataset.csv').iloc[:,1:].sample(20)
df

Unnamed: 0,transaction_time,credit_card_number,merchant,category,amount(usd),first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,transaction_id,merch_lat,merch_long,is_fraud,time,hour_of_day,day_of_week,month,uear,year,age,age_group,latitudinal_distance,longitudinal_distance
1118444,2020-04-10 19:29:08,4223708906367574214,fraud_Haag-Blanda,food_dining,58.97,Adam,Riddle,M,27718 Mason Bypass,Mount Saint Joseph,OH,45051,39.0965,-84.6431,177,Exhibition designer,1974-05-30 00:00:00,4fa420b164521e5be02fa00c7cf29eda,39.625132,-85.465542,0,2013-04-10 19:29:08,19,Friday,4,2020,2020,46,41-60,0.529,0.822
499880,2019-08-05 08:56:32,4836998673805450,"fraud_Moen, Reinger and Murphy",grocery_pos,158.37,Susan,Hardy,F,516 Brown Parks,Manistique,MI,49854,46.0062,-86.2555,6469,Trade mark attorney,1979-04-12 00:00:00,c9b6a4aa04b82c63d2ba22e68842ef30,45.332489,-86.302714,0,2012-08-05 08:56:32,8,Monday,8,2019,2019,40,21-40,0.674,0.047
1100304,2020-04-02 16:22:52,4471568287204,fraud_Hamill-D'Amore,health_fitness,53.06,Dakota,Maldonado,M,369 Cochran Radial,Pelham,NC,27311,36.4899,-79.4736,3402,Insurance underwriter,1927-10-24 00:00:00,0756025dbdb2b02b84534afcf4080c7a,35.77204,-79.439858,0,2013-04-02 16:22:52,16,Thursday,4,2020,2020,93,81-100,0.718,0.034
983264,2020-02-03 19:17:58,3567697931646329,fraud_Klein Group,entertainment,29.02,John,Stevens,M,428 Morgan River,Hudson,NY,12534,42.247,-73.7552,17867,Travel agency manager,1998-07-29 00:00:00,1b85a076c0ad97c2264f4b95d651318d,41.5609,-73.027502,0,2013-02-03 19:17:58,19,Monday,2,2020,2020,22,21-40,0.686,0.728
1424807,2020-08-04 16:37:51,30263540414123,fraud_Morissette LLC,entertainment,37.13,Erik,Patterson,M,162 Jessica Row Apt. 072,Hatch,UT,84735,37.7175,-112.4777,258,Geoscientist,1961-11-24 00:00:00,4f3ee532a3ac18d3da0aeae408204b95,38.029508,-112.294093,0,2013-08-04 16:37:51,16,Tuesday,8,2020,2020,59,41-60,0.312,0.184
1595877,2020-10-12 07:32:36,180067784565096,fraud_Sawayn PLC,shopping_pos,8.66,Mary,Juarez,F,35440 Ryan Islands,North Prairie,WI,53153,42.9385,-88.395,2328,Applications developer,1942-01-06 00:00:00,485a94beea3000c7d4c1c512306c3562,42.429292,-88.813796,0,2013-10-12 07:32:36,7,Monday,10,2020,2020,78,61-80,0.509,0.419
1003839,2020-02-16 12:15:46,4260128500325,fraud_Jaskolski-Vandervort,misc_net,115.19,Whitney,Gallagher,F,0374 Courtney Islands Apt. 400,Deane,KY,41812,37.2409,-82.7696,230,"Conservation officer, historic buildings",1997-08-04 00:00:00,108eb3928fe255c32f9b0a477f41d6c0,37.395417,-83.749962,0,2013-02-16 12:15:46,12,Sunday,2,2020,2020,23,21-40,0.155,0.98
982535,2020-02-03 13:03:23,213125815021702,fraud_Wilkinson Ltd,entertainment,122.73,Adam,Kirk,M,40847 Stark Junctions,Big Indian,NY,12410,42.074,-74.453,397,Psychiatrist,1931-09-12 00:00:00,ef21443b66345d10511ab353fbc0ff7a,41.669329,-75.268801,0,2013-02-03 13:03:23,13,Monday,2,2020,2020,89,81-100,0.405,0.816
133170,2019-03-14 15:25:09,4610064888664703,fraud_Osinski Inc,personal_care,6.02,Tammy,Maldonado,F,3312 Rachel Parks Suite 474,Stittville,NY,13469,43.2229,-75.2899,847,Systems analyst,1996-01-11 00:00:00,9237b9cd02a8b454bc93a57f85b4324c,43.715245,-75.917424,0,2012-03-14 15:25:09,15,Thursday,3,2019,2019,23,21-40,0.492,0.628
1326638,2020-06-30 23:36:50,30235438713303,fraud_Waters-Cruickshank,health_fitness,79.69,James,Baldwin,M,3603 Mitchell Court,Winfield,WV,25213,38.5072,-81.89,5512,Exhibition designer,1980-03-24 00:00:00,5c98a0d4a9df8be550af3edcf0ea8eda,38.944872,-81.355964,0,2013-06-30 23:36:50,23,Tuesday,6,2020,2020,40,21-40,0.438,0.534


## 1. Text Classification Pipeline
We utilize the Hugging Face transformers library to set up a text classification pipeline. This pipeline employs the "serbog/distilbert-jobCategory_410k" model, which is a DistilBERT-based model specifically trained for categorizing job titles.

In [22]:
pipe = pipeline("text-classification", model="serbog/distilbert-jobCategory_410k") # initialize

## 2. Job Title Classification 
Each job title in the dataframe's 'job' column is processed through the pipeline. The model predicts the category of each job title based on its text content.

In [23]:
def classify_job(curr_job_title):
    result = pipe(curr_job_title)
    print(result[0]['label'])
    return result[0]['label']

In [24]:
df['job_category'] = df['job'].apply(classify_job)

C2
C3
C3
C4
C2
C2
C2
C2
C2
C2
C2
C4
C2
C2
C2
C2
C2
C2
C1
C2


## 3. Aggregation of Job Titles
After classification, job titles are grouped by their predicted categories. This grouping allows us to aggregate similar jobs under a unified category, making it easier to understand and analyze the distribution of job types in our dataset.

In [25]:
df_sorted = df.sort_values('job_category')

unique_jobs_per_category = df_sorted.groupby('job_category')['job'].unique()

# create a new DataFrame to hold the first 10 unique jobs for each category
top_jobs_per_category = pd.DataFrame({
    'job_category': unique_jobs_per_category.index,
    'top_jobs': [jobs[:10] for jobs in unique_jobs_per_category.values]
})

# Reset index to make it look neat
top_jobs_per_category.reset_index(drop=True, inplace=True)

## 4. Result Presentation
Each row represents a unique job category, accompanied by a list of job titles that fall under that category. This structured format provides a clear overview of how job titles are distributed across different categories, as determined by the model.

In [26]:
display(top_jobs_per_category)

Unnamed: 0,job_category,top_jobs
0,C1,[Theme park manager]
1,C2,"[Exhibition designer, Multimedia programmer, Research officer, trade union, Archaeologist, Energy engineer, Naval architect, Engineer, maintenance, Careers information officer, Psychiatrist, Conservation officer, historic buildings]"
2,C3,"[Insurance underwriter, Trade mark attorney]"
3,C4,"[Media buyer, Travel agency manager]"


dont touch below

In [1]:
from transformers import pipeline
import pandas as pd
from IPython.display import display

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
X = pd.read_csv('../fraudDataset.csv')

In [None]:
# Load a text classification model pipeline
pipe = pipeline("text-classification", model="serbog/distilbert-jobCategory_410k")

job_categories = []

# Process each input using the reusable pipeline
for text in X['job']:
    result = pipe(
        text
    )
    job_categories.append(result[0]['label'])

In [None]:
X.to_csv('../X_job_categories.csv', index=False)

In [2]:
X = pd.read_csv(r"../X_job_categories.csv")

In [6]:
df_sorted = X.sort_values('job_categories')

unique_jobs_per_category = df_sorted.groupby('job_categories')['job'].unique()

# create a new DataFrame to hold the first 10 unique jobs for each category
top_jobs_per_category = pd.DataFrame({
    'job_categories': unique_jobs_per_category.index,
    'top_jobs': [jobs[:10] for jobs in unique_jobs_per_category.values]
})

# Reset index to make it look neat
top_jobs_per_category.reset_index(drop=True, inplace=True)

# Display the DataFrame
pd.set_option('display.max_rows', None)  
pd.set_option('display.width', None) 
pd.set_option('display.max_colwidth', None)

display(top_jobs_per_category)

Unnamed: 0,job_categories,top_jobs
0,C1,"[Energy manager, Sales professional, IT, Call centre manager, Drilling engineer, Chief Strategy Officer, Art gallery manager, Chief Financial Officer, Chief Executive Officer, Emergency planning/management officer, TEFL teacher]"
1,C2,"[Designer, ceramics/pottery, Counselling psychologist, Archivist, Toxicologist, Financial adviser, Scientist, biomedical, Physicist, medical, Writer, Editor, film/video, Comptroller]"
2,C3,"[Theatre manager, Administrator, Production manager, Field trials officer, Television production assistant, Water engineer, Probation officer, Agricultural consultant, Accounting technician, Tourist information centre manager]"
3,C4,"[Travel agency manager, Futures trader, Exhibitions officer, museum/gallery, Bookseller, Retail buyer, Chartered loss adjuster, Race relations officer, Magazine features editor, Technical brewer, Public relations account executive]"
4,C5,"[Sub, Barrister's clerk, Health physicist, Prison officer, Equities trader, Barrister, Public house manager, Mental health nurse, Make, Museum/gallery conservator]"
5,C6,"[Land, Horticulturist, commercial, Commercial horticulturist, Arboriculturist, Amenity horticulturist, Nature conservation officer, Forest/woodland manager]"
6,C7,"[Warden/ranger, Glass blower/designer, Mudlogger, Conservator, furniture, Furniture conservator/restorer, Television camera operator]"
7,C8,"[Clothing/textile technologist, Camera operator, Air cabin crew, Garment/textile technologist, Cabin crew, Dancer]"
8,C9,"[Press sub, Fisheries officer, Aid worker, Gaffer]"


C1: This category seems to include high-level managerial and executive positions across various industries, including energy, sales, finance, and emergency management.

C2: These jobs appear to be related to the arts, culture, and advisory services, including psychology, writing, editing, and financial advising.

C3: This category might focus on administrative and managerial roles in the arts, agriculture, and public information services.

C4: These jobs seem to be related to commercial services, trade, and public relations, including roles in management, trading, and editing.

C5: This category includes legal professions, health-related roles, and management positions in hospitality and mental health care.

C6: The jobs listed under this category are related to horticulture, conservation, and land management, indicating a focus on outdoor, environmental, and conservation work.

C7: These roles seem to be specialized artisanal or technical jobs related to conservation, media production, and geological services.

C8: This category includes jobs related to textiles, aviation, and performing arts, suggesting a focus on technical craftsmanship, customer service in travel, and entertainment.

C9: The roles here are diverse, ranging from media and journalism (press sub) to environmental and humanitarian work (fisheries officer, aid worker), indicating a category for specialized service and support roles.