# N.Rich Test Assignment EDA

I adopted some code from this repo: https://github.com/Reslan-Tinawi/20-newsgroups-Text-Classification

In [11]:
import pandas as pd
import numpy as np

# data visualization
import seaborn as sns

import matplotlib.pyplot as plt

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

sns.set()

## Helper functions

### Color palette generator

This function is used to create a palette of n colors of palette_name colors.

In [59]:
def get_n_color_palette(palette_name, n_colors, as_hex=False):
    palette = sns.color_palette(palette=palette_name, n_colors=n_colors)
    if as_hex:
        palette = palette.as_hex()
    palette.reverse()
    return palette

palette = get_n_color_palette("flare", 20, True)

## Data Statistics

In [13]:
df_train = pd.read_csv('../data/train_df.csv')
df_train.head()

Unnamed: 0,Title,Code
0,senior program analyst navy strike fighter sen...,15-1121.00
1,senior intelligence analyst iii job,33-3021.06
2,retail wireless sales consultant part,41-2031.00
3,test automation engineer w selenium and gerkin...,15-1121.00
4,public sector portfolio management senior asso...,11-1021.00


In [18]:
codes_statistics_df = (
    df_train.groupby(by="Code")["Title"]
    .agg(
        [
            ("count", lambda x: x.size),
            ("mean_length", lambda x: x.str.len().mean()),
            ("max_length", lambda x: x.str.len().max()),
            ("min_length", lambda x: x.str.len().min()),
        ]
    )
    .reset_index()
    .sort_values(by="count", ascending=False)
)
codes_statistics_df

Unnamed: 0,Code,count,mean_length,max_length,min_length
13,29-1141.00,1186,32.771501,150,4
11,15-1142.00,1082,33.0,99,7
8,15-1132.00,1080,32.92037,88,8
7,15-1122.00,1051,35.229305,107,10
9,15-1133.00,985,33.659898,209,7
1,11-2021.00,778,35.429306,119,4
4,13-1111.00,690,34.813043,102,7
15,33-3021.06,590,32.949153,90,9
0,11-1021.00,589,31.424448,105,6
2,11-2022.00,477,34.67086,96,8


Textual descriptions are rather short, but more or less comparable in length. However, the class imbalance is noticeable, with the largest class having almost 6 times more entries than the smallest one.

### Counts per code

Use a pie chart to display the percentages of entries for each code:

In [64]:
fig = px.pie(
    data_frame=codes_statistics_df,
    names="Code",
    values="count",
    color_discrete_sequence=palette,
    title="Code Percentages",
    width=800,
    height=500,
)

fig.update_layout(
    {
        "plot_bgcolor": "rgba(0, 0, 0, 0)",
        "paper_bgcolor": "rgba(0, 0, 0, 0)",
        "font": {
            "family": "Courier New, monospace",
            "size": 14
        },
    }
)

fig.show()

### Occupation vs. Code

One observation that I made regarding the occupation is that they form a small number of groups, be dubbed *Manager, Developer, Analyst, Blue Collar*. Leveraging this can help improve the model's performance. Here I manually group them and check the statistics for larger groups.

In [36]:
groups_dict = {
    '11-1021.00': 'Manager', 
    '11-2021.00': 'Manager', 
    '11-2022.00': 'Manager', 
    '11-3031.02': 'Manager', 
    '41-2031.00': 'Manager', 
    '43-4051.00': 'Manager',
    
    '15-1132.00': 'Developer', 
    '15-1133.00': 'Developer', 
    '15-1134.00': 'Developer', 
    '15-1142.00': 'Developer', 
    '15-1151.00': 'Developer',
    
    '13-1111.00': 'Analyst', 
    '13-2051.00': 'Analyst', 
    '15-1121.00': 'Analyst', 
    '15-1122.00': 'Analyst', 
    '33-3021.06': 'Analyst',
    
    '29-1141.00': 'Blue Collar', 
    '31-1014.00': 'Blue Collar', 
    '49-3023.02': 'Blue Collar', 
    '49-9071.00': 'Blue Collar', 
    '53-3032.00': 'Blue Collar'
}

In [38]:
df_occup = pd.read_csv('../data/occup_df.csv')
df_occup['Group'] = df_occup['Code'].map(groups_dict)
df_occup

Unnamed: 0,Occupation,Code,Group
0,General and Operations Managers,11-1021.00,Manager
1,Marketing Managers,11-2021.00,Manager
2,Sales Managers,11-2022.00,Manager
3,"Financial Managers, Branch or Department",11-3031.02,Manager
4,Management Analysts,13-1111.00,Analyst
5,Financial Analysts,13-2051.00,Analyst
6,Computer Systems Analysts,15-1121.00,Analyst
7,Information Security Analysts,15-1122.00,Analyst
8,"Software Developers, Applications",15-1132.00,Developer
9,"Software Developers, Systems Software",15-1133.00,Developer


In [69]:
occup_dict = df_occup.set_index('Code')['Occupation'].to_dict()

df_train['Occupation'] = df_train['Code'].map(occup_dict)
df_train['Group'] = df_train['Code'].map(groups_dict)

Here are some texts from each group.

In [70]:
df_train[df_train['Group'] == 'Analyst'].head(30)

Unnamed: 0,Title,Code,Group,Occupation
0,senior program analyst navy strike fighter sen...,15-1121.00,Analyst,Computer Systems Analysts
1,senior intelligence analyst iii job,33-3021.06,Analyst,Intelligence Analysts
3,test automation engineer w selenium and gerkin...,15-1121.00,Analyst,Computer Systems Analysts
9,public sector public financial management pfm ...,13-1111.00,Analyst,Management Analysts
11,principal associate allowance for loan losses ...,13-2051.00,Analyst,Financial Analysts
12,security officer on call operations,15-1121.00,Analyst,Computer Systems Analysts
13,analyst finance,13-2051.00,Analyst,Financial Analysts
15,systems engineer info assurance,15-1122.00,Analyst,Information Security Analysts
16,production analyst,33-3021.06,Analyst,Intelligence Analysts
17,research librarian,15-1122.00,Analyst,Information Security Analysts


Observe that the gold labels are still rather dirty or ambiguous. E.g., 'test automation engineer' is classified as a 'Computer Systems Analyst', which does not seem right.

In [68]:
df_train[df_train['Group'] == 'Blue Collar'].head(30)

Unnamed: 0,Title,Code,Group,Occupation
21,imaging tech aide,29-1141.00,Blue Collar,Registered Nurses
25,rn bonus hiring event july schedule your inter...,29-1141.00,Blue Collar,Registered Nurses
32,general technician,49-9071.00,Blue Collar,"Maintenance and Repair Workers, General"
36,food service technician,31-1014.00,Blue Collar,Nursing Assistants
51,psychiatric nurse rn,29-1141.00,Blue Collar,Registered Nurses
53,lpn outpatient inova urgent care full,29-1141.00,Blue Collar,Registered Nurses
60,engineer ft,49-9071.00,Blue Collar,"Maintenance and Repair Workers, General"
66,lpn health,29-1141.00,Blue Collar,Registered Nurses
68,maintenance technician lawyers,49-9071.00,Blue Collar,"Maintenance and Repair Workers, General"
71,dining nutrition support assoc nrv med ctr,29-1141.00,Blue Collar,Registered Nurses


### Class imbalance for occupation groups

In [51]:
groups_statistics_df = (
    df_train.groupby(by="Group")["Title"]
    .agg(
        [
            ("count", lambda x: x.size),
            ("mean_length", lambda x: x.str.len().mean()),
            ("max_length", lambda x: x.str.len().max()),
            ("min_length", lambda x: x.str.len().min()),
        ]
    )
    .reset_index()
    .sort_values(by="count", ascending=False)
)
groups_statistics_df

Unnamed: 0,Group,count,mean_length,max_length,min_length
2,Developer,3698,33.232829,209,6
3,Manager,3081,33.911392,119,4
0,Analyst,3056,34.365838,109,6
1,Blue Collar,2263,33.229783,174,4


In [62]:
fig = px.pie(
    data_frame=groups_statistics_df,
    names="Group",
    values="count",
    color_discrete_sequence=palette,
    title="Group Percentages",
    width=800,
    height=500,
)

fig.update_layout(
    {
        "plot_bgcolor": "rgba(0, 0, 0, 0)",
        "paper_bgcolor": "rgba(0, 0, 0, 0)",
        "font": {
            "family": "Courier New, monospace",
            "size": 14,
            # 'color': "#eaeaea"
        },
    }
)

fig.show()

Ah! Much better. I start thinking of training a separate classifier for occupation groups and then ensembling it with the occupation code predictor.

## Texts

I would like to take a closer look at the textual descriptions. These are just some random observations. I did some checks manually, as the dataset is rather small, so no code here.

* Texts are sort of normalized (low-cased, tokens are joined with a single whitespace, virtually no stop words, pronouns, etc.)
* There is a lot of abbreviations, some of which might be quite important, as in *i*, *ii*, *iv* denoting grades, *sr* for seniority level, *chha* for 'Certified Home Health Aid', regular abbreviations like *hr* and *vp*, some cryptic stuff like *rpsgt*, *fc*, *pt*. A sort of a knowledge base could be quite useful here.
* I spotted many variants, such as *flexi* and *flexible*, *sr* and *senior*, *icu* and *nicu* etc. A further preprocessing to normalize them could also be useful.
* Typos, e.g. *spcecialist*, *developler*, *exchnage*. A character-level approach could help improve robustness with such cases.
* Tokenization failures (*officeteam*) and artifacts (*womens*, *socio cultural*)
* Shortenings such as *med surg* alongside *surgery*
* Non-English, e.g. *español*, but extremely rare
* No named entities such as city names, locations, countries (with a noticeable exception of *us*)
* A lot of multi-word expressions (MWE), so MWE detection could help.

Conclusion:

I need to invest some time in feature engineering, focusing on
* character-level features (single-character and two-character abreviations are abundant, typos and tokenization artifacts to be fought)
* abbreviations and shortenings resolution, possibly with a hand-made dictionary

I could also try to introduce MWE-level features if I have time (not a priority). I will not bother with morphology, syntax, Unicode, NER, word vectors.