# Labeling Data

This notebook gives a brief introduction to a fundamental question that is rarely covered in ML courses: How to label data. In real life, it's rare that your datasets come pre-labeled for you. You have to do it yourself, which naively involves somebody going through and labeling *every single example* by hand. Sure, human labels are great and often ideal, but in practice it's impossible to do so for reasons of time and cost. It often falls on the data scientist or engineer to label the data himself. This begs the question, then. How can you label data to be "good enough" to build reliable ML models with a minimal amount of effort and resources?

The topic of *weakly supervised learning* addresses this question by considering way to combine various data labeling strategies together to produce "good enough" labels. It may be the case that we can label some examples algorithmically using various rules-based functions, and combine those algorithmic labels with a subset of human-labeled examples (if we have them), and then train a model on *all* of these labels to predict the best choice for each example.

This may seem confusing, but we'll go through an example below using a previous dataset: The spam classification dataset from the notebook on text classification. We will pretend that we don't have labels for that dataset, except a small subset (10% of the data), and our job is to use weak supervision to generate labels to do ML. We'll also compare them at the end with the ground truth labels (since we have them), though keep in mind that if you're using this technique you wouldn't be able to do this in real life.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import spacy
import nltk

from pathlib import Path
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score
from imblearn.over_sampling import RandomOverSampler

np.random.seed(42)