# Text vs Code Line Classifier

Train a TF-IDF + Logistic Regression pipeline to classify individual lines as either natural-language text or source code. The training data is scraped from StackOverflow HTML (code blocks vs. paragraph text) and Twitch chat messages.

## Contents
1. Imports & Load StackOverflow 2020 Data
2. HTML Parsing Helpers (`get_code_blocks`, `extract_plain_p_text`)
3. Extract Text & Code Lines — SO 2020
4. Extract Text & Code Lines — SO 2019
5. Extract Text & Code Lines — SO 2018
6. Load Twitch Chat Data (natural-text source)
7. Combine Datasets & Train/Val Split
8. TF-IDF + Logistic Regression Pipeline
9. Threshold Tuning & Evaluation
10. Test Set Evaluation
11. Save Model

In [5]:
import os
import pandas as pd

In [6]:
data_path_train = "data/text-vs-code" + '/StackOverflow_questions_2020.csv'

df_train = pd.read_csv(data_path_train)

In [7]:
print(df_train["Body"].tail())

1618    <p>Assume we have a numpy array A with shape (...
1619    <p>I am trying following code:</p>\n<pre><code...
1620    <p>I've been learning about C++ <code>constexp...
1621    <p>I'm still new in SSIS,</p>\n<p>Now I can re...
1622    <p>For a couple of days I am working on a way ...
Name: Body, dtype: object


In [3]:
from bs4 import BeautifulSoup

html_string = """<p>I use to install packages</p>
<p> this is my worlllddd wohoo </p>
<p> i love itsooo </p>
<pre><code>print("hello world")
pip install numpy
pip install pandas
pip install matplotlib
pipe introducce
for(int x = 0; iii)
error code here
ERROR code here
</code></pre>"""


def get_code_blocks(html_string):
    soup = BeautifulSoup(html_string, "html")

    code_snippets = [code.get_text(strip=False) for code in soup.find_all("code")]

    code_blocks = [
        code.get_text(strip=False)
        for code in soup.select("pre > code")
    ]

    code_normalized = []

    for code_block in code_blocks:
        code = code_block.splitlines()
        for line in code:
            stripped_line = line.lstrip()
            if len(stripped_line) > 2 and "pip " not in stripped_line and "ERROR" not in stripped_line:
              code_normalized.append(stripped_line)

    return code_normalized
    # print(code_snippets)
    # print(code_blocks[0].splitlines())
    # print(code_normalized)
print(get_code_blocks(html_string))

['print("hello world")', 'pipe introducce', 'for(int x = 0; iii)', 'error code here']


In [9]:
html_string = """
<p>I wanted to edit the actions in a table. However I get the error message "Please specify covering index name." when I try to edit the FK. How do I fix this?</p>

<p>The table consists of only two columns:</p>

<p><a href="https://i.stack.imgur.com/Thy7s.png" rel="noreferrer"><img src="https://i.stack.imgur.com/Thy7s.png" alt="picture of database table"></a>  </p>

<p>The foreign keys:</p>

<p><a href="https://i.stack.imgur.com/2Wd4R.png" rel="noreferrer"><img src="https://i.stack.imgur.com/2Wd4R.png" alt="foreign keys"></a></p>

<p>category FK:</p>

<p><a href="https://i.stack.imgur.com/QJGke.png" rel="noreferrer"><img src="https://i.stack.imgur.com/QJGke.png" alt="category FK"></a></p>
"""

html_string_2 = """
<p> I am a paragraph </p>
"""
from bs4 import BeautifulSoup

def extract_plain_p_text(html: str):
    soup = BeautifulSoup(html, "html")

    sentences = []
    allowed_tags = {"br", "strong"}

    for p in soup.find_all("p"):
        # Reject <p> if it contains any disallowed child tags
        if p.find(lambda tag: tag.name not in allowed_tags):
            continue

        # Each <br>-separated segment becomes one sentence
        for part in p.stripped_strings:
            sentences.append(part)

    return sentences


# Example usage
plain_p_text = extract_plain_p_text(html_string)

for s in plain_p_text:
    print(s)



I wanted to edit the actions in a table. However I get the error message "Please specify covering index name." when I try to edit the FK. How do I fix this?
The table consists of only two columns:
The foreign keys:
category FK:


In [10]:
code_train = []
text_train = []

not_allowed_tags = ["<sql-server>", "<graph>", "<keras>", "<sql>", "<pip>", "<tensorflow>", "<join>"]
allowed_tags =[
    "<c++>", "<java>", "<python>", "<php>", "<c#>", "<javascript>", "<c>", "<go>",
    "<react-native>", "<laravel>", "<django>", "<typescript>", "<node.js>", "<.net-core>",
               ]
for index, row in df_train.head(1600).iterrows():
  html_string = row["Body"]

  tags = []
  for tag in row["Tags"].split(">"):
    if tag:
      tag = tag + ">"
      tags.append(tag)
      # print(tags)

  plain_p_text = extract_plain_p_text(html_string)
  for line in plain_p_text:
    text_train.append(line)

  # print(f"Raw tags {tags}")
  if any(tag in tags for tag in not_allowed_tags):
    continue

  if not any(tag in tags for tag in allowed_tags):
    continue

  # print(f"Allowed {tags}")

  code = get_code_blocks(html_string)
  for line in code:
    code_train.append(line)

print(len(code_train))
print(len(text_train))

# for t in text_train:
#   print(t)

# for c in code_train:
#   print(c)


24940
7948


In [11]:
data_path_train_2019 = os.path.join("data/text-vs-code", 'StackOverflow_questions_2019.csv')

df_train_2019 = pd.read_csv(data_path_train_2019)

In [None]:
print(df_train_2019.shape)
print(df_train.shape)

In [12]:
code_train_2019 = []
text_train_2019 = []

not_allowed_tags = ["<sql-server>", "<graph>", "<keras>", "<sql>", "<pip>", "<tensorflow>", "<join>"]
allowed_tags =[
    "<c++>", "<java>", "<python>", "<php>", "<c#>", "<javascript>", "<c>", "<go>",
    "<react-native>", "<laravel>", "<django>", "<typescript>", "<node.js>", "<.net-core>",
               ]
for index, row in df_train_2019.head(3000).iterrows():
  html_string = row["Body"]

  tags = []
  for tag in row["Tags"].split(">"):
    if tag:
      tag = tag + ">"
      tags.append(tag)
      # print(tags)

  plain_p_text = extract_plain_p_text(html_string)
  for line in plain_p_text:
    text_train_2019.append(line)

  # print(f"Raw tags {tags}")
  if any(tag in tags for tag in not_allowed_tags):
    continue

  if not any(tag in tags for tag in allowed_tags):
    continue

  # print(f"Allowed {tags}")

  code = get_code_blocks(html_string)
  for line in code:
    code_train_2019.append(line)

print(len(code_train_2019))
print(len(text_train_2019))

48116
14698


In [13]:
data_path_train_2018 = os.path.join("data/text-vs-code", 'StackOverflow_questions_2018.csv')
df_train_2018 = pd.read_csv(data_path_train_2018)

In [14]:
code_train_2018 = []
text_train_2018 = []

not_allowed_tags = ["<sql-server>", "<graph>", "<keras>", "<sql>", "<pip>", "<tensorflow>", "<join>"]
allowed_tags =[
    "<c++>", "<java>", "<python>", "<php>", "<c#>", "<javascript>", "<c>", "<go>",
    "<react-native>", "<laravel>", "<django>", "<typescript>", "<node.js>", "<.net-core>",
               ]
for index, row in df_train_2018.head(1000).iterrows():
  html_string = row["Body"]

  tags = []
  for tag in row["Tags"].split(">"):
    if tag:
      tag = tag + ">"
      tags.append(tag)
      # print(tags)

  plain_p_text = extract_plain_p_text(html_string)
  for line in plain_p_text:
    text_train_2018.append(line)

  # print(f"Raw tags {tags}")
  if any(tag in tags for tag in not_allowed_tags):
    continue

  if not any(tag in tags for tag in allowed_tags):
    continue

  # print(f"Allowed {tags}")

  code = get_code_blocks(html_string)
  for line in code:
    code_train_2018.append(line)

print(len(code_train_2018))
print(len(text_train_2018))

14098
4669


In [16]:
data_path_twitch = os.path.join("data/text-vs-code", 'gamer.csv')

df_train_twitch = pd.read_csv(data_path_twitch)

In [18]:
display(df_train_twitch.head())

Unnamed: 0,user,channel,message,timestamp
0,itztony1702,healthygamer_gg,BibleThump BibleThump,2021-07-16 14:05:22
1,flaredrip,healthygamer_gg,SUPERHERO BibleThump BibleThump,2021-07-16 14:05:23
2,modxta23,healthygamer_gg,GOOD DAD FeelsGoodMan,2021-07-16 14:05:23
3,reaperdiff,healthygamer_gg,FeelsStrongMan,2021-07-16 14:05:23
4,3rdkira,healthygamer_gg,drhgWeird,2021-07-16 14:05:25


In [20]:
messages = []
for indx, row in df_train_twitch.iterrows():
  if row["message"] not in messages:
    messages.append(row["message"])
print(len(messages))
print(messages[0])

4684
BibleThump BibleThump


In [22]:
import numpy as np
import random
zeros = [0 for _ in range(len(text_train_2018))]
ones = [1 for _ in range(len(code_train_2018))]
text_val = zip(text_train, zeros)
code_val = zip(code_train, ones)

val_data = list(text_val) + list(code_val)
val_data = np.array(val_data)
np.random.shuffle(val_data)


# val_data = (
#     [(x, 0) for x in text_train_2018] +
#     [(x, 1) for x in code_train_2018]
# )

random.shuffle(val_data)

feature_names = ["line", "label"]

dataset_df_val = pd.DataFrame(val_data, columns=feature_names)
display(dataset_df_val.head())

Unnamed: 0,line,label
0,rijAlg.FeedbackSize = 128;,1
1,</Directory>,1
2,</Directory>,1
3,</Directory>,1
4,"authUser: null,",1


In [24]:
import numpy as np
text_concat = text_train + text_train_2019 + messages
code_concat = code_train

# train_data = (
#     [(x, 0) for x in text_concat] +
#     [(x, 1) for x in code_concat]
# )
# print(len(text_concat))
# print(len(code_concat))

# random.shuffle(train_data)
zeros = [0 for _ in range(len(text_concat))]
ones = [1 for _ in range(len(code_train))]
text = zip(text_concat, zeros)
code = zip(code_train, ones)

train_data = list(text) + list(code)
train_data = np.array(train_data)
np.random.shuffle(train_data)

feature_names = ["line", "label"]

dataset_df = pd.DataFrame(train_data, columns=feature_names)
display(dataset_df.head())


Unnamed: 0,line,label
0,".append(""svg"")",1
1,"""license"": ""MIT"",",1
2,As Rob Newton suggested I edited the convert f...,0
3,"current_p2 = b_curr_val.ToString(),",1
4,Questions:,0


In [25]:
print(dataset_df.shape)
print(dataset_df_val.shape)

(52270, 2)
(18767, 2)


In [26]:
X_train = []
Y_train = []
for index, row in dataset_df.head(52276).iterrows():
  line_str = str(row["line"])
  X_train.append(line_str)
  val = int(row["label"])
  if val != 0 and val != 1:
    print(val)
  Y_train.append(val)


In [27]:
X_val = []
Y_val = []
for index, row in dataset_df_val.head(18767).iterrows():
  line_str = str(row["line"])
  X_val.append(line_str)
  val = int(row["label"])
  Y_val.append(val)

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

model = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="char",
        ngram_range=(3, 5),
        min_df=1
    )),
    ("clf", LinearSVC(
        # random_state = 42,
        # class_weight = "balanced"
    ))
])

model.fit(X_train, Y_train)

y_pred = model.predict(X_val)

from sklearn.metrics import classification_report, f1_score
print(f1_score(Y_val, y_pred, average='macro'))
print(classification_report(Y_val, y_pred))


0.9945641616977929
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4642
           1       1.00      1.00      1.00     14125

    accuracy                           1.00     18767
   macro avg       0.99      0.99      0.99     18767
weighted avg       1.00      1.00      1.00     18767



In [29]:

mode_reg = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="char",
        ngram_range=(3, 5),
        min_df=1
    )),
    ("clf", LogisticRegression(
        # random_state = 42,
        # class_weight = "balanced"
    ))
])

mode_reg.fit(X_train, Y_train)

y_pred = mode_reg.predict(X_val)

from sklearn.metrics import classification_report, f1_score
print(f1_score(Y_val, y_pred, average='macro'))
print(classification_report(Y_val, y_pred))

0.9532507737682298
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      4642
           1       0.98      0.98      0.98     14125

    accuracy                           0.97     18767
   macro avg       0.95      0.95      0.95     18767
weighted avg       0.97      0.97      0.97     18767



In [30]:
y_pred_new = mode_reg.predict_proba(X_val)[:, 1] >= 0.6
y_pred_new = [1 if y == True else 0 for y in y_pred_new]
print(y_pred_new[:10])
print(f1_score(Y_val, y_pred_new, average='macro'))
print(classification_report(Y_val, y_pred_new))

[1, 1, 1, 1, 1, 1, 0, 1, 0, 1]
0.9479020693651898
              precision    recall  f1-score   support

           0       0.89      0.95      0.92      4642
           1       0.98      0.96      0.97     14125

    accuracy                           0.96     18767
   macro avg       0.94      0.96      0.95     18767
weighted avg       0.96      0.96      0.96     18767



In [31]:
code_real = [
        "public String removeTrailingZeros(String num) {",
        "int scheduleCourse(vector<vector<int>>& relations, vector<int>& time) {",
        "k = list(map(int, input().split()))",
        "	a[i] = abs(a[i])",
        "a.sort()",
        "print(filter_multiples_of_index([22, -6, 32, 82, 9, 25]))  # [-6, 32, 25]",
        "print('HARD')",
        "(a, b, c) = (min(a, b, c), a + b + c - min(a, b, c) - max(a, b, c), max(a, b, c))",
]
text_real = [
    "In the first case you can't embed $$$3-gon$$$ in the square. So, the square should be at least $$$2.0971524159829244 \cdot 2= 1.931851653 \cdot 2= 3.973703307974929$$$-gon.",
    "In the second case you can't embed $$$5-gon$$$ in the square. So, the square should be at least $$$3.196226611 \cdot 2= 6.386452622000014 \cdot 2= 12.772084220999975$$$-gon.",
    "The absolute error in each answer should not exceed $$$10^{-6}$$$- this is the condition that the answer is within a certain precision.",
    "I wrote 7: (anOverAverage(n)) because I thought about a way to make the problem more general. ",
    "The idea was to make the problem more general, because the solution is not unique.",
    "contain one value: the expected answer: two.",
    "Your answer tag must be: #22",
    "The test cases it generated are valid.",
    "Believe me\" i'm pretty sure EndeR is an algorithm.",
    "ALREADY FAILED PEEPEEPEELELE TIFFYY",
    "LET'S LISTEN TO THIS IN SLOW 1 2 3 4 5 6 7 8 9  PSYCH  LKLL ASTAKALKL L!kkk A",
    "LET'S LISTEN TO THIS IN SLOW 1 2 3 4 5 6 7 8 9  PSYCH  LKLL ASTAKALKL L!kkk",
    "IT'S 4AM  I'VE BEEN  THERE ? WHAT DO  I HAVE ?\nLOCKED LIKE I USED TO !",
    "LOCKED LIKE USED TO",
    "I'M WARRAST TO LET YOU   GO ON Dying",
     "REINFOR C   Get Out",
    "G.E  ESKA           ????     hh RICKY",
    "WHAT????? WHAT H*Â.K?!?!?!?!?!?! DID YOU  REALL WATCH METRAM CLAIMT",
    "I... i... i... ",
    "THEIR  PYRKO SKEE LA LE   *:D---------",
    "ha ha ha lol",
    "try that oO oO  Gotta walk it like a runaway rush Oo O So Orbit",
    "CURSEY HAH M0BER",
    "When you strung together Enders heart and her brain might trip off",
    "YOU STRAIGHT  WRAP UP ON THE LET IT ALL TRUK KUDDLE WITH!!!!!И  НО БЛ0КУД И ПРИНЦ ТОНССУЛЯРИ",
    "Windows restart",
    "And  I'm  on flip that if  and  times and CHOOSE THE CURSEY PIZZA CEEEEE",
    "1. Initialize the max_index of result to i=L = 0, R_index = 0;",
    "2. Now keep comparing L[0] * R[i] with L[R_index] * R[0];",
    "If both are greater than max_index and L[0]<= L[R_index] we will keep incrementing R_index.",
  ]

  "In the first case you can't embed $$$3-gon$$$ in the square. So, the square should be at least $$$2.0971524159829244 \cdot 2= 1.931851653 \cdot 2= 3.973703307974929$$$-gon.",
  "In the second case you can't embed $$$5-gon$$$ in the square. So, the square should be at least $$$3.196226611 \cdot 2= 6.386452622000014 \cdot 2= 12.772084220999975$$$-gon.",


In [32]:
import numpy as np

zeros = [0 for _ in range(len(text_real))]
ones = [1 for _ in range(len(code_real))]
text = zip(text_real, zeros)
code = zip(code_real, ones)

test_data = list(text) + list(code)
test_data = np.array(test_data)
np.random.shuffle(test_data)

feature_names = ["line", "label"]

dataset_df_test = pd.DataFrame(test_data, columns=feature_names)
display(dataset_df_test.head())

Unnamed: 0,line,label
0,2. Now keep comparing L[0] * R[i] with L[R_ind...,0
1,G.E ESKA ???? hh RICKY,0
2,int scheduleCourse(vector<vector<int>>& relati...,1
3,REINFOR C Get Out,0
4,LET'S LISTEN TO THIS IN SLOW 1 2 3 4 5 6 7 8 9...,0


In [33]:
X_test = []
Y_test = []
for index, row in dataset_df_test.iterrows():
  X_test.append(row["line"].lower())
  val = 1 if row["label"] == "1" else 0 if row["label"] == "0" else 2
  if val == 2:
    print(2)
  Y_test.append(val)

In [34]:
probs = mode_reg.predict_proba(X_test)[:, 1]

thresholds = np.linspace(0.1, 0.9, 81)
scores = [
    f1_score(Y_test, probs >= t, average="macro")
    for t in thresholds
]

best_t = thresholds[np.argmax(scores)]
print(f"Best threshold: {best_t}")


Best threshold: 0.84


In [35]:
y_pred_new = mode_reg.predict_proba(X_test)[:, 1] >= 0.8
y_pred_new = [1 if y == True else 0 for y in y_pred_new]
print(y_pred_new[:10])
print(f1_score(Y_test, y_pred_new, average='macro'))
print(classification_report(Y_test, y_pred_new))

[0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
0.9621136590229312
              precision    recall  f1-score   support

           0       1.00      0.97      0.98        30
           1       0.89      1.00      0.94         8

    accuracy                           0.97        38
   macro avg       0.94      0.98      0.96        38
weighted avg       0.98      0.97      0.97        38



In [36]:
y_pred = mode_reg.predict(X_test)

from sklearn.metrics import classification_report, f1_score
print(f1_score(Y_test, y_pred, average='macro'))
print(classification_report(Y_test, y_pred))

0.8642857142857143
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        30
           1       0.67      1.00      0.80         8

    accuracy                           0.89        38
   macro avg       0.83      0.93      0.86        38
weighted avg       0.93      0.89      0.90        38



In [37]:
for x, y in zip(X_test, y_pred):
  print(x, y)

2. now keep comparing l[0] * r[i] with l[r_index] * r[0]; 1
g.e  eska           ????     hh ricky 0
int schedulecourse(vector<vector<int>>& relations, vector<int>& time) { 1
reinfor c   get out 0
let's listen to this in slow 1 2 3 4 5 6 7 8 9  psych  lkll astakalkl l!kkk a 0
i'm warrast to let you   go on dying 0
cursey hah m0ber 0
public string removetrailingzeros(string num) { 1
i... i... i...  1
and  i'm  on flip that if  and  times and choose the cursey pizza ceeeee 0
contain one value: the expected answer: two. 0
print('hard') 1
when you strung together enders heart and her brain might trip off 0
the absolute error in each answer should not exceed $$$10^{-6}$$$- this is the condition that the answer is within a certain precision. 0
a.sort() 1
it's 4am  i've been  there ? what do  i have ?
locked like i used to ! 0
ha ha ha lol 0
what????? what h*â.k?!?!?!?!?!?! did you  reall watch metram claimt 0
1. initialize the max_index of result to i=l = 0, r_index = 0; 1
their  pyrko skee l

In [38]:
# Save trained model to models/ (tries joblib, then torch, then pickle)
from pathlib import Path
import joblib
import pickle
import torch

models_dir = Path("models")
models_dir.mkdir(parents=True, exist_ok=True)

# change `model` to your variable name if different
joblib_path = models_dir / "classifier.joblib"
torch_path = models_dir / "classifier.pt"
pickle_path = models_dir / "classifier.pkl"

try:
    joblib.dump(mode_reg, joblib_path)
    print(f"Model saved with joblib to {joblib_path}")
except Exception:
    try:
        import torch
        if hasattr(mode_reg, "state_dict"):
            torch.save(mode_reg.state_dict(), str(torch_path))
            print(f"PyTorch model state_dict saved to {torch_path}")
        else:
            raise RuntimeError("Not a torch.nn.Module")
    except Exception:
        with open(pickle_path, "wb") as f:
            pickle.dump(mode_reg, f)
        print(f"Model pickled to {pickle_path}")

Model saved with joblib to models\classifier.joblib
