# MultiLabel Logistic Regression using Tensorflow Core

- Multi-label classification is a type of classification problem in machine learning where each instance (data sample) can belong to multiple classes or have multiple labels. 
- In traditional single-label classification, each instance is associated with a single label, and the goal is to predict the most appropriate label for each instance. However, in multi-label classification, instances can be associated with more than one label simultaneously.

- **Single-label classification:**
> Suppose you have an image classification problem to identify animals in images. Each image contains only one animal, and you want to classify the image into one of the animal categories (e.g., cat, dog, bird).
- **Multi-label Classification:**
> Now, let's consider a multi-label classification scenario. Instead of single animals per image, each image can contain multiple animals, and you want to predict all the animals present in the image (e.g., cat, dog, bird, fish).

There are multiple approaches to solve multi-label classification problem such as **Binary Relevance**, **Classifier Chains**, **Label Powerset**, **neural network with multiple output and binary loss function**

- We are going to implement logistic regression (or single layer perceptrons with n_labels outputs and binary loss function

## Difference between multi-class classification and multi-label classification

![image.png](attachment:image.png)
<p style="text-align: center;"><b>Figure: Multi-class classification forward pass</b></p>

![image-2.png](attachment:image-2.png)
<p style="text-align: center;"><b>Figure: Multi-label classification forward pass</b></p>

# Dataset
- This time we are going to solve nlp problem to demonstrate multi-label logistic regression.
- **Name of the dataset:** Question from cross Validation stack exchange
- **URL**: https://www.kaggle.com/datasets/stackoverflow/statsquestions
- **API to download the dataset**: kaggle datasets download -d stackoverflow/statsquestions
- **problem statement**: classify the question into multiple tags, one question may belongs to zero, one or more than one tags.

Before downloading the dataset from kaggle we have to do some setups:
- install kaggle
```
pip install kaggle -q

```
- download kaggle.json file from your kaggle account, we need that for authentication.

In [None]:
# pip install kaggle -q

In [2]:
# I have already downloaded the kaggle.json file
! cd ../../ && ls

'architecture implementations'	 LICENSE
 daily_logs.md			 README.md
 kaggle.json			'tensorflow from basic to advance'
'keras basic to advance'


There you can see kaggle.json file

In [3]:
# specify the path to the kaggle.json file
import os

os.environ["KAGGLE_CONFIG_DIR"] = "../../"

In [4]:
# Now we can download file using the API
!kaggle datasets download -d stackoverflow/statsquestions

statsquestions.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
!ls

'Binary Logistic Regression with Tensorflow core.ipynb'
 dataset
'Multiclass Logistic Regression with tensorfow core.ipynb'
'Multilabel Classification.ipynb'
'MultiLabel Logistic Regression using Tensorflow core.ipynb'
 statsquestions.zip
'Vanishing Gradient problem.ipynb'
'with Tensorflow core.ipynb'


There you can see statsquestions.zip file which is our dataset. Now let's unzip it.

In [None]:
import shutil

zip_file_path = "statsquestions.zip"
destination_directory = "dataset"
shutil.unpack_archive(zip_file_path, destination_directory, 'zip')

In [None]:
!ls dataset

We have our dataset ready, let's do some exploratory data analysis.

# EDA
Exploratory Data Analysis (EDA) is an essential step in the data analysis process, where analysts or data scientists examine and summarize the main characteristics of a dataset to gain insights and understand the data better. 

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
df_questions = pd.read_csv("dataset/Questions.csv", encoding='iso-8859-1')
df_tags = pd.read_csv("dataset/Tags.csv", encoding='iso-8859-1')

In [9]:
df_questions.shape, df_tags.shape

((85085, 6), (244228, 2))

So we have about 85K questions and 244k tags. Don't get surprised for having more tags than questions, one question can have more that one tag remember?.

In [10]:
df_questions.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demog...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,<p>How would you describe in plain English the...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,<p>After taking a statistics course and then t...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,"<p>There is an old saying: ""Correlation does n..."


OwnerUerID, CreatinData and score are unwanted columns, we can remove them.

In [11]:
df_tags.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


Here, question_id 1 has tags bayesian, prior, and elicitation. 
Let's check our the frequencies of these tags.

In [12]:
len(df_tags.Tag.value_counts())

1315

We have 1315 unique tags in our dataset.

In [13]:
df_tags.Tag.value_counts().head(10)

Tag
r                     13236
regression            10959
machine-learning       6089
time-series            5559
probability            4217
hypothesis-testing     3869
self-study             3732
distributions          3501
logistic               3316
classification         2881
Name: count, dtype: int64

It seems tag "r" has the highest frequency i.e more than 13K, we are only looking at the top 10 tags.

In [14]:
df_tags.Tag.value_counts().sample(10)

Tag
standard-error         629
epidemiology           198
ensemble               159
mann-whitney-u-test    242
bias-node                4
hypergeometric          92
active-learning          9
kaplan-meier            97
toeplitz                 2
hawkes                   1
Name: count, dtype: int64

Upon checking frequency of the random tags, we are clear that some tags have very few presence in the dataset. Thus, we cannot use those tags model training. Let's take only the top 100 tags, and we will make a model to classify question to these most frequent 100 tags.

# Taking only the top 100 frequent tags

In [15]:
NUM_LABELS= 100

In [16]:
grouped_tags = df_tags.groupby("Tag").size().reset_index(name='count')
most_common_tags = grouped_tags.nlargest(NUM_LABELS, columns="count")

In [17]:
len(most_common_tags)

100

In [18]:
most_common_tags

Unnamed: 0,Tag,count
986,r,13236
1020,regression,10959
669,machine-learning,6089
1220,time-series,5559
946,probability,4217
...,...,...
818,nonlinear-regression,514
240,cox-model,510
757,monte-carlo,504
959,proportion,503


In [19]:
MOST_COMMON_TAGS = list(set(most_common_tags.Tag.to_list()))
", ".join(MOST_COMMON_TAGS)

'correlation, self-study, normal-distribution, goodness-of-fit, optimization, statistical-significance, matlab, r, stata, estimation, factor-analysis, model, deep-learning, logistic, experiment-design, panel-data, monte-carlo, least-squares, variance, generalized-linear-model, chi-squared, regression-coefficients, interpretation, poisson, pca, data-visualization, standard-error, binary-data, conditional-probability, feature-selection, data-transformation, multiple-regression, python, classification, machine-learning, ordinal, random-forest, predictive-models, model-selection, survival, linear-model, nonparametric, modeling, random-variable, algorithms, bayesian, nonlinear-regression, arima, svm, proportion, sample-size, binomial, neural-networks, cart, clustering, multilevel-analysis, multiple-comparisons, confidence-interval, t-test, expected-value, cox-model, interaction, missing-data, maximum-likelihood, anova, dataset, references, mean, pdf, hypothesis-testing, data-mining, regress

In [20]:
len(MOST_COMMON_TAGS)

100

Now, let's remove tags which are not common

In [21]:
df_tags.Tag = df_tags.Tag.apply(lambda tag: tag if tag in MOST_COMMON_TAGS else None)

In [22]:
df_tags.Tag

0                        bayesian
1                            None
2                            None
3                   distributions
4                            None
                   ...           
244223           machine-learning
244224    mathematical-statistics
244225        normal-distribution
244226                 estimation
244227                       None
Name: Tag, Length: 244228, dtype: object

We have replace less common tags with None values, so that we can remove it using dropna()

In [23]:
df_tags.dropna(inplace=True, ignore_index=True)

In [24]:
df_tags.Tag

0                         bayesian
1                    distributions
2                    distributions
3         statistical-significance
4                 machine-learning
                    ...           
152908                           r
152909            machine-learning
152910     mathematical-statistics
152911         normal-distribution
152912                  estimation
Name: Tag, Length: 152913, dtype: object

We have successfully remove less common tags from the tag dataframe.

# Text Preprocessing

In [25]:
df_questions = df_questions[['Id', 'Title', 'Body']]
df_questions.head()

Unnamed: 0,Id,Title,Body
0,6,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,Forecasting demographic census,<p>What are some of the ways to forecast demog...
2,22,Bayesian and frequentist reasoning in plain En...,<p>How would you describe in plain English the...
3,31,What is the meaning of p values and t values i...,<p>After taking a statistics course and then t...
4,36,Examples for teaching: Correlation does not me...,"<p>There is an old saying: ""Correlation does n..."


You can see **Body** column contains text along with HTML tags, let's clean that up. We can use regex for this.

In [26]:
import re

# specify pattern for HTML tags
#  .*? is a non-greedy quantifier that matches zero or more occurrences of any character (except for a newline) 
# in a non-greedy or minimal fashion. 
# This means it will match as few characters as possible while still allowing the overall pattern to match.
HTML_TAG_PATTERN = "<.*?>"
test_text = "<HTML><a href='example.com'>click me</a></HTML>"

re.sub(HTML_TAG_PATTERN, "", test_text)

'click me'

You can see \<HTML> and \<a> tags got removed from the text. Now, let's put this into a function.

In [27]:
def clean_text(text):
    HTML_TAG_PATTERN = "<.*?>"
    clean_text = re.sub(HTML_TAG_PATTERN, "", text).lower()
    # we are also ignoring punctuation marks except ' from the text 
    return " ".join(re.sub(r"[^a-zA-Z0-9 ']", ' ', clean_text).split())

In [28]:
df_questions.iloc[0].Body

'<p>Last year, I read a blog post from <a href="http://anyall.org/">Brendan O\'Connor</a> entitled <a href="http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/">"Statistics vs. Machine Learning, fight!"</a> that discussed some of the differences between the two fields.  <a href="http://andrewgelman.com/2008/12/machine_learnin/">Andrew Gelman responded favorably to this</a>:</p>\n\n<p>Simon Blomberg: </p>\n\n<blockquote>\n  <p>From R\'s fortunes\n  package: To paraphrase provocatively,\n  \'machine learning is statistics minus\n  any checking of models and\n  assumptions\'.\n  -- Brian D. Ripley (about the difference between machine learning\n  and statistics) useR! 2004, Vienna\n  (May 2004) :-) Season\'s Greetings!</p>\n</blockquote>\n\n<p>Andrew Gelman:</p>\n\n<blockquote>\n  <p>In that case, maybe we should get rid\n  of checking of models and assumptions\n  more often. Then maybe we\'d be able to\n  solve some of the problems that the\n  machine learning people can

In [29]:
# checking if the function is working as expected or not
clean_text(df_questions.iloc[0].Body)

"last year i read a blog post from brendan o'connor entitled statistics vs machine learning fight that discussed some of the differences between the two fields andrew gelman responded favorably to this simon blomberg from r's fortunes package to paraphrase provocatively 'machine learning is statistics minus any checking of models and assumptions' brian d ripley about the difference between machine learning and statistics user 2004 vienna may 2004 season's greetings andrew gelman in that case maybe we should get rid of checking of models and assumptions more often then maybe we'd be able to solve some of the problems that the machine learning people can solve but we can't there was also the statistical modeling the two cultures paper by leo breiman in 2001 which argued that statisticians rely too heavily on data modeling and that machine learning techniques are making progress by instead relying on the predictive accuracy of models has the statistics field changed over the last decade i

It seems it is working fine. Let's apply it to the entire dataset.

In [30]:
df_questions["Body"] = df_questions.Body.apply(clean_text)
# New column Text contains input text for the model
df_questions["Text"] = df_questions.Title.apply(clean_text)+ ' ' + df_questions["Body"]

In [31]:
df_questions.Text[0]

"the two cultures statistics vs machine learning last year i read a blog post from brendan o'connor entitled statistics vs machine learning fight that discussed some of the differences between the two fields andrew gelman responded favorably to this simon blomberg from r's fortunes package to paraphrase provocatively 'machine learning is statistics minus any checking of models and assumptions' brian d ripley about the difference between machine learning and statistics user 2004 vienna may 2004 season's greetings andrew gelman in that case maybe we should get rid of checking of models and assumptions more often then maybe we'd be able to solve some of the problems that the machine learning people can solve but we can't there was also the statistical modeling the two cultures paper by leo breiman in 2001 which argued that statisticians rely too heavily on data modeling and that machine learning techniques are making progress by instead relying on the predictive accuracy of models has the

# Merge Two dataframe
- Now let's merge two dataframe. 
- To merge two dataframe, we will store list of tags that belong to a particular question into a separete column.

In [32]:
# we can get all the tags from tag dataframe based on question id
df_tags[df_tags["Id"] == 25].Tag.values

array(['modeling', 'time-series'], dtype=object)

So, we have two tags that belongs to id == 25.

In [33]:
# since we have removed most of the tags we might get empty list also for a particular id.
df_tags[df_tags["Id"] == 20].Tag.values

array([], dtype=object)

In [34]:
def extract_tags_by_id(q_id):
    return df_tags[df_tags["Id"] == q_id].Tag.values

# this will add the list of tags of a particular id into a new column Tags.
def add_tags_row(row):
    row["Tags"] = extract_tags_by_id(row["Id"])
    return row

In [35]:
df_questions = df_questions.apply(add_tags_row, axis=1)

In [36]:
df_questions.head()[["Id", "Text", "Tags"]]

Unnamed: 0,Id,Text,Tags
0,6,the two cultures statistics vs machine learnin...,[machine-learning]
1,21,forecasting demographic census what are some o...,[forecasting]
2,22,bayesian and frequentist reasoning in plain en...,[bayesian]
3,31,what is the meaning of p values and t values i...,"[hypothesis-testing, t-test, p-value, interpre..."
4,36,examples for teaching correlation does not mea...,[correlation]


Our final dataframe is ready, now we need to tokenize and prepare the input x and y.

# Vectorization
- Vectorization in NLP refers to the process of converting text data into numerical vectors that machine learning algorithms can understand and process.
- We need to vectorize the text as well as Tags.
- This time we are using tensorflow built-in functions, we can create our custom functions also to do the same tast and it is not that complex. When we start focusing on text preprocessing that time we are going to implement these functions from scratch.

In [37]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
from sklearn.preprocessing import MultiLabelBinarizer

2023-07-23 21:45:14.462412: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-23 21:45:14.506405: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-23 21:45:14.507298: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


- MultilableBinarizer helps to vectorize the list of labels into one hot encoding
- For instance, let's say we have 3 unique labels, and example belongs to label 1 and label 2 than its one hot encoding will be [1 1 0] 

In [38]:
multi_label_binarizer = MultiLabelBinarizer()
multi_label_binarizer.fit(df_questions.Tags)

In [39]:
LABELS = multi_label_binarizer.classes_
# this should be equals to NUM_LABELS == 100
len(LABELS)

100

In [40]:
LABELS

array(['algorithms', 'anova', 'arima', 'autocorrelation', 'bayesian',
       'binary-data', 'binomial', 'bootstrap', 'cart', 'categorical-data',
       'chi-squared', 'classification', 'clustering',
       'conditional-probability', 'confidence-interval', 'correlation',
       'covariance', 'cox-model', 'cross-validation', 'data-mining',
       'data-transformation', 'data-visualization', 'dataset',
       'deep-learning', 'distributions', 'econometrics', 'estimation',
       'expected-value', 'experiment-design', 'factor-analysis',
       'feature-selection', 'forecasting', 'generalized-linear-model',
       'goodness-of-fit', 'hypothesis-testing', 'inference',
       'interaction', 'interpretation', 'least-squares', 'linear-model',
       'logistic', 'machine-learning', 'mathematical-statistics',
       'matlab', 'maximum-likelihood', 'mcmc', 'mean', 'missing-data',
       'mixed-model', 'model', 'model-selection', 'modeling',
       'monte-carlo', 'multilevel-analysis', 'multiple-co

Now, let's create a text tokenizer, and transform the text into sequences along with padding
- This basically involves three tasks
    - transform raw text into tokens
    - transform tokens into integer values (replace token by its corresponding index)
    - all padding or remove exceeding items from sequence to make all the examples into same sequence length
- To perform mentioned task we can use Tokenizer and pad_sequences class

In [41]:
MAX_LEN = 180
VOCAB_SIZE = 1000

tokenizer = Tokenizer(num_words=VOCAB_SIZE, lower=True)
tokenizer.fit_on_texts(df_questions.Text)

In [42]:
def text_to_feature_vector(corpus):
    sequences = tokenizer.texts_to_sequences(corpus)
    return pad_sequences(sequences, MAX_LEN)

In [43]:
text_to_feature_vector([df_questions.Text.values[10]])

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0, 243, 339,   5, 220,  66,   3, 387, 243, 962,   9,  66,
        680, 267,  68,  78, 339, 297,  11,   3, 311,  83,   5,  78,  12,
          2,  95,   4, 164, 768, 646, 726, 191, 230, 598,  94, 643,   9,
        134,   2,  18,  63, 629,  38,  49,  73,   2,  49,  90, 393,  67,
          9,  67, 556, 268,  46,  24,  84,  21, 654,   2, 111, 632,  22,
        550, 974,  32,  20, 104,   4,  21,   3, 636

Let's create a function that transform the model predictions into {label : prob} dictionary

In [236]:
def prediction_to_label(preds):
    label_prob = [(LABELS[i], prob) for i, prob in enumerate(preds.numpy().tolist()[0])]
    return dict(sorted(label_prob, key=lambda kv:kv[1], reverse=True))

In [237]:
# Testing prediction_to_label class
dummy_preds = tf.random.uniform([1, 10])
dummy_preds

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[0.5429491 , 0.6879803 , 0.03101194, 0.3544582 , 0.37709188,
        0.22089791, 0.43472993, 0.34998238, 0.5027523 , 0.19111383]],
      dtype=float32)>

In [238]:
prediction_to_label(dummy_preds)

{'anova': 0.6879802942276001,
 'algorithms': 0.5429490804672241,
 'cart': 0.5027523040771484,
 'binomial': 0.4347299337387085,
 'bayesian': 0.3770918846130371,
 'autocorrelation': 0.354458212852478,
 'bootstrap': 0.3499823808670044,
 'binary-data': 0.22089791297912598,
 'categorical-data': 0.19111382961273193,
 'arima': 0.03101193904876709}

# Prepare Input data

In [136]:
x = text_to_feature_vector(df_questions.Text) / VOCAB_SIZE
y = multi_label_binarizer.transform(df_questions.Tags)

In [137]:
x.shape, y.shape

((85085, 180), (85085, 100))

In [138]:
from sklearn.model_selection import train_test_split

In [139]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [140]:
x_train.shape, y_train.shape, x_test.shape

((59559, 180), (59559, 100), (25526, 180))

- Here you can see, x_train is of shape 85K x 180 matrix, where 85k represents rows(number of examples) and each row is encoded with vector of dim 180.
- similarly, y_train is of shape 85K x 100, where 85k represents rows and each row is encoded with vector of dim 100.

In [141]:
x_train[0]

array([0.046, 0.842, 0.055, 0.018, 0.004, 0.132, 0.001, 0.005, 0.148,
       0.048, 0.033, 0.021, 0.004, 0.021, 0.228, 0.008, 0.006, 0.017,
       0.009, 0.005, 0.03 , 0.008, 0.009, 0.228, 0.013, 0.03 , 0.017,
       0.009, 0.005, 0.008, 0.046, 0.001, 0.005, 0.059, 0.084, 0.788,
       0.165, 0.008, 0.165, 0.228, 0.148, 0.002, 0.056, 0.03 , 0.008,
       0.005, 0.165, 0.009, 0.05 , 0.005, 0.165, 0.008, 0.165, 0.03 ,
       0.011, 0.058, 0.007, 0.008, 0.009, 0.006, 0.005, 0.03 , 0.008,
       0.009, 0.008, 0.003, 0.001, 0.551, 0.037, 0.895, 0.228, 0.059,
       0.264, 0.046, 0.442, 0.815, 0.165, 0.002, 0.03 , 0.008, 0.165,
       0.148, 0.264, 0.165, 0.264, 0.008, 0.165, 0.228, 0.148, 0.264,
       0.165, 0.008, 0.165, 0.228, 0.148, 0.264, 0.228, 0.059, 0.264,
       0.249, 0.815, 0.054, 0.916, 0.001, 0.019, 0.188, 0.33 , 0.674,
       0.055, 0.033, 0.161, 0.012, 0.165, 0.008, 0.165, 0.228, 0.148,
       0.228, 0.059, 0.033, 0.002, 0.161, 0.037, 0.001, 0.338, 0.259,
       0.012, 0.001,

In [142]:
y_train[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

We can use tf.data.Dataset.from_tensor_slices() API to create tf.data.Dataset object, that allows us to create batchs

In [143]:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

In [144]:
for batch in train_dataset.batch(10):
    x, y = batch
    x = tf.cast(tf.transpose(x), "float32")
    y = tf.transpose(y)
    print("Batch x shape: ", x.shape)
    print("Batch y shape: ", y.shape)
    break

Batch x shape:  (180, 10)
Batch y shape:  (100, 10)


2023-07-23 22:26:41.395090: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


We have prepared input x and y for the model. Now, let's start building model.

# Sigmoid Activation Function¶

In [145]:
def sigmoid(z):
    exp_z = tf.math.exp(-z)
    return 1 / (1 + exp_z)

# Loss Activation Function

As I have should in the block diagram at the top, in the multi-lable loss function we will use binary_cross_entroy which is applied to each output (we have NUM_LABELS output).

In [169]:
def binary_cross_entropy(y_true, y_pred, from_logits=False):
    y_true = tf.cast(tf.constant(y_true), "float32")
    y_pred = tf.cast(tf.constant(y_pred), "float32")
    if from_logits:
        y_pred = sigmoid(y_pred)
    epsilon = 1e-7  # A small value to prevent log(1) condition
    y_pred = tf.clip_by_value(y_pred, epsilon, 1 - epsilon)
    loss = y_true * tf.math.log(y_pred) + (1 - y_true)*tf.math.log(1-y_pred)
#     print(loss.shape)
    sum_loss = tf.reduce_sum(loss, axis=0)
#     print(sum_loss.shape)
#     print(sum_loss)
    return -tf.reduce_mean(sum_loss)

In [170]:
# Let's test 
dummy_preds = tf.random.uniform([10, 2])
dummy_preds

<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[0.3589189 , 0.36954486],
       [0.9341887 , 0.1815654 ],
       [0.18172216, 0.36942506],
       [0.8167914 , 0.50181794],
       [0.74678195, 0.06313574],
       [0.40576112, 0.15353525],
       [0.6497719 , 0.99280787],
       [0.066293  , 0.9857409 ],
       [0.16270995, 0.6415174 ],
       [0.08020401, 0.9425225 ]], dtype=float32)>

In [171]:
dummy_y_true = tf.math.round(tf.random.uniform([10, 2]))
dummy_y_true

<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[1., 1.],
       [0., 0.],
       [1., 1.],
       [0., 0.],
       [0., 1.],
       [1., 1.],
       [0., 0.],
       [1., 0.],
       [1., 1.],
       [1., 0.]], dtype=float32)>

In [172]:
binary_cross_entropy(dummy_y_true, dummy_preds)

<tf.Tensor: shape=(), dtype=float32, numpy=18.767738>

In [173]:
# let create the perfect dummy_y_preds
dummy_y_true = tf.math.round(dummy_preds)
dummy_y_true

<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[0., 0.],
       [1., 0.],
       [0., 0.],
       [1., 1.],
       [1., 0.],
       [0., 0.],
       [1., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]], dtype=float32)>

In [174]:
binary_cross_entropy(dummy_y_true, dummy_preds)

<tf.Tensor: shape=(), dtype=float32, numpy=2.5289426>

In [175]:
binary_cross_entropy(dummy_y_true, dummy_y_true)

<tf.Tensor: shape=(), dtype=float32, numpy=1.192093e-06>

# Model Building

In [176]:
class MultiLabelLogisticRegression(tf.Module):
    def __init__(self, n_classes, name=None, **kwargs):
        super().__init__(name=name, **kwargs)
        self.is_built = False
        self.n_class = n_classes
    
    def build(self, num_features):
        # randomly generate bias and weights
        self.weights = tf.Variable(tf.random.normal([num_features, self.n_class]), name="weights")
        self.bias = tf.Variable(tf.zeros([1, self.n_class]), name="bias")
        print("build successfully")
        
    def predict(self, x):
        #compute model output
        # compute logits
        logits = tf.cast(x, "float32") @ self.weights + self.bias
        return sigmoid(logits)
    
    def __call__(self, x):
        if not self.is_built:
            self.build(x.shape[1])
            self.is_built = True
        return self.predict(x)

In [177]:
model = MultiLabelLogisticRegression(len(LABELS))

for batch in train_dataset.batch(10):
    x, y = batch
    output = model(x)
    print("Output shape: ", output.shape)
    print(output[0])
    break

build successfully
Output shape:  (10, 100)
tf.Tensor(
[2.5911629e-01 7.6438677e-01 2.9824348e-02 8.4170222e-01 9.6226603e-01
 1.8528363e-01 8.4176022e-01 2.7314616e-02 8.2227558e-01 7.8687954e-01
 8.0642849e-01 9.9151081e-01 1.2387007e-01 7.1559834e-01 9.9027050e-01
 9.8861027e-01 4.0504757e-01 2.6046950e-03 7.4719810e-01 2.4214981e-01
 7.4750167e-01 7.6068795e-01 9.6293223e-01 9.1054434e-01 5.8646214e-01
 3.3270833e-01 5.3501898e-01 2.7129704e-01 2.7825844e-01 8.8646702e-02
 4.6064411e-04 1.5543182e-03 1.0150222e-01 8.3940208e-02 6.9908774e-01
 8.1704948e-03 4.8082104e-01 3.3872837e-01 5.4124963e-01 5.7906896e-01
 2.6662073e-01 9.7414714e-01 3.0637300e-01 9.6275330e-01 9.7914803e-01
 1.5983982e-04 9.6667260e-01 3.8349581e-01 1.3375071e-01 8.4665638e-01
 9.6567065e-01 7.7206349e-01 7.5112706e-01 8.7236434e-02 3.5258945e-02
 1.5370038e-01 9.8645443e-01 9.4636271e-05 9.0858972e-01 6.3734114e-02
 6.4067525e-01 6.6949612e-01 9.9832207e-01 9.4888431e-01 9.0908402e-01
 9.9193019e-01 8.46447

2023-07-23 22:33:08.751584: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


# Accuracy
- Here we have used accuracy, metrics which is the worts metrics choice for this case.

In [178]:
def calculate_accuracy(y_true, y_pred, threshold=0.5):
    y_pred = tf.cast(y_pred >= 0.5, "int16")
    y_true = tf.cast(y_true, "int16")
    match = tf.cast(tf.equal(y_true, y_pred), "float32")
    return tf.reduce_mean(match)

# Train Function

In [179]:
# defining training function
def train(model, x_train, y_train, learning_rate=0.01):
    with tf.GradientTape() as tape:
        y_pred = model(x_train)
        acc = calculate_accuracy(y_train, y_pred)
        loss = binary_cross_entropy(y_train, y_pred)

    dw, db = tape.gradient(loss, [model.weights, model.bias])
    model.weights.assign_sub(learning_rate * dw)
    model.bias.assign_sub(learning_rate * db)
    return loss, acc

# Training Loop

In [184]:
def display_training_summary(loss, accuracy, epoch, step):
    if step % 10 == 0:
        print(f"EPOCH: {epoch} STEP: {step} --- loss: {loss}---accuracy: {accuracy}")

In [193]:
#model initiate
model = MultiLabelLogisticRegression(len(LABELS))
LOSS_HISTORY = []
ACC_HISTORY = []

In [198]:
from tqdm import tqdm

In [199]:
NUM_EPOCHS = 10
LEARNING_RATE = 0.1

#  Training Loop
for epoch in range(NUM_EPOCHS):
    step = 0
    for batch in train_dataset.batch(64, drop_remainder=True):
        step += 1
        x, y = batch
        loss, acc = train(model, x, y, LEARNING_RATE)
    LOSS_HISTORY.append(loss)
    ACC_HISTORY.append(acc)
    display_training_summary(loss, acc, epoch+1, step)

2023-07-23 22:38:14.589037: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 1 STEP: 930 --- loss: 7.779109954833984---accuracy: 0.9785937666893005


2023-07-23 22:38:19.170105: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 2 STEP: 930 --- loss: 7.674633026123047---accuracy: 0.9785937666893005


2023-07-23 22:38:23.640661: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 3 STEP: 930 --- loss: 7.578596591949463---accuracy: 0.9789062738418579


2023-07-23 22:38:27.911372: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 4 STEP: 930 --- loss: 7.489894866943359---accuracy: 0.9790624976158142


2023-07-23 22:38:32.297057: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 5 STEP: 930 --- loss: 7.4071502685546875---accuracy: 0.9790624976158142


2023-07-23 22:38:36.632075: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 6 STEP: 930 --- loss: 7.329549789428711---accuracy: 0.9792187213897705


2023-07-23 22:38:41.167429: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 7 STEP: 930 --- loss: 7.256796836853027---accuracy: 0.9793750047683716


2023-07-23 22:38:45.698068: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 8 STEP: 930 --- loss: 7.188448905944824---accuracy: 0.9795312285423279


2023-07-23 22:38:50.459798: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 9 STEP: 930 --- loss: 7.123833179473877---accuracy: 0.9795312285423279


2023-07-23 22:38:55.039379: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [59559,100]
	 [[{{node Placeholder/_1}}]]


EPOCH: 10 STEP: 930 --- loss: 7.06289529800415---accuracy: 0.9795312285423279


# Testing

In [208]:
text = df_questions.iloc[1].Text
y_true_label = df_questions.iloc[1].Tags 

In [209]:
text

"forecasting demographic census what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural areas are a lot larger than condensed urban areas is there a need to account for the area size difference if let's say i have census data dating back to 4 5 census periods how far can i forecast it into the future if some of the census zone change lightly in boundaries how can i account for that change what are the methods to validate census forecasts for example if i have data for existing 5 census periods should i model the first 3 and test it on the latter two or is there another way what's the state of practice in forecasting census data and what are some of the state of the art methods"

In [210]:
y_true_label

array(['forecasting'], dtype=object)

In [211]:
y_true_label = multi_label_binarizer.transform([y_true_label])
y_true_label

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [213]:
vector_rep = text_to_feature_vector([text]) / VOCAB_SIZE
vector_rep

array([[0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.038, 0.019, 0.063,
        0.005, 0.001, 0.846, 0.004, 0.569, 0.017, 0.063, 0.379, 0.009,
        0.949, 0.063, 0.005, 0.001, 0.01 , 0.821, 0.023, 0.019, 0.003,
        0.568, 0.822, 0.119, 0.007, 0.045, 0.003, 0.15 , 0.004, 0.635,
        0.011, 0.001, 0.725, 0.155, 0.173, 0.027, 0.505, 0.161, 0.002,
        0.018, 0.016, 0.9  , 0.004, 0.047, 0.042, 0.031, 0.443, 0.033,
        0.002, 0.569, 0.02 , 0.164, 0.001, 0.027, 0.063, 0.005, 0.001,
      

In [214]:
vector_rep.shape

(1, 180)

In [215]:
input_x = tf.constant(vector_rep)
input_x.shape

TensorShape([1, 180])

In [216]:
y_pred = model(input_x)

In [217]:
tf.math.round(tf.squeeze(y_pred))

<tf.Tensor: shape=(100,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>

In [218]:
calculate_accuracy(y_true_label, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.99>

In [220]:
y_pred.shape

TensorShape([1, 100])

In [239]:
prediction_to_label(y_pred)

{'probability': 0.22474242746829987,
 'statistical-significance': 0.15567928552627563,
 'regression': 0.14701983332633972,
 'r': 0.14516624808311462,
 'time-series': 0.12335717678070068,
 'logistic': 0.099468894302845,
 'data-visualization': 0.09647674858570099,
 'sample-size': 0.07880719006061554,
 'mcmc': 0.07526526600122452,
 'conditional-probability': 0.07357914000749588,
 'hypothesis-testing': 0.06757707893848419,
 'forecasting': 0.06606203317642212,
 'correlation': 0.05511552840471268,
 'machine-learning': 0.04774818569421768,
 't-test': 0.04656628146767616,
 'data-transformation': 0.0453295074403286,
 'random-variable': 0.0439240038394928,
 'categorical-data': 0.040010299533605576,
 'factor-analysis': 0.039026033133268356,
 'python': 0.03863678127527237,
 'sampling': 0.038059305399656296,
 'multiple-comparisons': 0.03742443770170212,
 'normal-distribution': 0.03584735840559006,
 'optimization': 0.029050569981336594,
 'p-value': 0.02873215265572071,
 'model-selection': 0.02831405