# FAKE NEWS CLASSIFIER : Build a system to identify unreliable news articles
Develop a machine learning program to identify when an article might be fake news. Run by the UTK Machine Learning Club.



## Dataset Description
**train.csv** : A full training dataset with the following attributes:

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable

    1: unreliable
    
    0: reliable

In [434]:
# !pip install numpy pandas matplotlib -q
# !pip install nltk textblob -q
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import textblob
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# !git clone https://github.com/tikendraw/funcyou.git -q
import funcyou

In [435]:
import re
from sklearn.model_selection import train_test_split

In [436]:
# DOWNLOADING EXTRA FILES
# nltk.download('all')
# !python -m textblob.download_corpora

# GET THE DATA

In [437]:
from funcyou.dataset import download_kaggle_dataset

In [438]:
# IMPORT THE DATA
DATA_LINK = 'https://www.kaggle.com/competitions/fake-news/code'

# download_kaggle_dataset(url = DATA_LINK)

In [439]:
# UNZIP THE DATA
# !unzip fake-news.zip -d dataset

## READ THE DATA

In [440]:
data0 = pd.read_csv('./dataset/train.csv')
print('datset.shape: ',data0.shape)
print('dataset info: ',data0.info())

datset.shape:  (20800, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB
dataset info:  None


In [441]:
data = data0.copy()
data = data[:10]

In [442]:
data.shape

(10, 5)

In [443]:
data

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton John’s 6 Favorite ...,,Ever wonder how Britain’s most iconic pop pian...,1
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0




Here We will be dropping `id` (unique value does not provide any value to 

instance) and `author` (to no create a biasness towards it)

In [444]:
# DROPPING ID AND AUTHOR
data.drop(['id','author'], axis = 1, inplace = True)

In [445]:
data.head(5)

Unnamed: 0,title,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Print \nAn Iranian woman has been sentenced to...,1


In [446]:
data.title[0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

# Preprocessing

### Handling Missing values

In [447]:
# CHECK FOR NAN VALUES
data.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [448]:
data.shape

(10, 3)

In [449]:
# CHECK IF TEXT AND TITLE IS MISSING TOGATHER
data[(data['text'] == np.nan) & (data['title']== np.nan)].shape

(0, 3)

There are no values where both 'text' and 'title' are missing

In [450]:
text_nulls = data[data['text'].isnull()].index.tolist()
title_nulls  = data[data['title'].isnull()].index.tolist()

In [451]:
# VALUE COUNTS OF MISSING TEXT AND TITLE
print('missing title values: ', len(title_nulls))
print('missing title label counts: ', data[data['title'].isnull()]['label'].value_counts())

# VALUE COUNTS OF MISSING TITLE AND TITLE
print('\nmissing text values: ', len(text_nulls))
print('missing TEXT  label counts: ', data[data['text'].isnull()]['label'].value_counts())


missing title values:  0
missing title label counts:  Series([], Name: label, dtype: int64)

missing text values:  0
missing TEXT  label counts:  Series([], Name: label, dtype: int64)


Here we can see that if either text or title is missing then it is classified as Fake news explicitely

So, we dropping nulls

In [452]:
print('before dropping na: ',data.shape)
data.dropna(axis = 0, inplace = True)
print('after dropping na: ',data.shape)

before dropping na:  (10, 3)
after dropping na:  (10, 3)


## Replacing Contractions

In [453]:
#@title this is contractions :: too long don't open
contractions_dict = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [469]:
# FUNCTIONS TO EXPAND CONTRACTIONS
def cont_to_exp(x):
    x = str(x).lower()
    xsplited = x.split(' ')
    exp_sentence = []
    for s in x.split():
        if s in contractions_dict.keys():
            
            s = contractions_dict.get(s)
        exp_sentence.append(s)
        
    x = ' '.join(exp_sentence)
    return x
#     print(xsplited)
#     return xsplited

In [470]:
# ss = "you Didn't he don't they can't"
ss  = data.title[0]
ss

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [471]:
contractions_dict.get("didn't")

'did not'

In [472]:
sss = cont_to_exp(ss)

['house', 'dem', 'aide:', 'we', 'didn’t', 'even', 'see', 'comey’s', 'letter', 'until', 'jason', 'chaffetz', 'tweeted', 'it']


In [473]:
sss[4] in contractions_dict.keys()

False

In [421]:
%%time
data['text'] = data['text'].apply(cont_to_exp)
data['title'] = data['title'].apply(cont_to_exp)

CPU times: user 6.72 ms, sys: 0 ns, total: 6.72 ms
Wall time: 7.43 ms


In [422]:
data.head()

Unnamed: 0,title,text,label
0,house dem aide: we didn’t even see comey’s let...,house dem aide: we didn’t even see comey’s let...,1
1,"flynn: hillary clinton, big woman on campus - ...",ever get the feeling your life circles the rou...,0
2,why the truth might get you fired,"why the truth might get you fired october 29, ...",1
3,15 civilians killed in single us airstrike hav...,videos 15 civilians killed in single us airstr...,1
4,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1


## Cleaning and Preprocessing

In [423]:
def text_cleaning(text):
    text = str(text)
    text = text.lower()
    text = re.sub("[^a-zA-Z]", " ", text) # removing punctuation
    # remove special characters from text column
    text = re.sub('[#,@,&]', '',text)
    # Remove digits
    text = re.sub('\d*','', text)
    # remove "'s"
    text = re.sub("'s",'', text)
    #Remove www
    text = re.sub('w{3}','', text)
    # remove urls
    text = re.sub("http\S+", "", text)
    # remove multiple spaces with single space
    text = re.sub('\s+', ' ', text)
    #remove all single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)

    return text

In [424]:
%%time
data['title'] = data['title'].apply(text_cleaning) 
data['text'] = data['text'].apply(text_cleaning) 

CPU times: user 27.6 ms, sys: 0 ns, total: 27.6 ms
Wall time: 28.2 ms


In [425]:
data.head()

Unnamed: 0,title,text,label
0,house dem aide we didn even see comey letter u...,house dem aide we didn even see comey letter u...,1
1,flynn hillary clinton big woman on campus brei...,ever get the feeling your life circles the rou...,0
2,why the truth might get you fired,why the truth might get you fired october the ...,1
3,civilians killed in single us airstrike have ...,videos civilians killed in single us airstrike...,1
4,iranian woman jailed for fictional unpublished...,print an iranian woman has been sentenced to s...,1


In [426]:
# Cleaning the words
lemmatizer = WordNetLemmatizer()

def nltk_clean(text):
    text = str(text).lower()
    text = [lemmatizer.lemmatize(word) for word in word_tokenize(text) if word not in stopwords.words('english')]
    text = ' '.join(text)
    return text

In [428]:
%%time

# takes alot of time
data['title'] = data['title'].apply(nltk_clean) 
data['text'] = data['text'].apply(nltk_clean) 

CPU times: user 525 ms, sys: 52.3 ms, total: 578 ms
Wall time: 576 ms


In [429]:
data.head()

Unnamed: 0,title,text,label
0,house dem aide even see comey letter jason cha...,house dem aide even see comey letter jason cha...,1
1,flynn hillary clinton big woman campus breitbart,ever get feeling life circle roundabout rather...,0
2,truth might get fired,truth might get fired october tension intellig...,1
3,civilian killed single u airstrike identified,video civilian killed single u airstrike ident...,1
4,iranian woman jailed fictional unpublished sto...,print iranian woman sentenced six year prison ...,1


In [None]:
data.title[0]

# Feature Engineering

In [None]:
data['title_len'] = data['title'].apply(lambda x: len(str(x).split()))
data['text_len'] = data['text'].apply(lambda x: len(str(x).split()))
data['title_text_ratio'] = data['title_len']/data['text_len']
# train['avg_title_len'] = train['title'].apply(lambda x: len(str(x).split()))

In [None]:
data

In [None]:
# Title Word Count distribution
plt.figure(figsize = (15,7))
plt.grid()
plt.hist(x = data.title_len, bins=25)
plt.xlabel('word length')
plt.ylabel('Count')
# plt.plot(x = , y = 0)
percent = 99
for i in range(95,100):
    plt.axvline(x = np.percentile(data.title_len, i), color = 'b', label = 'axvline - full height')
plt.title(f'Title Word Count distribution: {np.percentile(data.title_len, percent)} words cover {percent}% of title data')
plt.show()

In [None]:
# Title Word Count distribution
plt.figure(figsize = (15,7))
plt.grid()
plt.hist(x = data.text_len, bins=25)
plt.xlabel('word length')
plt.ylabel('Count')
# plt.plot(x = , y = 0)
percent = 99
for i in range(95,100):
    plt.axvline(x = np.percentile(data.text_len, i), color = 'b', label = 'axvline - full height')
plt.title(f'Text Word Count distribution: {np.percentile(data.text_len, percent)} words cover {percent}% of text data')
plt.show()

so word lenghth upto 22 covers 99% of the dataset title
so word lenghth upto 4061 covers 99% of the dataset text

# Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
s = data.text[11]
s

In [None]:
from scipy import sparse
tfidf = TfidfVectorizer()
ddd = tfidf.fit_transform(np.array([s]))
ddd

In [None]:
# tfidf.get_feature_names()

In [None]:
data[['text','title','label']]

# Helper functions

In [None]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import datetime


def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.
  Args:
      y_true: true labels in the form of a 1D array
      y_pred: predicted labels in the form of a 1D array
  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [None]:
def create_tensorboard_callback(dir_name, experiment_name):
    """
    Creates a TensorBoard callback instand to store log files.
    Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"
    Args:
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)
    """
    log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=log_dir
    )
    print(f"Saving TensorBoard log files to: {log_dir}")
    return tensorboard_callback


# Spliting data

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(data[['title', 'text']], data['label'], test_size = .1, random_state=3, stratify = data.label)

In [None]:
print('xtrain: ', xtrain.shape)
print('ytrain: ', ytrain.shape)
print('xtest: ', xtest.shape)
print('ytest: ', ytest.shape)

# Baseline : Model 0

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [None]:
model0 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

#fit model
model0.fit(xtrain.title.to_list(), ytrain.to_list())


In [None]:
ytrue0 = ytest.to_list()
ypred0 = model0.predict(xtest.title.to_list())

In [None]:
calculate_results(ytrue0, ypred0)

# Token Vectorization

In [None]:
from tensorflow.keras.layers import TextVectorization, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from tensorflow.keras import layers
import tensorflow as tf
from tensorflow import keras

In [None]:
# number of unique words in dataset
# %%time
all_title_words_list = [words.split() for words in xtrain.title]
all_title_words = set(num for sublist in all_title_words_list for num in sublist)
print('total token in titles: ',len(all_title_words))

all_text_words_list = [words.split() for words in xtrain.text]
all_text_words = set(num for sublist in all_text_words_list for num in sublist)
print('total token in text: ',len(all_text_words))

all_words_combined_list = all_text_words_list + all_title_words_list
all_words_combined = set(num for sublist in all_words_combined_list for num in sublist)

print('total token combined: ',len(all_words_combined))

In [None]:
# output_sequence_length
percent_of_the_data_to_cover = 99
output_sequence_len_title = int(np.percentile(data.title_len, percent_of_the_data_to_cover))
print('output_sequence_len_title: ',output_sequence_len_title)
output_sequence_len_text = int(np.percentile(data.text_len, percent_of_the_data_to_cover))
print('output_sequence_len_text: ',output_sequence_len_text)

we do not want every word to tokenize , there happens to be alot of less occuring words that we do not want. Thats why we will substract 500 words

In [None]:
max_token_title = len(all_title_words) - 2500          #number of words to tokenize -500 as we do not want every word to tokenize 
print(max_token_title)
max_token_text = len(all_text_words) -100_000
print(max_token_text)

In [None]:
# VECTORIZER FOR TITLE
title_text_vectorizer = TextVectorization(max_tokens=max_token_title, 
                                          output_sequence_length=output_sequence_len_title,
                                          pad_to_max_tokens = True)

In [None]:
# VECTORIZER FOR TEXT
text_text_vectorizer = TextVectorization(max_tokens=max_token_text, 
                                          output_sequence_length=output_sequence_len_text,
                                          pad_to_max_tokens = True)

In [None]:
# Adapt text vectorizer to training titles
# %%time
title_text_vectorizer.adapt(xtrain.title.to_list())

In [None]:
# how many words are there
total_words =  len(title_text_vectorizer.get_vocabulary())
print('total no. of words: ', total_words)
print('5 Most frequent words', title_text_vectorizer.get_vocabulary()[:5])
print('5 Least frequent words', title_text_vectorizer.get_vocabulary()[-5:])

In [None]:
# Test out text vectorizer
import random
target_title_sentence = random.choice(xtest.title.to_list())
print(f"Text:\n{target_title_sentence}")
print(f"\nLength of text: {len(target_title_sentence.split())}")
print(f"\nVectorized text:\n{title_text_vectorizer([target_title_sentence])}")

**Note:** Here instead of Directly adapting to a list of string, we will convert it into 
tf.data.dataset to overcome fitting it the memory problem. It crashed the system if used a list. while a tf.data adjust according to the memory.

In [None]:
# Adapt text vectorizer to training text
# %%time

train_text = tf.data.Dataset.from_tensor_slices(xtrain.text.to_list())
text_text_vectorizer.adapt(train_text)

In [None]:
# how many words are there
total_words =  len(text_text_vectorizer.get_vocabulary())
print('total no. of words: ', total_words)
print('5 Most frequent words', text_text_vectorizer.get_vocabulary()[:5])
print('5 Least frequent words', text_text_vectorizer.get_vocabulary()[-5:])

As we have seen above there are low frequency words that do not need to tokenize
as they occure less. so set max_token to less than actual unique words.

In [None]:
# Test out text vectorizer
import random
target_text_sentence = random.choice(xtest.text.to_list())
print(f"Text:\n{target_text_sentence}")
print(f"\nLength of text: {len(target_text_sentence.split())}")
print(f"\nVectorized text:\n{text_text_vectorizer([target_text_sentence])}")

# Word Embedding

In [None]:
from tensorflow.keras.layers import Embedding

### Title Embedding

In [None]:
total_title_words =  len(title_text_vectorizer.get_vocabulary())
print(total_title_words)

In [None]:
title_embedder = Embedding(input_dim = total_title_words,
                           output_dim = 32,
                           mask_zero = True)

In [None]:
print('sentence: ', target_title_sentence ,end = '\n')
print('\n')
vectorized_sentence = title_text_vectorizer([target_title_sentence])
print('vectorized: ', vectorized_sentence)
print('\n')
embedded_sentence = title_embedder(vectorized_sentence)
print('embedded shape', embedded_sentence.shape)
print('embedded: ', embedded_sentence)


### Text embedding

In [None]:
total_text_words =  len(text_text_vectorizer.get_vocabulary())
print(total_text_words)

In [None]:
text_embedder = Embedding(input_dim = total_text_words,
                           output_dim = 32,
                           mask_zero = True)

In [None]:
print('sentence: ', target_text_sentence ,end = '\n')
print('sentence len: ', len(target_text_sentence) ,end = '\n')

print('\n')
vectorized_sentence = title_text_vectorizer([target_text_sentence])
print('vectorized: ', vectorized_sentence)
print('\n')
embedded_sentence = text_embedder(vectorized_sentence)
print('embedded shape', embedded_sentence.shape)
print('embedded: ', embedded_sentence[:,:5,:5]) #This is too long to print 


# Creating `tf.Data`

In [None]:
# train tf.data
# train_title = tf.data.Dataset.from_tensor_slices(xtrain.title.to_list())
# train_text = tf.data.Dataset.from_tensor_slices(xtrain.text.to_list())
train_label = tf.data.Dataset.from_tensor_slices(ytrain)


train_features = tf.data.Dataset.from_tensor_slices((xtrain.title.to_list(), xtrain.text.to_list()))
train_dataset = tf.data.Dataset.zip((train_features, train_label))

# test tf.data
# test_title = tf.data.Dataset.from_tensor_slices(xtest.title.to_list())
# test_text = tf.data.Dataset.from_tensor_slices(xtest.text.to_list())
test_label = tf.data.Dataset.from_tensor_slices(ytest)

test_features = tf.data.Dataset.from_tensor_slices((xtest.title.to_list(), xtest.text.to_list()))
test_dataset  = tf.data.Dataset.zip((test_features, test_label))


In [None]:
modela = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

#fit model
modela.fit(xtrain.title.to_list(), ytrain)
# model0.fit(sss, ytrain.to_list())
ytruea = ytest.to_list()
ypreda = model0.predict(xtest.title.to_list())

calculate_results(ytruea, ypreda)

In [None]:
# Visualizing the data
for i,j in train_dataset.take(1):
    print(i,j)
    break

In [None]:
# prefetching 
train_dataset = train_dataset.batch(64).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(64).prefetch(tf.data.AUTOTUNE)

# Model 1

In [79]:
input1 = keras.Input(shape = (1), dtype = tf.string)
tokenize = title_text_vectorizer(input1)
print('tokenize shape: ', tokenize.shape)

embedd = title_embedder(tokenize)
print('embedded shape: ',embedd.shape)

# a conv1d layer
x = layers.Conv1D(32, 5, 1, activation='relu')(embedd)
x = layers.Dropout(.3)(x)
x = layers.Dense(32, activation='relu')(x)
output1 = layers.Dense(1, activation='relu')(x)

#compile
model1.compile(loss = tf.keras.losses.BinaryCrossentropy(),
              optimizer = tf.keras.optimizers.Adam(),
              metrics = ['accuracy'])

tokenize shape:  (None, 22)
embedded shape:  (None, 22, 32)


In [80]:
EPOCHS = 3

In [81]:
hist1 = model1.fit(train_dataset, epochs = EPOCHS,
                      steps_per_epoch = int(.1* (len(train_dataset)/EPOCHS)),
                      validation_steps = int(.2* (len(test_dataset)/EPOCHS)),
                      validation_data=test_dataset,
                      callbacks = [create_tensorboard_callback('tb','model1')]
                    )

Saving TensorBoard log files to: tb/model1/20221109-195505
Epoch 1/3
Epoch 2/3
Epoch 3/3


# Modelaa


In [73]:
from tensorflow.keras import layers
from tensorflow.keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D

In [74]:
input1 = layers.Input(shape = (1,), dtype = tf.string)
token = title_text_vectorizer(input1)
embed = title_embedder(token)
x = layers.LSTM(32,return_sequences=True)(embed)
x = layers.LSTM(32,return_sequences=True)(x)
title_out = GlobalAveragePooling1D()(x)
print('title output shape: ',title_out.shape)

input2 = layers.Input(shape = (1,), dtype= tf.string)
token2 = text_text_vectorizer(input2)
embed2 = text_embedder(token2)
x = layers.LSTM(32,return_sequences=True)(embed2)
x = layers.LSTM(32,return_sequences=True)(x)
text_out = GlobalAveragePooling1D()(x)
print('text output shape: ',text_out.shape)

concat = layers.Concatenate()([title_out, text_out])
print('concatenated shape: ',concat.shape)

x = layers.Dropout(.3)(concat)
x = layers.Dense(128, activation = 'relu')(x)
x = layers.Dense(128, activation = 'relu')(x)
outputs = layers.Dense(1, activation = 'sigmoid')(x)

modelaa = keras.Model(inputs = [input1,input2], outputs = outputs)

#COMPILE
modelaa.compile(loss = keras.losses.SparseCategoricalCrossentropy(),
               optimizer = keras.optimizers.Adam(),
               metrics = ['accuracy'])

title output shape:  (None, 32)
text output shape:  (None, 32)
concatenated shape:  (None, 64)


In [None]:
EPOCHS = 10
len(train_dataset), len(test_dataset)

In [None]:
historyaa = modelaa.fit(train_dataset, epochs = EPOCHS,
                      steps_per_epoch = int(.5* (len(train_dataset)/EPOCHS)),
                      validation_steps = int(.2* (len(test_dataset)/EPOCHS)),
                      validation_data=test_dataset,
                      callbacks = [create_tensorboard_callback('tb','model1')]
                    )

In [None]:
loss = history1.history['loss']
val_loss = history1.history['val_loss']

accuracy = history1.history['accuracy']
val_accuracy = history1.history['val_accuracy']

# print(val_accuracy)
plt.figure(figsize = (25,7))

plt.subplot(1,2,1)
plt.grid(True)
plt.plot(np.arange(len(loss)), loss, 'r', label='Training loss')
plt.plot(np.arange(len(val_loss)), val_loss, 'bo', label='Validation loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss Value')
plt.legend()

plt.subplot(1,2,2)
plt.grid(True)
plt.plot(np.arange(len(accuracy)), accuracy, 'r', label='Training accuracy')
plt.plot(np.arange(len(val_accuracy)), val_accuracy, 'bo', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accracy Value')
plt.legend()

plt.show()