## Introduction to Keras ##
### Building a Model
Keras is a high-level API for building ANNs with lower-level machine learning languages, such as tensorflow. Historically, keras provided access to a variety of machine learning frameworks, but now it's used almost exclusively with tensorflow, one of the "big 2" machine learning frameworks (along with `pytorch`). 

Before we move in to setting up an ANN meant for some NLP task, like classification, let's learn how to build a simple ANN that we can use to learn some complex pattern we define in some synthetic data.

For our purposes, we're going to use the `Sequential` API within Keras, which means we build our models step-by-step (i.e., each layer is added sequentially). We will also stick with `Dense`, or fully connected layers. 

Let's import the necessary components to get started.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split

If you recall, in the slides we had this neural network:

![image](7A-image.jpg)

Let's describe this in words, and then we'll convert that to an ANN defined with Keras:
1. The input layer has 3 nodes
2. The first hidden layer has 2 nodes
3. The second hidden layer has 4 nodes
4. The third hidden layer has 3 nodes
5. The output layer has one node.

Here's how we'd set this up with Keras:

In [None]:
model = Sequential() # establish the sequential ANN
# First hidden layer:
model.add(Dense(2,input_shape=(3,)))  # first layer requires us to specify input_shape
model.add(Dense(4)) # Second hidden layer
model.add(Dense(3)) # Third hidden layer
model.add(Dense(1)) # Output layer

How do we know if our model was set up correctly? We can use the `summary()` method:

In [None]:
model.summary()

### Fitting the Model
Now that we have the model built, let's look at how we'd train these 39 parameters. We'll evaluate some ANNs with actual data later, but for now let's build a random dataset with 3 input parameters and one output parameter. 

We'll use pandas and numpy to do this:

In [None]:
import pandas as pd, numpy as np
np.random.seed(123)
X1 = np.random.normal(5,2,size=10000) # random variable with mean 5, SD 2
X2 = (np.random.uniform(0,1,size=10000)>0.5).astype(int)
X3 = np.random.exponential(2,size=10000)+2



df = pd.DataFrame(np.stack([X1,X2,X3]).transpose(),
                 columns = ['X1','X2','X3'])

df['Y'] = 5*X1 - 2*X1**2 + X3/4 - 10*X2 + 10*np.log(X3)*X2 - X2/X3*2

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X1,df['Y'])

In [None]:
plt.scatter(X2,df['Y'])

In [None]:
plt.scatter(X3,df['Y'])

Now that we have some data we can train the model. To do, we must `compile` the model and define the optimizer and loss function we wish to use. Let's define each of these:
- **optimizer**: an algorithm or method used to adjust the parameters of a model during the training process in order to minimize the error or loss function. It determines how the model will update its internal parameters (weights and biases) based on the gradients computed during backpropagation. Examples include SGD, Adam (adaptive moment), Adagrad, etc.
- **loss** (function): the key metric used during training to assess model quality; often determined by nature of outcome (e.g., continuous variables use something like MSE)

We'll use "adam" for our optimizer and "mse" for our loss function:

In [None]:
model.compile(optimizer='adam',loss='mse')

Next, we need to fit the model. Before doing so, though, let's train/test split our data so we can evaluate model fit out-of-sample:

In [None]:
train,test = train_test_split(df,train_size=0.80,random_state=123)
train

Now we can fit the model, similar to how we've used the `fit()` method in scikit-learn:

In [None]:
model.fit(train[['X1','X2','X3']],train['Y'],epochs=20)

Before moving on, we're going to introduce a few additional elements to this simple model:
- activation functions: The power of ANNs comes from it's ability to learn non-linearities in data, and part of that power arises from the use of activation functions. Keras includes __[nine activation functions](https://keras.io/api/layers/activations/)__, plus the ability to build custom functions. We'll use "relu" for our nodes.
- callbacks: During the fitting process, it's possible to employ "callback functions", which perform an inter-training step, like saving the model or evaluating the possibility of overfitting. We'll use __[EarlyStopping](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping)__ in this next training run.
- validation_data: We know what this is, but we can actually incorporate it into our training procedure, which is very useful with the use of an early-stopping callback. 

Now we'll re-define the model with these new considerations and re-fit:

In [None]:
# Redefine the model:
model = Sequential()
model.add(Dense(2,input_shape=(3,),activation='gelu'))
model.add(Dense(4,activation='gelu'))
model.add(Dense(3))
model.add(Dense(1))

# Compile
from tensorflow.keras.callbacks import EarlyStopping
model.compile(optimizer='adam',loss='mse')

# Set up callback (early stopper) to monitor validation loss, and 
# have "patience" (how long to wait) of 5
earlystopper = EarlyStopping(monitor='val_loss', patience=5) 

# Fit
model.fit(train[['X1','X2','X3']],train['Y'],
          validation_data= (test[['X1','X2','X3']],test['Y']),
          callbacks=[earlystopper],
          epochs=100)

Suppose we wanted to understand something about the average error in our dataset. We can generated predicted values to do this:

In [None]:
df['predicted_Y'] = model.predict(df[['X1','X2','X3']])
df['error'] = df['Y'] - df['predicted_Y']
df[['Y','error']].describe()

## Using an ANN to detect Fake News 
### Data Setup
Now that we've seen how to build a simple ANN with Keras, let's work on building an ANN to detect fake news!

We're going to use a dataset from kaggle, available __[here](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)__. The data is comprised of a dataset of fake news ("Fake.csv") and real news stories ("True.csv"). Let's load each and set up a label to distinguish between the two datasets:

In [None]:
fake = pd.read_csv('/storage/ice-shared/mgt8833/classdata/Fake.csv')
fake['label'] = 'Fake'
real = pd.read_csv('/storage/ice-shared/mgt8833/classdata/True.csv')
real['label'] = 'True'

These two datasets have the exact same columns, so we can simply stack them. I'm then going to do a random shuffle and reindex so it looks like what we'd expect a normal labeled dataset to look like:

In [None]:
news = pd.concat([fake,real]).sample(frac=1,random_state=123).reset_index(drop=True)
news

We're going to use the "text" column of this dataset in a classifier. Before doing so, though, let's make one simple cleaning-related adjustment based on this quick observation of the data. Namely, "real" news often comes from Reuters, and we don't want that little bit of formatting to drive our results:

In [None]:
news.loc[news['label']=='True','text']

We can use a simple regular expression to clear out this formatting:

In [None]:
news.loc[news['label']=='True','text'].str.replace(r'^[A-Z]+\s+\(Reuters\)','',regex=True)

So we'll use that regular expression, adjusting to account for "/" and whitespace, to clean up the data:

In [None]:
news['clean_text'] = news['text'].str.replace(r'^[A-Z/ ]+\s+\(Reuters\)','',regex=True)
news

Our labels are currently words, so let's set up an indicator variable. We could just use logic like we have before (e.g., `news['label']=='Fake'`, but instead lot's use `map()`, which would be easier to extend to multiple labels:

In [None]:
news['FAKE'] = news['label'].map({'Fake':1,'True':0})
news['FAKE'].mean()

Now, let's create a document term matrix, which we'll then reduce using PCA. For our document term matrix we'll set the following parameters:
- `max_features` = 1000
- `stop_words` = "english"
- `token_pattern` = "\b[a-zA-Z]{3,}\b"
- `ngram_range` = (1,2)

In [None]:
news

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

vec = TfidfVectorizer(max_features = 1000,
                      stop_words = 'english',
                      token_pattern = r"(?u)\b[a-z]{3,}\b",
                      lowercase = True,
                      ngram_range = (1,2))
dtm = vec.fit_transform(news['clean_text'])

Next, I want to normalize the data such that each record has unit length. We will use scikit-learn's normalize function (__[docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html)__). If you picture each numeric representation as a vector in 1,000-dimensional space, normalize will scale all vectors to have unit length (or length of 1).

In [None]:
from sklearn.preprocessing import normalize
dtm_normed = normalize(dtm)

Now would be a good time to consider an approach to reducing the dimensionality of the data. For instance, you could use:
- Topic modeling (LDA, NMF)
- PCA (probably "Truncated SVD" since the DTM is sparse)
- UMAP (probably in conjunction with PCA)
- Word embeddings (e.g., Word2Vec)

It turns out the raw data in this instance, though, works pretty well. So for brevity we're going to move on to designing and training the ANN.

### Designing the ANN

Now we'll design an ANN to classify this dataset. Let's first split the data so we can use the holdout (validation) sample during model tuning. Note that Keras doesn't work with sparse matrices, so we need to move to a dense matrix:

In [None]:
trainX,testX,trainY,testY = train_test_split(dtm_normed.todense(),news['FAKE'],train_size = 0.80,random_state=123)

Now we can set up the ANN. How do we choose the number of layers and nodes? There isn't a hard-and-fast rule for this. I decided to start with 2 hidden layers, each with 10 nodes. We'll also use relu activation on the hidden layers. 

Note that since this is a two-class, one-hot classification problem, we **must** use "sigmoid" activation on the output layer. This will produce a probability we can use to assign labels.

In [None]:
classifier = Sequential()
classifier.add(Dense(10,input_shape=(dtm_normed.shape[1],),activation='relu')) 
classifier.add(Dense(10,activation='relu')) # Second hidden layer
classifier.add(Dense(1,activation='sigmoid'))# Output layer
classifier.summary()

This relatively simple model has over 10,000 parameters to train!

We'll now compile and train the model. We'll use "adam" again for our optimizer, but we must use a different loss function since we're no longer dealing with a continuous output (MSE-type measures don't make sense). For this type of classification problem "binary crossentropy" is appropriate. We'll also include in the compiler an instruction to compute classification accuracy.

To fit, we'll use the following parameters:
- `epochs`: 50
- `callbacks`: We'll use an early stopper again that monitors accuracy of the validation sample ("val_accuracy"). I'll set patience at 5.
- `validation_data`: Pass the names of our "test" datasets, `testX` and `testY`
- `use_multiprocessing`: `True` (speed things up)
- `batch_size`: 32 (Note that this is the default, but you can change this. It's another hyperparamter to tune--certain data work best with smaller batch sizes, and others work best with larger).

Let's train!

In [None]:
classifier.compile(loss='binary_crossentropy',optimizer = 'adam',metrics=['accuracy'])
earlystopper = EarlyStopping(monitor='val_accuracy', patience=5)
classifier.fit(trainX,trainY,epochs=50,callbacks=[earlystopper],
               validation_data=(testX,testY), use_multiprocessing=True, batch_size=32)

Wow! Very accurate, and very quickly! Should we be concerned?

### Revisiting our data 

This truly is the first model I tried, but it did concern me so I checked a few things:
1. Did we "taint" the model by including some function of the output as an input feature? **ANSWER**: No, I don't think so.
2. Which features are important? **ANSWER**: None of them (commented code below)
3. Are the input data for each class (fake and real) comparable? **ANSWER**: We normed all the vectors, so they should be comparable.
4. What features are systematically different? **ANSWER**: See below

In [None]:
# These next two cells compute feature importances using permutation importance, and the %%capture (magic) 
# limits the output of the first cell. While permuatation importance is a common way to get
# a general sense of feature importance, it turns out no features (alone) affect accuracy

In [None]:
# %%capture
# # Uncomment this code if you want to run a feature importance analysis. I've commented so you can try to understand
# # what's done
# #THIS SAYS ALL FEATURES ARE WORTHLESS
# def calculate_feature_importance(model, X, y, metric):
#     baseline_score = metric(y, (model.predict(X)>0.5)[:,0].astype(int)) # computes current accuracy as baseline

#     feature_importance = np.zeros(X.shape[1]) # generates vector of 0s of same shape as number of features
#     for feature_index in range(X.shape[1]): # iterates over each feature
#         X_permuted = X.copy() # copies the original X data into a new dataset
#         np.random.shuffle(X_permuted[:, feature_index]) # randomly sorts one column of the data (feature_index)

#         permuted_score = metric(y, (model.predict(testX)>0.5)[:,0].astype(int)) # uses the permutated data to re-compute accuracy
#         feature_importance[feature_index] = baseline_score - permuted_score # computes feature importance as difference between baseline and permutated accuracy

#     feature_importance /= baseline_score  # Normalize the importance values (scale by original accuracy)
#     return feature_importance

# # Calculate feature importance using permutation importance
# from sklearn.metrics import accuracy_score
# feature_importance = calculate_feature_importance(classifier, testX, testY, accuracy_score)

In [None]:
# # Print feature importance values
# for feature_index, importance_value in enumerate(feature_importance):
#     if importance_value > 0:
#         print(f"Feature {vec.get_feature_names_out()[feature_index]}: {importance_value}")

Let's see which features have the largest differences between real and fake news. To evaluate this, we need to:
1. Separately compute the mean value for each feature by label
2. Compute the difference between means
3. Evaluate high and low differences

In [None]:
# compute value of each feature 
fake_means = np.asarray(dtm_normed[news['FAKE']==1,:].mean(axis=0)).flatten()
real_means = np.asarray(dtm_normed[news['FAKE']==0,:].mean(axis=0)).flatten()
means = pd.DataFrame(np.stack([fake_means,real_means]).transpose(),columns=['Fake','Real'])

In [None]:
# Compute the differences and set index equal to feature names:
means['Difference'] = fake_means - real_means
means.index = vec.get_feature_names_out()

In [None]:
print(means.sort_values(by='Difference').head(25))
print(means.sort_values(by='Difference').tail(25))

Overall, this provides me some comfort the model is actually using the data. The big things that jump out at me is that the "real" news tends to use very objective or "checkable" words ("said", days of week, offical titles, etc.). The fake news seems more casual, referring to social media, etc.

### Evaluating Model Performance & Applying to New Data
The last two things we will do is look at how the model performs by class, similar to how we evaluated other classifiers. We'll also look at predictions for new data

#### Evaluating Model Performance
We've discussed that accuracy is not always the best means of evaluating a model. Let's take a deeper dive using some techniques we applied in earlier modules. Before doign so, though, we first need to generate predicted values. Let's look at the output of the `predict()` method:

In [None]:
predicted = classifier.predict(dtm_normed.todense())
predicted

A few things to observe here:
- The output array is a function of the dimensions of the output layer. Our output layer had 1 node, so this is a 1 column array. If we had a three-state classification problem (e.g., negative, neutral, positive), this array would have 3 columns, one per output
- If we had 3 or more classes, we would likely use `argmax()` to determine which class had the highest probability (assuming a `softmax` classification problem)
- In our case, we'll use probabilities above 50% for 1s (Fake), and otherwise 0 (Real). This conversion is relatively straightforward:

In [None]:
news['pred_fake'] = (predicted>0.50).flatten().astype(int)

Now we can use any of our sklearn classification metrics to evaluate. Let's look at the classification report:

In [None]:
from sklearn.metrics import classification_report
print(classification_report(news['FAKE'],news['pred_fake']))

What about the 1%? If we want to look at any of those, we can simply examine rows where predicted doesn't equal actual:

In [None]:
misclass = news.loc[news.eval("FAKE != pred_fake")]
misclass

Last, let's see how this classifier works for news outside the source used to train. Specifically, I provide a file, "gpt_news.csv", which includes 5 "fake news" articles and 5 "real news" articles generated by ChatGPT. 

Let's first load the data:

In [None]:
gpt = pd.read_csv("/storage/ice-shared/mgt8833/classdata/gpt_news.csv")
gpt

You're welcome to look at any of these articles in more detail, but the fake news articles include topics like aliens, time travel, and vampires. The real articles discuss covid, a Mars rover, and climate change. So, topically they are very different. 

Let's see how the classifier performs:

In [None]:
# Fake articles
classifier.predict(normalize(vec.transform(gpt['gpt_fake'])).todense())

Four of 5 on the fake side. Now the real:

In [None]:
# Real articles
classifier.predict(normalize(vec.transform(gpt['gpt_real'])).todense())

Much worse! Only two of 5 correctly identified.

This illustrates the importance of how specific corpuses of data can produce models that perform well in one setting, but more poorly in others.

### Wrap Up 
This demo gave you a (brief) introduction into Keras and Tensorflow. ANNs are extremely powerful tools for classification and prediction in general, and particularly in processing unstructured data (NLP, image classifiers, etc.). We used Keras for an NLP classification task, but you could also use Keras for:
- regression problems
- multilabel classification (e.g., topic identification)
- multiclass (>3) classification (e.g., cluster prediction)
Note that each of these requires some tweaks, such as different output layer activation and loss functions.

One thing we did not discuss was tuning hyperparamters in an ANN. Keras includes some functionality that makes it possible to tune Keras models in a manner very similar to how we tuned scikit-learn classifiers. See __[here](https://medium.com/swlh/hyper-parameter-tuning-for-keras-models-with-scikit-learn-library-dba47cf41551)__ for an example of tuning an image classifier. 

Keras can also be used to build more advanced ANNs, such as those including LSTM layers, recurrent ANNs, and convolutional ANNs. There are many resources available online that illustrate those topics.