# Assignment 1: Facebook Post Classification

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
#!pip install imblearn

## Background

As one of the largest social media websites in the world, Facebook is an attractive platform for businesses to reach their consumers. Almost all consumer-facing businesses have virtual presence on Facebook, in the form of Facebook business pages (e.g., see [here](https://www.facebook.com/target/) for Target's Facebook business page). Everyday, Facebook users who visit these business pages generate a large amount of posts. These user posts may represent customer complains, questions, or appreciations directed towards the focal businesses. 

For businesses, these user posts contain valuable information about customers' needs and preferences, and understanding what the user posts are talking about represents an important opportunity to get to know your customers in real-time.

## Dataset and Task

For this assignment, you will use a **labeled dataset** named "FB_posts_labeled.txt". It is a **tab-delimited** file with the following fields:
- postId: this is a unique identifier for each user post. There are 7961 posts in total;
- message: this is the text of each post;
- Appreciation: this is a binary (0/1) indicator of whether a post is an appreciation;
- Complaint: this is a binary (0/1) indicator of whether a post is a customer complaint;
- Feedback: this is a binary (0/1) indicator of whether a post is a customer feedback (e.g., questions and suggestions).

Appreciation, Complaint, and Feedback are the three mutually exclusive content categories / classes in this dataset. They were labeled by humans, and the labeling isn't perfect (i.e., there may be ambiguous cases where the labels are not appropriate). However, for the sake of this assignment, let's treat them as the ground truth. **Your task is to build a text classifier to predict the content category of a post based on its textual content.** 

To evaluate the out-of-sample performance of your model, you will use it to make predictions for 2039 posts in an **unlabeled dataset** named "FB_posts_unlabeled.txt". It is also a tab-delimited file, but only has postId and message fields. I keep the ground truth labels for these posts in a private place, in order to objectively evaluate your model's performance. The performance metric I will use is **averaged F-measure** across the three categories.

In [10]:
# text = []
# label = []
# for line in open("FB_posts_labeled.txt"):
#     line = line.rstrip('\n').split('\t')
#     text.append(line[0])
#     label.append(int(line[1]))
# text = np.array(text)
# label = np.array(label)

In [3]:
df = pd.read_table('FB_posts_labeled.txt', delimiter = '\t')
df

Unnamed: 0,postId,message,Appreciation,Complaint,Feedback
0,126016648090_10150802142013091,Great ! ;),1,0,0
1,108381603303_10151136215833304,YUM! YUM!,1,0,0
2,108381603303_3913438087739,Yummm :)),1,0,0
3,110455108974424_343049739048292,sweet,1,0,0
4,110455108974424_350358541650745,nice,1,0,0
...,...,...,...,...,...
7956,179590995428478_390650150989227,Oregon locations,0,1,0
7957,6806028948_10151298595908949,Just had two very long and very expensive flig...,0,1,0
7958,77978885595_10152286635360596,pet smart #1756 Flowery branch ga really gave...,0,1,0
7959,125472670805257_397218893630632,having terrible trouble getting help from delt...,0,1,0


In [4]:
def map_target(x):
    
    if x['Appreciation'] == 1:
        return '0'
    elif x['Complaint'] == 1:
        return '1'
    else:
        return '2'
    
df['target'] = df.apply(lambda x: map_target(x), axis = 1)

## Oversampling Approach 

In [5]:
df['target'].value_counts()

1    4255
0    2062
2    1644
Name: target, dtype: int64

In [6]:
from imblearn.over_sampling import RandomOverSampler
resample_df = df[['message', 'target']]
X = resample_df['message']
y = resample_df['target']
ros = RandomOverSampler(random_state=42,sampling_strategy={'0':2500, '1': 4255, '2': 2000})
X_res, y_res = ros.fit_resample(X.to_numpy().reshape(-1, 1), y)
text = X_res.flatten()

In [7]:
label = tf.keras.utils.to_categorical(y_res, num_classes=3)

## Normal Approach

In [None]:
text = df['message'].to_numpy()

In [13]:
label = tf.keras.utils.to_categorical(df['target'], num_classes=3)

In [8]:
text.shape, label.shape

((8755,), (8755, 3))

## Vectorization

In [9]:
vectorize_layer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = None,
    standardize = 'lower_and_strip_punctuation',
    split = 'whitespace',
    ngrams = None,
    output_mode = 'int',
    output_sequence_length = None
)

In [10]:
# apply it to the text data with "adapt"
vectorize_layer.adapt(text)
# check preprocessing results, such as vocabulary, 
#vectorize_layer.get_vocabulary()
len(vectorize_layer.get_vocabulary())

19465

In [11]:
# now use it to process some text
input_text = [['very good movie'], ['Niharika']]
vectorize_layer(input_text)

<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[  95,  113, 2805],
       [   1,    0,    0]], dtype=int64)>

## Model 1 - Simple RNN

In [149]:
model_rnn = keras.Sequential()

model_rnn.add(vectorize_layer)

model_rnn.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 64,
    mask_zero = True
))

model_rnn.add(keras.layers.SimpleRNN(128)) # see note below

model_rnn.add(keras.layers.Dense(3, activation = 'softmax'))

In [150]:
model_rnn.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_5 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_5 (Embedding)     (None, None, 64)          1245760   
                                                                 
 simple_rnn_5 (SimpleRNN)    (None, 128)               24704     
                                                                 
 dense_5 (Dense)             (None, 3)                 387       
                                                                 
Total params: 1,270,851
Trainable params: 1,270,851
Non-trainable params: 0
_________________________________________________________________


In [151]:
# configure training / optimization
model_rnn.compile(loss = keras.losses.CategoricalCrossentropy(),
                  optimizer='adam',
                  metrics=[[tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]])

In [152]:
# training with 20% validation and 10 epochs.
model_rnn.fit(x = text, y = label, validation_split = 0.2,
              epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x26602656760>

In [153]:
# try to make some predicitons
model_rnn.predict(['timepass'])



array([[0.30253524, 0.41786057, 0.27960423]], dtype=float32)

## Model 2 - LSTM

In [158]:
model_lstm = keras.Sequential()

model_lstm.add(vectorize_layer)

model_lstm.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 64,
    mask_zero = True
))

model_lstm.add(keras.layers.LSTM(128)) # see note below

model_lstm.add(keras.layers.Dense(3, activation = 'softmax'))

In [159]:
model_lstm.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_5 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_6 (Embedding)     (None, None, 64)          1245760   
                                                                 
 lstm (LSTM)                 (None, 128)               98816     
                                                                 
 dense_6 (Dense)             (None, 3)                 387       
                                                                 
Total params: 1,344,963
Trainable params: 1,344,963
Non-trainable params: 0
_________________________________________________________________


In [160]:
# configure training / optimization
model_lstm.compile(loss = keras.losses.CategoricalCrossentropy(),
                  optimizer='adam',
                  metrics=[[tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]])

In [161]:
# training with 20% validation and 10 epochs.
model_lstm.fit(x = text, y = label, validation_split = 0.2,
              epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2660309b970>

In [162]:
# try to make some predicitons
model_lstm.predict(['timepass'])



array([[0.2656086 , 0.39089018, 0.34350124]], dtype=float32)

## Model 3 - GRU

In [19]:
model_gru = keras.Sequential()

model_gru.add(vectorize_layer)

model_gru.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 64,
    mask_zero = True
))

model_gru.add(keras.layers.GRU(128)) # see note below

model_gru.add(keras.layers.Dense(3, activation = 'softmax'))

In [20]:
model_gru.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 64)          1245760   
                                                                 
 gru (GRU)                   (None, 128)               74496     
                                                                 
 dense (Dense)               (None, 3)                 387       
                                                                 
Total params: 1,320,643
Trainable params: 1,320,643
Non-trainable params: 0
_________________________________________________________________


In [21]:
# configure training / optimization
model_gru.compile(loss = keras.losses.CategoricalCrossentropy(),
                  optimizer='adam',
                  metrics=[[tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]])

In [22]:
# training with 20% validation and 10 epochs.
model_gru.fit(x = text, y = label, validation_split = 0.2,
              epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1eaba0e8b50>

In [23]:
# try to make some predicitons
model_gru.predict(['timepass'])



array([[0.4258635 , 0.23383796, 0.3402986 ]], dtype=float32)

## Model 4 - BiDirectional GRU

In [37]:
model_bigru = keras.Sequential()

model_bigru.add(vectorize_layer)

model_bigru.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 128,
    mask_zero = True
))

model_bigru.add(keras.layers.Bidirectional(keras.layers.GRU(128, activation='relu')))

#model_bigru.add(keras.layers.Dropout(0.3))

#model_bigru.add(keras.layers.Bidirectional(keras.layers.GRU(128, activation='relu')))

model_bigru.add(keras.layers.Dense(3, activation = 'softmax'))

In [38]:
model_bigru.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding_8 (Embedding)     (None, None, 128)         2491520   
                                                                 
 bidirectional_9 (Bidirectio  (None, 256)              198144    
 nal)                                                            
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_6 (Dense)             (None, 3)                 771       
                                                                 
Total params: 2,690,435
Trainable params: 2,690,435
No

In [39]:
# configure training / optimization
model_bigru.compile(loss = keras.losses.CategoricalCrossentropy(),
                  optimizer='adam',
                  metrics=[[tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]])

In [40]:
# training with 20% validation and 10 epochs.
model_bigru.fit(x = text, y = label, validation_split = 0.2,
              epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x212b1de8fa0>

In [41]:
# try to make some predicitons
model_bigru.predict(['timepass'])



array([[0.24583004, 0.31514028, 0.4390297 ]], dtype=float32)

## Submission

In [42]:
submission = pd.read_table('FB_posts_unlabeled.txt', delimiter = '\t')
submission

Unnamed: 0,postId,message
0,108381603303_10151119973393304,Love. It. To
1,115568331790246_371841206162956,NICE
2,115568331790246_515044031842672,Congrats
3,147285781446_10151010892176447,Awesome!
4,159616034235_10150639103634236,Award
...,...,...
2034,179590995428478_422375854483323,"you guys are terrible, holding goverment check..."
2035,125472670805257_525103854175468,as i platinum elite member of delta and a loya...
2036,179590995428478_377568608964048,Really?
2037,179590995428478_341070505947192,Horrible decision.


In [43]:
target = model_bigru.predict(list(submission['message']))
target_df = pd.DataFrame(target, columns = ['a','c','f'])
target_df['label'] = target_df.idxmax(axis=1)



In [44]:
submission['Appreciation_pred'] = pd.get_dummies(target_df['label'])['a']
submission['Complaint_pred'] = pd.get_dummies(target_df['label'])['c']
submission['Feedback_pred'] = pd.get_dummies(target_df['label'])['f']
submission.drop('message', axis = 1,inplace = True)

In [45]:
submission.to_csv('predictions_bi_gru_3.csv', index = False)

## Submit your Predictions

Throughout this assignment, you are encouraged to build different models and submit their predictions as many times as you'd like. To submit a set of predictions, you MUST adhere to the following format (a sample submission file that adheres to all the following requirements is provided on Canvas):

1. The submission must be a csv file, with exactly four columns and 2040 rows;
2. The first row must be the headers, specifically, "postId,Appreciation_pred,Complaint_pred,Feedback_pred". Spellings are case-sensitive;
3. The first column must contain postId. The order of the posts doesn't matter - I will do a join between your predictions and the ground truth table based on postId;
4. The remaining three columns contain your model's predictions for each post. Note that you must generate **binary predictions** for each category. In other words, the numbers in each of those three columns must be either 0 or 1. Also, a post can only belong to one category, so only 1 category can have value 1 and all the others must have value 0.

Because I use an automated system to evaluate prediction performance, if your prediction file does not follow the above format, it won't be recognized. I suggest adapting the following pseudocode to generate the prediction file:

In [None]:
# Don't run before adaptation, this is pseudocode!
f = open('predictions.csv', 'w')
# write the first header row
f.write('postId,Appreciation_pred,Complaint_pred,Feedback_pred' + '\n')

for post in unlabeled_set:
    Appreciation_pred, Complaint_pred, Feedback_pred = YOUR_MODEL_PREDICTIONS
    if Appreciation_pred not in [0,1] or Complaint_pred not in [0,1] or Feedback_pred not in [0,1]:
        SOMETHING_IS_WRONG (did you forget to turn probability predictions into binary predictions?)
    if Appreciation_pred + Complaint_pred + Feedback_pred != 1:
        SOMETHING_IS_WRONG
    f.write(postId + ',' + str(Appreciation_pred) + ',' + str(Complaint_pred) + ',' + str(Feedback_pred) + '\n')
f.close()

**To use the submission system**:
1. Visit [http://18.189.32.82:3838/FBapp](http://18.189.32.82:3838/FBapp) to access the prediction submission system;
2. Enter and select your x500 ID (because I need to keep track of who submitted what). You should see a text display "welcome!" after you enter your ID;
3. Upload the prediction file with the correct format as discussed above. After the file is uploaded, the performance metrics will be shown automatically, including the precision/recall/F-measure of each class and the average F-measure. The entire confusion matrix is not provided to prevent gaming behavior.

If the submission system is not working at any point during this assignment, please contact me via email.

## Grading

Your grade (out of 25 points) of this assignment is determined as follows:
1. I rank everyone based on their highest performance. Say your rank is $A$;
2. I rank everyone based on their second-highest performance. Say your rank is $B$;
3. I rank everyone based on their third-highest performance. Say your rank is $C$;
4. I compute a score ("weighted average ranking") $S = \frac{1}{2}A + \frac{1}{3}B + \frac{1}{6}C$.
5. The person(s) with the lowest $S$ gets 25 points, the person(s) with the second-lowest $S$ gets 24.5 points, so on and so forth.

The design of this grading scheme **encourages consistent efforts that leads to steady performance improvement**, and demotes the relative importance of having one lucky high performance.