# Use Google's Bert to train the classifier and predict 

Read the given training data in pandas, remove the mismatched row, 
then convert into [bert](https://github.com/google-research/bert) style data.

Bert requires that input data match very specific structure, with columns separated by tab.

In [6]:
#import pandas, sklearn stuff
import pandas as pd
from sklearn.model_selection import train_test_split #for training and test split
from sklearn.preprocessing import LabelEncoder #code target from data to 0,1 

#train dataset for bert
le = LabelEncoder()
#read in training data and rename the headers. Also drop na
df = pd.read_csv("input_data/training.csv", header= None, \
                names =["description", "relevancy"]).dropna()
#convert the training data into bert style data. Bert style data is 
# id: column id 
# label: label data into 0 or 1 for classification
# alpha: dummy 'a' on all rows for classification
df_bert = pd.DataFrame({'id': df.index,
                      'label': le.fit_transform(df['relevancy']),
                      'alpha': ['a']*df.shape[0],
                        'text': df['description'].replace(r'\n',' ', regex = True)})

df_bert_train, df_bert_dev = train_test_split(df_bert, test_size= .1)

#test dataset for bert
df_test = pd.read_csv("input_data/validation_data.csv")
df_bert_test = pd.DataFrame({'id': df_test.index,
                            'text': df_test['description'].replace(r'\n'," ", regex = True)})

#Write the dataframes on to disk
df_bert_train.to_csv('input_data//train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('input_data//dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('input_data//test.tsv', sep='\t', index=False, header=True)



Next train the bert classifier in bash with flags as needed. 


In [22]:
%%bash
python3 bert/run_classifier.py --task_name=cola --do_train=true --do_eval=true  --data_dir=input_data/ --vocab_file=bert/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=bert/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=bert/uncased_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=50 --train_batch_size=8 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=output_data/

W0722 08:35:40.131162 139810268645184 deprecation_wrapper.py:119] From /mnt/c/Users/PUpreti/Documents/Python Scripts/Briq/bert/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0722 08:35:40.133342 139810268645184 deprecation_wrapper.py:119] From bert/run_classifier.py:981: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0722 08:35:40.133876 139810268645184 deprecation_wrapper.py:119] From bert/run_classifier.py:784: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0722 08:35:40.133986 139810268645184 deprecation_wrapper.py:119] From bert/run_classifier.py:784: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0722 08:35:40.134450 139810268645184 deprecation_wrapper.py:119] From /mnt/c/Users/PUpreti/Documents/Python Scripts/Briq/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use t

Towards the end of the results above, we can see the evaluation results of the training. After the classifier runs. Run the same model to make predictions using --do_predict=True flag. Most of the bash code is very similar to above with few changesThis will make prediction based on the train model and provide probability of each of our Labels happening and writes the file in output_data as test_results.tsv

In [26]:
%%bash
python bert/run_classifier.py --task_name=cola --do_predict=true --data_dir=input_data --vocab_file=bert/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=bert/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=bert/uncased_L-12_H-768_A-12/bert_model.ckpt    --max_seq_length=128 --output_dir=output_data

W0722 08:44:37.074271 140072122517312 deprecation_wrapper.py:119] From /mnt/c/Users/PUpreti/Documents/Python Scripts/Briq/bert/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0722 08:44:37.076633 140072122517312 deprecation_wrapper.py:119] From bert/run_classifier.py:981: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0722 08:44:37.077196 140072122517312 deprecation_wrapper.py:119] From bert/run_classifier.py:784: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0722 08:44:37.077309 140072122517312 deprecation_wrapper.py:119] From bert/run_classifier.py:784: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0722 08:44:37.077657 140072122517312 deprecation_wrapper.py:119] From /mnt/c/Users/PUpreti/Documents/Python Scripts/Briq/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use t

The output in the `output_data/test_results.tsv` has two columns and the column headers are our binary labels. Next, we will read that in, use it to make the training data look like test data. That is, add the relevancy to the data

In [27]:
df_results = pd.read_csv("output_data/test_results.tsv", sep="\t", header=None)
#test data is available in the memory. If not read it again from input_data
data_results_csv = pd.DataFrame({'id': df_test.index,
                                'description': df_test['description'],
                                'prob_of_low': df_results[1],
                                'relevancy': df_results.idxmax(axis=1)})
data_results_csv['relevancy'].replace(0, "Great", inplace=True)
data_results_csv['relevancy'].replace(1,"Low", inplace=True)
data_results_csv.to_csv('output_data/final_prediction.csv',sep=",", index=None)

What we did here was to create a dataframe from testing data, selected second column from the results and created a binary column by selecting the header where the p of two observation was bigger. The result was captured in the `output_data/final_predictions.csv`. 