<a href="https://colab.research.google.com/github/raghumriyer/colab/blob/master/Raghu_Emails_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Import Mbox data

**Download all your gmails from google account in mbox data format**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Explore the csv file

In [0]:
import pandas as pd
!cp '/content/drive/My Drive/Colab Notebooks/EmailClassifierFinal.csv' .
df = pd.read_csv('EmailClassifierFinal.csv')
#df.head(10)

In [0]:
df['subs'] = df['from'].map(str) + ' - ' + df['subject'].map(str)

In [6]:
df.subs

0        info@airtelmail.info - Mobile Internet Usage I...
1        info@airtelmail.info - Mobile Internet Usage I...
2        info@airtelmail.info - Mobile Internet Usage I...
3        interaction@zorpia.com - We send you RAVIKEERT...
4        interaction@zorpia.com - b"\xe2\x96\xb6 Remind...
                               ...                        
24592    crmf@karvy.com - Scheme wise Monthly/Half Year...
24593    epromotions@custcomm.icicibank.com - Hello tra...
24594    notification+zj4oyt=foo96@facebookmail.com - T...
24595    update+zj4oyt=foo96@facebookmail.com - Murali ...
24596    intimations@cpc.gov.in - Intimation U/S 143(1)...
Name: subs, Length: 24597, dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24597 entries, 0 to 24596
Data columns (total 5 columns):
date       24597 non-null object
from       24597 non-null object
subject    24597 non-null object
Tags       24597 non-null object
subs       24597 non-null object
dtypes: object(5)
memory usage: 960.9+ KB


In [0]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

**Manually Tag the dataset by downloading it locally**

## Data processing

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# create the tokenizer
t = Tokenizer()
# fit the tokenizer
t.fit_on_texts(df.subs.values)

# summarize what was learned
#print(t.word_counts)
#print(t.document_count)
#print(t.word_index)
#print(t.word_docs)

# integer encode documents
encoded_text = t.texts_to_sequences(df.subs.values)
encoded_text = pad_sequences(encoded_text)
print(encoded_text.shape)

(24597, 34)


In [10]:
encoded_labels = pd.get_dummies(df['Tags']).values
print(encoded_labels.shape)

(24597, 3)


## Train Valid Split

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(encoded_text, encoded_labels, test_size = 0.20, random_state = 101)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(19677, 34) (19677, 3)
(4920, 34) (4920, 3)


## Model Training

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SpatialDropout1D, Dropout, LSTM, Embedding

# define the LSTM model
model = Sequential()
model.add(Embedding(len(t.word_index)+1, 1024, input_length = X_train.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(512, dropout=0.2, return_sequences=True))
model.add(LSTM(256, dropout=0.1, recurrent_dropout=0.2))
model.add(Dense(Y_train.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 34, 1024)          13453312  
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 34, 1024)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 34, 512)           3147776   
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               787456    
_________________________________________________________________
dense (Dense)                (None, 3)                 771       
Total params: 17,389,315
Trainable params: 17,389,315
N

In [13]:
from tensorflow.keras.callbacks import ModelCheckpoint
# define the checkpoint
checkpoint = ModelCheckpoint("model.h5", monitor='val_acc', verbose=1, save_best_only=True, mode='max')

# Fit the Model to data
model.fit(X_train, Y_train, epochs=30, batch_size=128, validation_data=(X_test, Y_test), verbose=1, callbacks=[checkpoint])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 19677 samples, validate on 4920 samples
Epoch 1/30
Epoch 00001: val_acc improved from -inf to 0.98862, saving model to model.h5
Epoch 2/30
Epoch 00002: val_acc improved from 0.98862 to 0.99126, saving model to model.h5
Epoch 3/30
Epoch 00003: val_acc improved from 0.99126 to 0.99634, saving model to model.h5
Epoch 4/30
Epoch 00004: val_acc did not improve from 0.99634
Epoch 5/30
Epoch 00005: val_acc did not improve from 0.99634
Epoch 6/30
Epoch 00006: val_acc improved from 0.99634 to 0.99654, saving model to model.h5
Epoch 7/30
Epoch 00007: val_acc did not improve from 0.99654
Epoch 8/30
Epoch 00008: val_acc did not improve from 0.99654
Epoch 9/30
Epoch 00009: val_acc improved from 0.99654 to 0.99756, saving model to model.h5
Epoch 10/30
Epoch 00010: val_acc did not improve from 0.99756
Epoch 11/30
Epoch 00011: val_acc did not improve from 0.99756
Epoch 12/30
Epoch 00012: val_acc did 

<tensorflow.python.keras.callbacks.History at 0x7fc9d2d325f8>

## Model Evaluation

In [14]:
# load the model
from tensorflow.keras.models import load_model
bestmodel = load_model("model.h5")

score, acc = bestmodel.evaluate(X_test, Y_test, batch_size=128, verbose=0)
print('Test score:', score)
print('Test accuracy:', acc)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Test score: 0.016311643743221232
Test accuracy: 0.997561


In [16]:
from sklearn.metrics import confusion_matrix,classification_report
import numpy as np
Y_pred = bestmodel.predict_classes(X_test, batch_size = 128)
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix \n", confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

confusion matrix 
 [[ 225    2    2]
 [   1   71    3]
 [   0    4 4612]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       229
           1       0.92      0.95      0.93        75
           2       1.00      1.00      1.00      4616

    accuracy                           1.00      4920
   macro avg       0.97      0.98      0.97      4920
weighted avg       1.00      1.00      1.00      4920



In [0]:
!cp -r 'model.h5' 'drive/My Drive/Colab Notebooks/'

## Load pretrained model

In [0]:
!cp 'drive/My Drive/Colab Notebooks/model.h5' .

In [0]:
# load the model
from tensorflow.keras.models import load_model
bestmodel = load_model("model.h5")

## Predict on a text

In [33]:
text2 = ['amazon.in - Invoice of your bill']
text1 = ['Mobile Internet Usage Information']
text = ['Great festivities bring great CashBack']
#vectorizing the text by the pre-fitted tokenizer instance
text = t.texts_to_sequences(text)
#padding the text to have exactly the same shape as `embedding` input
text = pad_sequences(text, maxlen=34, dtype='int32', value=0)
#print(text)

label = bestmodel.predict(text, batch_size=1, verbose = 1)[0]
print("Finance") if(pd.np.argmax(label) == 2) else print("MaybeUseful") if (pd.np.argmax(label) == 1) else print("NotFinance")

Finance


## Save the dataframes

In [27]:
df.Tags.value_counts()
#df.Tags
#print(df.Tags)


NotFinance     23084
Finance         1141
MaybeUseful      372
Name: Tags, dtype: int64

In [0]:
df[df.Tags=='Finance'].to_csv('financial_emails.csv', index=False)

In [0]:
df[df.Tags=='MaybeUseful'].to_csv('maybeuseful_emails.csv', index=False)

**Anonymize the dataset before saving it in drive**

## Save the csv files for later use

In [0]:
!cp 'financial_emails.csv' 'drive/My Drive/Colab Notebooks/' 

In [0]:
!cp 'maybeuseful_emails.csv' 'drive/My Drive/Colab Notebooks/' 