# Script content
 
### 1) This script downloads df_emails data frame which contains one row per email, with tokenized email body and inferred owner. 
 
##### In/out info about the emails follows based on whether folder name matches the name in X-from of email. We will process email bodies in two ways, one for topic modelling and slightly differently for recommendation engine. 
 
##### Difference is that for the later version if email contains a forwarded email thread in it’s body, then we need to remove that part keeping only what the sender wrote (outbox emails) or what is been written to him/her in the last email in the thread (inbox emails). It is a choice we are making to infer preferences and expertise levels on topics--with outbox it is quite clear, and in case of inbox this choice seems to be correct as it prevents earlier emails by the folder owner in a thread to be counted for preference levels. 
 
##### Furthermore, from each email body we will remove warnings at the bottom of emails (privacy, environment etc) as well as From, To, X-From, X-to and all other fields that come along with a  forward thread. This is done for for both versions of processing.

### 2) Then script removes standard English words, digits, spaces and special characters, words with 1 or 2 letters; it bring nouns to singular form, removes suffix where applicable --- with a nested list 'clean_text' and 'clean_text0'  as result. 

### 3) For persistency we save both lists with cPickle to the disk.


In [2]:
# download de_emails
import pandas as pd
df_emails=pd.read_pickle('/notebooks/LDA models and data/Data Frames and lists/df_emails.pkl')
df_emails0=pd.read_pickle('/notebooks/LDA models and data/Data Frames and lists/df_emails0.pkl')
# LDA models and data/Data Frames and lists  df_emails.pkl

In [7]:
import random
df_emails=df_emails.sample(n=200)
df_emails0=df_emails0.sample(n=200)

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline

# NLP
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
from nltk.stem.porter import PorterStemmer

# LDA
import gensim
from gensim import corpora

import re
# from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
# from sklearn.cluster import KMeans
# from sklearn.decomposition import PCA
# from sklearn.lda import LDA
# from sklearn.decomposition import LatentDirichletAllocation

Using TensorFlow backend.


### We need to process email bodies to tokenized words sequences. That is we apply the standard steps for statistical processing of natural language. 

### We first write a function 'clean_text' in the first cell below which removes punctuation and special characters, digits as well as English words which are too common.


In [8]:
def clean_text(text):
    stop = set(stopwords.words('english'))
    stop.update(("to","cc","subject","http","from","sent","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"))
    exclude = set(string.punctuation) 
    lemma = WordNetLemmatizer()
    porter= PorterStemmer()
    
    text=text.rstrip()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()) )])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    stem = " ".join(porter.stem(token) for token in normalized.split())
    two_letter_words_free = " ".join([i for i in stem.split() if len(i)>=3])
    
#     return stem
    return two_letter_words_free

 ### CHECK how it works:

In [9]:
clean_text(' !This is a 45.98*demonstrating. Of af actions ji of this function\'s $#wORKing/')

u'demonstr action function work'

### Next process all bodies into a list. We write two functions, one that only takes email bodies and another that also takes directory path name and in-outbox identifier. This can he handled in a single function but we leave such implementation details for a later stage.

In [37]:
# Function for LDA training part and valuation

def text_clean_df2list1(data):
    text_clean=[]
    for text in data['body']:
        text_clean.append(clean_text(text).split())
    return text_clean
    
# Function for recommendation engine part
    
def text_clean_df2list0(data):
    text_clean0=[]
#     k=0
    for k in range(0, data.shape[0]):# dirpath, inoutid, text in data[['dirpath', 'inout_id', 'body']]:
        text_clean0.append([data['dirpath'].iloc[k], data['inout_id'].iloc[k], text_clean.append(clean_text(data['body'].\
                                                                                                 iloc[k]).split()  ) ] )
    return text_clean0

### Now process df_emails and df_emails0. Recall that second one has texts from forwarded email threads removed from bodies. Also we will take directory path name as well as in/outbox id as we need it later on.

### The above two lists are sufficient to proceed and we save them to the disk.

In [None]:
import cPickle
with open('/notebooks/LDA models and data/Data Frames and lists/text_clean.pkl', 'wb') as pickle_file:
    cPickle.dump(obj=text_clean, file=pickle_file, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('/notebooks/LDA models and data/Data Frames and lists/text_clean0.pkl', 'wb') as pickle_file:
cPickle.dump(obj=text_clean01, file=pickle_file, protocol=pickle.HIGHEST_PROTOCOL)