# Data Preprocessing:

### Features (as in paper):
1. 11.140 ngram features: tf and tf-idf weighted word and character ngrams stemmed with Porter's stemmer
2. type-token ratio
3. ratio of comments in English
4. ratio of British english vs. American English words
5. 93 features from LIWC 
6. 26 PSYCH features (Preotiuc: Paraphrase Database and MRC Psycholinguistics Database)

### Columns (from the description of the dataset):
1. 'global':[7,10], #subreddits_commented, subreddits_commented_mbti, num_comments
2. 'liwc':[10,103], #liwc
3. 'word':[103,3938], #top1000 word ngram (1,2,3) per dimension based on chi2
4. 'char':[3938,7243], #top1000 char ngrams (2,3) per dimension based on chi2
5. 'sub':[7243,12228], #number of comments in each subreddit
6. 'ent':[12228,12229], #entropy
7. 'subtf':[12229,17214], #tf-idf on subreddits
8. 'subcat':[17214,17249], #manually crafted subreddit categories
9. 'lda50':[17249,17299], #50 LDA topics
10. 'posts':[17299,17319], #posts statistics
11. 'lda100':[17319,17419], #100 LDA topics
12. 'psy':[17419,17443], #psycholinguistic features
13. 'en':[17443,17444], #ratio of english comments
14. 'ttr':[17444,17445], #type token ratio
15. 'meaning':[17445,17447], #additional pyscholinguistic features
16. 'time_diffs':[17447,17453], #commenting time diffs
17. 'month':[17453,17465], #monthly distribution
18. 'hour':[17465,17489], #hourly distribution
19. 'day_of_week':[17489,17496], #daily distribution
20. 'word_an':[17496,21496], #word ngrams selected by F-score
21. 'word_an_tf':[21496,25496], #tf-idf ngrams selected by F-score
22. 'char_an':[25496,29496], #char ngrams selected by F-score
23. 'char_an_tf':[29496,33496], #tf-idf char ngrams selected by F-score
24. 'brit_amer':[33496,33499], #british vs american english ratio


## Import packages

In [1]:
import nltk
# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.util import bigrams, ngrams
import re
import string
from string import punctuation
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer 
from collections import Counter
from num2words import num2words 
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import datetime
import random
random.seed(32)

# close nltk download window to continue

[nltk_data] Downloading package punkt to /home/sophia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sophia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sophia/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /home/sophia/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


## Import data

In [2]:
pandora = pd.read_csv('/home/sophia/ma_py/pandora_bigfive1000.csv')
#provide identifier
# lst = []
# for i in range(len(pandora)):
#     lst.append(pandora.author[i] + str(i))
# pandora['ident'] = lst

# multiindex
# pandora = pandora.set_index(['author', 'ident']).sort_index()

authors = pd.read_csv('/home/sophia/ma_py/author_profiles.csv')
# find missing data in big five traits
authorslst = authors['author'].tolist()
print("Author search: ", 'DarthHedonist' in authorslst)
print("Author search: ", 'FonsoTheWhitesican' in authorslst)
print("Author search: ", 'chaosking121' in authorslst)

bigfive = authors[['author','agreeableness','openness','conscientiousness','extraversion','neuroticism']]
bigfive = bigfive.dropna()
# print(bigfive[bigfive['author'] == "DarthHedonist"])

# pandoradf = pd.merge(pandora, bigfive, how='left', on='author')
pandoradf = pandora.merge(bigfive, how='left', on=['author'])
# pandoradf = pandoradf.dropna()
pandoradf = pandoradf.sort_values(by='author')
pandoradf = pandoradf[pandoradf['agreeableness'].notna()]
pandoradf = pandoradf.reset_index()

print("Length of dataframe: ", len(pandoradf))
print("NaN in df? ", pandoradf.isnull().any().any())
print("Sum of NaN in agreeableness", pandoradf['agreeableness'].isnull().values.sum())
print("Sum of NaN in openness", pandoradf['openness'].isnull().values.sum())
print("Sum of NaN in conscientiousness", pandoradf['conscientiousness'].isnull().values.sum())
print("Sum of NaN in extraversion", pandoradf['extraversion'].isnull().values.sum())
print("Sum of NaN in neuroticism", pandoradf['neuroticism'].isnull().values.sum())
# nan_values = pandoradf[pandoradf['neuroticism'].isna()]
# nan_values
pandoradf.head()
# pandoradf[pandoradf.isnull().any(axis=1)]

# number of entries does not fit

Author search:  True
Author search:  True
Author search:  True
Length of dataframe:  975
NaN in df?  True
Sum of NaN in agreeableness 0
Sum of NaN in openness 0
Sum of NaN in conscientiousness 0
Sum of NaN in extraversion 0
Sum of NaN in neuroticism 0


Unnamed: 0,index,author,author_flair_text,body,downs,created_utc,subreddit_id,link_id,parent_id,score,...,subreddit,ups,word_count,word_count_quoteless,lang,agreeableness,openness,conscientiousness,extraversion,neuroticism
0,906,-BigSexy-,Catechumen,Oooh i see,,1510236798,t5_2qra3,t3_7bqplg,t1_dpkf5mq,1.0,...,OrthodoxChristianity,,3,3,en,39.0,92.0,1.0,18.0,4.0
1,145,-BlitzN9ne,,**Quality** material right here,,1549708109,t5_3cx36,t3_aopk72,t3_aopk72,13.0,...,UnethicalLifeProTips,,4,4,en,50.0,85.0,15.0,50.0,30.0
2,367,-CrestiaBell,,EA Indubitably,,1512615144,t5_2qh03,t3_7hzusb,t1_dqvr3u0,15.0,...,gaming,,2,2,en,50.0,85.0,50.0,85.0,50.0
3,182,-CrestiaBell,,That's because we had to watch that cartoon in...,,1475867279,t5_2qh03,t3_56c8z3,t1_d8i510p,22.0,...,gaming,,55,55,en,50.0,85.0,50.0,85.0,50.0
4,245,-CrestiaBell,I bet you thought my account would be here c:,[You will protect these smiles..](http://i.img...,,1505862626,t5_2qh22,t3_716at9,t3_716at9,22.0,...,anime,,5,5,en,50.0,85.0,50.0,85.0,50.0


## Adjust representations of some columns

In [3]:
# change language to numeric representation
def adjust(df):
    # change lang to numerical representation
    language = df['lang'].values.tolist()
    language = set(language)
    language
    df['language']= np.select([df.lang == 'en', df.lang == 'es', df.lang == 'nl'], 
                            [0, 1, 2], 
                            default=3)
    # print(gramsdf['language'])
    df = df.drop(columns=['lang'])

    # change big five to binary representation
    df['agree'] = df['agreeableness'].apply(lambda x: 0 if x<50 else 1)
    df['openn'] = df['openness'].apply(lambda x: 0 if x<50 else 1)
    df['consc'] = df['conscientiousness'].apply(lambda x: 0 if x<50 else 1)
    df['extra'] = df['extraversion'].apply(lambda x: 0 if x<50 else 1)
    df['neuro'] = df['neuroticism'].apply(lambda x: 0 if x<50 else 1)
    return df
# newpandoradf = adjust(pandoradf)

## Feature extraction

In [4]:
def choose_stopwordlist(df, mode):
    if mode == 'NLTK':
        stopwordList = stopwords.words('english')
    if mode == 'NLTK-neg':
        stopwordList = stopwords.words('english')
        stopwordList.remove('no')
        stopwordList.remove('nor')
        stopwordList.remove('not')
    return stopwordList

# stopwordList = choose_stopwordlist(pandoradf, mode='NLTK-neg')

# print(stopwordList)

### Preprocessing

1. lower 
2. tokenize
3. numbers to words
4. delete special tokens

In [5]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

# featuredf['probody'] = featuredf['body'].apply(lambda x:(decontracted(''.join(x))))
# print(featuredf.iloc[5]['probody'])

def senttokenize(df):
    sentbody = []
    for row in df['body']:
        sentences = sent_tokenize(row)
        sentbody.append(sentences)
    df['senttokens'] = sentbody
    return df

In [6]:
def low_stop_num_token(workdf, stopwordList):
    # lower, remove special characters, remove stopwords
    workdf['probody'] = workdf['probody'].apply(lambda x: ' '.join([x.lower() for x in x.split() if x.isalnum()]))
    workdf['probody'] = workdf['probody'].apply(lambda x: ' '.join([x for x in x.split() if (x not in stopwordList)]))
    newbody = []
    newprobody = []
    # num2words
    for sentence in tqdm(workdf['probody']):
        # string to list
        inputtext = sentence.split()
        numlist = []
        for i in range(len(inputtext)):
            if inputtext[i].isnumeric():
                numlist.append(i)
        for number in numlist:
            inputtext[number] = num2words(inputtext[number])
        
        # list to string
        celltext = ' '.join(inputtext)
        newprobody.append(celltext)
        # tokenize
        words = word_tokenize(celltext)
        newbody.append(words)
    workdf['probody'] = newprobody
    workdf['tokens'] = newbody
    return workdf

# preprocesseddf = preprocessing(featuredf)
# print(preprocesseddf.iloc[2]['body'])
# preprocesseddf.head()
# preprocesseddf.info()

In [7]:
tqdm.pandas()
# Porter Stemmer
def stemming(df):
    ps = PorterStemmer()
    df['tokens'] = df['tokens'].progress_apply(lambda x:([ps.stem(word) for word in x]))
    return df

# stemmeddf = stemming(preprocesseddf)
# print(stemmeddf.iloc[1]['tokens'])
# stemmeddf.head()

In [8]:
## Sort dataframe

In [9]:
# gramsdf.info(verbose=True)

In [10]:
def ordering(df):
    lst = []
    for i in range(len(df)):
        lst.append(df.author[i] + str(i))
    df['ident'] = lst
    
    cols_tomove = ['index', 'author', 'ident', 'body', 'probody', 'tokens', 'senttokens', 'agreeableness', 'openness', 'conscientiousness', 'extraversion', 'neuroticism', 'agree', 'openn', 'consc', 'extra', 'neuro', 'language']
    orderdf  = df[cols_tomove + [col for col in df.columns if col not in cols_tomove]]
#     orderdf.info(verbose=True)
    return orderdf

# Wrapper

In [11]:
def preprocess(df):
    # adjust some column representations
    df = adjust(df)
    # choose stopwordlist with or without negation
    stopwordList = choose_stopwordlist(df, mode='NLTK-neg')
    # decontract abbreviations (e.g., n't to not)
    df['probody'] = df['body'].apply(lambda x:(decontracted(''.join(x))))
    # create sentence tokens
    df = senttokenize(df)
    # lower, remove stopwords, num2words, tokenize
    df = low_stop_num_token(df, stopwordList)
    # porters stemmer
    df = stemming(df)
    df = ordering(df)
    return df

predf = preprocess(pandoradf)
print(predf.ident)


  0%|          | 0/975 [00:00<?, ?it/s]

  0%|          | 0/975 [00:00<?, ?it/s]

0          -BigSexy-0
1         -BlitzN9ne1
2       -CrestiaBell2
3       -CrestiaBell3
4       -CrestiaBell4
            ...      
970    zugzwang_03970
971    zugzwang_03971
972    zugzwang_03972
973    zugzwang_03973
974    zugzwang_03974
Name: ident, Length: 975, dtype: object


In [12]:
predf.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 32 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 975 non-null    int64  
 1   author                975 non-null    object 
 2   ident                 975 non-null    object 
 3   body                  975 non-null    object 
 4   probody               975 non-null    object 
 5   tokens                975 non-null    object 
 6   senttokens            975 non-null    object 
 7   agreeableness         975 non-null    float64
 8   openness              975 non-null    float64
 9   conscientiousness     975 non-null    float64
 10  extraversion          975 non-null    float64
 11  neuroticism           975 non-null    float64
 12  agree                 975 non-null    int64  
 13  openn                 975 non-null    int64  
 14  consc                 975 non-null    int64  
 15  extra                 9

In [13]:
predf['probody']

0                                               oooh see
1                                         material right
2                                         ea indubitably
3      watch cartoon school martin luther king time t...
4                                                protect
                             ...                        
970    keep extra clothing items soon get suit jacket...
971    someone stumbled onto sub 5min thanks sticky g...
972    institutions accommodate religious serious med...
973    toss dirty laundry pile week way worn cheap pe...
974            auto correct hates french thanks catching
Name: probody, Length: 975, dtype: object

## Export dataframe

In [14]:
predf.to_pickle("preprocessed.pkl")