# Data Preprocessing:

### Features (as in paper):
1. 11.140 ngram features: tf and tf-idf weighted word and character ngrams stemmed with Porter's stemmer
2. type-token ratio
3. ratio of comments in English
4. ratio of British english vs. American English words
5. 93 features from LIWC 
6. 26 PSYCH features (Preotiuc: Paraphrase Database and MRC Psycholinguistics Database)

### Columns (from the description of the dataset):
1. 'global':[7,10], #subreddits_commented, subreddits_commented_mbti, num_comments
2. 'liwc':[10,103], #liwc
3. 'word':[103,3938], #top1000 word ngram (1,2,3) per dimension based on chi2
4. 'char':[3938,7243], #top1000 char ngrams (2,3) per dimension based on chi2
5. 'sub':[7243,12228], #number of comments in each subreddit
6. 'ent':[12228,12229], #entropy
7. 'subtf':[12229,17214], #tf-idf on subreddits
8. 'subcat':[17214,17249], #manually crafted subreddit categories
9. 'lda50':[17249,17299], #50 LDA topics
10. 'posts':[17299,17319], #posts statistics
11. 'lda100':[17319,17419], #100 LDA topics
12. 'psy':[17419,17443], #psycholinguistic features
13. 'en':[17443,17444], #ratio of english comments
14. 'ttr':[17444,17445], #type token ratio
15. 'meaning':[17445,17447], #additional pyscholinguistic features
16. 'time_diffs':[17447,17453], #commenting time diffs
17. 'month':[17453,17465], #monthly distribution
18. 'hour':[17465,17489], #hourly distribution
19. 'day_of_week':[17489,17496], #daily distribution
20. 'word_an':[17496,21496], #word ngrams selected by F-score
21. 'word_an_tf':[21496,25496], #tf-idf ngrams selected by F-score
22. 'char_an':[25496,29496], #char ngrams selected by F-score
23. 'char_an_tf':[29496,33496], #tf-idf char ngrams selected by F-score
24. 'brit_amer':[33496,33499], #british vs american english ratio


## Import packages

In [104]:
import nltk
# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.util import bigrams, ngrams
import re
import string
from string import punctuation
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer 
from collections import Counter
from num2words import num2words 
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import random
random.seed(32)

# close nltk download window to continue

[nltk_data] Downloading package punkt to /home/sophia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sophia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sophia/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /home/sophia/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


## Import data

In [105]:
pandora = pd.read_csv('/home/sophia/ma_py/pandora_bigfive1000.csv')

authors = pd.read_csv('/home/sophia/ma_py/author_profiles.csv')

bigfive = authors[['author', 'agreeableness','openness','conscientiousness','extraversion','neuroticism']]
bigfive = bigfive.dropna()

pandoradf = pd.merge(pandora, bigfive, on='author', how='outer')
pandoradf = pandoradf.dropna()
pandoradf = pandoradf.reset_index()
pandoradf.tail()

Unnamed: 0,index,author,author_flair_text,body,downs,created_utc,subreddit_id,link_id,parent_id,score,...,subreddit,ups,word_count,word_count_quoteless,lang,agreeableness,openness,conscientiousness,extraversion,neuroticism
82,932,Xaielao,Pharah,It seems to me that the least played character...,0.0,1463691000.0,t5_2u5kl,t3_4k2vck,t3_4k2vck,2.0,...,Overwatch,2.0,98.0,96.0,en,78.0,57.0,38.0,31.0,10.0
83,944,BadgerKid96,19,Close.,0.0,1469684000.0,t5_2rjli,t3_4ur4p7,t1_d5sbnrs,1.0,...,teenagers,1.0,1.0,1.0,en,77.0,73.0,73.0,1.0,98.0
84,962,Ambedo_1,INTJ,ahh gotcha. thanks for replying. i think i do ...,0.0,1479464000.0,t5_2qowo,t3_54j3ww,t1_da5gu9x,1.0,...,intj,0.0,120.0,120.0,en,11.0,6.0,61.0,1.0,45.0
85,968,WhatINeverSaid,[ISFJ],No it isn't too much information. I would say ...,0.0,1429376000.0,t5_2s90r,t3_32vycz,t1_cqfnjda,1.0,...,mbti,1.0,42.0,42.0,en,34.0,10.0,54.0,33.0,46.0
86,988,mdhh99,http://smile.amazon.com/gp/registry/wishlist/2...,What type of skate? The trick to ice skate is ...,0.0,1436839000.0,t5_2tx47,t3_3d6q1c,t1_ct2e6qm,2.0,...,Random_Acts_Of_Amazon,2.0,27.0,27.0,en,8.0,9.0,14.0,14.0,29.0


## Adjust representations of some columns

In [106]:
# change language to numeric representation
def adjust(df):
    # change lang to numerical representation
    language = df['lang'].values.tolist()
    language = set(language)
    language
    df['language']= np.select([df.lang == 'en', df.lang == 'es', df.lang == 'nl'], 
                            [0, 1, 2], 
                            default=3)
    # print(gramsdf['language'])
    df = df.drop(columns=['lang'])

    # change big five to binary representation
    df['agree'] = df['agreeableness'].apply(lambda x: 0 if x<50 else 1)
    df['openn'] = df['openness'].apply(lambda x: 0 if x<50 else 1)
    df['consc'] = df['conscientiousness'].apply(lambda x: 0 if x<50 else 1)
    df['extra'] = df['extraversion'].apply(lambda x: 0 if x<50 else 1)
    df['neuro'] = df['neuroticism'].apply(lambda x: 0 if x<50 else 1)
    return df
# newpandoradf = adjust(pandoradf)

## Feature extraction

In [107]:
def choose_stopwordlist(df, mode):
    if mode == 'NLTK':
        stopwordList = stopwords.words('english')
    if mode == 'NLTK-neg':
        stopwordList = stopwords.words('english')
        stopwordList.remove('no')
        stopwordList.remove('nor')
        stopwordList.remove('not')
    return stopwordList

# stopwordList = choose_stopwordlist(pandoradf, mode='NLTK-neg')

# print(stopwordList)

### Preprocessing

1. lower 
2. tokenize
3. numbers to words
4. delete special tokens

In [108]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

featuredf['probody'] = featuredf['body'].apply(lambda x:(decontracted(''.join(x))))
print(featuredf.iloc[5]['probody'])

def senttokenize(df):
    sentbody = []
    for row in df['body']:
        sentences = sent_tokenize(row)
        sentbody.append(sentences)
    df['senttokens'] = sentbody
    return df

I do not like this description by the way, too vague and disorganized and short also wth I thought you were TYPE_MENTION there is no way in hell you are TYPE_MENTION


In [109]:
def low_stop_num_token(workdf):
    # lower, remove special characters, remove stopwords
    workdf['probody'] = workdf['probody'].apply(lambda x: ' '.join([x.lower() for x in x.split() if x.isalnum()]))
    workdf['probody'] = workdf['probody'].apply(lambda x: ' '.join([x for x in x.split() if (x not in stopwordList)]))
    newbody = []
    newprobody = []
    # num2words
    for sentence in tqdm(workdf['probody']):
        # string to list
        inputtext = sentence.split()
        numlist = []
        for i in range(len(inputtext)):
            if inputtext[i].isnumeric():
                numlist.append(i)
        for number in numlist:
            inputtext[number] = num2words(inputtext[number])
        
        # list to string
        celltext = ' '.join(inputtext)
        newprobody.append(celltext)
        # tokenize
        words = word_tokenize(celltext)
        newbody.append(words)
    workdf['probody'] = newprobody
    workdf['tokens'] = newbody
    return workdf

# preprocesseddf = preprocessing(featuredf)
# print(preprocesseddf.iloc[2]['body'])
# preprocesseddf.head()
# preprocesseddf.info()

In [110]:
tqdm.pandas()
# Porter Stemmer
def stemming(df):
    ps = PorterStemmer()
    df['tokens'] = df['tokens'].progress_apply(lambda x:([ps.stem(word) for word in x]))
    return df

# stemmeddf = stemming(preprocesseddf)
# print(stemmeddf.iloc[1]['tokens'])
# stemmeddf.head()

In [111]:
## Sort dataframe

In [112]:
# gramsdf.info(verbose=True)

In [113]:
def ordering(df):
    cols_tomove = ['index', 'author', 'body', 'probody', 'tokens', 'senttokens', 'agreeableness', 'openness', 'conscientiousness', 'extraversion', 'neuroticism', 'agree', 'openn', 'consc', 'extra', 'neuro', 'language']
    orderdf  = df[cols_tomove + [col for col in df.columns if col not in cols_tomove]]
#     orderdf.info(verbose=True)
    return orderdf

# orderdf = ordering(gramsdf)

# Wrapper

In [114]:
def preprocess(df):
    # adjust some column representations
    df = adjust(df)
    # choose stopwordlist with or without negation
    stopwordList = choose_stopwordlist(df, mode='NLTK-neg')
    # decontract abbreviations (e.g., n't to not)
    df['probody'] = df['body'].apply(lambda x:(decontracted(''.join(x))))
    # create sentence tokens
    df = senttokenize(df)
    # lower, remove stopwords, num2words, tokenize
    df = low_stop_num_token(df)
    # porters stemmer
    df = stemming(df)
    df = ordering(df)
    return df

predf = preprocess(pandoradf)


  0%|          | 0/87 [00:00<?, ?it/s]

  0%|          | 0/87 [00:00<?, ?it/s]

In [115]:
predf.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 87 non-null     int64  
 1   author                87 non-null     object 
 2   body                  87 non-null     object 
 3   probody               87 non-null     object 
 4   tokens                87 non-null     object 
 5   senttokens            87 non-null     object 
 6   agreeableness         87 non-null     float64
 7   openness              87 non-null     float64
 8   conscientiousness     87 non-null     float64
 9   extraversion          87 non-null     float64
 10  neuroticism           87 non-null     float64
 11  agree                 87 non-null     int64  
 12  openn                 87 non-null     int64  
 13  consc                 87 non-null     int64  
 14  extra                 87 non-null     int64  
 15  neuro                 87 

In [116]:
predf['probody']

0                               subtle enough look like
1        downturned dirty small small obvious would not
2     man would call man guess mouse agree not safe ...
3                                           added thank
4     squatted 225x14 couple weeks ago made sad card...
                            ...                        
82    seems least played characters also characters ...
83                                                     
84    ahh thanks think care people not societal thin...
85    no not much would say probably not fact behavi...
86    type trick ice skate lean would imagine would ...
Name: probody, Length: 87, dtype: object

## Export dataframe

In [117]:
predf.to_pickle("preprocessed.pkl")