### What this script does:

- Remove stopwords: English & French stopwords are loaded from two txt files into two sets. When a comment/post is detected as in French, detectable French stopwords will be removed. Otherwise, English stopwords will be removed accordingly. Note: the stopwords list needs to be updated to suit our purpose, for example, when we need to find out about the attitude/sentiment of comments, we should probably exclude words such as 'couldn't', 'cannot' or 'mustn't' from this list. 


- Remove non ascii characters: after this step, only digits, English & French characters are kept. Turn it off if it's unnecessary or removes too much implicit information such as emoji's.


- Tokenization: with MWETokenizer, multi-word tokens can be added based on our needs, for example, I've added 'climate change', 'canada150', 'justin trudeau' as customized tokens and counted their frequency in later step. 


- Stemming: stem the tokens obtained from last step.


- Output tokens' & stemmers' frequency distribution: output csv files can be found in folder 'word_count', frequency of tokens and stemmers are listed in descending order. The freq_perc column is obtained by dividing the frequency of a word by the total number of comment/posts in each original csv file. Note: Note: there are duplicated tokens in each comment/post, we can choose to count the duplicate or not by commenting/uncommenting a line of code.


In [1]:
import glob
import os
import pandas as pd
from langdetect import detect
import nltk
#nltk.download()   # comment after first download
from nltk.tokenize import sent_tokenize, MWETokenizer, wordpunct_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import FrenchStemmer
from nltk.stem.snowball import EnglishStemmer
from nltk.probability import FreqDist

## Choose a path

In [2]:
#======= Twitter =======
rootPath='../tw/cleaned data/*.csv'
COLUMN_NAME = 'full_text_cleaned'

#======= Facebook =======
#rootPath='../fb/cleaned data/comments/*.csv'
#COLUMN_NAME = 'comment_message_cleaned'

#======= Instagram =======
#rootPath='../in/cleaned data/posts/*.csv'
#COLUMN_NAME = 'caption'

multiWordsPath = './multiwords.txt'

In [3]:
outputDir = rootPath[:-5] + 'word_count/'
if not os.path.exists(outputDir):
    os.makedirs(outputDir)
filePaths = glob.glob(rootPath)  
display(filePaths)

['../tw/cleaned data/ParksCanada_tweets.csv',
 '../tw/cleaned data/ParcsCanada_tweets.csv']

In [4]:
testfile = '../tw/cleaned data/ParcsCanada_tweets.csv'
filename = os.path.basename(testfile)
outputFileName = filename[:-4] + '_word_count.csv'
display(outputFileName)


stopWords_en = set(stopwords.words('english'))
stopWords_fr = set(stopwords.words('french'))

'ParcsCanada_tweets_word_count.csv'

In [5]:
def detect_lang(text):
    try:
        lang = detect(text)
    except:
        return 'error'
    return lang


def filter_stop_words(text):   
    stopWords = stopWords_en  
    if detect_lang(text) == 'fr':
        stopWords = stopWords_fr  
    filtered_text = [w for w in wordpunct_tokenize(text) if w.lower() not in stopWords
            and len(w) > 1 and w.isalnum()]   
    return ' '.join(filtered_text)


def load_multi_words(filepath):
    with open(filepath) as file:
        lines = file.readlines()
        words = [word.strip() for word in lines]
        return words


def tokenize_multi_words(topic_list):
    result = []
    print('>>>Adding custermized topic words/tokens to tokenizer...')
    for words in topic_list:
        print(words)
        result.append(words.split('_'))
    return result

multiWords = load_multi_words(multiWordsPath)       # load custermized tokens
tokenizedMultiWords = tokenize_multi_words(multiWords)
tokenizer = MWETokenizer(tokenizedMultiWords)
#tokenizer = MWETokenizer()    # Uncomment this line if no customized multi-word tokens needed

def tokenize_text(text):
    return tokenizer.tokenize(text.split())   # remove .lower()


def stem_text(text):
    if detect_lang(text) == 'fr':
        stemmer = FrenchStemmer(ignore_stopwords=False)
    else:
        stemmer = EnglishStemmer(ignore_stopwords=False) 
    stems = [stemmer.stem(tok) for tok in text]
    return stems


>>>Adding custermized topic words/tokens to tokenizer...
climate_change
canada150
justin_trudeau


In [6]:
df = pd.read_csv(testfile)   # the first unnamed column already exists in csv file 
#df
#df[COLUMN_NAME].tolist()

In [7]:
df['text_filtered'] = df[COLUMN_NAME].astype(str).apply(filter_stop_words)
pd.options.display.max_rows = 999
df[[COLUMN_NAME, 'text_filtered']]
#df['text_filtered'].tolist()

Unnamed: 0,full_text_cleaned,text_filtered
0,Vivez une véritable eérience de nature sauvage...,Vivez véritable eérience nature sauvage hiver ...
1,Résolution : réserver tôt! Nos réservations po...,Résolution réserver tôt réservations commencen...
2,Donnez une saveur canadienne à vos plans de la...,Donnez saveur canadienne plans veille Jour An ...
3,Nous savons que vous avez vécu de grandes sur ...,savons vécu grandes cette année alors parlez P...
4,Des nuits dans l'arrière-pays étoilées aux abr...,nuits arrière pays étoilées abris rustiques do...
5,Cherchez-vous une escapade du ? À quatre heure...,Cherchez escapade quatre heures Toronto détend...
6,la écouverte famille/groupe est à % de rabais ...,écouverte famille groupe rabais jusqu décembre...
7,a été une année remplie d’ et de écouvertes et...,année remplie écouvertes voulons entendre parl...
8,C’est la Journée du tic-tac de l'horloge qui v...,Journée tic tac horloge veut dire année tire f...
9,Le Père Noël vous a apporté de l’équipement de...,Père Noël apporté équipement camping cadeau re...


## Load multi-word tokens, tokenize comments/posts, and stem tokens

In [8]:
df['text_tokenized'] = df['text_filtered'].apply(tokenize_text)
df['text_stemmed'] = df['text_tokenized'].apply(stem_text)

In [9]:
token_lst = df['text_tokenized'].tolist()
token_lst
token_fdist = FreqDist()
for list_i in token_lst:
    list_i = set(list_i)  # Adding this line would count a word once even if it appears multple times in one comment/post
    for token in list_i:
        token_fdist[token.lower()] += 1
#token_fdist.most_common(30)
#token_fdist['justin_trudeau']       # check the frequency of a token

In [10]:
stemmer_lst = df['text_stemmed'].tolist()
stemmer_lst
stemmer_fdist = FreqDist()
for list_i in stemmer_lst:
    list_i = set(list_i)  # Adding this line would count a word once even if it appears multple times in one comment/post
    for token in list_i:
        stemmer_fdist[token.lower()] += 1
#stemmer_fdist.most_common(30)
 
display(stemmer_fdist['canada150']) # stemmer doesn't change multi-word tokens
display(stemmer_fdist['justin_trudeau'])        

1

0

## Output token/stemmer frequency distribution

In [11]:
token_df = pd.DataFrame(list(token_fdist.items()), columns=['token', 'tok_freq'])
token_df['tok_freq_perc'] = token_df.tok_freq/len(df)
token_df = token_df.sort_values('tok_freq', ascending=False).reset_index(drop=True)

stemmer_df = pd.DataFrame(list(stemmer_fdist.items()), columns=['stemmer', 'stem_freq'])
stemmer_df['stem_freq_perc'] = stemmer_df.stem_freq/len(df)
stemmer_df = stemmer_df.sort_values('stem_freq', ascending=False).reset_index(drop=True)

print('>>> Output word frequency distribution for ' + filename)
output_df = pd.concat([token_df, stemmer_df], axis=1)
output_df.to_csv(outputDir + outputFileName, index=None)   
output_df

>>> Output word frequency distribution for ParcsCanada_tweets.csv


Unnamed: 0,token,tok_freq,tok_freq_perc,stemmer,stem_freq,stem_freq_perc
0,les,751,0.249834,les,722.0,0.240186
1,plus,245,0.081504,parc,319.0,0.106121
2,cette,220,0.073187,plus,245.0,0.081504
3,éfiparcs,204,0.067864,cett,219.0,0.072854
4,parcs,198,0.065868,éfiparc,205.0,0.068197
5,semaine,172,0.057219,site,193.0,0.064205
6,lhn,171,0.056886,endroit,193.0,0.064205
7,canada,168,0.055888,visit,175.0,0.058217
8,écouverte,157,0.052229,semain,173.0,0.057552
9,découvrez,148,0.049235,lhn,171.0,0.056886
